Modern organizations work with huge amounts of data. That data can available in various forms including documents, spreadsheets, audio recordings, emails, JSON, then many, many more. One of the foremost common ways in which such data is recorded is via text. That text is typically quite almost like the tongue that we use from day-to-day.
Natural Language Processing (NLP) is that the study of programming computers to process and analyse large amounts of natural textual data. Knowledge of NLP is important for Data Scientists since text is such a simple to use and customary container for storing data.
Faced with the task of performing analysis and building models from textual data, one must skills to perform the essential Data Science tasks. That includes data cleaning, formatting, parsing, analysing, visualizing, and modelling the text data. It’ll all require a couple of extra steps additionally to the standard way these tasks are done when the info is formed from raw numbers.
This will teach you the importance of NLP when utilized in Data Science. Here we are going to use some most common techniques that you can use and handle your text data it includes some code examples with NLTK that is Natural Language Tool kit.
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Table of contents
∙ Tokenization
∙ Stop Word Removal
∙ Stemming
∙ Lemmatization
∙ Sentimental Analysis
∙ Tokenization:
Tokenization is that the process of splitting or cutting sentences into words. This is not as simple because it looks. for instance, the word “New York” within the first example above was separated into two tokens.
However, ny may be a pronoun and could be quite important in our analysis. we’d be happier keeping it in only one token. As such, care must be taken during this step.
The main advantage of Tokenization is that it converts the text into a format that is easier to convert to raw numbers, which may actually be used for processing. It’s a natural initiative when analysing text data.
We will consider simple example.
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Code:
Import nltk
sentence = “My name is Aniket and I love NLP”
tokens = nltk.word_tokenizer(sentence)
print(tokens)
o/p : [‘My’ , ’name’ , ’is’ , ’Aniket’ , ’and’ , ‘I’ , ‘love’ , ‘NLP’ ]
∙ Stop word Removal:
After tokenization we have to remove stop words. Stop Words Removal features a similar goal as Tokenization convert the text data into a format that’s more suitable for processing. during this case, stop words removal removes common language prepositions like “and”, “the”, “a”, then on in English. This way, once we analyse our data, we’ll be ready to traverse the noise and focus in on the words that have actual real-world meaning.
stop words removal are often easily done by removing words that are during a pre-defined list. a crucial thing to notice is that there’s no universal list of stop words. As such, the list is usually created from scratch and tailored to the appliance being worked on.
Code:
import nltk
from nltk.corpus import stopwords
sentence = “This is a sentence for removing stop words”
tokens = nltk.word_tokenize(sentence)
stop_words = stopwords.words(‘english’)
filtered_tokens = [w for w in tokens if w not in stop_words] print(filtered_tokens)
Output:
[‘This’ , ‘sentence’ , ‘removing’ , ‘stop’ , ‘words’]
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
∙ Stemming:
Stemming is nothing but cleaning up text data for processing. Stemming is that the process of reducing words into their root form. the aim of this is often to scale back words which are spelled slightly differently thanks to context but have an equivalent meaning, into an equivalent token for processing. for instance, think about using the word “cook” during a sentence. There’s quite lot of the way we will write the word “cook”, counting on the context:
cook ===? cook
cooks ===? cook
cooked ===? cook
cooking ===? cook
In above example the common root word is “cook”.
All of those different sorts of the word cook have essentially an equivalent definition. So, ideally, when we’re doing our analysis, we’d want them to all or any be mapped to an equivalent token. during this case, we mapped all of them to the token for the word “cook”.
Code:
import nltk
snowball_stemmer = nltk.stem.SnowballStemmer(‘english’) s_1 = snowball_stemmer.stem(“cook”)
s_2 = snowball_stemmer.stem(“cooks”)
s_3 = snowball_stemmer.stem(“cooked”)
s_4 = snowball_stemmer.stem(“cooking”)
Output:
s_1, s_2, s_3, s_4 all have the same result i.e. “cook”.
∙ Lemmatization:
Lemmatization is that the process of grouping together the various inflected sorts of a word in order that they are often analysed as one item.
Lemmatization is analogous to stemming but it shows context to the words. So it links words with similar getting to one word. One major difference with stemming is that lemmatize takes a neighbourhood of speech parameter, “pos” If not supplied, the default is “noun.”
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Code:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Verb
print(lemmatizer.lemmatize(‘playing’, pos=”v”))
o/p: play
# Noun
print(lemmatizer.lemmatize(‘playing’, pos=”n”))
o/p: playing
# Ajective
print(lemmatizer.lemmatize(‘playing’, pos=”a”))
o/p: playing
# Adverb
print(lemmatizer.lemmatize(‘playing’, pos=”r”))
o/p: playing
∙ Which one is better Stemming or Lemmatization?
Stemming works on words without knowing its context and that’s why stemming has lower accuracy and faster than lemmatization. Lemmatizing is taken into account better than stemming. Word lemmatizing returns a true word albeit it’s not an equivalent word, it might be a synonym, but a minimum of it’s a true word. Sometimes you don’t care about this level of accuracy and every one you would like is speed, during this case, stemming is best.
∙ Sentiment Analysis:
Sentiment Analysis may be a broad range of subjective analysis which uses tongue processing techniques to perform tasks like identifying the sentiment of a customer review, positive or negative feeling during a sentence, judging mood via voice analysis or transcription analysis etc.
Example:
“I did not like the chocolate milk-shake” – is a negative experience of milk-shake.
“I did not hate the chocolate milk-shake” – may be considered as a neutral experience.
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
∙ Some approaches in sentimental Analysis:
- Named entity recognition (NER): It involves determining the parts of a text which can be identified and categorized into pre-set groups. samples of such groups include names of people and names of places.
- Word sense disambiguation: It involves giving getting to a word supported the context.
- Natural language generation: It involves using databases to get semantic intentions and convert it into human readable language.
∙ Why is NLP difficult?
Natural Language processing is taken into account a difficult problem in computing. It’s the character of the human language that creates NLP difficult.
The rules that dictate the passing of data using natural languages aren’t easy for computers to know.
Some of these rules are often high-level and abstract; for instance, when someone uses a sarcastic remark to pass information. On the opposite hand, a number of these rules are often low-level; for instance, using the character “s” to suggest the plurality of things.