NLP

Summary: An nltk implementation of basic text cleaning and normalization techniques using POS tagger.

https://towardsdatascience.com/building-a-text-normalizer-using-nltk-ft-pos-tagger-e713e611db8

NLP<3 I : Text Normalization using POS Taggers


In [1]:
import nltk
In [2]:
hermione_said = '''Books! And cleverness! There are more important things - friendship and bravery and - oh Harry - be careful!'''

I. Tokenization Processes


Sentence Tokenize

collection of sequential groups of words. Each sequence ~ each sentence.

In [3]:
from nltk import sent_tokenize, word_tokenize
sequences = sent_tokenize(hermione_said)
sequences
Out[3]:
['Books!',
 'And cleverness!',
 'There are more important things - friendship and bravery and - oh Harry - be careful!']

Word Tokenize

collection of words - or more appropriately tokens - that form a sequence (or sentence) While sentences, comprised of lexical words should have a meaning, a sequence of tokens might not bear any significant meaning at first glance.

In [4]:
seq_tokens = [word_tokenize(seq) for seq in sequences]
seq_tokens
Out[4]:
[['Books', '!'],
 ['And', 'cleverness', '!'],
 ['There',
  'are',
  'more',
  'important',
  'things',
  '-',
  'friendship',
  'and',
  'bravery',
  'and',
  '-',
  'oh',
  'Harry',
  '-',
  'be',
  'careful',
  '!']]

Tokens itself does not bear any meaning. For example the tokens containing only puctuations does not bear any meaning. In fact, in isolated states, words like "books", "and", "friendship" has dictionary meaning but huhman communication is always contextual and context is difficult to decipher from single words.

Remove Punctuations

In [5]:
import string
no_punct_seq_tokens = []

for seq_token in seq_tokens:
    no_punct_seq_tokens.append([token for token in seq_token if token not in string.punctuation])
In [6]:
no_punct_seq_tokens
Out[6]:
[['Books'],
 ['And', 'cleverness'],
 ['There',
  'are',
  'more',
  'important',
  'things',
  'friendship',
  'and',
  'bravery',
  'and',
  'oh',
  'Harry',
  'be',
  'careful']]

II. Normalization Techniques - Stemming and Lemmatization


Resolving ambiguity by reducing tokens to inflectional forms or other derivational forms to a common base form

Stemming

Stemming is used to reduce different grammatical forms or word forms of a word like its noun, adjective, verb, adverb etc. to its root form. Computationally, it is a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. It is important for information retrieval systems.

Ref: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [7]:
# Using Porter Stemmer implementation in nltk
from nltk.stem import PorterStemmer
In [8]:
stemmer = PorterStemmer()
In [9]:
stemmed_tokens = [stemmer.stem(token) for seq in no_punct_seq_tokens for token in seq]
stemmed_tokens
Out[9]:
['book',
 'and',
 'clever',
 'there',
 'are',
 'more',
 'import',
 'thing',
 'friendship',
 'and',
 'braveri',
 'and',
 'oh',
 'harri',
 'be',
 'care']

Note: Stemming is rule based and thus we can see the changes like "braveri" and "harri" which does not make sense. Also, notice all the words are already transformed into lower case. This poses a challenge for proper noun detection because the only significant physical notation - the first letter in upper case - will not more be in the data.

Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Ref: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [10]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

lm = WordNetLemmatizer()
In [11]:
lemmatized_tokens = [lm.lemmatize(token) for seq in no_punct_seq_tokens for token in seq]
lemmatized_tokens
Out[11]:
['Books',
 'And',
 'cleverness',
 'There',
 'are',
 'more',
 'important',
 'thing',
 'friendship',
 'and',
 'bravery',
 'and',
 'oh',
 'Harry',
 'be',
 'careful']

This looks quite naive. The only thing that has changed is "thing" to "things"

So, now, we will make use of POS argument and try to lemmatize again and test a few variations:

In [12]:
lm.lemmatize("running", pos="v"), lm.lemmatize("running", pos="n")
Out[12]:
('run', 'running')
In [13]:
lm.lemmatize("Harry", pos="n"), lm.lemmatize("Harry", pos="v"), lm.lemmatize("harry", pos="v")
Out[13]:
('Harry', 'Harry', 'harry')
In [14]:
lm.lemmatize("Books", pos="n"), lm.lemmatize("Books", pos="v")
Out[14]:
('Books', 'Books')
In [15]:
lm.lemmatize("books", pos="v"), lm.lemmatize("books", pos="n")
Out[15]:
('book', 'book')
In [16]:
lm.lemmatize("more", pos="a"), lm.lemmatize("better", pos="a"), lm.lemmatize("best", pos="a")
Out[16]:
('more', 'good', 'best')

III. POS TAGGER


Penn Treebank tag Explanation
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP\$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non 3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh determiner
WP Wh pronoun
WRB Wh adverb
WP\$ Possessive wh pronoun

To learn more about the tags, visit-

In [17]:
import os
from nltk.tag import StanfordPOSTagger

path = os.getcwd()
path_to_stnfrd_core_nlp = path + '/stanford-postagger/'

jar = path_to_stnfrd_core_nlp + 'stanford-postagger.jar'
model = path_to_stnfrd_core_nlp + 'models/english-bidirectional-distsim.tagger'

st = StanfordPOSTagger(model, jar, encoding='utf8')
In [18]:
text_tags_lemmatized_tokens = st.tag(lemmatized_tokens)
In [19]:
text_tags_lemmatized_tokens
Out[19]:
[('Books', 'NNS'),
 ('And', 'CC'),
 ('cleverness', 'NN'),
 ('There', 'EX'),
 ('are', 'VBP'),
 ('more', 'RBR'),
 ('important', 'JJ'),
 ('thing', 'NN'),
 ('friendship', 'NN'),
 ('and', 'CC'),
 ('bravery', 'NN'),
 ('and', 'CC'),
 ('oh', 'UH'),
 ('Harry', 'NNP'),
 ('be', 'VB'),
 ('careful', 'JJ')]
In [20]:
text_tags_stemmed_tokens = st.tag(stemmed_tokens)
In [21]:
text_tags_stemmed_tokens
Out[21]:
[('book', 'NN'),
 ('and', 'CC'),
 ('clever', 'JJ'),
 ('there', 'EX'),
 ('are', 'VBP'),
 ('more', 'JJR'),
 ('import', 'NN'),
 ('thing', 'NN'),
 ('friendship', 'NN'),
 ('and', 'CC'),
 ('braveri', 'NN'),
 ('and', 'CC'),
 ('oh', 'UH'),
 ('harri', 'NNS'),
 ('be', 'VB'),
 ('care', 'NN')]

What to choose - Stemming or Lemma or both?

This little snippet is gives a taster in the larger challenges that NLP practitioners face when dealing with tokens.

For example, the first token - book - the stemmed token's tag is NN (Noun) and that of the lemmatized one is NNS (Plural Noun). Which seems better? To answer this question, we need to take a step back and identify answers to questions like:

  • What is the problem statement?
  • What features are important to address the problem statement?
  • Is this featue an overhead for computation?

Second, Harry - which is firstly wrongly stemmed to harri and therefore the POS tagger fails to identify it correctly as a Proper Noun while the lemmatized token correctly classified Harry. Besides, like, braveri - even though these words are not anywhere in the english lexical dictionary, should have been classified as a FW.

EVALUATING POS TAGGED TOKENS

In [22]:
# Token sequences tagging
st.tag_sents(sentences=no_punct_seq_tokens)
Out[22]:
[[('Books', 'NNS')],
 [('And', 'CC'), ('cleverness', 'NN')],
 [('There', 'EX'),
  ('are', 'VBP'),
  ('more', 'RBR'),
  ('important', 'JJ'),
  ('things', 'NNS'),
  ('friendship', 'NN'),
  ('and', 'CC'),
  ('bravery', 'NN'),
  ('and', 'CC'),
  ('oh', 'UH'),
  ('Harry', 'NNP'),
  ('be', 'VB'),
  ('careful', 'JJ')]]

Let's create a gold set or the expected results to measure the performance of the POS tagger model. In this new variable, I have only corrected theh symbols. I am happy with all other tags.

In [23]:
gold = [[('Books', 'NNS')],
 [('And', 'CC'), ('cleverness', 'NN')],
 [('There', 'EX'),
  ('are', 'VBP'),
  ('more', 'RBR'),
  ('important', 'JJ'),
  ('things', 'NNS'),
  ('friendship', 'NN'),
  ('and', 'CC'),
  ('bravery', 'NN'),
  ('and', 'CC'),
  ('oh', 'UH'),
  ('Harry', 'NNP'),
  ('be', 'VB'),
  ('careful', 'JJ')]]
In [24]:
gold
Out[24]:
[[('Books', 'NNS')],
 [('And', 'CC'), ('cleverness', 'NN')],
 [('There', 'EX'),
  ('are', 'VBP'),
  ('more', 'RBR'),
  ('important', 'JJ'),
  ('things', 'NNS'),
  ('friendship', 'NN'),
  ('and', 'CC'),
  ('bravery', 'NN'),
  ('and', 'CC'),
  ('oh', 'UH'),
  ('Harry', 'NNP'),
  ('be', 'VB'),
  ('careful', 'JJ')]]
In [25]:
st.evaluate(gold)
Out[25]:
1.0

SOLVING THE PROBLEMS DISCUSSED


Some Observations:

  • The Lemma worked fine but the stemming results were sometimes in correct
  • The POS tags worked well when the cases of the words were preserved. Otherwise, the outcomes were incorrect
  • We also have a "pos" argument in lemmatization which could be used to get better results

OBJECTIVE: To normalize correctly

Step 1: To access the pos tags

In [26]:
for each_seq in st.tag_sents(sentences=no_punct_seq_tokens):
    for tuples in each_seq:
        print(tuples[0], tuples[1])
Books NNS
And CC
cleverness NN
There EX
are VBP
more RBR
important JJ
things NNS
friendship NN
and CC
bravery NN
and CC
oh UH
Harry NNP
be VB
careful JJ

Step 2: To create a mapper for the arguments to wordnet according to the treebank POS tag codes

In [27]:
from nltk.corpus.reader.wordnet import VERB, NOUN, ADJ, ADV
In [28]:
dict_pos_map = {
    # Look for NN in the POS tag because all nouns begin with NN
    'NN': NOUN,
    # Look for VB in the POS tag because all nouns begin with VB
    'VB':VERB,
    # Look for JJ in the POS tag because all nouns begin with JJ
    'JJ' : ADJ,
    # Look for RB in the POS tag because all nouns begin with RB
    'RB':ADV  
}

Step 3: To get the lemmas accoridngly (NO STEMMER)

In [31]:
normalized_sequence = []
for each_seq in st.tag_sents(sentences=no_punct_seq_tokens):
    normalized_tokens = []
    for tuples in each_seq:
        temp = tuples[0]
        if tuples[1] == "NNP" or tuples[1] == "NNPS":
            continue
        if tuples[1][:2] in dict_pos_map.keys():
            temp = lm.lemmatize(tuples[0].lower(), 
                                pos=dict_pos_map[tuples[1][:2]])
        normalized_tokens.append(temp)
    normalized_sequence.append(normalized_tokens)
normalized_sequence
Out[31]:
[['book'],
 ['And', 'cleverness'],
 ['There',
  'be',
  'more',
  'important',
  'thing',
  'friendship',
  'and',
  'bravery',
  'and',
  'oh',
  'be',
  'careful']]

Step 4: Adding stemmer

In [30]:
normalized_sequence = []
for each_seq in st.tag_sents(sentences=no_punct_seq_tokens):
    normalized_tokens = []
    for tuples in each_seq:
        temp = tuples[0]
        if tuples[1] == "NNP" or tuples[1] == "NNPS":
            continue
        if tuples[1][:2] in dict_pos_map.keys():
            temp = lm.lemmatize(tuples[0].lower(), 
                                pos=dict_pos_map[tuples[1][:2]])
        temp = stemmer.stem(temp)
        normalized_tokens.append(temp)
    normalized_sequence.append(normalized_tokens)
normalized_sequence
Out[30]:
[['book'],
 ['and', 'clever'],
 ['there',
  'be',
  'more',
  'import',
  'thing',
  'friendship',
  'and',
  'braveri',
  'and',
  'oh',
  'be',
  'care']]

Look's better and I know the Proper Noun - Harry - is retained as it is.

Thanks for visiting!