Named Entity Recognition (NER) 18 class
NER ONTO example
nlu.load('ner').predict('Angela Merkel from Germany and the American Donald Trump dont share many opinions')
embeddings |
ner_tag |
entities |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
PER |
Angela Merkel |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
GPE |
Germany |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
NORP |
American |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
PER |
Donald Trump |
Named Entity Recognition (NER) 5 Class
NER CONLL example
nlu.load('ner.conll').predict('Angela Merkel from Germany and the American Donald Trump dont share many opinions')
embeddings |
ner_tag |
entities |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
PER |
Angela Merkel |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
LOC |
Germany |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
MISC |
American |
[[-0.563759982585907, 0.26958999037742615, 0.3… |
PER |
Donald Trump |
Part of speech (POS)
POS Classifies each token with one of the following tags
Part of Speech example
nlu.load('pos').predict('Part of speech assigns each token in a sentence a grammatical label')
token |
pos |
Part |
NN |
of |
IN |
speech |
NN |
assigns |
NNS |
each |
DT |
token |
NN |
in |
IN |
a |
DT |
sentence |
NN |
a |
DT |
grammatical |
JJ |
label |
NN |
Emotion Classifier
Emotion Classifier example
Classifies text as one of 4 categories (joy, fear, surprise, sadness)
nlu.load('emotion').predict('I love NLU!')
sentence_embeddings |
emotion_confidence |
sentence |
emotion |
[0.027570432052016258, -0.052647676318883896, …] |
0.976017 |
I love NLU! |
joy |
Sentiment Classifier
Sentiment Classifier Example
Classifies binary sentiment for every sentence, either positive or negative.
nlu.load('sentiment').predict("I hate this guy Sami")
sentiment_confidence |
sentence |
sentiment |
checked |
0.5778 |
I hate this guy Sami |
negative |
[I, hate, this, guy, Sami] |
Question Classifier 50 class
50 Class Questions Classifier example
Classify between 50 different types of questions trained on Trec50
When setting predict(meta=True) nlu will output the probabilities for all other 49 question classes.
nlu.load('en.classify.trec50').predict('How expensive is the Watch?')
sentence_embeddings |
question_confidence |
sentence |
question |
[0.051809534430503845, 0.03128402680158615, -0…] |
0.919436 |
How expensive is the watch? |
NUM_count |
Fake News Classifier
Fake News Classifier example
nlu.load('en.classify.fakenews').predict('Unicorns have been sighted on Mars!')
sentence_embeddings |
fake_confidence |
sentence |
fake |
[-0.01756167598068714, 0.015006818808615208, -…] |
1.000000 |
Unicorns have been sighted on Mars! |
FAKE |
Cyberbullying Classifier
Cyberbullying Classifier example
Classifies sexism and racism
nlu.load('en.classify.cyberbullying').predict('Women belong in the kitchen.') # sorry we really don't mean it
sentence_embeddings |
cyberbullying_confidence |
sentence |
cyberbullying |
[-0.054944973438978195, -0.022223370149731636,…] |
0.999998 |
Women belong in the kitchen. |
sexism |
Spam Classifier
Spam Classifier example
nlu.load('en.classify.spam').predict('Please sign up for this FREE membership it costs $$NO MONEY$$ just your mobile number!')
sentence_embeddings |
spam_confidence |
sentence |
spam |
[0.008322705514729023, 0.009957313537597656, 0…] |
1.000000 |
Please sign up for this FREE membership it cos… |
spam |
Sarcasm Classifier
Sarcasm Classifier example
nlu.load('en.classify.sarcasm').predict('gotta love the teachers who give exams on the day after halloween')
sentence_embeddings |
sarcasm_confidence |
sentence |
sarcasm |
[-0.03146284446120262, 0.04071342945098877, 0….] |
0.999985 |
gotta love the teachers who give exams on the… |
sarcasm |
IMDB Movie Sentiment Classifier
Movie Review Sentiment Classifier example
nlu.load('en.sentiment.imdb').predict('The Matrix was a pretty good movie')
document |
sentence_embeddings |
sentiment_negative |
sentiment_negative |
sentiment_positive |
sentiment |
The Matrix was a pretty good movie |
[[0.04629608988761902, -0.020867452025413513, … ] |
[2.7235753918830596e-07] |
[2.7235753918830596e-07] |
[0.9999997615814209] |
[positive] |
Twitter Sentiment Classifier Example
nlu.load('en.sentiment.twitter').predict('@elonmusk Tesla stock price is too high imo')
document |
sentence_embeddings |
sentiment_negative |
sentiment_negative |
sentiment_positive |
sentiment |
@elonmusk Tesla stock price is too high imo |
[[0.08604438602924347, 0.04703635722398758, -0…] |
[1.0] |
[1.0] |
[1.692714735043349e-36] |
[negative] |
Language Classifier
Languages Classifier example
Classifies the following 20 languages:
Bulgarian, Czech, German, Greek, English, Spanish, Finnish, French, Croatian, Hungarian, Italy, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Turkish, and Ukrainian
nlu.load('lang').predict(['NLU is an open-source text processing library for advanced natural language processing for the Python.','NLU est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python.'])
language_confidence |
document |
language |
0.985407 |
NLU is an open-source text processing library …] |
en |
0.999822 |
NLU est une bibliothèque de traitement de text…] |
fr |
E2E Classifier
E2E Classifier example
This is a multi class classifier trained on the E2E dataset for Natural language generation
nlu.load('e2e').predict('E2E is a dataset for training generative models')
sentence_embeddings |
e2e |
e2e_confidence |
sentence |
[0.021445205435156822, -0.039284929633140564, …,] |
customer rating[high] |
0.703248 |
E2E is a dataset for training generative models |
None |
name[The Waterman] |
0.703248 |
None |
None |
eatType[restaurant] |
0.703248 |
None |
None |
priceRange[£20-25] |
0.703248 |
None |
None |
familyFriendly[no] |
0.703248 |
None |
None |
familyFriendly[yes] |
0.703248 |
None |
Toxic Classifier
Toxic Text Classifier example
nlu.load('en.classify.toxic').predict('You are to stupid')
toxic_confidence |
toxic |
sentence_embeddings |
document |
0.978273 |
[toxic,insult] |
[[-0.03398505970835686, 0.0007853527786210179,…,] You are to stupid |
|
Word Embeddings Bert
BERT Word Embeddings example
nlu.load('bert').predict('NLU offers the latest embeddings in one line ')
token |
bert_embeddings |
NLU |
[0.3253086805343628, -0.574441134929657, -0.08…] |
offers |
[-0.6660361886024475, -0.1494743824005127, -0…] |
the |
[-0.6587662696838379, 0.3323703110218048, 0.16…] |
latest |
[0.7552685737609863, 0.17207926511764526, 1.35…] |
embeddings |
[-0.09838500618934631, -1.1448147296905518, -1…] |
in |
[-0.4635896384716034, 0.38369956612586975, 0.0…] |
one |
[0.26821616291999817, 0.7025910019874573, 0.15…] |
line |
[-0.31930840015411377, -0.48271292448043823, 0…] |
Word Embeddings Biobert
BIOBERT Word Embeddings example
Bert model pretrained on Bio dataset
nlu.load('biobert').predict('Biobert was pretrained on a medical dataset')
token |
biobert_embeddings |
NLU |
[0.3253086805343628, -0.574441134929657, -0.08…] |
offers |
[-0.6660361886024475, -0.1494743824005127, -0…] |
the |
[-0.6587662696838379, 0.3323703110218048, 0.16…] |
latest |
[0.7552685737609863, 0.17207926511764526, 1.35…] |
embeddings |
[-0.09838500618934631, -1.1448147296905518, -1…] |
in |
[-0.4635896384716034, 0.38369956612586975, 0.0…] |
one |
[0.26821616291999817, 0.7025910019874573, 0.15…] |
line |
[-0.31930840015411377, -0.48271292448043823, 0…] |
Word Embeddings Covidbert
COVIDBERT Word Embeddings
Bert model pretrained on COVID dataset
nlu.load('covidbert').predict('Albert uses a collection of many berts to generate embeddings')
token |
covid_embeddings |
He |
[-1.0551927089691162, -1.534174919128418, 1.29…,] |
was |
[-0.14796507358551025, -1.3928604125976562, 0….,] |
suprised |
[1.0647121667861938, -0.3664901852607727, 0.54…,] |
by |
[-0.15271103382110596, -0.6812090277671814, -0…,] |
the |
[-0.45744237303733826, -1.4266574382781982, -0…,] |
diversity |
[-0.05339818447828293, -0.5118572115898132, 0….,] |
of |
[-0.2971905767917633, -1.0936176776885986, -0….,] |
NLU |
[-0.9573594331741333, -0.18001675605773926, -1…,] |
Word Embeddings Albert
ALBERT Word Embeddings examle
nlu.load('albert').predict('Albert uses a collection of many berts to generate embeddings')
token |
albert_embeddings |
Albert |
[-0.08257609605789185, -0.8017427325248718, 1…] |
uses |
[0.8256351947784424, -1.5144840478897095, 0.90…] |
a |
[-0.22089454531669617, -0.24295514822006226, 3…] |
collection |
[-0.2136894017457962, -0.8225528597831726, -0…] |
of |
[1.7623294591903687, -1.113651156425476, 0.800…] |
many |
[0.6415284872055054, -0.04533941298723221, 1.9…] |
berts |
[-0.5591965317726135, -1.1773797273635864, -0…] |
to |
[1.0956681966781616, -1.4180747270584106, -0.2…] |
generate |
[-0.6759272813796997, -1.3546931743621826, 1.6…] |
embeddings |
[-0.0035803020000457764, -0.35928264260292053,…] |
Electra Embeddings
ELECTRA Word Embeddings example
nlu.load('electra').predict('He was suprised by the diversity of NLU')
token |
electra_embeddings |
He |
[0.29674115777015686, -0.21371933817863464, -0…,] |
was |
[-0.4278327524662018, -0.5352768898010254, -0….,] |
suprised |
[-0.3090559244155884, 0.8737565279006958, -1.0…,] |
by |
[-0.07821277529001236, 0.13081523776054382, 0….,] |
the |
[0.5462881922721863, 0.0683358758687973, -0.41…,] |
diversity |
[0.1381239891052246, 0.2956242859363556, 0.250…,] |
of |
[-0.5667567253112793, -0.3955455720424652, -0….,] |
NLU |
[0.5597224831581116, -0.703249454498291, -1.08…,] |
Word Embeddings Elmo
ELMO Word Embeddings example
nlu.load('elmo').predict('Elmo was trained on Left to right masked to learn its embeddings')
token |
elmo_embeddings |
Elmo |
[0.6083735227584839, 0.20089012384414673, 0.42…] |
was |
[0.2980785369873047, -0.07382500916719437, -0…] |
trained |
[-0.39923471212387085, 0.17155063152313232, 0…] |
on |
[0.04337821900844574, 0.1392083466053009, -0.4…] |
Left |
[0.4468783736228943, -0.623046875, 0.771505534…] |
to |
[-0.18209676444530487, 0.03812692314386368, 0…] |
right |
[0.23305709660053253, -0.6459438800811768, 0.5…] |
masked |
[-0.7243442535400391, 0.10247116535902023, 0.1…] |
to |
[-0.18209676444530487, 0.03812692314386368, 0…] |
learn |
[1.2942464351654053, 0.7376189231872559, -0.58…] |
its |
[0.055951207876205444, 0.19218483567237854, -0…] |
embeddings |
[-1.31377112865448, 0.7727609872817993, 0.6748…] |
Word Embeddings Xlnet
XLNET Word Embeddings example
nlu.load('xlnet').predict('XLNET computes contextualized word representations using combination of Autoregressive Language Model and Permutation Language Model')
token |
xlnet_embeddings |
XLNET |
[-0.02719488926231861, -1.7693557739257812, -0…] |
computes |
[-1.8262947797775269, 0.8455266356468201, 0.57…] |
contextualized |
[2.8446314334869385, -0.3564329445362091, -2.1…] |
word |
[-0.6143839359283447, -1.7368144989013672, -0…] |
representations |
[-0.30445945262908936, -1.2129613161087036, 0…] |
using |
[0.07423821836709976, -0.02561005763709545, -0…] |
combination |
[-0.5387097597122192, -1.1827564239501953, 0.5…] |
of |
[-1.403516411781311, 0.3108177185058594, -0.32…] |
Autoregressive |
[-1.0869172811508179, 0.7135171890258789, -0.2…] |
Language |
[-0.33215752243995667, -1.4108021259307861, -0…] |
Model |
[-1.6097160577774048, -0.2548254430294037, 0.0…] |
and |
[0.7884324789047241, -1.507911205291748, 0.677…] |
Permutation |
[0.6049966812133789, -0.157279372215271, -0.06…] |
Language |
[-0.33215752243995667, -1.4108021259307861, -0…] |
Model |
[-1.6097160577774048, -0.2548254430294037, 0.0…] |
Word Embeddings Glove
GLOVE Word Embeddings example
nlu.load('glove').predict('Glove embeddings are generated by aggregating global word-word co-occurrence matrix from a corpus')
token |
glove_embeddings |
Glove |
[0.3677999973297119, 0.37073999643325806, 0.32…] |
embeddings |
[0.732479989528656, 0.3734700083732605, 0.0188…] |
are |
[-0.5153300166130066, 0.8318600058555603, 0.22…] |
generated |
[-0.35510000586509705, 0.6115900278091431, 0.4…] |
by |
[-0.20874999463558197, -0.11739999800920486, 0…] |
aggregating |
[-0.5133699774742126, 0.04489300027489662, 0.1…] |
global |
[0.24281999468803406, 0.6170300245285034, 0.66…] |
word-word |
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …] |
co-occurrence |
[0.16384999454021454, -0.3178800046443939, 0.1…] |
matrix |
[-0.2663800120353699, 0.4449099898338318, 0.32…] |
from |
[0.30730998516082764, 0.24737000465393066, 0.6…] |
a |
[-0.2708599865436554, 0.04400600120425224, -0…] |
corpus |
[0.39937999844551086, 0.15894000232219696, -0…] |
Multiple Token Embeddings at once
Compare 6 Embeddings at once with NLU and T-SNE example
#This takes around 10GB RAM, watch out!
nlu.load('bert albert electra elmo xlnet use glove').predict('Get all of them at once! Watch your RAM tough!')
xlnet_embeddings |
use_embeddings |
elmo_embeddings |
electra_embeddings |
glove_embeddings |
sentence |
albert_embeddings |
biobert_embeddings |
bert_embeddings |
[[-0.003953204490244389, -1.5821468830108643, …,] |
[-0.019299551844596863, -0.04762779921293259, …,] |
[[0.04002974182367325, -0.43536433577537537, -…,] |
[[0.19559216499328613, -0.46693214774131775, -…,] |
[[0.1443299949169159, 0.4395099878311157, 0.58…,] |
Get all of them at once, watch your RAM tough! |
[[-0.4743960201740265, -0.581386387348175, 0.7…,] |
[[-0.00012563914060592651, -1.372296929359436,…,] |
[[-0.7687976360321045, 0.8489367961883545, -0….,] |
Bert Sentence Embeddings
BERT Sentence Embeddings example
sentence |
bert_sentence_embeddings |
He was suprised by the diversity of NLU |
[-1.0726687908172607, 0.4481312036514282, -0.0…,] |
Electra Sentence Embeddings
ELECTRA Sentence Embeddings example
nlu.load('embed_sentence.electra').predict('He was suprised by the diversity of NLU')
sentence |
electra_sentence_embeddings |
He was suprised by the diversity of NLU |
[0.005376118700951338, 0.18036000430583954, -0…,] |
Sentence Embeddings Use
USE Sentence Embeddings example
nlu.load('use').predict('USE is designed to encode whole sentences and documents into vectors that can be used for text classification, semantic similarity, clustering or oder NLP tasks')
sentence |
use_embeddings |
USE is designed to encode whole sentences and …] |
[0.03302069380879402, -0.004255455918610096, -…] |
Spell Checking
Spell checking example
nlu.load('spell').predict('I liek pentut buttr ant jely')
token |
checked |
I |
I |
liek |
like |
peantut |
pentut |
buttr |
buttr |
and |
and |
jelli |
jely |
Dependency Parsing Unlabeled
Untyped Dependency Parsing example
nlu.load('dep.untyped').predict('Untyped Dependencies represent a grammatical tree structure.md')
token |
pos |
dependency |
Untyped |
NNP |
ROOT |
Dependencies |
NNP |
represent |
represent |
VBD |
Untyped |
a |
DT |
structure |
grammatical |
JJ |
structure |
tree |
NN |
structure |
structure |
NN |
represent |
Dependency Parsing Labeled
Typed Dependency Parsing example
nlu.load('dep').predict('Typed Dependencies represent a grammatical tree structure.md where every edge has a label')
token |
pos |
dependency |
labled_dependency |
Typed |
NNP |
ROOT |
root |
Dependencies |
NNP |
represent |
nsubj |
represent |
VBD |
Typed |
parataxis |
a |
DT |
structure |
nsubj |
grammatical |
JJ |
structure |
amod |
tree |
NN |
structure |
flat |
structure |
NN |
represent |
nsubj |
where |
WRB |
structure |
mark |
every |
DT |
edge |
nsubj |
edge |
NN |
where |
nsubj |
has |
VBZ |
ROOT |
root |
a |
DT |
label |
nsubj |
label |
NN |
has |
nsubj |
Tokenization
Tokenization example
nlu.load('tokenize').predict('Each word and symbol in a sentence will generate token.')
token |
Each |
word |
and |
symbol |
will |
generate |
a |
token |
. |
Stemmer
Stemmer example
nlu.load('stemm').predict('NLU can get you the stem of a word')
token |
stem |
NLU |
nlu |
can |
can |
get |
get |
you |
you |
the |
the |
stem |
stem |
of |
of |
a |
a |
word |
word |
Stopwords Removal
Stopwords Removal example
nlu.load('stopwords').predict('I want you to remove stopwords from this sentence please')
token |
cleanTokens |
I |
remove |
want |
stopewords |
you |
sentence |
to |
None |
remove |
None |
stopwords |
None |
from |
None |
this |
None |
sentence |
None |
please |
None |
Lemmatization
Lemmatization example
nlu.load('lemma').predict('Lemmatizing generates a less noisy version of the inputted tokens')
token |
lemma |
Lemmatizing |
Lemmatizing |
generates |
generate |
a |
a |
less |
less |
noisy |
noisy |
version |
version |
of |
of |
the |
the |
inputted |
input |
tokens |
token |
Normalizers
Normalizing example
nlu.load('norm').predict('@CKL_IT says that #normalizers are pretty useful to clean #structured_strings in #NLU like tweets')
normalized |
token |
CKLIT |
@CKL_IT |
says |
says |
that |
that |
normalizers |
#normalizers |
are |
are |
pretty |
pretty |
useful |
useful |
to |
to |
clean |
clean |
structuredstrings |
#structured_strings |
in |
in |
NLU |
#NLU |
like |
like |
tweets |
tweets |
NGrams
NGrams example
nlu.load('ngram').predict('Wht a wondful day!')
document |
ngrams |
pos |
To be or not to be |
[To, be, or, not, to, be, To be, be or, or not…] |
[TO, VB, CC, RB, TO, VB] |
Date Matching
Date Matching example
nlu.load('match.datetime').predict('In the years 2000/01/01 to 2010/01/01 a lot of things happened')
document |
date |
In the years 2000/01/01 to 2010/01/01 a lot of things happened |
[2000/01/01, 2010/01/01] |
Entity Chunking
Checkout see here for all possible POS labels or
Splits text into rows based on matched grammatical entities.
Entity Chunking Example
# First we load the pipeline
pipe = nlu.load('match.chunks')
# Now we print the info to see at which index which com,ponent is and what parameters we can configure on them
pipe.generate_class_metadata_table()
# Lets set our Chunker to only match NN
pipe['default_chunker'].setRegexParsers(['<NN>+', '<JJ>+'])
# Now we can predict with the configured pipeline
pipe.predict("Jim and Joe went to the big blue market next to the town hall")
# the outputs of component_list.print_info()
The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('disabled') | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : disabled
>>> component_list['sentence_detector'] has settable params:
component_list['sentence_detector'].setCustomBounds([]) | Info: characters used to explicitly mark sentence bounds | Currently set to : []
component_list['sentence_detector'].setDetectLists(True) | Info: whether detect lists during sentence detection | Currently set to : True
component_list['sentence_detector'].setExplodeSentences(False) | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
component_list['sentence_detector'].setMaxLength(99999) | Info: Set the maximum allowed length for each sentence | Currently set to : 99999
component_list['sentence_detector'].setMinLength(0) | Info: Set the minimum allowed length for each sentence. | Currently set to : 0
component_list['sentence_detector'].setUseAbbreviations(True) | Info: whether to apply abbreviations at sentence detection | Currently set to : True
component_list['sentence_detector'].setUseCustomBoundsOnly(False) | Info: Only utilize custom bounds in sentence detection | Currently set to : False
>>> component_list['regex_matcher'] has settable params:
component_list['regex_matcher'].setCaseSensitiveExceptions(True) | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
component_list['regex_matcher'].setTargetPattern('\S+') | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
component_list['regex_matcher'].setMaxLength(99999) | Info: Set the maximum allowed length for each token | Currently set to : 99999
component_list['regex_matcher'].setMinLength(0) | Info: Set the minimum allowed length for each token | Currently set to : 0
>>> component_list['sentiment_dl'] has settable params:
>>> component_list['default_chunker'] has settable params:
component_list['default_chunker'].setRegexParsers(['<DT>?<JJ>*<NN>+']) | Info: an array of grammar based chunk parsers | Currently set to : ['<DT>?<JJ>*<NN>+']```
chunk |
pos |
market |
[NNP, CC, NNP, VBD, TO, DT, JJ, JJ, NN, JJ, TO… |
town hall |
[NNP, CC, NNP, VBD, TO, DT, JJ, JJ, NN, JJ, TO… |
big blue |
[NNP, CC, NNP, VBD, TO, DT, JJ, JJ, NN, JJ, TO… |
next |
[NNP, CC, NNP, VBD, TO, DT, JJ, JJ, NN, JJ, TO… |
Sentence Detection
Sentence Detection example
nlu.load('sentence_detector').predict('NLU can detect things. Like beginning and endings of sentences. It can also do much more!', output_level ='sentence')
sentence |
word_embeddings |
pos |
ner |
NLU can detect things. |
[[0.4970400035381317, -0.013454999774694443, 0…] |
[NNP, MD, VB, NNS, ., IN, VBG, CC, NNS, IN, NN… ] |
[O, O, O, O, O, B-sent, O, O, O, O, O, O, B-se…] |
Like beginning and endings of sentences. |
[[0.4970400035381317, -0.013454999774694443, 0…] |
[NNP, MD, VB, NNS, ., IN, VBG, CC, NNS, IN, NN…] |
[O, O, O, O, O, B-sent, O, O, O, O, O, O, B-se…] |
It can also do much more! |
[[0.4970400035381317, -0.013454999774694443, 0…] |
[NNP, MD, VB, NNS, ., IN, VBG, CC, NNS, IN, NN…] |
[O, O, O, O, O, B-sent, O, O, O, O, O, O, B-se…] |