NLU release notes

 

NLU 3.0.1 Release Notes

We are very excited to announce NLU 3.0.1 has been released! This is one of the most visually appealing releases, with the integration of the Spark-NLP-Display library and visualizations for dependency trees, entity resolution, entity assertion, relationship between entities and named entity recognition. In addition to this, the schema of how columns are named by NLU has been reworked and all 140+ tutorial notebooks have been updated to reflect the latest changes in NLU 3.0.0+ Finally, new multilingual models for Afrikaans, Welsh, Maltese, Tamil, andVietnamese are now available.

New Features and Enhancements

  • 1 line to visualization for NER, Dependency, Resolution, Assertion and Relation via Spark-NLP-Display integration
  • Improved column naming schema
  • Over 140 + NLU tutorial Notebooks updated and improved to reflect latest changes in NLU 3.0.0 +
  • New multilingual models for Afrikaans, Welsh, Maltese, Tamil, andVietnamese

Improved Column Name generation

  • NLU categorized each internal component now with boolean labels for name_deductable and always_name_deductable .
  • Before generating column names, NLU checks wether each component is of unique in the pipeline or not. If a component is not unique in the pipe and there are multiple components of same type, i.e. multiple NER models, NLU will deduct a base name for the final output columns from the NLU reference each NER model is pointing to.
  • If on the other hand, there is only one NER model in the pipeline, only the default ner column prefixed will be generated.
  • For some components, like embeddings and classifiers are now defined as always_name_deductable, for those NLU will always try to infer a meaningful base name for the output columns.
  • Newly trained component output columns will now be prefixed with trained_<type> , for types pos , ner, cLassifier, sentiment and multi_classifier

Enhanced offline mode

  • You can still load a model from a path as usual with nlu.load(path=model_path) and output columns will be suffixed with from_disk
  • You can now optionally also specify request parameter during load a model from HDD, it will be used to deduct more meaningful column name suffixes, instead of from_disk, i.e. by calling nlu.load(request ='en.embed_sentence.biobert.pubmed_pmc_base_cased', path=model_path)

NLU visualization

The latest NLU release integrated the beautiful Spark-NLP-Display package visualizations. You do not need to worry about installing it, when you try to visualize something, NLU will check if Spark-NLP-Display is installed, if it is missing it will be dynamically installed into your python executable environment, so you don’t need to worry about anything!

See the visualization tutorial notebook and visualization docs for more info.

Cheat Sheet visualization

NER visualization

Applicable to any of the 100+ NER models! See here for an overview

nlu.load('ner').viz("Donald Trump from America and Angela Merkel from Germany don't share many oppinions.")

NER visualization

Dependency tree visualization

Visualizes the structure of the labeled dependency tree and part of speech tags

nlu.load('dep.typed').viz("Billy went to the mall")

Dependency Tree visualization

#Bigger Example
nlu.load('dep.typed').viz("Donald Trump from America and Angela Merkel from Germany don't share many oppinions but they both love John Snow Labs software")

Dependency Tree visualization

Assertion status visualization

Visualizes asserted statuses and entities.
Applicable to any of the 10 + Assertion models! See here for an overview

nlu.load('med_ner.clinical assert').viz("The MRI scan showed no signs of cancer in the left lung")

Assert visualization

#bigger example
data ='This is the case of a very pleasant 46-year-old Caucasian female, seen in clinic on 12/11/07 during which time MRI of the left shoulder showed no evidence of rotator cuff tear. She did have a previous MRI of the cervical spine that did show an osteophyte on the left C6-C7 level. Based on this, negative MRI of the shoulder, the patient was recommended to have anterior cervical discectomy with anterior interbody fusion at C6-C7 level. Operation, expected outcome, risks, and benefits were discussed with her. Risks include, but not exclusive of bleeding and infection, bleeding could be soft tissue bleeding, which may compromise airway and may result in return to the operating room emergently for evacuation of said hematoma. There is also the possibility of bleeding into the epidural space, which can compress the spinal cord and result in weakness and numbness of all four extremities as well as impairment of bowel and bladder function. However, the patient may develop deeper-seated infection, which may require return to the operating room. Should the infection be in the area of the spinal instrumentation, this will cause a dilemma since there might be a need to remove the spinal instrumentation and/or allograft. There is also the possibility of potential injury to the esophageus, the trachea, and the carotid artery. There is also the risks of stroke on the right cerebral circulation should an undiagnosed plaque be propelled from the right carotid. She understood all of these risks and agreed to have the procedure performed.'
nlu.load('med_ner.clinical assert').viz(data)

Assert visualization

Relationship between entities visualization

Visualizes the extracted entities between relationship.
Applicable to any of the 20 + Relation Extractor models See here for an overview

nlu.load('med_ner.jsl.wip.clinical relation.temporal_events').viz('The patient developed cancer after a mercury poisoning in 1999 ')

Entity Relation visualization

# bigger example
data = 'This is the case of a very pleasant 46-year-old Caucasian female, seen in clinic on 12/11/07 during which time MRI of the left shoulder showed no evidence of rotator cuff tear. She did have a previous MRI of the cervical spine that did show an osteophyte on the left C6-C7 level. Based on this, negative MRI of the shoulder, the patient was recommended to have anterior cervical discectomy with anterior interbody fusion at C6-C7 level. Operation, expected outcome, risks, and benefits were discussed with her. Risks include, but not exclusive of bleeding and infection, bleeding could be soft tissue bleeding, which may compromise airway and may result in return to the operating room emergently for evacuation of said hematoma. There is also the possibility of bleeding into the epidural space, which can compress the spinal cord and result in weakness and numbness of all four extremities as well as impairment of bowel and bladder function. However, the patient may develop deeper-seated infection, which may require return to the operating room. Should the infection be in the area of the spinal instrumentation, this will cause a dilemma since there might be a need to remove the spinal instrumentation and/or allograft. There is also the possibility of potential injury to the esophageus, the trachea, and the carotid artery. There is also the risks of stroke on the right cerebral circulation should an undiagnosed plaque be propelled from the right carotid. She understood all of these risks and agreed to have the procedure performed'
pipe = nlu.load('med_ner.jsl.wip.clinical relation.clinical').viz(data)

Entity Relation visualization

Entity Resolution visualization for chunks

Visualizes resolutions of entities Applicable to any of the 100+ Resolver models See here for an overview

nlu.load('med_ner.jsl.wip.clinical resolve_chunk.rxnorm.in').viz("He took Prevacid 30 mg  daily")

Chunk Resolution visualization

# bigger example
data = "This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."
nlu.load('med_ner.jsl.wip.clinical resolve_chunk.rxnorm.in').viz(data)

Chunk Resolution visualization

Entity Resolution visualization for sentences

Visualizes resolutions of entities in sentences Applicable to any of the 100+ Resolver models See here for an overview

nlu.load('med_ner.jsl.wip.clinical resolve.icd10cm').viz('She was diagnosed with a respiratory congestion')

Sentence Resolution visualization

# bigger example
data = 'The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion'
nlu.load('med_ner.jsl.wip.clinical resolve.icd10cm').viz(data)

Sentence Resolution visualization

Configure visualizations

Define custom colors for labels

Some entity and relation labels will be highlighted with a pre-defined color, which you can find here.
For labels that have no color defined, a random color will be generated.
You can define colors for labels manually, by specifying via the viz_colors parameter and defining hex color codes in a dictionary that maps labels to colors .

data = 'Dr. John Snow suggested that Fritz takes 5mg penicilin for his cough'
# Define custom colors for labels
viz_colors={'STRENGTH':'#800080', 'DRUG_BRANDNAME':'#77b5fe', 'GENDER':'#77ffe'}
nlu.load('med_ner.jsl.wip.clinical').viz(data,viz_colors =viz_colors)

define colors labels

Filter entities that get highlighted

By default every entity class will be visualized.
The labels_to_viz can be used to define a set of labels to highlight.
Applicable for ner, resolution and assert.

data = 'Dr. John Snow suggested that Fritz takes 5mg penicilin for his cough'
# Filter wich NER label to viz
labels_to_viz=['SYMPTOM']
nlu.load('med_ner.jsl.wip.clinical').viz(data,labels_to_viz=labels_to_viz)

filter labels

New models

New multilingual models for Afrikaans, Welsh, Maltese, Tamil, andVietnamese

nlu.load() Refrence Spark NLP Refrence
vi.lemma lemma
mt.lemma lemma
ta.lemma lemma
af.lemma lemma
af.pos pos_afribooms
cy.lemma lemma

Reworked and updated NLU tutorial notebooks

All of the 140+ NLU tutorial Notebooks have been updated and reworked to reflect the latest changes in NLU 3.0.0+

Bugfixes

  • Fixed a bug that caused resolution algorithms output level to be inferred incorrectly
  • Fixed a bug that caused stranger cols got dropped
  • Fixed a bug that caused endings to miss when .predict(position=True) was specified
  • Fixed a bug that caused pd.Series to be converted incorrectly internally
  • Fixed a bug that caused output level transformations to crash
  • Fixed a bug that caused verbose mode not to turn of properly after turning it on.
  • fixed a bug that caused some models to crash when loaded for HDD

  • 140+ updates tutorials
  • Updated visualization docs
  • Models Hub with new models
  • Spark NLP publications
  • NLU in Action
  • NLU documentation
  • Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark==3.0.1

200+ State of the Art Medical Models for NER, Entity Resolution, Relation Extraction, Assertion, Spark 3 and Python 3.8 support in NLU 3.0 Release and much more

We are incredible excited to announce the release of NLU 3.0.0 which makes most of John Snow Labs medical healthcare model available in just 1 line of code in NLU. These models are the most accurate in their domains and highly scalable in Spark clusters.
In addition, Spark 3.0.X and Spark 3.1.X is now supported, together with Python3.8

This is enabled by the the amazing Spark NLP3.0.1 and Spark NLP for Healthcare 3.0.1 releases.

New Features

  • Over 200 new models for the healthcare domain
  • 6 new classes of models, Assertion, Sentence/Chunk Resolvers, Relation Extractors, Medical NER models, De-Identificator Models
  • Spark 3.0.X and 3.1.X support
  • Python 3.8 Support
  • New Output level relation
  • 1 Line to install NLU just run !wget https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh -O - | bash
  • Various new EMR and Databricks versions supported
  • GPU Mode, more then 600% speedup by enabling GPU mode.
  • Authorized mode for licensed features

New Documentation

New Notebooks

AssertionDLModels

Language nlu.load() reference Spark NLP Model reference
English assert assertion_dl
English assert.biobert assertion_dl_biobert
English assert.healthcare assertion_dl_healthcare
English assert.large assertion_dl_large

New Word Embeddings

Language nlu.load() reference Spark NLP Model reference
English embed.glove.clinical embeddings_clinical
English embed.glove.biovec embeddings_biovec
English embed.glove.healthcare embeddings_healthcare
English embed.glove.healthcare_100d embeddings_healthcare_100d
English en.embed.glove.icdoem embeddings_icdoem
English en.embed.glove.icdoem_2ng embeddings_icdoem_2ng

Sentence Entity resolvers

Language nlu.load() reference Spark NLP Model reference
English embed_sentence.biobert.mli sbiobert_base_cased_mli
English resolve sbiobertresolve_cpt
English resolve.cpt sbiobertresolve_cpt
English resolve.cpt.augmented sbiobertresolve_cpt_augmented
English resolve.cpt.procedures_augmented sbiobertresolve_cpt_procedures_augmented
English resolve.hcc.augmented sbiobertresolve_hcc_augmented
English resolve.icd10cm sbiobertresolve_icd10cm
English resolve.icd10cm.augmented sbiobertresolve_icd10cm_augmented
English resolve.icd10cm.augmented_billable sbiobertresolve_icd10cm_augmented_billable_hcc
English resolve.icd10pcs sbiobertresolve_icd10pcs
English resolve.icdo sbiobertresolve_icdo
English resolve.rxcui sbiobertresolve_rxcui
English resolve.rxnorm sbiobertresolve_rxnorm
English resolve.snomed sbiobertresolve_snomed_auxConcepts
English resolve.snomed.aux_concepts sbiobertresolve_snomed_auxConcepts
English resolve.snomed.aux_concepts_int sbiobertresolve_snomed_auxConcepts_int
English resolve.snomed.findings sbiobertresolve_snomed_findings
English resolve.snomed.findings_int sbiobertresolve_snomed_findings_int

RelationExtractionModel

Language nlu.load() reference Spark NLP Model reference
English relation.posology posology_re
English relation redl_bodypart_direction_biobert
English relation.bodypart.direction redl_bodypart_direction_biobert
English relation.bodypart.problem redl_bodypart_problem_biobert
English relation.bodypart.procedure redl_bodypart_procedure_test_biobert
English relation.chemprot redl_chemprot_biobert
English relation.clinical redl_clinical_biobert
English relation.date redl_date_clinical_biobert
English relation.drug_drug_interaction redl_drug_drug_interaction_biobert
English relation.humen_phenotype_gene redl_human_phenotype_gene_biobert
English relation.temporal_events redl_temporal_events_biobert

NERDLModels

Language nlu.load() reference Spark NLP Model reference
English med_ner.ade.clinical ner_ade_clinical
English med_ner.ade.clinical_bert ner_ade_clinicalbert
English med_ner.ade.ade_healthcare ner_ade_healthcare
English med_ner.anatomy ner_anatomy
English med_ner.anatomy.biobert ner_anatomy_biobert
English med_ner.anatomy.coarse ner_anatomy_coarse
English med_ner.anatomy.coarse_biobert ner_anatomy_coarse_biobert
English med_ner.aspect_sentiment ner_aspect_based_sentiment
English med_ner.bacterial_species ner_bacterial_species
English med_ner.bionlp ner_bionlp
English med_ner.bionlp.biobert ner_bionlp_biobert
English med_ner.cancer ner_cancer_genetics
Englishs med_ner.cellular ner_cellular
English med_ner.cellular.biobert ner_cellular_biobert
English med_ner.chemicals ner_chemicals
English med_ner.chemprot ner_chemprot_biobert
English med_ner.chemprot.clinical ner_chemprot_clinical
English med_ner.clinical ner_clinical
English med_ner.clinical.biobert ner_clinical_biobert
English med_ner.clinical.noncontrib ner_clinical_noncontrib
English med_ner.diseases ner_diseases
English med_ner.diseases.biobert ner_diseases_biobert
English med_ner.diseases.large ner_diseases_large
English med_ner.drugs ner_drugs
English med_ner.drugsgreedy ner_drugs_greedy
English med_ner.drugs.large ner_drugs_large
English med_ner.events_biobert ner_events_biobert
English med_ner.events_clinical ner_events_clinical
English med_ner.events_healthcre ner_events_healthcare
English med_ner.financial_contract ner_financial_contract
English med_ner.healthcare ner_healthcare
English med_ner.human_phenotype.gene_biobert ner_human_phenotype_gene_biobert
English med_ner.human_phenotype.gene_clinical ner_human_phenotype_gene_clinical
English med_ner.human_phenotype.go_biobert ner_human_phenotype_go_biobert
English med_ner.human_phenotype.go_clinical ner_human_phenotype_go_clinical
English med_ner.jsl ner_jsl
English med_ner.jsl.biobert ner_jsl_biobert
English med_ner.jsl.enriched ner_jsl_enriched
English med_ner.jsl.enriched_biobert ner_jsl_enriched_biobert
English med_ner.measurements ner_measurements_clinical
English med_ner.medmentions ner_medmentions_coarse
English med_ner.posology ner_posology
English med_ner.posology.biobert ner_posology_biobert
English med_ner.posology.greedy ner_posology_greedy
English med_ner.posology.healthcare ner_posology_healthcare
English med_ner.posology.large ner_posology_large
English med_ner.posology.large_biobert ner_posology_large_biobert
English med_ner.posology.small ner_posology_small
English med_ner.radiology ner_radiology
English med_ner.radiology.wip_clinical ner_radiology_wip_clinical
English med_ner.risk_factors ner_risk_factors
English med_ner.risk_factors.biobert ner_risk_factors_biobert
English med_ner.i2b2 nerdl_i2b2
English med_ner.tumour nerdl_tumour_demo
English med_ner.jsl.wip.clinical jsl_ner_wip_clinical
English med_ner.jsl.wip.clinical.greedy jsl_ner_wip_greedy_clinical
English med_ner.jsl.wip.clinical.modifier jsl_ner_wip_modifier_clinical
English med_ner.jsl.wip.clinical.rd jsl_rd_ner_wip_greedy_clinical

De-Identification Models

Language nlu.load() reference Spark NLP Model reference
English med_ner.deid.augmented ner_deid_augmented
English med_ner.deid.biobert ner_deid_biobert
English med_ner.deid.enriched ner_deid_enriched
English med_ner.deid.enriched_biobert ner_deid_enriched_biobert
English med_ner.deid.large ner_deid_large
English med_ner.deid.sd ner_deid_sd
English med_ner.deid.sd_large ner_deid_sd_large
English med_ner.deid nerdl_deid
English med_ner.deid.synthetic ner_deid_synthetic
English med_ner.deid.dl ner_deidentify_dl
English en.de_identify deidentify_rb
English de_identify.rules deid_rules
English de_identify.clinical deidentify_enriched_clinical
English de_identify.large deidentify_large
English de_identify.rb deidentify_rb
English de_identify.rb_no_regex deidentify_rb_no_regex

Chunk resolvers

Language nlu.load() reference Spark NLP Model reference
English resolve_chunk.athena_conditions chunkresolve_athena_conditions_healthcare
English resolve_chunk.cpt_clinical chunkresolve_cpt_clinical
English resolve_chunk.icd10cm.clinical chunkresolve_icd10cm_clinical
English resolve_chunk.icd10cm.diseases_clinical chunkresolve_icd10cm_diseases_clinical
English resolve_chunk.icd10cm.hcc_clinical chunkresolve_icd10cm_hcc_clinical
English resolve_chunk.icd10cm.hcc_healthcare chunkresolve_icd10cm_hcc_healthcare
English resolve_chunk.icd10cm.injuries chunkresolve_icd10cm_injuries_clinical
English resolve_chunk.icd10cm.musculoskeletal chunkresolve_icd10cm_musculoskeletal_clinical
English resolve_chunk.icd10cm.neoplasms chunkresolve_icd10cm_neoplasms_clinical
English resolve_chunk.icd10cm.poison chunkresolve_icd10cm_poison_ext_clinical
English resolve_chunk.icd10cm.puerile chunkresolve_icd10cm_puerile_clinical
English resolve_chunk.icd10pcs.clinical chunkresolve_icd10pcs_clinical
English resolve_chunk.icdo.clinical chunkresolve_icdo_clinical
English resolve_chunk.loinc chunkresolve_loinc_clinical
English resolve_chunk.rxnorm.cd chunkresolve_rxnorm_cd_clinical
English resolve_chunk.rxnorm.in chunkresolve_rxnorm_in_clinical
English resolve_chunk.rxnorm.in_healthcare chunkresolve_rxnorm_in_healthcare
English resolve_chunk.rxnorm.sbd chunkresolve_rxnorm_sbd_clinical
English resolve_chunk.rxnorm.scd chunkresolve_rxnorm_scd_clinical
English resolve_chunk.rxnorm.scdc chunkresolve_rxnorm_scdc_clinical
English resolve_chunk.rxnorm.scdc_healthcare chunkresolve_rxnorm_scdc_healthcare
English resolve_chunk.rxnorm.xsmall.clinical chunkresolve_rxnorm_xsmall_clinical
English resolve_chunk.snomed.findings chunkresolve_snomed_findings_clinical

New Classifiers

Language nlu.load() reference Spark NLP Model reference
English classify.icd10.clinical classifier_icd10cm_hcc_clinical
English classify.icd10.healthcare classifier_icd10cm_hcc_healthcare
English classify.ade.biobert classifierdl_ade_biobert
English classify.ade.clinical classifierdl_ade_clinicalbert
English classify.ade.conversational classifierdl_ade_conversational_biobert
English classify.gender.biobert classifierdl_gender_biobert
English classify.gender.sbert classifierdl_gender_sbert
English classify.pico classifierdl_pico_biobert

German Medical models

nlu.load() reference Spark NLP Model reference
[embed] w2v_cc_300d
[embed.w2v] w2v_cc_300d
[resolve_chunk] chunkresolve_ICD10GM
[resolve_chunk.icd10gm] chunkresolve_ICD10GM
resolve_chunk.icd10gm.2021 chunkresolve_ICD10GM_2021
med_ner.legal ner_legal
med_ner ner_healthcare
med_ner.healthcare ner_healthcare
med_ner.healthcare_slim ner_healthcare_slim
med_ner.traffic ner_traffic

Spanish Medical models

| nlu.load() reference | Spark NLP Model reference | | ———————————————————— | ———————————————————— | | embed.scielo.150d | embeddings_scielo_150d| | embed.scielo.300d | embeddings_scielo_300d| | embed.scielo.50d | embeddings_scielo_50d| | embed.scielowiki.150d | embeddings_scielowiki_150d| | embed.scielowiki.300d | embeddings_scielowiki_300d| | embed.scielowiki.50d | embeddings_scielowiki_50d| | embed.sciwiki.150d | embeddings_sciwiki_150d| | embed.sciwiki.300d | embeddings_sciwiki_300d| | embed.sciwiki.50d | embeddings_sciwiki_50d| | med_ner | ner_diag_proc| | med_ner.neoplasm | ner_neoplasms| | med_ner.diag_proc | ner_diag_proc|

GPU Mode

You can now enable NLU GPU mode by setting gpu=true while loading a model. I.e. nlu.load('train.sentiment' gpu=True) . If must resart you kernel, if you already loaded a nlu pipeline withouth GPU mode.

Output Level Relation

This new output level is used for relation extractors and will give you 1 row per relation extracted.

Bug fixes

  • Fixed a bug that caused loading NLU models in offline mode not to work in some occasions

1 line Install NLU

!wget https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh -O - | bash

Install via PIP

! pip install nlu pyspark==3.0.1

Additional NLU ressources

Intent and Action Classification, analyze Chinese News and the Crypto market, train a classifier that understands 100+ languages, translate between 200 + languages, answer questions, summarize text, and much more in NLU 1.1.3

NLU 1.1.3 Release Notes

We are very excited to announce that the latest NLU release comes with a new pretrained Intent Classifier and NER Action Extractor for text related to music, restaurants, and movies trained on the SNIPS dataset. Make sure to check out the models hub and the easy 1-liners for more info!

In addition to that, new NER and Embedding models for Bengali are now available

Finally, there is a new NLU Webinar with 9 accompanying tutorial notebooks which teach you a lot of things and is segmented into the following parts :

  • Part1: Easy 1 Liners
    • Spell checking/Sentiment/POS/NER/ BERTtology embeddings
  • Part2: Data analysis and NLP tasks on Crypto News Headline dataset
    • Preprocessing and extracting Emotions, Keywords, Named Entities and visualize them
  • Part3: NLU Multi-Lingual 1 Liners with Microsoft’s Marian Models
    • Translate between 200+ languages (and classify lang afterward)
  • Part 4: Data analysis and NLP tasks on Chinese News Article Dataset
    • Word Segmentation, Lemmatization, Extract Keywords, Named Entities and translate to english
  • Part 5: Train a sentiment Classifier that understands 100+ Languages
  • Part 6: Question answering, Summarization, Squad and more with Google’s T5

New Models

NLU 1.1.3 New Non-English Models

Language nlu.load() reference Spark NLP Model reference Type
Bengali bn.ner.cc_300d bengaliner_cc_300d NerDLModel
Bengali bn.embed bengali_cc_300d NerDLModel
Bengali bn.embed.cc_300d bengali_cc_300d Word Embeddings Model (Alias)
Bengali bn.embed.glove bengali_cc_300d Word Embeddings Model (Alias)

NLU 1.1.3 New English Models

Language nlu.load() reference Spark NLP Model reference Type
English en.classify.snips nerdl_snips_100d NerDLModel
English en.ner.snips classifierdl_use_snips ClassifierDLModel

New NLU Webinar

State-of-the-art Natural Language Processing for 200+ Languages with 1 Line of code

Talk Abstract

Learn to harness the power of 1,000+ production-grade & scalable NLP models for 200+ languages - all available with just 1 line of Python code by leveraging the open-source NLU library, which is powered by the widely popular Spark NLP.

John Snow Labs has delivered over 80 releases of Spark NLP to date, making it the most widely used NLP library in the enterprise and providing the AI community with state-of-the-art accuracy and scale for a variety of common NLP tasks. The most recent releases include pre-trained models for over 200 languages - including languages that do not use spaces for word segmentation algorithms like Chinese, Japanese, and Korean, and languages written from right to left like Arabic, Farsi, Urdu, and Hebrew. All software and models are free and open source under an Apache 2.0 license.

This webinar will show you how to leverage the multi-lingual capabilities of Spark NLP & NLU - including automated language detection for up to 375 languages, and the ability to perform translation, named entity recognition, stopword removal, lemmatization, and more in a variety of language families. We will create Python code in real-time and solve these problems in just 30 minutes. The notebooks will then be made freely available online.

You can watch the video here,

NLU 1.1.3 New Notebooks and tutorials

New Webinar Notebooks

  1. NLU basics, easy 1-liners (Spellchecking, sentiment, NER, POS, BERT
  2. Analyze Crypto News dataset with Keyword extraction, NER, Emotional distribution, and stemming
  3. Translate Crypto News dataset between 300 Languages with the Marian Model (German, French, Hebrew examples)
  4. Translate Crypto News dataset between 300 Languages with the Marian Model (Hindi, Russian, Chinese examples)
  5. Analyze Chinese News Headlines with Chinese Word Segmentation, Lemmatization, NER, and Keyword extraction
  6. Train a Sentiment Classifier that will understand 100+ languages on just a French Dataset with the powerful Language Agnostic Bert Embeddings
  7. Summarize text and Answer Questions with T5
  8. Solve any task in 1 line from SQUAD, GLUE and SUPER GLUE with T5
  9. Overview of models for various languages

New easy NLU 1-liners in NLU 1.1.3

nlu.load("en.classify.snips").predict("book a spot for nona gray  myrtle and alison at a top-rated brasserie that is distant from wilson av on nov  the 4th  2030 that serves ouzeri",output_level = "document")

outputs :

ner_confidence entities document Entities_Classes
[1.0, 1.0, 0.9997000098228455, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9990000128746033, 1.0, 1.0, 1.0, 0.9965000152587891, 0.9998999834060669, 0.9567000269889832, 1.0, 1.0, 1.0, 0.9980000257492065, 0.9991999864578247, 0.9988999962806702, 1.0, 1.0, 0.9998999834060669] [‘nona gray myrtle and alison’, ‘top-rated’, ‘brasserie’, ‘distant’, ‘wilson av’, ‘nov the 4th 2030’, ‘ouzeri’] book a spot for nona gray myrtle and alison at a top-rated brasserie that is distant from wilson av on nov the 4th 2030 that serves ouzeri [‘party_size_description’, ‘sort’, ‘restaurant_type’, ‘spatial_relation’, ‘poi’, ‘timeRange’, ‘cuisine’]

Named Entity Recognition (NER) Model in Bengali (bengaliner_cc_300d)

# Bengali for: 'Iajuddin Ahmed passed Matriculation from Munshiganj High School in 1947 and Intermediate from Munshiganj Horganga College in 1950.'
nlu.load("bn.ner.cc_300d").predict("১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন",output_level = "document")

outputs :

ner_confidence entities Entities_Classes document
[0.9987999796867371, 0.9854000210762024, 0.8604000210762024, 0.6686999797821045, 0.5289999842643738, 0.7009999752044678, 0.7684999704360962, 0.9979000091552734, 0.9976000189781189, 0.9930999875068665, 0.9994000196456909, 0.9879000186920166, 0.7407000064849854, 0.9215999841690063, 0.7657999992370605, 0.39419999718666077, 0.9124000072479248, 0.9932000041007996, 0.9919999837875366, 0.995199978351593, 0.9991999864578247] [‘সালে’, ‘ইয়াজউদ্দিন আহম্মেদ’, ‘মুন্সিগঞ্জ উচ্চ বিদ্যালয়’, ‘সালে’, ‘মুন্সিগঞ্জ হরগঙ্গা কলেজ’] [‘TIME’, ‘PER’, ‘ORG’, ‘TIME’, ‘ORG’] ১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন

Identify intent in general text - SNIPS dataset

nlu.load("en.ner.snips").predict("I want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area",output_level = "document")

outputs :

document snips snips_confidence
I want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area BookRestaurant 1

Word Embeddings for Bengali (bengali_cc_300d)

# Bengali for : 'Iajuddin Ahmed passed Matriculation from Munshiganj High School in 1947 and Intermediate from Munshiganj Horganga College in 1950.'
nlu.load("bn.embed").predict("১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন",output_level = "document")

outputs :

document bn_embed_embeddings
১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন [-0.0828 0.0683 0.0215 … 0.0679 -0.0484…]

NLU 1.1.3 Enhancements

  • Added automatic conversion to Sentence Embeddings of Word Embeddings when there is no Sentence Embedding Avaiable and a model needs the converted version to run.

NLU 1.1.3 Bug Fixes

  • Fixed a bug that caused ur.sentiment NLU pipeline to build incorrectly
  • Fixed a bug that caused sentiment.imdb.glove NLU pipeline to build incorrectly
  • Fixed a bug that caused en.sentiment.glove.imdb NLU pipeline to build incorrectly
  • Fixed a bug that caused Spark 2.3.X environments to crash.

NLU Installation

# PyPi
!pip install nlu pyspark==2.4.7
#Conda
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu

Additional NLU ressources

NLU 1.1.2 Release Notes

Hindi WordEmbeddings , Bengali Named Entity Recognition (NER), 30+ new models, analyze Crypto news with John Snow Labs NLU 1.1.2

We are very happy to announce NLU 1.1.2 has been released with the integration of 30+ models and pipelines Bengali Named Entity Recognition, Hindi Word Embeddings, and state-of-the-art transformer based OntoNotes models and pipelines from the incredible Spark NLP 2.7.3 Release in addition to a few bugfixes.
In addition to that, there is a new NLU Webinar video showcasing in detail how to use NLU to analyze a crypto news dataset to extract keywords unsupervised and predict sentimential/emotional distributions of the dataset and much more!

Python’s NLU library: 1,000+ models, 200+ Languages, State of the Art Accuracy, 1 Line of code - NLU NYC/DC NLP Meetup Webinar

Using just 1 line of Python code by leveraging the NLU library, which is powered by the award-winning Spark NLP.

This webinar covers, using live coding in real-time, how to deliver summarization, translation, unsupervised keyword extraction, emotion analysis, question answering, spell checking, named entity recognition, document classification, and other common NLP tasks. T his is all done with a single line of code, that works directly on Python strings or pandas data frames. Since NLU is based on Spark NLP, no code changes are required to scale processing to multi-core or cluster environment - integrating natively with Ray, Dask, or Spark data frames.

The recent releases for Spark NLP and NLU include pre-trained models for over 200 languages and language detection for 375 languages. This includes 20 languages families; non-Latin alphabets; languages that do not use spaces for word segmentation like Chinese, Japanese, and Korean; and languages written from right to left like Arabic, Farsi, Urdu, and Hebrew. We’ll also cover some of the algorithms and models that are included. The code notebooks will be freely available online.

NLU 1.1.2 New Models and Pipelines

NLU 1.1.2 New Non-English Models

Language nlu.load() reference Spark NLP Model reference Type
Bengali bn.ner ner_jifs_glove_840B_300d Word Embeddings Model (Alias)
Bengali bn.ner.glove ner_jifs_glove_840B_300d Word Embeddings Model (Alias)
Hindi hi.embed hindi_cc_300d NerDLModel
Bengali bn.lemma lemma Lemmatizer
Japanese ja.lemma lemma Lemmatizer
Bihari bh.lemma lemma Lemma
Amharic am.lemma lemma Lemma

NLU 1.1.2 New English Models and Pipelines

Language nlu.load() reference Spark NLP Model reference Type
English en.ner.onto.bert.small_l2_128 onto_small_bert_L2_128 NerDLModel
English en.ner.onto.bert.small_l4_256 onto_small_bert_L4_256 NerDLModel
English en.ner.onto.bert.small_l4_512 onto_small_bert_L4_512 NerDLModel
English en.ner.onto.bert.small_l8_512 onto_small_bert_L8_512 NerDLModel
English en.ner.onto.bert.cased_base onto_bert_base_cased NerDLModel
English en.ner.onto.bert.cased_large onto_bert_large_cased NerDLModel
English en.ner.onto.electra.uncased_small onto_electra_small_uncased NerDLModel
English en.ner.onto.electra.uncased_base onto_electra_base_uncased NerDLModel
English en.ner.onto.electra.uncased_large onto_electra_large_uncased NerDLModel
English en.ner.onto.bert.tiny onto_recognize_entities_bert_tiny Pipeline
English en.ner.onto.bert.mini onto_recognize_entities_bert_mini Pipeline
English en.ner.onto.bert.small onto_recognize_entities_bert_small Pipeline
English en.ner.onto.bert.medium onto_recognize_entities_bert_medium Pipeline
English en.ner.onto.bert.base onto_recognize_entities_bert_base Pipeline
English en.ner.onto.bert.large onto_recognize_entities_bert_large Pipeline
English en.ner.onto.electra.small onto_recognize_entities_electra_small Pipeline
English en.ner.onto.electra.base onto_recognize_entities_electra_base Pipeline
English en.ner.onto.large onto_recognize_entities_electra_large Pipeline

New Tutorials and Notebooks

NLU 1.1.2 Bug Fixes

  • Fixed a bug that caused NER confidences not beeing extracted
  • Fixed a bug that caused nlu.load(‘spell’) to crash
  • Fixed a bug that caused Uralic/Estonian/ET language models not to be loaded properly

New Easy NLU 1-liners in 1.1.2

Named Entity Recognition for Bengali (GloVe 840B 300d)

#Bengali for :  It began to be widely used in the United States in the early '90s.
nlu.load("bn.ner").predict("৯০ এর দশকের শুরুর দিকে বৃহৎ আকারে মার্কিন যুক্তরাষ্ট্রে এর প্রয়োগের প্রক্রিয়া শুরু হয়'")

output :

entities token Entities_classes ner_confidence
[‘মার্কিন যুক্তরাষ্ট্রে’] ৯০ [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] এর [‘LOC’] 0.9999
[‘মার্কিন যুক্তরাষ্ট্রে’] দশকের [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] শুরুর [‘LOC’] 0.9969
[‘মার্কিন যুক্তরাষ্ট্রে’] দিকে [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] বৃহৎ [‘LOC’] 0.9994
[‘মার্কিন যুক্তরাষ্ট্রে’] আকারে [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] মার্কিন [‘LOC’] 0.9602
[‘মার্কিন যুক্তরাষ্ট্রে’] যুক্তরাষ্ট্রে [‘LOC’] 0.4134
[‘মার্কিন যুক্তরাষ্ট্রে’] এর [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] প্রয়োগের [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] প্রক্রিয়া [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] শুরু [‘LOC’] 0.9999
[‘মার্কিন যুক্তরাষ্ট্রে’] হয় [‘LOC’] 1
[‘মার্কিন যুক্তরাষ্ট্রে’] [‘LOC’] 1

Bengali Lemmatizer

#Bengali for :  One morning in the marble-decorated building of Vaidyanatha, an obese monk was engaged in the enchantment of Duis and the milk service of one and a half Vaidyanatha. Give me two to eat
nlu.load("bn.lemma").predict("একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও")

output :

lemma document
[‘একদিন’, ‘প্রাতঃ’, ‘বৈদ্যনাথ’, ‘মার্বলমণ্ডিত’, ‘দালান’, ‘এক’, ‘স্থূলউদর’, ‘সন্ন্যাসী’, ‘দুইসের’, ‘মোহনভোগ’, ‘এবং’, ‘দেড়সের’, ‘দুগ্ধ’, ‘সেবা’, ‘নিযুক্ত’, ‘আছে’, ‘বৈদ্যনাথ’, ‘গা’, ‘একখান’, ‘চাদর’, ‘দেওয়া’, ‘জোড়কর’, ‘একান্ত’, ‘বিনীতভাব’, ‘ভূতল’, ‘বসা’, ‘ভক্তিভরা’, ‘পবিত্র’, ‘ভোজনব্যাপার’, ‘নিরীক্ষণ’, ‘করা’, ‘এমন’, ‘সময়’, ‘কোনোমত’, ‘দ্বারী’, ‘দৃষ্টি’, ‘এড়ানো’, ‘জীর্ণদেহ’, ‘বালক’, ‘সহিত’, ‘এক’, ‘অতি’, ‘শীর্ণকায়া’, ‘রমণী’, ‘গৃহ’, ‘প্রবেশ’, ‘বিশ্বাস’, ‘ক্ষীণস্বর’, ‘কহা’, ‘বাবু’, ‘দুই’, ‘খাওয়া’, ‘দাওয়া’] একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও

Japanese Lemmatizer

#Japanese for :  Some residents were uncomfortable with this, but it seems that no one is now openly protesting or protesting.
nlu.load("ja.lemma").predict("これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。")

output :

lemma document
[‘これ’, ‘にる’, ‘不快’, ‘感’, ‘を’, ‘示す’, ‘住民’, ‘はる’, ‘いる’, ‘まする’, ‘たる’, ‘がる’, ‘,’, ‘現在’, ‘,’, ‘表立つ’, ‘てる’, ‘反対’, ‘やる’, ‘抗議’, ‘のる’, ‘声’, ‘を’, ‘挙げる’, ‘てる’, ‘いる’, ‘住民’, ‘はる’, ‘いる’, ‘なぐ’, ‘よう’, ‘です’, ‘。’] これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。

Aharic Lemmatizer

#Aharic for :  Bookmark the permalink.
nlu.load("am.lemma").predict("መጽሐፉን መጽሐፍ ኡ ን አስያዛት አስያዝ ኧ ኣት ።")

output :

lemma document
[‘’, ‘መጽሐፍ’, ‘ኡ’, ‘ን’, ‘’, ‘አስያዝ’, ‘ኧ’, ‘ኣት’, ‘።’] መጽሐፉን መጽሐፍ ኡ ን አስያዛት አስያዝ ኧ ኣት ።

Bhojpuri Lemmatizer

#Bhojpuri for : In this event, participation of World Bhojpuri Conference, Purvanchal Ekta Manch, Veer Kunwar Singh Foundation, Purvanchal Bhojpuri Mahasabha, and Herf - Media.
nlu.load("bh.lemma").predict("एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।")

output :

lemma document
[‘एह’, ‘आयोजन’, ‘में’, ‘विश्व’, ‘भोजपुरी’, ‘सम्मेलन’, ‘COMMA’, ‘पूर्वांचल’, ‘एकता’, ‘मंच’, ‘COMMA’, ‘वीर’, ‘कुँवर’, ‘सिंह’, ‘फाउन्डेशन’, ‘COMMA’, ‘पूर्वांचल’, ‘भोजपुरी’, ‘महासभा’, ‘COMMA’, ‘अउर’, ‘हर्फ’, ‘-‘, ‘मीडिया’, ‘को’, ‘सहभागिता’, ‘बा’, ‘।’] एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।

Named Entity Recognition - BERT Tiny (OntoNotes)

nlu.load("en.ner.onto.bert.small_l2_128").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.8536999821662903, 0.7195000052452087, 0.746…] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘1970s’, ‘1980s’, ‘Seattle’, ‘Washington’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’]

Named Entity Recognition - BERT Mini (OntoNotes)

nlu.load("en.ner.onto.bert.small_l4_256").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.835099995136261, 0.40450000762939453, 0.331…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘1970s and 1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘ORG’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘GPE’, ‘GPE’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - BERT Small (OntoNotes)

nlu.load("en.ner.onto.bert.small_l4_512").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.964900016784668, 0.8299000263214111, 0.9607…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘the 1970s and 1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - BERT Medium (OntoNotes)

nlu.load("en.ner.onto.bert.small_l8_512").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.916700005531311, 0.5873000025749207, 0.8816…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘the 1970s and 1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - BERT Base (OntoNotes)

nlu.load("en.ner.onto.bert.cased_base").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.504800021648407, 0.47290000319480896, 0.462…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘the 1970s and 1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - BERT Large (OntoNotes)

nlu.load("en.ner.onto.electra.uncased_small").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.7213000059127808, 0.6384000182151794, 0.731…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘1970s’, ‘1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - ELECTRA Small (OntoNotes)

nlu.load("en.ner.onto.electra.uncased_small").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.""",output_level = "document")

output :

ner_confidence Entities_classes entities
[0.8496000170707703, 0.4465999901294708, 0.568…] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘1970s’, ‘1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’]

Named Entity Recognition - ELECTRA Base (OntoNotes)

nlu.load("en.ner.onto.electra.uncased_base").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadellabase.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.5134000182151794, 0.9419000148773193, 0.802…] [‘William Henry Gates III’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘the 1970s’, ‘1980s’, ‘Seattle’, ‘Washington’, ‘Gates’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’]

Named Entity Recognition - ELECTRA Large (OntoNotes)


nlu.load("en.ner.onto.electra.uncased_large").predict("""William Henry Gates III (born October 28, 1955) is an American business magnate,
 software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft,
  Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect,
   while also being the largest individual shareholder until May 2014.
    He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico;
     it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect.
     During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time
      role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.
 He gradually transferred his duties to Ray Ozzie and Craig Mundie.
  He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadellabase.""",output_level = "document")

output :

ner_confidence entities Entities_classes
[0.8442000150680542, 0.26840001344680786, 0.57…] [‘William Henry Gates’, ‘October 28, 1955’, ‘American’, ‘Microsoft Corporation’, ‘Microsoft’, ‘Gates’, ‘May 2014’, ‘one’, ‘1970s’, ‘1980s’, ‘Seattle’, ‘Washington’, ‘Gates co-founded’, ‘Microsoft’, ‘Paul Allen’, ‘1975’, ‘Albuquerque’, ‘New Mexico’, ‘largest’] [‘PERSON’, ‘DATE’, ‘NORP’, ‘ORG’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘CARDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘GPE’, ‘PERSON’, ‘ORG’, ‘PERSON’, ‘DATE’, ‘GPE’, ‘GPE’, ‘GPE’]

Recognize Entities OntoNotes - BERT Tiny


nlu.load("en.ner.onto.bert.tiny").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.994700014591217, 0.9412999749183655, 0.9685…] [‘Johnson’, ‘first’, ‘2001’, ‘Parliament’, ‘eight years’, ‘London’, ‘2008 to 2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘ORG’, ‘DATE’, ‘GPE’, ‘DATE’]

Recognize Entities OntoNotes - BERT Mini

nlu.load("en.ner.onto.bert.mini").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.996399998664856, 0.9733999967575073, 0.8766…] [‘Johnson’, ‘first’, ‘2001’, ‘eight years’, ‘London’, ‘2008 to 2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘DATE’]

Recognize Entities OntoNotes - BERT Small

nlu.load("en.ner.onto.bert.small").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9987999796867371, 0.9610000252723694, 0.998…] [‘Johnson’, ‘first’, ‘2001’, ‘eight years’, ‘London’, ‘2008 to 2016’, ‘Parliament’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘DATE’, ‘ORG’]

Recognize Entities OntoNotes - BERT Medium


nlu.load("en.ner.onto.bert.medium").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9969000220298767, 0.8575999736785889, 0.995…] [‘Johnson’, ‘first’, ‘2001’, ‘eight years’, ‘London’, ‘2008 to 2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘DATE’]

Recognize Entities OntoNotes - BERT Base

nlu.load("en.ner.onto.bert.base").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.996999979019165, 0.933899998664856, 0.99930…] [‘Johnson’, ‘first’, ‘2001’, ‘Parliament’, ‘eight years’, ‘London’, ‘2008 to 2016’, ‘Parliament’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘ORG’, ‘DATE’, ‘GPE’, ‘DATE’, ‘ORG’]

Recognize Entities OntoNotes - BERT Large

nlu.load("en.ner.onto.bert.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9786999821662903, 0.9549000263214111, 0.998…] [‘Johnson’, ‘first’, ‘2001’, ‘Parliament’, ‘eight years’, ‘London’, ‘2008 to 2016’, ‘Parliament’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘ORG’, ‘DATE’, ‘GPE’, ‘DATE’, ‘ORG’]

Recognize Entities OntoNotes - ELECTRA Small

nlu.load("en.ner.onto.electra.small").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9952999949455261, 0.8589000105857849, 0.996…] [‘Johnson’, ‘first’, ‘2001’, ‘eight years’, ‘London’, ‘2008 to 2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘DATE’]

Recognize Entities OntoNotes - ELECTRA Base

nlu.load("en.ner.onto.electra.base").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9987999796867371, 0.9474999904632568, 0.999…] [‘Johnson’, ‘first’, ‘2001’, ‘Parliament’, ‘eight years’, ‘London’, ‘2008’, ‘2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘ORG’, ‘DATE’, ‘GPE’, ‘DATE’, ‘DATE’]

Recognize Entities OntoNotes - ELECTRA Large

nlu.load("en.ner.onto.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.",output_level="document")

output :

ner_confidence entities Entities_classes
[0.9998000264167786, 0.9613999724388123, 0.998…] [‘Johnson’, ‘first’, ‘2001’, ‘eight years’, ‘London’, ‘2008 to 2016’] [‘PERSON’, ‘ORDINAL’, ‘DATE’, ‘DATE’, ‘GPE’, ‘DATE’]

NLU Installation

# PyPi
!pip install nlu pyspark==2.4.7
#Conda
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu

Additional NLU ressources

NLU 1.1.1 Release Notes

We are very excited to release NLU 1.1.1! This release features 3 new tutorial notebooks for Open/Closed book question answering with Google’s T5, Intent classification and Aspect Based NER. In Addition NLU 1.1.0 comes with 25+ pretrained models and pipelines in Amharic, Bengali, Bhojpuri, Japanese, and Korean languages from the amazing Spark2.7.2 release Finally NLU now supports running on Spark 2.3 clusters.

NLU 1.1.0 New Non-English Models

Language nlu.load() reference Spark NLP Model reference Type
Arabic ar.ner arabic_w2v_cc_300d Named Entity Recognizer
Arabic ar.embed.aner aner_cc_300d Word Embedding
Arabic ar.embed.aner.300d aner_cc_300d Word Embedding (Alias)
Bengali bn.stopwords stopwords_bn Stopwords Cleaner
Bengali bn.pos pos_msri Part of Speech
Thai th.segment_words wordseg_best Word Segmenter
Thai th.pos pos_lst20 Part of Speech
Thai th.sentiment sentiment_jager_use Sentiment Classifier
Thai th.classify.sentiment sentiment_jager_use Sentiment Classifier (Alias)
Chinese zh.pos.ud_gsd_trad pos_ud_gsd_trad Part of Speech
Chinese zh.segment_words.gsd wordseg_gsd_ud_trad Word Segmenter
Bihari bh.pos pos_ud_bhtb Part of Speech
Amharic am.pos pos_ud_att Part of Speech

NLU 1.1.1 New English Models and Pipelines

Language nlu.load() reference Spark NLP Model reference Type
English en.sentiment.glove analyze_sentimentdl_glove_imdb Sentiment Classifier
English en.sentiment.glove.imdb analyze_sentimentdl_glove_imdb Sentiment Classifier (Alias)
English en.classify.sentiment.glove.imdb analyze_sentimentdl_glove_imdb Sentiment Classifier (Alias)
English en.classify.sentiment.glove analyze_sentimentdl_glove_imdb Sentiment Classifier (Alias)
English en.classify.trec50.pipe classifierdl_use_trec50_pipeline Language Classifier
English en.ner.onto.large onto_recognize_entities_electra_large Named Entity Recognizer
English en.classify.questions.atis classifierdl_use_atis Intent Classifier
English en.classify.questions.airline classifierdl_use_atis Intent Classifier (Alias)
English en.classify.intent.atis classifierdl_use_atis Intent Classifier (Alias)
English en.classify.intent.airline classifierdl_use_atis Intent Classifier (Alias)
English en.ner.atis nerdl_atis_840b_300d Aspect based NER
English en.ner.airline nerdl_atis_840b_300d Aspect based NER (Alias)
English en.ner.aspect.airline nerdl_atis_840b_300d Aspect based NER (Alias)
English en.ner.aspect.atis nerdl_atis_840b_300d Aspect based NER (Alias)

New Easy NLU 1-liner Examples :

Extract aspects and entities from airline questions (ATIS dataset)

	
nlu.load("en.ner.atis").predict("i want to fly from baltimore to dallas round trip")
output:  ["baltimore"," dallas", "round trip"]

Intent Classification for Airline Traffic Information System queries (ATIS dataset)


nlu.load("en.classify.questions.atis").predict("what is the price of flight from newyork to washington")
output:  "atis_airfare"	

Recognize Entities OntoNotes - ELECTRA Large


nlu.load("en.ner.onto.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London.")	
output:  ["Johnson", "first", "2001", "eight years", "London"]	

Question classification of open-domain and fact-based questions Pipeline - TREC50

nlu.load("en.classify.trec50.pipe").predict("When did the construction of stone circles begin in the UK? ")
output:  LOC_other

Traditional Chinese Word Segmentation

# 'However, this treatment also creates some problems' in Chinese
nlu.load("zh.segment_words.gsd").predict("然而,這樣的處理也衍生了一些問題。")
output:  ["然而",",","這樣","的","處理","也","衍生","了","一些","問題","。"]

Part of Speech for Traditional Chinese

# 'However, this treatment also creates some problems' in Chinese
nlu.load("zh.pos.ud_gsd_trad").predict("然而,這樣的處理也衍生了一些問題。")

Output:

Token POS
然而 ADV
PUNCT
這樣 PRON
PART
處理 NOUN
ADV
衍生 VERB
PART
一些 ADJ
問題 NOUN
PUNCT

Thai Word Segment Recognition

# 'Mona Lisa is a 16th-century oil painting created by Leonardo held at the Louvre in Paris' in Thai
nlu.loadnlu.load("th.segment_words").predict("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส")

Output:

token
M
o
n
a
Lisa
เป็น
ภาพ
สีน้ำ
มัน
ใน
ศตวรรษ
ที่
16
ที่
สร้าง
L
e
o
n
a
r
d
o
จัด
ขึ้น
ที่
พิพิธภัณฑ์
ลูฟร์
ใน
ปารีส

Part of Speech for Bengali (POS)

# 'The village is also called 'Mod' in Tora language' in Behgali 
nlu.load("bn.pos").predict("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷")

Output:

token pos
বাসস্থান-ঘরগৃহস্থালি NN
তোড়া NNP
ভাষায় NN
গ্রামকেও NN
বলে VM
` SYM
মোদ NN
SYM
SYM

Stop Words Cleaner for Bengali

# 'This language is not enough' in Bengali 
df = nlu.load("bn.stopwords").predict("এই ভাষা যথেষ্ট নয়")

Output:

cleanTokens token
ভাষা এই
যথেষ্ট ভাষা
নয় যথেষ্ট
None নয়

Part of Speech for Bengali


# 'The people of Ohu know that the foundation of Bhojpuri was shaken' in Bengali
nlu.load('bh.pos').predict("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई")

Output:

pos token
DET ओहु
NOUN लोग
ADP के
NOUN मालूम
VERB बा
SCONJ कि
ADJ श्लील
VERB होखते
PROPN भोजपुरी
ADP के
NOUN नींव
VERB हिल
AUX जाई

Amharic Part of Speech (POS)

# ' "Son, finish the job," he said.' in Amharic
nlu.load('am.pos').predict('ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"')

Output:

pos token
NOUN ልጅ
DET
PART
NOUN ሥራ
DET
PART
VERB አስጨርስ
PRON ኧው
AUX ኣል
PRON ኧሁ
PUNCT
NOUN

Thai Sentiment Classification

#  'I love peanut butter and jelly!' in thai
nlu.load('th.classify.sentiment').predict('ฉันชอบเนยถั่วและเยลลี่!')[['sentiment','sentiment_confidence']]

Output:

sentiment sentiment_confidence
positive 0.999998

Arabic Named Entity Recognition (NER)

# 'In 1918, the forces of the Arab Revolt liberated Damascus with the help of the British' in Arabic
nlu.load('ar.ner').predict('في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز',output_level='chunk')[['entities_confidence','ner_confidence','entities']]

Output:

entity_class ner_confidence entities
ORG [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] قوات الثورة العربية
LOC [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] دمشق
PER [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] الإنكليز

NLU 1.1.0 Enhancements :

  • Spark 2.3 compatibility

New NLU Notebooks and Tutorials

Installation

# PyPi
!pip install nlu pyspark==2.4.7
#Conda
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu

Additional NLU ressources

NLU 1.1.0 Release Notes

We are incredibly excited to release NLU 1.1.0! This release it integrates the 720+ new models from the latest Spark-NLP 2.7.0 + releases You can now achieve state-of-the-art results with Sequence2Sequence transformers like for problems text summarization, question answering, translation between 192+ languages and extract Named Entity in various Right to Left written languages like Koreas, Japanese, Chinese and many more in 1 line of code!
These new features are possible because of the integration of the Google’s T5 models and Microsoft’s Marian models transformers

NLU 1.1.0 has over 720+ new pretrained models and pipelines while extending the support of multi-lingual models to 192+ languages such as Chinese, Japanese, Korean, Arabic, Persian, Urdu, and Hebrew.

NLU 1.1.0 New Features

  • 720+ new models you can find an overview of all NLU models here and further documentation in the models hub
  • NEW: Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
  • NEW: Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
  • NEW: Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
  • NEW: Introducing WordSegmenter model for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
  • NEW: Introducing DocumentNormalizer component for cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters

NLU 1.1.0 New Notebooks, Tutorials and Articles

NLU 1.1.0 New Training Tutorials

Binary Classifier training Jupyter tutorials

Multi Class text Classifier training Jupyter tutorials

NLU 1.1.0 New Medium Tutorials

Translation

Translation example
You can translate between more than 192 Languages pairs with the Marian Models You need to specify the language your data is in as start_language and the language you want to translate to as target_language.
The language references must be ISO language codes

nlu.load('<start_language>.translate.<target_language>')

Translate Turkish to English:
nlu.load('tr.translate_to.fr')

Translate English to French:
nlu.load('en.translate_to.fr')

Translate French to Hebrew:
nlu.load('en.translate_to.fr')

translate_pipe = nlu.load('en.translate_to.fr')
df = translate_pipe.predict('Billy likes to go to the mall every sunday')
df
sentence translation
Billy likes to go to the mall every sunday Billy geht gerne jeden Sonntag ins Einkaufszentrum

T5

Example of every T5 task

Overview of every task available with T5

The T5 model is trained on various datasets for 17 different tasks which fall into 8 categories.

  1. Text summarization
  2. Question answering
  3. Translation
  4. Sentiment analysis
  5. Natural Language inference
  6. Coreference resolution
  7. Sentence Completion
  8. Word sense disambiguation

Every T5 Task with explanation:

Task Name Explanation
1.CoLA Classify if a sentence is gramaticaly correct
2.RTE Classify whether if a statement can be deducted from a sentence
3.MNLI Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).
4.MRPC Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)
5.QNLI Classify whether the answer to a question can be deducted from an answer candidate.
6.QQP Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)
7.SST2 Classify the sentiment of a sentence as positive or negative
8.STSB Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)
9.CB Classify for a premise and a hypothesis whether they contradict each other or not (binary).
10.COPA Classify for a question, premise, and 2 choices which choice the correct choice is (binary).
11.MultiRc Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),
12.WiC Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.
13.WSC/DPR Predict for an ambiguous pronoun in a sentence what it is referring to.
14.Summarization Summarize text into a shorter representation.
15.SQuAD Answer a question for a given context.
16.WMT1. Translate English to German
17.WMT2. Translate English to French
18.WMT3. Translate English to Romanian

refer to this notebook to see how to use every T5 Task.

Question Answering

Question answering example)

Predict an answer to a question based on input context.
This is based on SQuAD - Context based question answering

Predicted Answer Question Context
carbon monoxide What does increased oxygen concentrations in the patient’s lungs displace? Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
pie What did Joey eat for breakfast? Once upon a time, there was a squirrel named Joey. Joey loved to go outside and play with his cousin Jimmy. Joey and Jimmy played silly games together, and were always laughing. One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond. Joey woke up early in the morning to eat some food before they left. Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast. After he ate, he and Jimmy went to the pond. On their way there they saw their friend Jack Rabbit. They dove into the water and swam for several hours. The sun was out, but the breeze was cold. Joey and Jimmy got out of the water and started walking home. Their fur was wet, and the breeze chilled them. When they got home, they dried off, and Jimmy put on his favorite purple shirt. Joey put on a blue shirt with red and green dots. The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed,’
# Set the task on T5
t5['t5'].setTask('question ') 


# define Data, add additional tags between sentences
data = ['''
What does increased oxygen concentrations in the patient’s lungs displace? 
context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
''']


#Predict on text data with T5
t5.predict(data)

How to configure T5 task parameter for Squad Context based question answering and pre-process data

.setTask('question:) and prefix the context which can be made up of multiple sentences with context:

Example pre-processed input for T5 Squad Context based question answering:

question: What does increased oxygen concentrations in the patient’s lungs displace? 
context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.

Text Summarization

Summarization example

Summarizes a paragraph into a shorter version with the same semantic meaning, based on Text summarization

# Set the task on T5
pipe = nlu.load('summarize')

# define Data, add additional tags between sentences
data = [
'''
The belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .
''',
'''  Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.'''
]


#Predict on text data with T5
pipe.predict(data)
Predicted summary Text
manchester united face newcastle in the premier league on wednesday . louis van gaal’s side currently sit two points clear of liverpool in fourth . the belgian duo took to the dance floor on monday night with some friends . the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .

Binary Sentence similarity/ Paraphrasing

Binary sentence similarity example Classify whether one sentence is a re-phrasing or similar to another sentence
This is a sub-task of GLUE and based on MRPC - Binary Paraphrasing/ sentence similarity classification

t5 = nlu.load('en.t5.base')
# Set the task on T5
t5['t5'].setTask('mrpc ')

# define Data, add additional tags between sentences
data = [
''' sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .
sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 "
'''
,
'''  
sentence1: I like to eat peanutbutter for breakfast
sentence2: 	I like to play football.
'''
]

#Predict on text data with T5
t5.predict(data)

| Sentence1 | Sentence2 | prediction| |————|————|———-| |We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , “ Rumsfeld said .| Rather , the US acted because the administration saw “ existing evidence in a new light , through the prism of our experience on September 11 “ . | equivalent | | I like to eat peanutbutter for breakfast| I like to play football | not_equivalent |

How to configure T5 task for MRPC and pre-process text

.setTask('mrpc sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity

mrpc 
sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . 
sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11",

Regressive Sentence similarity/ Paraphrasing

Measures how similar two sentences are on a scale from 0 to 5 with 21 classes representing a regressive label.
This is a sub-task of GLUE and based onSTSB - Regressive semantic sentence similarity .

t5 = nlu.load('en.t5.base')
# Set the task on T5
t5['t5'].setTask('stsb ') 

# define Data, add additional tags between sentences
data = [
             
              ''' sentence1:  What attributes would have made you highly desirable in ancient Rome?  
                  sentence2:  How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?'
              '''
             ,
             '''  
              sentence1: What was it like in Ancient rome?
              sentence2: 	What was Ancient rome like?
              ''',
              '''  
              sentence1: What was live like as a King in Ancient Rome??
              sentence2: 	What was Ancient rome like?
              '''

             ]



#Predict on text data with T5
t5.predict(data)

Question1 Question2 prediction
What attributes would have made you highly desirable in ancient Rome? How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER? 0
What was it like in Ancient rome? What was Ancient rome like? 5.0
What was live like as a King in Ancient Rome?? What is it like to live in Rome? 3.2

How to configure T5 task for stsb and pre-process text

.setTask('stsb sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 STSB - Regressive semantic sentence similarity

stsb
sentence1: What attributes would have made you highly desirable in ancient Rome?        
sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Grammar Checking

Grammar checking with T5 example) Judges if a sentence is grammatically acceptable.
Based on CoLA - Binary Grammatical Sentence acceptability classification

pipe = nlu.load('grammar_correctness')
# Set the task on T5
pipe['t5'].setTask('cola sentence: ')
# define Data
data = ['Anna and Mike is going skiing and they is liked is','Anna and Mike like to dance']
#Predict on text data with T5
pipe.predict(data)

|sentence | prediction| |————|————| | Anna and Mike is going skiing and they is liked is | unacceptable |
| Anna and Mike like to dance | acceptable |

Document Normalization

Document Normalizer example
The DocumentNormalizer extracts content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters

pipe = nlu.load('norm_document')
data = '<!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <p>This is an example of a simple HTML page with one paragraph.</p> </body> </html>'
df = pipe.predict(data,output_level='document')
df

|text|normalized_text| |——|————-| | <!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <p>This is an example of a simple HTML page with one paragraph.</p> </body> </html> |Example This is an example of a simple HTML page with one paragraph.|

Word Segmenter

Word Segmenter Example
The WordSegmenter segments languages without any rule-based tokenization such as Chinese, Japanese, or Korean

pipe = nlu.load('ja.segment_words')
# japanese for 'Donald Trump and Angela Merkel dont share many opinions'
ja_data = ['ドナルド・トランプとアンゲラ・メルケルは多くの意見を共有していません']
df = pipe.predict(ja_data, output_level='token')
df

token
ドナルド
トランプ
アンゲラ
メルケル
多く
意見
共有
ませ

Installation

# PyPi
!pip install nlu pyspark==2.4.7
#Conda
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu

Additional NLU ressources

NLU 1.0.6 Release Notes

Trainable Multi Label Classifiers, predict Stackoverflow Tags and much more in 1 Line of with NLU 1.0.6

We are glad to announce NLU 1.0.6 has been released! NLU 1.0.6 comes with the Multi Label classifier, it can learn to map strings to multiple labels. The Multi Label Classifier is using Bidirectional GRU and CNNs inside TensorFlow and supports up to 100 classes.

NLU 1.0.6 New Features

  • Multi Label Classifier
    • The Multi Label Classifier learns a 1 to many mapping between text and labels. This means it can predict multiple labels at the same time for a given input string. This is very helpful for tasks similar to content tag prediction (HashTags/RedditTags/YoutubeTags/Toxic/E2e etc..)
    • Support up to 100 classes
    • Pre-trained Multi Label Classifiers are already avaiable as Toxic and E2E classifiers

Multi Label Classifier

By default Universal Sentence Encoder Embeddings (USE) are used as sentence embeddings for training.

fitted_pipe = nlu.load('train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)

If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.

#Train on BERT sentence emebddings
fitted_pipe = nlu.load('embed_sentence.bert train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)

Configure a custom line seperator

#Use ; as label seperator
fitted_pipe = nlu.load('embed_sentence.electra train.multi_classifier').fit(train_df, label_seperator=';')
preds = fitted_pipe.predict(train_df)

NLU 1.0.6 Enhancements

  • Improved outputs for Toxic and E2E Classifier.
    • by default, all predicted classes and their confidences which are above the threshold will be returned inside of a list in the Pandas dataframe
    • by configuring meta=True, the confidences for all classes will be returned.

NLU 1.0.6 New Notebooks and Tutorials

NLU 1.0.6 Bug-fixes

  • Fixed a bug that caused en.ner.dl.bert to be inaccessible
  • Fixed a bug that caused pt.ner.large to be inaccessible
  • Fixed a bug that caused USE embeddings not properly beeing configured to document level output when using multiple embeddings at the same time

NLU 1.0.5 Release Notes

Trainable Part of Speech Tagger (POS), Sentiment Classifier with BERT/USE/ELECTRA sentence embeddings in 1 Line of code! Latest NLU Release 1.0.5

We are glad to announce NLU 1.0.5 has been released!
This release comes with a trainable Sentiment classifier and a Trainable Part of Speech (POS) models!
These Neural Network Architectures achieve the state of the art (SOTA) on most binary Sentiment analysis and Part of Speech Tagging tasks!
You can train the Sentiment Model on any of the 100+ Sentence Embeddings which include BERT, ELECTRA, USE, Multi Lingual BERT Sentence Embeddings and many more!
Leverage this and achieve the state of the art in any of your datasets, all of this in just 1 line of Python code

NLU 1.0.5 New Features

  • Trainable Sentiment DL classifier
  • Trainable POS

NLU 1.0.5 New Notebooks and Tutorials

Sentiment Classifier Training

Sentiment Classification Training Demo

To train the Binary Sentiment classifier model, you must pass a dataframe with a ‘text’ column and a ‘y’ column for the label.

By default Universal Sentence Encoder Embeddings (USE) are used as sentence embeddings.

fitted_pipe = nlu.load('train.sentiment').fit(train_df)
preds = fitted_pipe.predict(train_df)

If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.

#Train Classifier on BERT sentence embeddings
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
#Train Classifier on ELECTRA sentence embeddings
fitted_pipe = nlu.load('embed_sentence.electra train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)

Part Of Speech Tagger Training

Part Of Speech Tagger Training demo

fitted_pipe = nlu.load('train.pos').fit(train_df)
preds = fitted_pipe.predict(train_df)

NLU 1.0.5 Installation changes

Starting from version 1.0.5 NLU will not automatically install pyspark for users anymore.
This enables easier customizing the Pyspark version which makes it easier to use in various cluster enviroments.

To install NLU from now on, please run

pip install nlu pyspark==2.4.7 

or install any pyspark>=2.4.0 with pyspark<3

NLU 1.0.5 Improvements

  • Improved Databricks path handling for loading and storing models.

NLU 1.0.4 Release Notes

John Snow Labs NLU 1.0.4 : Trainable Named Entity Recognizer (NER) , achieve SOTA in 1 line of code and easy scaling to 100’s of Spark nodes

We are glad to announce NLU 1.0.4 releases the State of the Art breaking Neural Network architecture for NER, Char CNNs - BiLSTM - CRF!

#fit and predict in 1 line!
nlu.load('train.ner').fit(dataset).predict(dataset)


#fit and predict in 1 line with BERT!
nlu.load('bert train.ner').fit(dataset).predict(dataset)


#fit and predict in 1 line with ALBERT!
nlu.load('albert train.ner').fit(dataset).predict(dataset)


#fit and predict in 1 line with ELMO!
nlu.load('elmo train.ner').fit(dataset).predict(dataset)

Any NLU pipeline stored can now be loaded as pyspark ML pipeline

# Ready for big Data with Spark distributed computing
import pyspark
nlu_pipe.save(path)
pyspark_pipe = pyspark.ml.PipelineModel.load(stored_model_path)
pyspark_pipe.transform(spark_df)

NLU 1.0.4 New Features

NLU 1.0.4 New Notebooks,Tutorials and Docs

NLU 1.0.4 Bug Fixes

  • Fixed a bug that NER token confidences do not appear. They now appear when nlu.load(‘ner’).predict(df, meta=True) is called.
  • Fixed a bug that caused some Spark NLP models to not be loaded properly in offline mode

1.0.3 Release Notes

We are happy to announce NLU 1.0.3 comes with a lot new features, training classifiers, saving them and loading them offline, enabling running NLU with no internet connection, new notebooks and articles!

NLU 1.0.3 New Features

  • Train a Deep Learning classifier in 1 line! The popular ClassifierDL which can achieve state of the art results on any multi class text classification problem is now trainable! All it takes is just nlu.load(‘train.classifier).fit(dataset) . Your dataset can be a Pandas/Spark/Modin/Ray/Dask dataframe and needs to have a column named x for text data and a column named y for labels
  • Saving pipelines to HDD is now possible with nlu.save(path)
  • Loading pipelines from disk now possible with nlu.load(path=path).
  • NLU offline mode: Loading from disk makes running NLU offline now possible, since you can load pipelines/models from your local hard drive instead of John Snow Labs AWS servers.

NLU 1.0.3 New Notebooks and Tutorials

NLU 1.0.3 Bug fixes

  • Sentence Detector bugfix

NLU 1.0.2 Release Notes

We are glad to announce nlu 1.0.2 is released!

NLU 1.0.2 Enhancements

  • More semantically concise output levels sentence and document enforced :
    • If a pipe is set to output_level=’document’ :
      • Every Sentence Embedding will generate 1 Embedding per Document/row in the input Dataframe, instead of 1 embedding per sentence.
      • Every Classifier will classify an entire Document/row
      • Each row in the output DF is a 1 to 1 mapping of the original input DF. 1 to 1 mapping from input to output.
    • If a pipe is set to output_level=’sentence’ :
      • Every Sentence Embedding will generate 1 Embedding per Sentence,
      • Every Classifier will classify exactly one sentence
      • Each row in the output DF can is mapped to one row in the input DF, but one row in the input DF can have multiple corresponding rows in the output DF. 1 to N mapping from input to output.
  • Improved generation of column names for classifiers. based on input nlu reference
  • Improved generation of column names for embeddings, based on input nlu reference
  • Improved automatic output level inference
  • Various test updates
  • Integration of CI pipeline with Github Actions

New Documentation is out!

Check it out here : http://nlu.johnsnowlabs.com/

NLU 1.0.1 Release Notes

NLU 1.0.1 Bugfixes

  • Fixed bug that caused NER pipelines to crash in NLU when input string caused the NER model to predict without additional metadata

1.0 Release Notes

  • Automatic to Numpy conversion of embeddings
  • Added various testing classes
  • New 6 embeddings at once notebook with t-SNE and Medium article
  • Integration of Spark NLP 2.6.2 enhancements and bugfixes https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.6.2
  • Updated old T-SNE notebooks with more elegant and simpler generation of t-SNE embeddings

0.2.1 Release Notes

  • Various bugfixes
  • Improved output column names when using multiple classifirs at once

0.2 Release Notes

  • Improved output column names classifiers

0.1 Release Notes

1.0 Release Notes

  • Automatic to Numpy conversion of embeddings
  • Added various testing classes
  • New 6 embeddings at once notebook with t-SNE and Medium article
  • Integration of Spark NLP 2.6.2 enhancements and bugfixes https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.6.2
  • Updated old T-SNE notebooks with more elegant and simpler generation of t-SNE embeddings

0.2.1 Release Notes

  • Various bugfixes
  • Improved output column names when using multiple classifirs at once

0.2 Release Notes

  • Improved output column names classifiers

0.1 Release Notes

We are glad to announce that NLU 0.0.1 has been released! NLU makes the 350+ models and annotators in Spark NLPs arsenal available in just 1 line of python code and it works with Pandas dataframes! A picture says more than a 1000 words, so here is a demo clip of the 12 coolest features in NLU, all just in 1 line!

NLU in action

What does NLU 0.1 include?

NLU in action

What does NLU 0.1 include?

  • NLU provides everything a data scientist might want to wish for in one line of code!
  • 350 + pre-trained models
  • 100+ of the latest NLP word embeddings ( BERT, ELMO, ALBERT, XLNET, GLOVE, BIOBERT, ELECTRA, COVIDBERT) and different variations of them
  • 50+ of the latest NLP sentence embeddings ( BERT, ELECTRA, USE) and different variations of them
  • 50+ Classifiers (NER, POS, Emotion, Sarcasm, Questions, Spam)
  • 40+ Supported Languages
  • Labeled and Unlabeled Dependency parsing
  • Various Text Cleaning and Pre-Processing methods like Stemming, Lemmatizing, Normalizing, Filtering, Cleaning pipelines and more

NLU 0.1 Features Google Collab Notebook Demos

Last updated