OCR models overview

This page gives you an overview of every OCR model in NLU which are provided by Spark OCR.

Additionally you can refer to the OCR tutorial Notebooks

Overview of all OCR features

Overview of OCR Text Extractors
These models grab the text directly from your input file and returns it as a Pandas DataFrame

NLU Spell	Transformer Class
nlp.load(`img2text`)	ImageToText
nlp.load(`pdf2text`)	PdfToText
nlp.load(`doc2text`)	DocToText

Overview of OCR Table Extractors
These models grab all Table data from the files detected and return a list of Pandas DataFrames, containing Pandas DataFrame for every table detected

NLU Spell	Transformer Class
nlp.load(`pdf2table`)	PdfToTextTable
nlp.load(`ppt2table`)	PptToTextTable
nlp.load(`doc2table`)	DocToTextTable

File Path handling for OCR Models

When your nlu pipeline contains a ocr spell the predict method will accept the following inputs :

a string pointing to a folder or to a file
a list, numpy array or Pandas Series containing paths pointing to folders or files
a Pandas Dataframe or Spark Dataframe containing a column named path which has one path entry per row pointing to folders or files

For every path in the input passed to the predict() method, nlu will distinguish between two cases:

If the path points to a file, nlu will apply OCR transformers to it, if the file type is processable with the currently loaded OCR pipeline.
If the path points to a folder, nlu will recursively search for files in the folder and sub-folders which have file types which are applicable with the loaded OCR pipeline.

NLU checks the file endings to determine whether the OCR models can be applied or not, i.e. .pdf, .img etc.. If your files lack these endings, NLU will not process them.

Image to Text

Sample image: MarineGEO circle logo

nlu.load('img2text').predict('path/to/haiku.png')

Output of IMG OCR:

text
“The Old Pond” by Matsuo Basho
An old silent pond
A frog jumps into the pond—
Splash! Silence again.

PDF to Text

Sample PDF: MarineGEO circle logo

nlu.load('pdf2text').predict('path/to/haiku.pdf')

Output of PDF OCR:

text
“Lighting One Candle” by Yosa Buson
The light of a candle
Is transferred to another candle—
Spring twilight

DOCX to text

Sample DOCX: MarineGEO circle logo

nlu.load('doc2text').predict('path/to/haiku.docx')

Output of DOCX OCR:

text
“In a Station of the Metro” by Ezra Pound
The apparition of these faces in the crowd;
Petals on a wet, black bough.

PDF with Tables

Sample PDF:

nlu.load('pdf2table').predict('/path/to/sample.pdf')

Output of PDF Table OCR :

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear
21	6	160	110	3.9	2.62	16.46	0	1	4
21	6	160	110	3.9	2.875	17.02	0	1	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4
21.4	6	258	110	3.08	3.215	19.44	1	0	3
18.7	8	360	175	3.15	3.44	17.02	0	0	3
13.3	8	350	245	3.73	3.84	15.41	0	0	3
19.2	8	400	175	3.08	3.845	17.05	0	0	3
27.3	4	79	66	4.08	1.935	18.9	1	1	4
26	4	120.3	91	4.43	2.14	16.7	0	1	5
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5
15.8	8	351	264	4.22	3.17	14.5	0	1	5
19.7	6	145	175	3.62	2.77	15.5	0	1	5
15	8	301	335	3.54	3.57	14.6	0	1	5
21.4	4	121	109	4.11	2.78	18.6	1	1	4

DOCX with Tables

Sample DOCX:

nlu.load('doc2table').predict('/path/to/sample.docx')

Output of DOCX Table OCR :

Screen Reader	Responses	Share
JAWS	853	49%
NVDA	238	14%
Window-Eyes	214	12%
System Access	181	10%
VoiceOver	159	9%

PPT with Tables

Sample PPT with two tables:

nlu.load('ppt2table').predict('/path/to/sample.docx')

Output of PPT Table OCR :

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

and

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
6.7	3.3	5.7	2.5	virginica
6.7	3	5.2	2.3	virginica
6.3	2.5	5	1.9	virginica
6.5	3	5.2	2	virginica
6.2	3.4	5.4	2.3	virginica
5.9	3	5.1	1.8	virginica

Combine OCR and NLP models

Sample image containing named entities from U.S. Presidents Wikipedia:

MarineGEO circle logo

nlu.load('img2text ner').predict('path/to/presidents.png')

Output of image OCR and NER NLP :

entities_ner	entities_ner_class	entities_ner_confidence
Four	CARDINAL	0.9986
Abraham Lincoln	PERSON	0.705514
John F. Kennedy),	PERSON	0.966533
one	CARDINAL	0.9457
Richard Nixon,	PERSON	0.71895
John Tyler	PERSON	0.9929
first	ORDINAL	0.9811
The Twenty-fifth Amendment	LAW	0.548033
Constitution	LAW	0.9762
Tyler’s	CARDINAL	0.5329
1967	DATE	0.8926
Richard Nixon	PERSON	0.99515
first	ORDINAL	0.9588
Gerald Ford	PERSON	0.996
Spiro Agnew’s	PERSON	0.99165
1973	DATE	0.9438
Ford	PERSON	0.8337
second	ORDINAL	0.9119
Nelson Rockefeller	PERSON	0.98615
1967	DATE	0.589

Authorize NLU for OCR

You need a set of credentials to access the licensed OCR features. You can grab one here

Authorize anywhere via providing via JSON file

If you provide a JSON file with credentials, nlu will check whether there are only OCR or also Healthcare secrets. If both are contained in the JSON file, nlu will give you access to healthcare and OCR features, if only one of them is present you will be accordingly only authorized for one set of the features. You can specify the location of your secrets.json like this :

path = '/path/to/secrets.json'
nlu.auth(path).load('licensed_model').predict(data)

Authorize via providing String parameters

You can manually enter your secrets and authorize nlu for OCR and Healthcare features

import nlu
AWS_ACCESS_KEY_ID = 'YOUR_SECRETS'
AWS_SECRET_ACCESS_KEY = 'cgsHeZR+YOUR_SECRETS'
OCR_SECRET = 'YOUR_SECRETS'
JSL_SECRET = 'YOUR_SECRETS'
OCR_LICENSE = "YOUR_SECRETS"
SPARK_NLP_LICENSE = 'YOUR_SECRETS'
# this will automatically install the OCR library and NLP Healthcare library when credentials are provided
nlu.auth(SPARK_NLP_LICENSE,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,JSL_SECRET, OCR_LICENSE, OCR_SECRET)

PREVIOUSLegal NLP

NEXTMedical NLP 1-Liners