How could I use machine learning to detect text and non-text regions in scanned documents?

Question

I have a collection of scanned documents (which come from newspapers, books, and magazines) with complex alignments for the text, i.e. the text could be at any angle w.r.t. the page. I can do a lot of processing for different features extraction. However, I want to know some robust methods that do not need many features.
Can machine learning be helpful for this purpose? How could I use machine learning to detect text and non-text regions in these scanned documents?

Douglas Daseeco · Answer

Since the document is scanned, it will not be in an open document format so no associated API can be used.

Approach 1

Evaluate TextBridge Pro, FreeOCR, and other alternatives that purport page layout detection.  If any of them work, drive them programmatically (preferably headless) to read the scanned document, detect page layout and OCR the text, export to a document with an open format, and then use the API

With this approach, the object recognition AI is in the product and development time and resources are saved.

Approach 2

Do a 2D FFT windowing through the page in both directions.  See the cosine, trapazoidaly, Hamming, and Hanning windows and apply them in horizontal and vertical directions.  Use Approach 1, assuming those products work with the scanned documents to label the examples, and then train a DCNN (deep convolutional NN) to recognize from the 2D FFT output spectra where the pictures are.  By interpolation, close to a perfect crop of the images and the text regions can be obtained with some hyper-parameters on the model obtained.

Approach 3

This approach is just Approach 2 but preparing the labeled example data set by hand, which may be necessary because the existing software products may not handle the images being laid out at angles other than 0, 90, 180, or 270 degrees.

Approach 4

Create an architecture that is based on feature extraction, and use font rendering libraries to build the back half of an auto-encoder, allowing portions of image that do not auto-encode to be preserved as an x-y coordinate pair, which will allow the images to jump over the pictures if the convergence is set up correctly.

Final Note

One can offload some processing to a learning process, so that the actual document process runs faster, but sometimes the preparation of the example data set and the learning consumes more resources.  That's why those who assess which approach will cost less and can recommend the best approach with some reliability are high paid.

hobs · Answer

TextDetector, Tesseract and other open source packages implement text detection (object detection for text). There's also a pretrained Tensorflow model that does text detection. A text detector will give you the bounding boxes in your image for any text that it recognizes. In the case of Tesseract, it will also output the text (OCR is built in). So you can read the code in these packages to get ideas for your own machine learning pipeline. Basically you need both a regressor (for the bounding boxes) and a classifier (to detect whether the box contains text or not).

Answered by hobs on December 22, 2021

How could I use machine learning to detect text and non-text regions in scanned documents?

2 Answers

Add your own answers!

Ask a Question