TransWikia.com

Extracting structure and content from invoices

Data Science Asked by Don Draper on June 20, 2021

Lately, I have been largely inspired by this https://rossum.ai/, which is able to extract text from invoice documents.

Do you have any ideas on how this could be implemented? It’s clear that they did a lot of research to reach this performance level, but in my case I am interested in the overall approach to such problems.

If I understand correctly, the first part of the pipeline is to extract different blocks from the document. In that case, is object detection the right approach to get bounding boxes around the blocks? I guess it might not be really good at extracting tabular data.

If not object detection, what is the correct way to tackle the problem?

Thanks.

One Answer

I think extracting relevant details from an invoice in commercial applications certainly involves a lot of high spec algorithms. Maybe you are right that they identify relevant parts first and extract the details afterwards.

However, my first starting point would be to get all the text from an invoice (e.g. via tesseract). If you have a decent photo, tesseract will be able to OCR the content. The next step would be to identify relevant content, such as payment amount, names, and bank account numbers. This may be possible by hardcoded rules to some extent. Alternatively, one could use NLP-like models to detect certain sequences. With some effort, this should work out well since invoices are relatively structured documents.

https://pypi.org/project/pytesseract/

https://github.com/tesseract-ocr/tesseract/wiki

Answered by Peter on June 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP