TransWikia.com

Looking for a OCR Library in WinForm

Software Recommendations Asked by KSA on September 25, 2021

Is there any free/paid OCR library that able to capture the invoices data in PDF format?
Need to have a low error rate.
We need to take those data and do some further processing.

3 Answers

Take a look at this article: https://bitmiracle.com/blog/ocr-pdf-in-net

Basically, you need 2 tools:

  1. PDF library to convert PDF pages to images. There are many paid .NET libraries for that. If you are looking for free tools - look at C/C++ libraries like Ghostscript, xpdf, muPDF.
  2. OCR engine to recognize text on images. Tesseract (free and open source) is the leader here.

Here is the sample code from the article above that uses Tesseract with paid Docotic.Pdf library:

using System;
using System.IO;
using System.Text;
using BitMiracle.Docotic.Pdf;
using Tesseract;

namespace OCR
{
    public static class OcrAndExtractText
    {
        public static void Main()
        {
            // BitMiracle.Docotic.LicenseManager.AddLicenseData("temporary or permanent license key here");
        
            var documentText = new StringBuilder();
            using (var pdf = new PdfDocument("Partner.pdf"))
            {
                using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default))
                {
                    for (int i = 0; i < pdf.PageCount; ++i)
                    {
                        if (documentText.Length > 0)
                            documentText.Append("rnrn");

                        PdfPage page = pdf.Pages[i];
                        string searchableText = page.GetText();

                        // Simple check if the page contains searchable text.
                        // We do not need to perform OCR in that case.
                        if (!string.IsNullOrEmpty(searchableText.Trim()))
                        {
                            documentText.Append(searchableText);
                            continue;
                        }

                        // This page is not searchable.
                        // Save the page as a high-resolution image
                        PdfDrawOptions options = PdfDrawOptions.Create();
                        options.BackgroundColor = new PdfRgbColor(255, 255, 255);
                        options.HorizontalResolution = 300;
                        options.VerticalResolution = 300;

                        string pageImage = $"page_{i}.png";
                        page.Save(pageImage, options);

                        // Perform OCR
                        using (Pix img = Pix.LoadFromFile(pageImage))
                        {
                            using (Page recognizedPage = engine.Process(img))
                            {
                                Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");

                                string recognizedText = recognizedPage.GetText();
                                documentText.Append(recognizedText);
                            }
                        }
                        
                        File.Delete(pageImage);
                    }
                }
            }

            using (var writer = new StreamWriter("result.txt"))
                writer.Write(documentText.ToString());
        }
    }
}

Answered by Vitaliy Shibaev on September 25, 2021

Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines of code, a scanned paper document containing raster images is converted to a searchable and selectable document.

You can able to get the data from the invoice PDF or image using OCR processor in our Essential PDF. Please refer the below link for more details,
https://www.syncfusion.com/blogs/post/optical-character-recognition-in-pdf-using-tesseract-open-source-engine.aspx

You can download the OCR processor product setup here and find the required NuGet package from here.

Note: Essential PDF supports OCR process PDF document/image in ASP.NET Core platform.

The following code demonstrate how to get OCR’ed text from an existing invoice document,

//Initialize the OCR processor by providing the path of tesseract binaries
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries"))
{

//Load a PDF document
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");

//Set OCR language to process
processor.Settings.Language = Languages.English;

//Process OCR by providing the PDF document and Tesseract data
string extractedText=processor.PerformOCR(lDoc, @"TessData");

//Save the OCR processed PDF document in the disk
lDoc.Save("Sample.pdf");
lDoc.Close(true);

}

Note: I am working for Syncfusion.

Answered by Sowmiya on September 25, 2021

The LEADTOOLS toolkit is a professional SDK that provides the ability to recognize multiple field types using OCR for detection and extraction.

If the invoice image or document you are extracting from has an organized and defined structure, you can use the LEADTOOLS Forms Recognition and Processing features (https://www.leadtools.com/sdk/forms) to create a single template with multiple data field defined in it to extract from multiple filled sheets. (Disclaimer: I am an employee of the vendor of this toolkit)

The code to extract text from an invoice master form template with defined fields would look like this:

using (RasterCodecs codecs = new RasterCodecs())
{
    string masterFormRepository = @"Invoice Master Form Path";
    string filledFormDirectory = @"Filled Invoice Path";
    using (IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false))
    {
        ocrEngine.Startup(codecs, null, null, null);
        IMasterFormsRepository repository = new DiskMasterFormsRepository(codecs, masterFormRepository);
        using (AutoFormsEngine engine = new AutoFormsEngine(repository, ocrEngine, null, AutoFormsRecognitionManager.Default | AutoFormsRecognitionManager.Ocr, 30, 80, false))
        {
            foreach (var file in Directory.EnumerateFiles(filledFormDirectory))
            {
                using (RasterImage image = codecs.Load(file))
                {
                    //Run the recognition
                    AutoFormsRunResult runResult = engine.Run(image, null, null, null);
                }
            }
        }
    }
}

Answered by Hussam Barouqa on September 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP