error when using syncfusion.pdf.ocr

Question

I am attempting to ocr a pdf document in a .net maui application written in c#. When executing the following code Syncfusion.Pdf throws and exception and I am not sure how to proceed. Any clues would be appreciated!

Code:

private async void OCRFileButton_Clicked(object sender, EventArgs e)

{

try

{

var basePath = AppDomain.CurrentDomain.BaseDirectory;

using (OCRProcessor processor = new OCRProcessor(basePath + @"TesseractBinaries\"))

{

FileStream stream = new FileStream(sourcePdfPath, FileMode.Open, FileAccess.Read);

PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(stream);

//Set OCR language to process.

processor.Settings.Language = Languages.English;

//Process OCR by providing the PDF document.

processor.PerformOCR(pdfLoadedDocument, basePath + @"Tessdata\");

}

SaveFileButton.IsEnabled = true;

}

catch (Exception ex)

{

await Toast.Make($"Could not OCR this PDF file. Error: " + ex.Message).Show(CancellationToken.None);

}

Exception:

Syncfusion.Pdf.PdfException: Exception has been thrown by the target of an invocation.

at Syncfusion.OCRProcessor.OCRProcessor.ProcessOCR(String[] args, String[] imagePathList, Int32& orientation)

at Syncfusion.OCRProcessor.OCRProcessor.GetHOCR(String imagePath, String dataPath, Boolean multiPageTiff, String[] imagePathList)

at Syncfusion.OCRProcessor.OCRProcessor.PerformOCR(PdfLoadedDocument lDoc, Int32 startIndex, Int32 endIndex, String dataPath)

Karmegam Seerangan · Accepted Answer

We have checked the provided documents on our end. In our OCR processor library, internally extract the images from the pdf document and then send the images to the tesseract. The provided document does not have any images, Some text can be selected while other text is embedded within graphics, which is the cause of the problem you are experiencing.
However, we have workaround solution to resolve this issue. In this workaround solution, we convert the pdf page to an image and then send the image to the tesseract. Please follow the below documentation,How to Extract the Text from Image Free PDF Documents Using WinForms OCR Processor? | SyncfusionNuget: NuGet Gallery | Syncfusion.Maui.PdfToImageConverter 25.1.35 Kindly try the provided solution and let us know the result.

Karmegam Seerangan · Answer

Hi Edward,Thank you for reaching out to Syncfusion support.

The Syncfusion .NET Optical Character Recognition (OCR) library extracts text from scanned PDFs and images. It uses the Tesseract OCR engine. The Syncfusion OCR library does not work on mobile platforms with the Tesseract engine, so starting from version 20.3.0.47, we added support to use any external OCR service, such as Azure Cognitive Services OCR, with our existing OCR library to process OCR in mobile platforms. Please refer the documentation to use the ExternalOCREngine in the Maui application.https://www.syncfusion.com/blogs/post/ocr-in-net-maui-building-an-image-processing-application.aspxRegards,Karmegam

Edward Alexander · Answer

Can the Tesseract engine be used with .Net Maui on Windows Desktop?

Karmegam Seerangan · Answer

Yes, Our OCR processor has support in the .NET Maui on Windows desktop machine. We have attached the .Net Maui sample for your reference.

Sample: https://www.syncfusion.com/downloads/support/directtrac/general/ze/OCRMaui2127323075

Kindly try the sample and let us know the result.

Edward Alexander · Answer

I was able to get this working by copying the files at runtime to location I can access with a common path in both dev and when installed via msix (from the store)

public async Task<bool> CopyTesseractFiles()

{

if (!filesCopied)

{

string directoryName = Path.GetDirectoryName(new Uri(Assembly.GetExecutingAssembly().CodeBase).PathAndQuery);

var localFolder = FileSystem.AppDataDirectory;

var tessdataFolder = System.IO.Path.Combine(localFolder, "tessdata");

System.IO.Directory.CreateDirectory(tessdataFolder);

var tesseractFolder = System.IO.Path.Combine(localFolder, "win-x64");

System.IO.Directory.CreateDirectory(tesseractFolder);

tesseractFolder = System.IO.Path.Combine(tesseractFolder, "native");

System.IO.Directory.CreateDirectory(tesseractFolder);

ShowStatusMessage = true;

if (directoryName.IndexOf("repos") < 0)

{

string path = directoryName.Replace("%20", " ");

var sourceFilePath = Path.Combine(path, "runtimes\\tessdata\\eng.traineddata");

var destinationFilePath = System.IO.Path.Combine(tessdataFolder, "eng.traineddata");

try

{

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

}

catch (Exception ex)

{

StatusMessage = "Error: " + ex.ToString();

await Task.Delay(1);

return false;

}

sourceFilePath = Path.Combine(path, "runtimes\\win-x64\\native\\leptonica-1.80.0.dll");

destinationFilePath = System.IO.Path.Combine(tesseractFolder, "leptonica-1.80.0.dll");

try

{

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

}

catch (Exception ex)

{

StatusMessage = "Error: " + ex.ToString();

await Task.Delay(1);

return false;

}

sourceFilePath = Path.Combine(path, "runtimes\\win-x64\\native\\libSyncFusionTesseract.dll");

destinationFilePath = System.IO.Path.Combine(tesseractFolder, "libSyncFusionTesseract.dll");

try

{

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

}

catch (Exception ex)

{

StatusMessage = "Error: " + ex.ToString();

await Task.Delay(1);

return false;

}

filesCopied = true;

return true;

}

else

{

string path = Path.GetFullPath(Path.Combine(directoryName, "../../../../../"));

var sourceFilePath = Path.Combine(path, "runtimes/tessdata/eng.traineddata");

var destinationFilePath = System.IO.Path.Combine(tessdataFolder, "eng.traineddata");

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

sourceFilePath = Path.Combine(path, "runtimes/win-x64/native/leptonica-1.80.0.dll");

destinationFilePath = System.IO.Path.Combine(tesseractFolder, "leptonica-1.80.0.dll");

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

sourceFilePath = Path.Combine(path, "runtimes/win-x64/native/libSyncFusionTesseract.dll");

destinationFilePath = System.IO.Path.Combine(tesseractFolder, "libSyncFusionTesseract.dll");

System.IO.File.Copy(sourceFilePath, destinationFilePath, overwrite: true);

filesCopied = true;

return true;

}

else

{

filesCopied = true;

return true;

}

Then to OCR:

private async Task<bool> OCRFile()

{

await Task.Delay(1000);

string path = FileSystem.AppDataDirectory;

try

{

using (OCRProcessor processor = new OCRProcessor())

{

//Assembly assembly = typeof(MainPage).GetTypeInfo().Assembly;

FileStream inputStream = new FileStream(sourcePdfPath, FileMode.Open, FileAccess.Read);

//Load an existing PDF document.

PdfLoadedDocument workingDocument = new PdfLoadedDocument(inputStream);

processor.TessDataPath = Path.Combine(path, @"tessdata\");

processor.TesseractPath = Path.Combine(path, @"win-x64\native\");

//Set OCR language.

processor.Settings.Language = Languages.English;

processor.Settings.PageSegment = PageSegMode.AutoOsd;

processor.Settings.IsImageStraighteningEnabled = true;

//Perform OCR with input document and tessdata (Language packs).

processor.PerformOCR(workingDocument);

//Create file stream.

//Saves the PDF to the memory stream.

using MemoryStream ms = new();

workingDocument.Save(ms);

//Close the PDF document

workingDocument.Close(true);

ms.Position = 0;

// Save the merged PDF to the Cache directory

var fileName = "OCRFile.pdf";

var filePath = Path.Combine(FileSystem.CacheDirectory, fileName);

//PdfViewer.DocumentSource = null;

document.Close(true);

docStream?.Close();

File.WriteAllBytes(filePath, ms.ToArray());

docStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);

document = new PdfLoadedDocument(ms.ToArray());

}

return true;

} catch

{

return false;

}

Edward Alexander · Answer

I notice that not all content seems to be processed.  For example pages with text and images, text in the image does not become searchable. Is there a way to ensure that all images/pages and content on a page becomes searchable?

Karmegam Seerangan · Answer

We have checked the reported issue on our end and it's working well. We suspect the reported issue may occur due to input documents having unicode characters. In our library, we are unable to directly draw the Unicode characters in the output pdf document. We can resolve the Unicode characters issue by using the Unicode property in our OCRProcessor settings.

Please find the documentation and documentation below,
Performing OCR with Unicode characters

//Initialize the OCR processor.

OCRProcessor processor = new OCRProcessor();

//Load an existing PDF document.

FileStream stream = new FileStream("Input.pdf", FileMode.Open);

PdfLoadedDocument document = new PdfLoadedDocument(stream);

//Set OCR language.

processor.Settings.Language = Languages.English;

//Sets the Unicode font to preserve the Unicode characters in a PDF document.

FileStream fontStream = new FileStream(@"ARIALUNI.ttf", FileMode.Open);

processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8);

//Perform OCR with input document and test data (Language packs).

processor.PerformOCR(document);

Kindly try the provided solution and let us know the result. If still you are facing the issue, we kindly request you to share the Modified sample and Input documents with us to replicate the same issue on our end. This will be more helpful for us to analyze and provide you with a prompt solution.

Edward Alexander · Answer

Thanks for the great response! Your support has been fantastic (a rare thing these days).

Unfortunately using the arialuni.ttf Unicode font did not solve my particular issue. I am including a copy of a pdf file created by "print to pdf" from a web page. "print to pdf" creates a much different pdf content file that "save to pdf". After performing OCR on this document, I can only select text from the bottom portions which were already selectable text and needed no OCR.

Attachment: sample_916205c3.zip

Edward Alexander · Answer

Once again FANTASTIC support!  This was a big help and solved the major portion of my issue.      private async Task<bool> OCRFile()    {        await Task.Delay(1000);        string path = FileSystem.AppDataDirectory;        try        {            using (OCRProcessor processor = new OCRProcessor())            {                PdfToImageConverter imageConverter = new PdfToImageConverter();                imageConverter.Load(docStream);                PdfDocument doc = new PdfDocument();                for (int i = 0; i < imageConverter.PageCount; i++)                {                    Stream outputStream = imageConverter.Convert(i);                    PdfBitmap pdfImage = new PdfBitmap(outputStream);                    PdfSection section = doc.Sections.Add();                    section.PageSettings.Margins.All = 0;                    section.PageSettings.Size = new Syncfusion.Drawing.SizeF(pdfImage.PhysicalDimension.Width, pdfImage.PhysicalDimension.Height);                    PdfPage page = section.Pages.Add();                    PdfGraphics graphics = page.Graphics;                    graphics.DrawImage(pdfImage, 0, 0, page.Size.Width, page.Size.Height);                }                MemoryStream file = new MemoryStream();                doc.Save(file);                doc.Close(true);                PdfLoadedDocument workingDocument = new PdfLoadedDocument(file);                //Load an existing PDF document.                //PdfLoadedDocument workingDocument = new PdfLoadedDocument(inputStream);                processor.TessDataPath = Path.Combine(path, @"tessdata");                processor.TesseractPath = Path.Combine(path, @"win-x64
ative");                //Set OCR language.                processor.Settings.Language = Languages.English;                processor.Settings.PageSegment = PageSegMode.AutoOsd;                processor.Settings.IsImageStraighteningEnabled = true;                //Sets the Unicode font to preserve the Unicode characters in a PDF document.                FileStream fontStream = new FileStream(Path.Combine(path, @"arialuni.ttf"), FileMode.Open);                processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8);                //Perform OCR with input document and tessdata (Language packs).                processor.PerformOCR(workingDocument);                //Create file stream.                //Saves the PDF to the memory stream.                using MemoryStream ms = new();                workingDocument.Save(ms);                //Close the PDF document                workingDocument.Close(true);                ms.Position = 0;                // Save the OCR PDF to the Cache directory                var fileName = "OCRFile.pdf";                var filePath = Path.Combine(FileSystem.CacheDirectory, fileName);                processor.Dispose();                document.Close(true);                docStream?.Close();                fontStream?.Close();                File.WriteAllBytes(filePath, ms.ToArray());                docStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);                document = new PdfLoadedDocument(ms.ToArray());            }            return true;        }        catch        {            return false;        }    }

Edward Alexander · Answer

I have noticed a few things after converting pages to image and creating a new document. Sometimes when trying to select text after OCR, the selected rectangle is not well aligned with the actual text in image (post OCR). In some instances with larger fonts, its just two low, and sometimes you can only select across multiple lines.

I experimented with

performing OCR (using the code in my prior reply)
saving the document
selecting text is misaligned
open saved document
perform OCR (second pass)
much of the text is better aligned when selecting

The application can be found at https://www.microsoft.com/store/apps/9P820BX16GZJ (latest release with the added convert to image functionality is still pending, but should be available later today)

Karmegam Seerangan · Answer

We are trying to replicate the problem on our end using previously shared documents and we are not able to reproduce it. We suspect that the issue is document-specific. Therefore, we request you to share the input PDF document with us so that we can replicate the problem on our end. It will be more helpful for us to analyze further and provide you with a prompt solution. However, we have attached the output document for your reference.

Output: https://www.syncfusion.com/downloads/support/directtrac/general/ze/MauiOutput1497806629

Edward Alexander · Answer

Looks like my last update to the store is failing on OCR... I will post back when that issue is resolved.

Karmegam Seerangan · Answer

Thank you for the update. Please get back to us if you need further assistance on this.

Edward Alexander · Answer

OK... New version in the store working fine now (there was an issue finding the ttf file). Again the application is available at https://www.microsoft.com/store/apps/9P820BX16GZJ.

After performing OCR notice that selecting text in the second line is offset. I also notice that post OCR the ZoomFactor seems to scale much differently. I had to add a manual zoom feature to offset this issue.

Image_5203_1711372729074

Attachment: Andouille_Creole_(GOOD)_d362e66b.zip

Karmegam Seerangan · Answer

We are unable to select the reported text on our end. We suspect the reported issue may occur due to the low quality of the image when converting the PDF page. Please refer to the below screenshot.We can increase the image quality while converting pdf to image. Please use the code snippet below. PdfToImageConverter imageConverter = new PdfToImageConverter();imageConverter.Load(inputStream); for (int i = 0; i < imageConverter.PageCount; i++){    Stream outputStream = imageConverter.Convert(i,
new SizeF(1836, 2372));    PdfBitmap pdfImage = new PdfBitmap(outputStream); Additionally, we can increase the OCR recognition output by using tessdata-fast and tessdata-best. Please find the tessdata link below.https://www.syncfusion.com/downloads/support/directtrac/general/ze/tessdata-fast187458364 https://github.com/tesseract-ocr/tessdata_fasthttps://github.com/tesseract-ocr/tessdata_best Kindly try the provided solution and let us know the result. If you are still facing issues, we kindly request you to share the modified sample, and complete code snippet including redactions with us to replicate the same issue on our end. This will be more helpful for us to analyze and provide you with a prompt solution.