OcrProcessor only reads PDF page title image when there's plenty of text?

Using the code below I read the text from a one page document.  The text result is: 

"QQD COMFORT C700 LETTINGS " - No other text is returned?  

Note: Please see attached PDF (zipped) for source document.

Nuget package versions are at the bottom of this message.


 private static string OCRprocessPDF(string pdfIn)
 {
     //Initialize the OCR processor.
     using (OCRProcessor processor = new OCRProcessor())
     {
         //Load an existing PDF document.

         SKBitmap sKBitmap = new SKBitmap();
         FileStream stream = new FileStream(pdfIn, FileMode.Open, FileAccess.Read);
         PdfLoadedDocument pdfLoadedDocument = new PdfLoadedDocument(stream);
         //Set OCR language to process.
         processor.Settings.Language = Languages.English;
         //Process OCR by providing the PDF document.
         var textFound = processor.PerformOCR(pdfLoadedDocument);
         //Close the document.
         pdfLoadedDocument.Close(true);
         return textFound;
     }
 }​

// Visual Studio packages
<Project Sdk="Microsoft.NET.Sdk.Web">
  <PropertyGroup>
    <TargetFramework>net7.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>  
<ItemGroup>  
   <PackageReference Include="Emgu.CV" Version="4.8.0.5324" />
   <PackageReference Include="Emgu.CV.Bitmap" Version="4.8.0.5324" />
   <PackageReference Include="Emgu.CV.runtime.windows" Version="4.8.0.5324" />
   <PackageReference Include="SkiaSharp" Version="2.88.5" />
   <PackageReference Include="Syncfusion.EJ2.PdfViewer.AspNet.Core.Windows" Version="23.1.36" />
   <PackageReference Include="Syncfusion.PDF.OCR.Net.Core" Version="23.1.36" />
   <PackageReference Include="Microsoft.AspNetCore.OpenApi" Version="7.0.11" />
   <PackageReference Include="Swashbuckle.AspNetCore" Version="6.5.0" />  
<PackageReference Include="System.Drawing.Common" Version="7.0.0" />
  </ItemGroup>
</Project>​

Attachment: Dallas_2a210f6.zip

1 Reply 1 reply marked as answer

KS Karmegam Seerangan Syncfusion Team September 29, 2023 02:46 PM UTC

Hi Russell,


We are validated the reported issue on our end. OCR processor recognize the text from images only. The input documents contain only one image, so that image result only return to you. However, we have attached the sample to get your excepted output by converting the pdf page to image and then send the image to Perform OCR.


Sample Link : https://www.syncfusion.com/downloads/support/directtrac/general/ze/PerformOCR_Image786831718.zip


Incase If your input image quality is very low, we recommend you to try the OCR processor with tessdata_best to get better results. You can get the tessdata_best from below link,

tessdata_best:
https://github.com/tesseract-ocr/tessdata_best

Regards,

Karmegam S



Marked as answer
Loader.
Up arrow icon