We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
close icon

Update OCR to use tesseract 4.0.0

Tesseract has been updated to 4.0.0 as of October 29, 2018.
https://github.com/tesseract-ocr/tesseract

The Syncfusion OCR library currently uses 3.02/5. I've found 4.0.0 to be much better for OCR of PDF documents. Is there a work item for upgrading OCR library to use newer tesseract library?

9 Replies

DB Dilli Babu Nandha Gopal Syncfusion Team November 12, 2018 04:17 PM UTC

Hi Jason, 
 
We have tested the new Tesseract version and found that new Tesseract is performing slower than older version please find the details in the below table.  
Tesseract Version 4.0 OCR Process Time Taken Table Report (Syncfusion Tesseract .dll): 
  
Document 
Size 
Page count 
TesseractVersion4.0 LSTM Engine  
(OCR Time) 
Tesseract Version 4.0 
( Tesseract Engine) 
(OCR  Time ) 
Tesseract Version 3.05 
(OCR Time) 
Input.pdf 
1.8 Mb 
1 
00.27.604 sec 
00.13.496 sec 
00.12.870 sec 
Defect_143275.pdf 
106 KB 
1 
00.19.612 sec 
00.25.264 sec 
00.18.931 sec 
DefectID_WF11781.pdf 
3.0 MB 
8 
00.58.832 sec 
00.42.624 sec 
00.44.559 sec 
SpecialCharacters.pdf 
15.7 KB 
1 
00.09.172 sec 
00.07.523 sec 
00.09.441 sec 
DefectID_WF13618_1.pdf 
3.6 MB 
2 
00.23.752 sec 
00.37.851 sec 
00.34.297 sec 
WF25811.pdf 
24.8 MB 
15 
04.59.782 sec 
03.41.819 sec 
03.50.944 sec 
DefectID_WF32606.pdf 
4.27 MB 
12 
01.16.325 sec 
01.01.061 sec 
1.04.171 sec 
Defect_139301.pdf 
5.4MB 
32 
05.14.944 sec 
04.26.121 sec 
04.18.701 sec 
 
  
At present, we don't have any immediate plans provide support for this newer version. We have logged the feature request to this feature. We will let you know once this feature has been implemented. 
 
Regards, 
Dilli babu. 
 
 



JM Jason Morse November 12, 2018 07:15 PM UTC

Thank you for the update. Although the current tesseract 4.0 performance generally is slower it is not my primary concern - recognition quality is. I am more than willing to consider taking a degradation in performance to achieve an much better improvement in recognition with LSTM engine.  

Document  Size  Page count  TesseractVersion4.0 LSTM Engine   Tesseract Version 4.0 ( Tesseract Engine)  Tesseract Version 3.05 (Baseline)
(OCR Time sec)  Perf Improvement (OCR Time sec)  Perf Improvement (OCR Time sec) 
Input.pdf  1.8 Mb  27.604 -114% 13.496 -5% 12.87
Defect_143275.pdf  106 KB  19.612 -4% 25.264 -33% 18.931
DefectID_WF11781.pdf  3.0 MB  58.832 -32% 42.624 4% 44.559
SpecialCharacters.pdf  15.7 KB  9.172 3% 7.523 20% 9.441
DefectID_WF13618_1.pdf  3.6 MB  23.752 31% 37.851 -10% 34.297
WF25811.pdf  24.8 MB  15  299.782 -30% 221.819 4% 230.944
DefectID_WF32606.pdf  4.27 MB  12  76.325 -19% 61.061 5% 64.171
Defect_139301.pdf  5.4MB  32  314.944 -22% 266.121 -3% 258.701


DB Dilli Babu Nandha Gopal Syncfusion Team November 13, 2018 09:07 AM UTC

Hi Jason, 
 
Thank you for your update. 
  
We have considered your request and logged the feature request to this feature. We will implement this feature in any of our upcoming releases. The feature implementation would also greatly depend on the factors such as product design, code compatibility and complexity. We request you to visit our website periodically for feature related updates. 
 
Regards, 
Dilli babu. 
 



EF Effy August 5, 2019 07:31 PM UTC

+1 on this request,
We've started using Tesseract as an external process instead of using SyncFusion due to the older version.


SL Sowmiya Loganathan Syncfusion Team August 6, 2019 07:34 AM UTC

Hi Effy, 
 
At present the feature for “Update PDF OCR to Tesseract version 4.0” is not implemented. We will let you know once the feature is implemented. We request you to visit our website periodically for feature related updates.  
 
Regards, 
Sowmiya L 



AJ aJeff August 8, 2019 04:05 PM UTC

I also would very much like to have version 4.0 incorporated - the improved pre-processing and OCR output is well worth the slight performance hit


SL Sowmiya Loganathan Syncfusion Team August 9, 2019 01:46 PM UTC

Hi aJeff, 
 
Thank you for the update. At present we do not have any immediate plans to “Update PDF OCR to Tesseract version 4.0”. Please visit our website periodically for feature related updates. 
 
Regards, 
Sowmiya L 



DP Daniel Persidok September 5, 2019 01:37 PM UTC

Hello!
We're also interested in the 4.0.
The current version (3.05) is not even able to read simple numbers correct like the tax id in invoices.
BR
Daniel


SK Surya Kumar Syncfusion Team September 6, 2019 07:15 AM UTC

Hi Daniel, 

We have logged the feature request for using Tesseract 4.0 in OCR and you can track the status of this feature from below link: 

We will let you know once this feature is implemented. 

Regards, 
Surya Kumar 


Loader.
Live Chat Icon For mobile
Up arrow icon