We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
close icon

ExtractText not working for pdfDocument object?

Hi,

I am using version 8.303.0.21 of Sync PDF. I have a few questions here:

1) Does ExtractText function only works on pdfLoadedDocument object? and not pdfdocument?

2) Seems that after importPage "pFinalDoc.ImportPage(pTempDoc, j)", I am not able to do a extractText on pdfDocument. It is giving me "Nothing".


Any advice is appreciateed. Thanks.



Sub test(ByVal msInputFile As MemoryStream)

Dim pDoc As Syncfusion.Pdf.Parsing.PdfLoadedDocument = New Parsing.PdfLoadedDocument(msInputFile)
Dim found As Boolean
Dim searchKey As String
Dim searchList As New SortedList(Of String, Byte())
Dim m As MemoryStream
Dim pFinalDoc As PdfDocument
Dim pTempDoc As Syncfusion.Pdf.Parsing.PdfLoadedDocument
Dim s As String = String.Empty

For i As Integer = 0 To pDoc.Pages.Count - 1

'create a new PDF doc
pFinalDoc = New Syncfusion.Pdf.PdfDocument()

'search if there is any existing PDF having the same key info
searchKey = pDoc.Pages(i).ExtractText().Substring(0, 10)
found = searchList.Keys.Contains(searchKey)

If (found = True) Then
'already existing, load existing pages
pTempDoc = New Parsing.PdfLoadedDocument(searchList(searchKey))

For j As Integer = 0 To pTempDoc.Pages.Count - 1
pFinalDoc.ImportPage(pTempDoc, j)

s &= pFinalDoc.Pages(j).ExtractText()
Next

If (pFinalDoc.Pages(0).ExtractText() = Nothing) then
msgbox "Error"
End if


End If

'add current page
pFinalDoc.ImportPage(pDoc, i)

'save final doc to memory in order to get byte array
m = New MemoryStream()
pFinalDoc.Save(m)


If (found = True) Then
'set to the new value
searchList(searchKey) = m.ToArray()
Else
searchList.Add(searchKey, m.ToArray())
End If

pFinalDoc.Close()


m = Nothing
pFinalDoc = Nothing
Next

End Sub

3 Replies

AG Angappan G Syncfusion Team August 13, 2010 10:28 AM UTC

Hi HY,

Thank you for your interest in Essential PDF.

We regret for the delay in getting back to you

1.The ExtractText method will only work with the PdfLoadedDocument not with PdfDocument class objects.

2.We can't use the ExtractText method with PdfDocument even after importing the contents of the existing document because the method will only work with PdfLoadedDocument class.

Please let us know if you have any queries.

Regards,
Angappan.


RT Rodrigo T December 15, 2017 04:55 PM UTC

Hi, that topic is very useful, please insert into main documentation of pdf.PdfDocument.ExtractText.

Using pdf.PdfDocument.ExtractText, formatted text and others returns dirty.

Using pdf.PdfLoadedDocument.ExtractText, all works fine.

Or still have (2017) a bug into pdf.PdfDocument.ExtractText comparing to out of pdf.PdfLoadedDocument.ExtractText.

Thanks!


SA Sabari Anand Senthamarai Kannan Syncfusion Team December 18, 2017 12:45 PM UTC

Hi Rodrigo, 

Thank you for contacting Syncfusion products. 

The text extraction from the PDF document cannot be performed using the PdfDocument class after imported from the PdfLoadedDocument object. It can only be performed using the PdfLoadedDocument class. We will update the same in our UG documentation and it will be refreshed within a week. 

Please let us know if you need any further assistance. 

Regards, 
Sabari Anand 


Loader.
Live Chat Icon For mobile
Up arrow icon