VB.NET - Reading the contents of a PDF

WhapSumi · August 2, 2006

Hey all,

I’m currently creating a document management system in VB/ASP.net for a hospital in the Cleveland area. I'm using a SQL database to store information about each document such as contents, location, author, and other info. I've been able to read the contents of almost every document type programmically using the MS Office Clipboard. However I have not been able to read PDFs. I know there are some add-ins for .net that may give me the ability to do this, but they are not open source or free to use. I’ve tried using IE to first open the PDF, but I have not been able to figure out how to copy the contents to the clipboard programmically using IE. Does anyone have an idea how to do this or maybe another way of doing it? Any help would be greatly appreciated!

Zythryn · August 2, 2006

Sorry, can't help off hand although I am pretty sure we did something like this at work.

Try this thread, some of the information there may be helpful for you:

http://groups.google.com/group/microsoft.public.dotnet.languages.vb.data/browse_thread/thread/99a3fbc1608f396e/236c45f7cf2713b5%23236c45f7cf2713b5

alexander · August 2, 2006

this may possibly help:

http://www.componentsource.com/relevance/a100/xpdfco/index.html?q=pdf+visual

C1ay · August 2, 2006

Sorry, I think you're up a creek without a paddle. Generating PDF's automatically from text or marked up content is pretty straight forward, going the other way is not. In PDFs that I have tried to convert back to text I have found quite a variety of techniques used to prevent this.

In one document I worked on I found that when I selected all of the text in the PDF and pasted it to a file editor all of the words appeared backwards even though they displayed correctly in the pdf. I had to write a script to actually reverse each word.

In another I found that the text transferred via the clipboard had hidden characters inserted between all of the characters in the text. They didn't display but they were imbedded in the binary information.

In yet another I found the whole document was actually an inverted mirror image of the file displayed as a pdf, i.e. the whole document was turned around backwards and then flipped upside down. It required a custom script to straighten out as well.

For plain PDFs where such techniques are not used you mighht be successful but with the great variety of technique I have seen at preventing document copying you will need an equal variety of program functions to handle them.

WhapSumi · August 2, 2006

Well I figured out a way to do it, but it requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it). Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project:

Dim objPDFPage As AcroPDPage

Dim objPDFDoc As New AcroPDDoc

Dim objPDFAVDoc As AcroAVDoc

Dim objAcroApp As AcroApp

Dim objPDFRectTemp As Object

Dim objPDFRect As New AcroRect

Dim lngTextRangeCount As Long

Dim objPDFTextSelection As AcroPDTextSelect

Dim temptextcount As Long

Dim strText As String

Dim lngPageCount As Long

Dim Fora As Long

objPDFDoc.Open(tbdocdisplaypath.Text)

lngPageCount = objPDFDoc.GetNumPages

For Fora = 0 To lngPageCount - 1

objPDFPage = objPDFDoc.AcquirePage(Fora)

objPDFRectTemp = objPDFPage.GetSize

objPDFRect.Left = 0

objPDFRect.right = objPDFRectTemp.x

objPDFRect.Top = objPDFRectTemp.y

objPDFRect.bottom = 0

' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect)

objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect)

' Get The Text Of The Range

temptextcount = objPDFTextSelection.GetNumText

For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText

doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1)

objPDFDoc.Close()

C1ay · August 2, 2006

Try it on this one...http://cepi.auctionsourceonline.com/pdf/gearboxes.pdf

alexander · August 2, 2006

you could look into how gpdf or xpdf do it... except that you would need to maybe write some API integration of the gpdf libraries with VB

WhapSumi · August 3, 2006

C1ay,

That PDF is has security enabled with password protection so my program can't read it (Unless you know the password). I would first have to write a program to hack it or employ the use a 3rd party hacker tool in my code. This however would make the import process unberribly slow which it already seems to be. It took my program 7 hours to import the contents of 15, 450 page PDFs totalling about 36 MB. I'm pretty sure it's not my program. It seems that the Adobe APIs are causing the slowness. I'm also pretty sure that the azzholes at Adobe put this lovely feature in by design (I've always hated Adobe applications. There is a special place in hell set aside for Adobe programmers I'm sure of it).

I still need to find a better solution for extracting text from PDFs so my quest continues tomorrow... I'm still open to any suggestions from anyone.

C1ay · August 3, 2006

I'm only pointing out yet another pdf copy protection. The data you're after is in the pdf file and you'll be OK as long as you're only trying to extract the data that isn't protected one way or another. OTOH, I've seen a variety of methods employed to prevent what you're trying to do.

WhapSumi · August 3, 2006

Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.

C1ay · August 3, 2006

Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.

I've also encountered PDFs where the text was displayed with a poor contrast to the background like a dark gray text on a light gray background. It was visibly readable but it threw OCR off enough that it wasn't usable.

Sign In

VB.NET - Reading the contents of a PDF

Recommended Posts

WhapSumi

Zythryn

alexander

C1ay

WhapSumi

C1ay

alexander

WhapSumi

C1ay

WhapSumi

C1ay

Join the conversation

Browse

Activity