Jump to content
Science Forums

Recommended Posts

Posted

Hey all,

 

I’m currently creating a document management system in VB/ASP.net for a hospital in the Cleveland area. I'm using a SQL database to store information about each document such as contents, location, author, and other info. I've been able to read the contents of almost every document type programmically using the MS Office Clipboard. However I have not been able to read PDFs. I know there are some add-ins for .net that may give me the ability to do this, but they are not open source or free to use. I’ve tried using IE to first open the PDF, but I have not been able to figure out how to copy the contents to the clipboard programmically using IE. Does anyone have an idea how to do this or maybe another way of doing it? Any help would be greatly appreciated!

Posted

Sorry, I think you're up a creek without a paddle. Generating PDF's automatically from text or marked up content is pretty straight forward, going the other way is not. In PDFs that I have tried to convert back to text I have found quite a variety of techniques used to prevent this.

 

In one document I worked on I found that when I selected all of the text in the PDF and pasted it to a file editor all of the words appeared backwards even though they displayed correctly in the pdf. I had to write a script to actually reverse each word.

 

In another I found that the text transferred via the clipboard had hidden characters inserted between all of the characters in the text. They didn't display but they were imbedded in the binary information.

 

In yet another I found the whole document was actually an inverted mirror image of the file displayed as a pdf, i.e. the whole document was turned around backwards and then flipped upside down. It required a custom script to straighten out as well.

 

For plain PDFs where such techniques are not used you mighht be successful but with the great variety of technique I have seen at preventing document copying you will need an equal variety of program functions to handle them.

Posted

Well I figured out a way to do it, but it requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it). Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project:

 

Dim objPDFPage As AcroPDPage

 

Dim objPDFDoc As New AcroPDDoc

Dim objPDFAVDoc As AcroAVDoc

Dim objAcroApp As AcroApp

Dim objPDFRectTemp As Object

Dim objPDFRect As New AcroRect

Dim lngTextRangeCount As Long

Dim objPDFTextSelection As AcroPDTextSelect

Dim temptextcount As Long

Dim strText As String

 

Dim lngPageCount As Long

Dim Fora As Long

 

objPDFDoc.Open(tbdocdisplaypath.Text)

lngPageCount = objPDFDoc.GetNumPages

 

For Fora = 0 To lngPageCount - 1

 

objPDFPage = objPDFDoc.AcquirePage(Fora)

objPDFRectTemp = objPDFPage.GetSize

objPDFRect.Left = 0

objPDFRect.right = objPDFRectTemp.x

objPDFRect.Top = objPDFRectTemp.y

objPDFRect.bottom = 0

 

' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect)

objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect)

 

' Get The Text Of The Range

 

temptextcount = objPDFTextSelection.GetNumText

For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText

doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1)

Next

 

doctext = doctext & vbCrLf

 

Next

 

doctype = "PDF"

 

objPDFDoc.Close()

Posted

C1ay,

 

That PDF is has security enabled with password protection so my program can't read it (Unless you know the password). I would first have to write a program to hack it or employ the use a 3rd party hacker tool in my code. This however would make the import process unberribly slow which it already seems to be. It took my program 7 hours to import the contents of 15, 450 page PDFs totalling about 36 MB. I'm pretty sure it's not my program. It seems that the Adobe APIs are causing the slowness. I'm also pretty sure that the azzholes at Adobe put this lovely feature in by design (I've always hated Adobe applications. There is a special place in hell set aside for Adobe programmers I'm sure of it).

 

I still need to find a better solution for extracting text from PDFs so my quest continues tomorrow... I'm still open to any suggestions from anyone.

Posted

I'm only pointing out yet another pdf copy protection. The data you're after is in the pdf file and you'll be OK as long as you're only trying to extract the data that isn't protected one way or another. OTOH, I've seen a variety of methods employed to prevent what you're trying to do.

Posted

Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.

Posted
Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.

I've also encountered PDFs where the text was displayed with a poor contrast to the background like a dark gray text on a light gray background. It was visibly readable but it threw OCR off enough that it wasn't usable.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...