WhapSumi Posted August 2, 2006 Report Posted August 2, 2006 Hey all, I’m currently creating a document management system in VB/ASP.net for a hospital in the Cleveland area. I'm using a SQL database to store information about each document such as contents, location, author, and other info. I've been able to read the contents of almost every document type programmically using the MS Office Clipboard. However I have not been able to read PDFs. I know there are some add-ins for .net that may give me the ability to do this, but they are not open source or free to use. I’ve tried using IE to first open the PDF, but I have not been able to figure out how to copy the contents to the clipboard programmically using IE. Does anyone have an idea how to do this or maybe another way of doing it? Any help would be greatly appreciated! Quote
Zythryn Posted August 2, 2006 Report Posted August 2, 2006 Sorry, can't help off hand although I am pretty sure we did something like this at work. Try this thread, some of the information there may be helpful for you: http://groups.google.com/group/microsoft.public.dotnet.languages.vb.data/browse_thread/thread/99a3fbc1608f396e/236c45f7cf2713b5%23236c45f7cf2713b5 Quote
alexander Posted August 2, 2006 Report Posted August 2, 2006 this may possibly help:http://www.componentsource.com/relevance/a100/xpdfco/index.html?q=pdf+visual Quote
C1ay Posted August 2, 2006 Report Posted August 2, 2006 Sorry, I think you're up a creek without a paddle. Generating PDF's automatically from text or marked up content is pretty straight forward, going the other way is not. In PDFs that I have tried to convert back to text I have found quite a variety of techniques used to prevent this. In one document I worked on I found that when I selected all of the text in the PDF and pasted it to a file editor all of the words appeared backwards even though they displayed correctly in the pdf. I had to write a script to actually reverse each word. In another I found that the text transferred via the clipboard had hidden characters inserted between all of the characters in the text. They didn't display but they were imbedded in the binary information. In yet another I found the whole document was actually an inverted mirror image of the file displayed as a pdf, i.e. the whole document was turned around backwards and then flipped upside down. It required a custom script to straighten out as well. For plain PDFs where such techniques are not used you mighht be successful but with the great variety of technique I have seen at preventing document copying you will need an equal variety of program functions to handle them. Quote
WhapSumi Posted August 2, 2006 Author Report Posted August 2, 2006 Well I figured out a way to do it, but it requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it). Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project: Dim objPDFPage As AcroPDPage Dim objPDFDoc As New AcroPDDoc Dim objPDFAVDoc As AcroAVDoc Dim objAcroApp As AcroApp Dim objPDFRectTemp As Object Dim objPDFRect As New AcroRect Dim lngTextRangeCount As Long Dim objPDFTextSelection As AcroPDTextSelect Dim temptextcount As Long Dim strText As String Dim lngPageCount As Long Dim Fora As Long objPDFDoc.Open(tbdocdisplaypath.Text) lngPageCount = objPDFDoc.GetNumPages For Fora = 0 To lngPageCount - 1 objPDFPage = objPDFDoc.AcquirePage(Fora) objPDFRectTemp = objPDFPage.GetSize objPDFRect.Left = 0 objPDFRect.right = objPDFRectTemp.x objPDFRect.Top = objPDFRectTemp.y objPDFRect.bottom = 0 ' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect) objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect) ' Get The Text Of The Range temptextcount = objPDFTextSelection.GetNumText For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1) Next doctext = doctext & vbCrLf Next doctype = "PDF" objPDFDoc.Close() Quote
C1ay Posted August 2, 2006 Report Posted August 2, 2006 Try it on this one...http://cepi.auctionsourceonline.com/pdf/gearboxes.pdf Quote
alexander Posted August 2, 2006 Report Posted August 2, 2006 you could look into how gpdf or xpdf do it... except that you would need to maybe write some API integration of the gpdf libraries with VB Quote
WhapSumi Posted August 3, 2006 Author Report Posted August 3, 2006 C1ay, That PDF is has security enabled with password protection so my program can't read it (Unless you know the password). I would first have to write a program to hack it or employ the use a 3rd party hacker tool in my code. This however would make the import process unberribly slow which it already seems to be. It took my program 7 hours to import the contents of 15, 450 page PDFs totalling about 36 MB. I'm pretty sure it's not my program. It seems that the Adobe APIs are causing the slowness. I'm also pretty sure that the azzholes at Adobe put this lovely feature in by design (I've always hated Adobe applications. There is a special place in hell set aside for Adobe programmers I'm sure of it). I still need to find a better solution for extracting text from PDFs so my quest continues tomorrow... I'm still open to any suggestions from anyone. Quote
C1ay Posted August 3, 2006 Report Posted August 3, 2006 I'm only pointing out yet another pdf copy protection. The data you're after is in the pdf file and you'll be OK as long as you're only trying to extract the data that isn't protected one way or another. OTOH, I've seen a variety of methods employed to prevent what you're trying to do. Quote
WhapSumi Posted August 3, 2006 Author Report Posted August 3, 2006 Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes. Quote
C1ay Posted August 3, 2006 Report Posted August 3, 2006 Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.I've also encountered PDFs where the text was displayed with a poor contrast to the background like a dark gray text on a light gray background. It was visibly readable but it threw OCR off enough that it wasn't usable. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.