How to Extract Text from a PDF document
By Amanda Morin
It can be very frustrating to try to extract text from a PDF file for use in another application. It's not uncommon for graphics to get in the way or for layout of the document to make it difficult for the test to be transferred in meaningful sentences. Though it's not impossible to extract text with a copy-and-paste approach, it can be time consuming and doesn't allow for PDF file text to be exported as a different format. There are, however, a few ways to extract text from a PDF file.
Extract Text Using Acrobat Reader
Open the file in Acrobat Reader. In Windows, select "File-->Export Document to Text," name the document and save it.
Copy the text on a Mac or Linux OS by accessing the View menu and choosing "Continuous" or "Continuous-Facing." (The former will provide you with the text in one column, while the latter will format the text as side-by-side pages.) Go to "Edit--> Select All" and then" Edit--> Copy."
Use the Select tool if you only want to extract some of the text. Click on the "Text Select" tool and then choose the information you want. In a document formatted in multiple columns, you'll need to use the "Column Select" tool first. Go to "Edit-->Copy."
Convert PDF to HTML
Use Gmail as a shortcut. Attach the PDF file to an email and send it to your Gmail account. When you open the email you will see a number of options next to the attachment. Choose "View as HTML" and save the file that opens in a separate window. Though you won't be able to view any graphics, the HTML file will retain the document's text formatting.
Extract and convert files on the command line. Linux users can use a basic conversion command which will change a .pdf file to a .txt file: "pdftotext filename.pdf." Be sure to replace the filename with the name of the PDF file.
Download a PDF to text conversion program. There are a number of open source and freeware programs available such as PDFBox and Easy PDF to Text Converter (see Resources below). Many of these programs can also convert PDF files to HTML as well.
- Determine whether the document is formatted to contain both text and graphics. The Adobe Acrobat approach will only work if the PDF file contains both; it won't work for files with images only. In some cases the text in a PDF document is actually formatted as an image. This often happens when an original document is scanned and a PDF file is created from the scanned image.
- Be prepared to reformat some of the text when using Acrobat Reader. This manner of extraction simply exports the PDF file to a text file--it won't retain necessarily retain the formatting. However, if you just need to use the words this shouldn't be a problem.
Amanda Morin served as a kindergarten teacher and early intervention specialist for 10 years, working with special-needs children and teaching parenting classes. Since becoming a freelance writer, she has written for a number of publications, including Education.com, the Maine Department of Education, ModernMom and others. Morin holds a Bachelor of Science in elementary education from the University of Maine, Orono.