How to Extract Text from a PDF document

By Amanda Morin

It can be very frustrating to try to extract text from a PDF file for use in another application. It's not uncommon for graphics to get in the way or for layout of the document to make it difficult for the test to be transferred in meaningful sentences. Though it's not impossible to extract text with a copy-and-paste approach, it can be time consuming and doesn't allow for PDF file text to be exported as a different format. There are, however, a few ways to extract text from a PDF file.

Extract Text Using Acrobat Reader

Open the file in Acrobat Reader. In Windows, select "File-->Export Document to Text," name the document and save it.

Copy the text on a Mac or Linux OS by accessing the View menu and choosing "Continuous" or "Continuous-Facing." (The former will provide you with the text in one column, while the latter will format the text as side-by-side pages.) Go to "Edit--> Select All" and then" Edit--> Copy."

Use the Select tool if you only want to extract some of the text. Click on the "Text Select" tool and then choose the information you want. In a document formatted in multiple columns, you'll need to use the "Column Select" tool first. Go to "Edit-->Copy."

Convert PDF to HTML

Use Gmail as a shortcut. Attach the PDF file to an email and send it to your Gmail account. When you open the email you will see a number of options next to the attachment. Choose "View as HTML" and save the file that opens in a separate window. Though you won't be able to view any graphics, the HTML file will retain the document's text formatting.

Extract and convert files on the command line. Linux users can use a basic conversion command which will change a .pdf file to a .txt file: "pdftotext filename.pdf." Be sure to replace the filename with the name of the PDF file.

Download a PDF to text conversion program. There are a number of open source and freeware programs available such as PDFBox and Easy PDF to Text Converter (see Resources below). Many of these programs can also convert PDF files to HTML as well.

×