Extracting Text With GIMP
By Darrin Koltow
The main benefit of extracting text from image files is the ability it gives you to search that text. For example, if you have a batch of business letters scanned in as JPEG files, after extracting text from those JPEGs, you can search for letters from a particular customer by searching for her name. How you extract text with GIMP depends on text type: rasterized text is like an image, so use GIMP's Threshold command or a similar tool; for vector text that word processors work with, use a free plugin from GIMP's plugin registry.
Download the Extract Text script from the GIMP Plugin Registry (link in Resources). Close GIMP, and then use Windows Explorer to copy or move the script to the scripts folder under your GIMP installation directory. The folder's name is similar to this sample one:
C:\Program Files\GIMP 2\Share\Gimp\2.0\Scripts
Open GIMP, and then click the "contrib" menu that the Extract Text script added to GIMP's default menus. Click the "Extract" command, and then click the "Input File" button to display a file selection dialog box. Select a GIMP file -- which has the extension XCF -- that has at least one text layer.
Click the "Text File" button, and then enter a file name for the output text file. Click "OK" to extract the text, and then use Windows Explorer to navigate to and open the output text file you specified. The text file displays the extracted text.
Open in GIMP an image file that has rasterized text. All JPEGs, GIFs and PNGs, for example, have only rasterized text. By contrast, GIMP and Photoshop files -- extensions XCF and PSD, respectively -- can have vector text.
Click the "Threshold" command under the Colors menu to display the Threshold dialog box. This command maps all colors to black or white depending on the black-white value of each pixel relative to black and white thresholds that you specify. The resulting image shows white text on a black background, which means it has extremely high contrast. This attribute greatly aids in text recognition.
Drag the dialog box's black and white arrow sliders left or right until the image's text is clearly legible. Click "OK" to close the dialog box. Click the "Color" menu's "Value Invert" command to swap the black and white values, which makes it easier for OCR programs to recognize text.
Click the "File" menu and choose the "Export" command. Enter a file name that ends in "JPG" or "PNG," and then click "Export" to save the image to disk.
Use an OCR resource such as Google Drive, Tesseract or FreeOCR (links in Resources) to convert the rasterized text to vector-based, selectable text.
Information in this article applies to GIMP 2.8. It may vary slightly or significantly with other versions or products.
Darrin Koltow wrote about computer software until graphics programs reawakened his lifelong passion of becoming a master designer and draftsman. He has now committed to acquiring the training for a position designing characters, creatures and environments for video games, movies and other entertainment media.