Tuesday, October 11, 2011

Ubuntu pdf image to text (OCR) - Extract all text from PDF

Ubuntu pdf image to text (OCR) - Extract all text from PDF


sudo apt-get install ghostscript

gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=file-name.tif file-name.pdf

# depending on the language, you must install the corresponding package:
sudo apt-get install tesseract-ocr tesseract-ocr-eng

tesseract file-name.tif file-name-txt-without-extension -l eng


No comments: