Let’s scan those lines from that image

Kevin Tewouda
5 min readFeb 25, 2024

Pytesseract at our rescue

Photo by Mahrous Houses on Unsplash

You may have wondered how printers recognize text from paper documents we pass it, and give you a PDF file you can save and use later.

Today, we will use a similar approach to recognize text from a document we pass it on, a technique known as Optical Character Recognition, but with Python code. 🥸

We will use the pytesseract library, a wrapper of the tesseract project initially developed by Hewlett-Packard Laboratories.

Installation

Before installing pytesseract, we need to install the tesseract binary. The method varies on the platform, for common options, we can use one of these methods.

# ubuntu
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

# macOS
$ sudo port install tesseract
# or
$ brew install tesseract

On Windows, you can install a binary reading the documentation on this page.

For more distributions or if you have issues with your installation, I recommend you to read this documentation.

To install pytesseract, you will need Python 3.7 or higher.

$ pip install pytesseract

# or with poetry
$ poetry add pytesseract

--

--

Kevin Tewouda

Déserteur camerounais résidant désormais en France. Passionné de programmation, sport, de cinéma et mangas. J’écris en français et en anglais dû à mes origines.