Member-only story
Let’s scan those lines from that image
Pytesseract at our rescue
You may have wondered how printers recognize text from paper documents we pass it, and give you a PDF file you can save and use later.
Today, we will use a similar approach to recognize text from a document we pass it on, a technique known as Optical Character Recognition, but with Python code. 🥸
We will use the pytesseract library, a wrapper of the tesseract project initially developed by Hewlett-Packard Laboratories.
Installation
Before installing pytesseract, we need to install the tesseract binary. The method varies on the platform, for common options, we can use one of these methods.
# ubuntu
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
# macOS
$ sudo port install tesseract
# or
$ brew install tesseract
On Windows, you can install a binary reading the documentation on this page.
For more distributions or if you have issues with your installation, I recommend you to read this documentation.
To install pytesseract, you will need Python 3.7 or higher.
$ pip install pytesseract
# or with poetry
$ poetry add pytesseract