How To Set up Tesseract OCR on AlmaLinux 8

In this article, we want to teach you How To Set up Tesseract OCR on AlmaLinux 8.

Tesseract is an open-source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page. Tesseract is compatible with many programming languages and frameworks through wrappers.

It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

what you read in this post?

How To Set up Tesseract OCR on AlmaLinux 8

To set up Tesseract, you need to log in to your server as a non-root user with sudo privileges. To do this, you can follow our article the Initial Server Setup with AlmaLinux 8.

Now follow the steps below to install Tesseract OCR on AlmaLinux 8.

Install Tesseract OCR on AlmaLinux 8

At this point, we want to show you to install Tesseract on AlmaLinux 8 from the source.

First, you need to update your local package index with the following command:

sudo dnf update -y

Then, you need to install the required packages for building the Tesseract on AlmaLinux 8:

sudo dnf install git automake make autoconf libtool clang gcc-c++.x86_64 wget

Install the leptonica dependencies with the following command:

sudo dnf install zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel libtiff libtiff-devel libpng libpng-devel

Now you need o move the executables to your path with the following commands:

# cd /usr/local/lib 
# sudo cp /usr/lib64/libjpeg.so.62 . 
# sudo cp /usr/lib64/libwebp.so.7 . 
# sudo cp /usr/lib64/libtiff.so.5 . 
# sudo cp /usr/lib64/libpng16.so.16 .

Next, you need to clone leptonica from git with the following command:

# cd ~ 
# git clone https://github.com/DanBloomberg/leptonica.git --depth 1

Switch to your Leptonica directory:

cd leptonica

At this point, you can compile the leptonica with the following commands:

# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp --with-libtiff --with-libpng --disable-dependency-tracking 
# make 
# sudo make install 
# sudo ldconfig

When your Leptonica installation is completed, you can download the latest version of Tesseract OCR on AlmaLinux 8 from GitHub. To do this, run the commands below:

# cd ~ 
# VER=$(curl -s https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest|grep tag_name | cut -d '"' -f 4) 
# wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/$VER.tar.gz -O tesseract-5.tar.gz

Then, extract your downloaded file:

tar zxvf tesseract-5.tar.gz

Switch to your Tesseract directory on AlmaLinux 8:

cd tesseract-*/

Now you need to compile your Tesseract OCR with the following commands:

# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig 
# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/

At this point, you can build and install Tesseract on AlmaLinux 8 with the commands below:

# make
# sudo make install 
# sudo ldconfig

When your installation is completed, you can load Tesseract languages.

Load Tesseract Languages on AlmaLinux 8

First, you need to create a language path with the following command:

mkdir -p /tess/traineddata

Then, export the Tesseract path by adding the below line to ~/.bashrc.

export TESSDATA_PREFIX=/home/$USER/tess/traineddata

Note: You can replace $USER with the exact username on the system

Now source the profile with the following command:

source ~/.bashrc

At this point, you can add any trained data available on Github tessdata to the path.

# cd $TESSDATA_PREFIX 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata

Now let’s see how to use Tesseract OCR.

How To Use Tesseract OCR

When Tesseract OCR has been installed on AlmaLinux 8, you can now start extracting text from scanned documents or images.

To convert an image to a text file, you can use the syntax below:

tesseract <image_name> <output file_name>

For example:

tesseract image.png new

The output will be a text file- new of the image file- image.png.

When using Tesseract OCR you can specify the language you want to use with the -l flag. For example, use Czech.

tesseract image.png new -l ces

You can specify multiple languages as well.

tesseract image.png new -l ces+eng

Conclusion

At this point, you learn to Set up Tesseract OCR on AlmaLinux 8.

Hope you enjoy it.

May you will be interested in this article:

How To Set up Tesseract OCR on Debian 11