Install Tesseract OCR on AlmaLinux 9

In this guide, we want to teach you to Install Tesseract OCR on AlmaLinux 9.

Tesseract — is an optical character recognition engine with open-source code, this is the most popular and qualitative OCR library.

OCR uses artificial intelligence for text search and its recognition of images.

Tesseract is finding templates in pixels, letters, words, and sentences. It uses a two-step approach that calls adaptive recognition. It requires one data stage for character recognition, then the second stage to fulfill any letters, it wasn’t insured in, by letters that can match the word or sentence context.

what you read in this post?

Steps To Install Tesseract OCR on AlmaLinux 9

Steps To Install Tesseract OCR on AlmaLinux 9

To complete this guide, you must log in to your server as a non-root user with sudo privileges. To do this, you can follow our guide on Initial Server Setup with AlmaLinux 9.

Set up Tesseract OCR on AlmaLinux 9

At this point, we want to show you to install Tesseract on AlmaLinux 9 from the source.

First, you need to update your local package index with the following command:

sudo dnf update -y

Install required packages and Dependencies

Then, you need to install the required packages for building the Tesseract on AlmaLinux 9:

sudo dnf install git automake make autoconf libtool clang gcc-c++.x86_64 wget -y

Install the leptonica dependencies with the following command:

sudo dnf install zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel libtiff libtiff-devel libpng libpng-devel -y

Now you need o move the executables to your path with the following commands:

# cd /usr/local/lib 
# sudo cp /usr/lib64/libjpeg.so.62 . 
# sudo cp /usr/lib64/libwebp.so.7 . 
# sudo cp /usr/lib64/libtiff.so.5 . 
# sudo cp /usr/lib64/libpng16.so.16 .

Clone Leptonica From Git

Next, you need to clone leptonica from git with the following command:

# cd ~ 
# git clone https://github.com/DanBloomberg/leptonica.git --depth 1

Switch to your Leptonica directory:

cd leptonica

Compile and Build Leptonica

At this point, you can compile the leptonica with the following commands:

# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp --with-libtiff --with-libpng --disable-dependency-tracking 
# sudo make 
# sudo make install 
# sudo ldconfig

Download Tesseract OCR on AlmaLinux 9

When your Leptonica installation is completed, you can download the latest version of Tesseract OCR on AlmaLinux 9 from GitHub. To do this, run the commands below:

# cd ~ 
# VER=$(curl -s https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest|grep tag_name | cut -d '"' -f 4) 
# wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/$VER.tar.gz -O tesseract-5.tar.gz

Then, extract your downloaded file:

tar zxvf tesseract-5.tar.gz

Switch to your Tesseract directory on AlmaLinux 9:

cd tesseract-*/

Compile Tesseract OCR

Now you need to compile your Tesseract OCR with the following commands:

# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig 
# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/

Build and Install Tesseract OCR

At this point, you can build and install Tesseract on AlmaLinux 9 with the commands below:

# sudo make
# sudo make install 
# sudo ldconfig

When your installation is completed, you can load Tesseract languages.

Load Tesseract Languages

First, you need to create a language path with the following command:

mkdir -p /tess/traineddata

Then, export the Tesseract path by adding the below line to ~/.bashrc.

export TESSDATA_PREFIX=/home/$USER/tess/traineddata

Note: You can replace $USER with the exact username on the system

Now source the profile with the following command:

source ~/.bashrc

At this point, you can add any trained data available on Github tessdata to the path.

# cd $TESSDATA_PREFIX 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata

Now let’s see how to use Tesseract OCR.

How To Use Tesseract OCR on AmaLinux 9

When Tesseract OCR has been installed on AlmaLinux 9, you can now start extracting text from scanned documents or images.

To convert an image to a text file, you can use the syntax below:

tesseract <image_name> <output file_name>

For example:

tesseract image.png new

The output will be a text file- new to the image file- image.png.

When using Tesseract OCR you can specify the language you want to use with the -l flag. For example, use Czech.

tesseract image.png new -l ces

You can specify multiple languages as well.

tesseract image.png new -l ces+eng

Conclusion

At this point, you have learned to Install Tesseract OCR on AlmaLinux 9.

Hope you enjoy it. You may be interested in these articles:

Install and Secure Wekan Server on AlmaLinux 9

How To Set up Redis on Rocky Linux 9