Unlocking the Power of PaddleOCR
An Introduction to Text Detection and Recognition
Optical Character Recognition (OCR) is a powerful technology that enables machines to recognize and extract text from images or scanned documents. OCR finds applications in various fields, including document digitization, text extraction from images, and text-based data analysis. In this article, we will explore how to use PaddleOCR, an advanced OCR toolkit based on deep learning, for text detection and recognition tasks. We will walk through a code snippet that demonstrates the process step-by-step.
Table of content:
Prerequisites
Before we dive into the code, let’s ensure we have everything set up to run the PaddleOCR library. Make sure you have the following prerequisites installed on your machine:
- Python (3.6 or higher)
- PaddleOCR library
- Other necessary dependencies (e.g., NumPy, pandas, etc)
You can install PaddleOCR using the following pip command:
pip install paddleocr
Setting up PaddleOCR
Once you have Python and the required libraries installed, let’s set up PaddleOCR. You can use PaddleOCR’s pre-trained models, which are available for text detection and recognition.
Code Overview
The code snippet for text detection and recognition using PaddleOCR consists of the following main components:
- Image Preprocessing: Load the input image and perform any necessary preprocessing steps, such as resizing or normalization.
- Text Detection: Utilize the PaddleOCR text detection model to locate bounding boxes around the text regions in the input image.
- Text Recognition: For each detected bounding box, use the PaddleOCR text recognition model to extract the corresponding text.
- Post-processing: Organize the detected text and recognition results for further analysis or display.
Step-by-Step Implementation
Let’s break down the code snippet and explain each step in detail:
- Text Detection
The code is a part of a class named DecMain
, which is designed for Optical Character Recognition (OCR) evaluation using ground truth data. It uses PaddleOCR to extract text from images and then calculates metrics like precision, recall, and Character Error Rate (CER) to evaluate the performance of the OCR system.
class DecMain:
def __init__(self, image_folder_path, label_file_path, output_file):
self.image_folder_path = image_folder_path
self.label_file_path = label_file_path
self.output_file = output_file
def run_dec(self):
# Check and update the ground truth file
CheckAndUpdateGroundTruth(self.label_file_path).check_and_update_ground_truth_file()
df = OcrToDf(image_folder=self.image_folder_path, label_file=self.label_file_path, det=True, rec=True, cls=False).ocr_to_df()
ground_truth_data = ReadGroundTruthFile(self.label_file_path).read_ground_truth_file()
# Get the extracted text as a list of dictionaries (representing the OCR results)
ocr_results = df.to_dict(orient="records")
# Calculate precision, recall, and CER
precision, recall, total_samples = CalculateMetrics(ground_truth_data, ocr_results).calculate_precision_recall()
CreateSheet(dataframe=df, precision=precision, recall=recall, total_samples=total_samples,
file_name=self.output_file).create_sheet()
Let's break down the code and explain each part:
class DecMain:
def __init__(self, image_folder_path, label_file_path, output_file):
self.image_folder_path = image_folder_path
self.label_file_path = label_file_path
self.output_file = output_file
- The
DecMain
class has an__init__
method that initializes the object with the following parameters: image_folder_path
: The path to the folder containing the input images for OCR.label_file_path
: The path to the ground truth label file that contains the actual text content of the images.output_file
: The filename of the output file where the evaluation results will be saved.
def run_dec(self):
# Check and update the ground truth file
CheckAndUpdateGroundTruth(self.label_file_path).check_and_update_ground_truth_file()
- The
run_dec
method is responsible for running the OCR evaluation process. It first checks and updates the ground truth file using theCheckAndUpdateGroundTruth
class.
df = OcrToDf(image_folder=self.image_folder_path, label_file=self.label_file_path, det=True, rec=True, cls=False).ocr_to_df()
- The
OcrToDf
class is used to convert the OCR results into a pandas DataFrame (df
). It takes the following parameters: image_folder
: The path to the folder containing the input images for OCR.label_file
: The path to the ground truth label file.- The parameters
det=True
andrec=True
indicate that both text detection and recognition results will be included in the DataFrame.
ground_truth_data = ReadGroundTruthFile(self.label_file_path).read_ground_truth_file()
- The
ReadGroundTruthFile
class is used to read the ground truth label file and load its contents into theground_truth_data
variable.
# Get the extracted text as a list of dictionaries (representing the OCR results)
ocr_results = df.to_dict(orient="records")
- The OCR results obtained in DataFrame
df
are converted to a list of dictionaries (ocr_results
), with each dictionary representing the OCR result for a single image.
# Calculate precision, recall, and CER
precision, recall, total_samples = CalculateMetrics(ground_truth_data, ocr_results).calculate_precision_recall()
- The
CalculateMetrics
class is used to calculate the OCR evaluation metrics: precision, recall, and the total number of samples evaluated. The class takes the ground truth data and OCR results as inputs.
CreateSheet(dataframe=df, precision=precision, recall=recall, total_samples=total_samples,
file_name=self.output_file).create_sheet()
- The
CreateSheet
class is responsible for creating an output sheet (e.g., Excel or CSV) with the evaluation metrics and OCR results. It takes the DataFramedf
, precision, recall, total samples, and the output filename as inputs.
Overall, the DecMain
class provides a structured way to evaluate the OCR performance using ground truth data and PaddleOCR's text detection and recognition capabilities. It calculates important evaluation metrics and stores the results in a specified output file for further analysis.
Note: Format of the Ground Truth Label File
To perform OCR evaluation using the
DecMain
class and the provided code, it's crucial to format the ground truth label file correctly. The label file should be in JSON format and follow the structure as shown below:image_name.jpg [{"transcription": "215mm 18", "points": [[199, 6], [357, 6], [357, 33], [199, 33]], "difficult": False, "key_cls": "digits"}, {"transcription": "XZE SA", "points": [[15, 6], [140, 6], [140, 36], [15, 36]], "difficult": False, "key_cls": "text"}]
The label file should be in JSON format. Each line of the file represents an image’s OCR ground truth.
Each line contains the filename of the image, followed by the OCR results for that image in the form of a JSON object.
The JSON object should have the following keys:
"transcription"
: The ground truth text transcription of the image.
"points"
: A list of four points representing the bounding box coordinates of the text region in the image.
"difficult"
: A boolean value indicating whether the text region is difficult to recognize.
"key_cls"
: The class label of the OCR result, e.g., "digits" or "text".Make sure to follow this format while creating the ground truth label file for accurate OCR evaluation.
If you’re eager to explore the full implementation of the OCR evaluation using PaddleOCR, you’re in luck! I’ve made the entire code available on my public Git repository. You can access it at here. The repository contains the DecMain
class along with other necessary classes that enable you to perform OCR, calculate evaluation metrics, and generate output sheets. Feel free to clone the repository, try out the code with your own data, and even contribute to its improvement!
2. Text Recognition
The code defines a class named RecMain
, which is designed to run text recognition (OCR) using a pre-trained OCR model on a folder of images and generate an evaluation Excel sheet.
class RecMain:
def __init__(self, image_folder, rec_file, output_file):
self.image_folder = image_folder
self.rec_file = rec_file
self.output_file = output_file
def run_rec(self):
image_paths = GetImagePathsFromFolder(self.image_folder, self.rec_file). \
get_image_paths_from_folder()
ocr_model = LoadRecModel().load_model()
results = ProcessImages(ocr=ocr_model, image_paths=image_paths).process_images()
ground_truth_data = ConvertTextToDict(self.rec_file).convert_txt_to_dict()
model_predictions, ground_truth_texts, image_names, precision, recall, \
overall_model_precision, overall_model_recall, cer_data_list = EvaluateRecModel(results,
ground_truth_data).evaluate_model()
# Create Excel sheet
CreateMetricExcel(image_names, model_predictions, ground_truth_texts,
precision, recall, cer_data_list, overall_model_precision, overall_model_recall,
self.output_file).create_excel_sheet()
Let's break down the code and explain each part:
class RecMain:
def __init__(self, image_folder, rec_file, output_file):
self.image_folder = image_folder
self.rec_file = rec_file
self.output_file = output_file
- The
RecMain
class has an__init__
method that initializes the object with the following parameters: image_folder
: The path to the folder containing the input images for text recognition.rec_file
: The path to the ground truth label file that contains the actual text content of the images.output_file
: The filename of the output Excel sheet where the evaluation results will be saved
def run_rec(self):
image_paths = GetImagePathsFromFolder(self.image_folder, self.rec_file).get_image_paths_from_folder()
- The
run_rec
method is responsible for running the text recognition process. It first uses theGetImagePathsFromFolder
class to get a list of image paths within the specifiedimage_folder
. This step ensures that the OCR model will process all images within the given directory.
ocr_model = LoadRecModel().load_model()
- The
LoadRecModel
class is used to load the pre-trained OCR model for text recognition. It may utilize PaddleOCR or any other OCR library to load the model.
results = ProcessImages(ocr=ocr_model, image_paths=image_paths).process_images()
- The
ProcessImages
class is responsible for processing the images using the loaded OCR model. It takes the OCR model (ocr_model
) and the list of image paths (image_paths
) as inputs.
ground_truth_data = ConvertTextToDict(self.rec_file).convert_txt_to_dict()
- The
ConvertTextToDict
class is used to read the ground truth label file and convert it into a dictionary format (ground_truth_data
). This conversion prepares the ground truth data for comparison with the OCR model predictions.
model_predictions, ground_truth_texts, image_names, precision, recall, \
overall_model_precision, overall_model_recall, cer_data_list = EvaluateRecModel(results,
ground_truth_data).evaluate_model()
- The
EvaluateRecModel
class is responsible for comparing the OCR model predictions with the ground truth data and calculating evaluation metrics such as precision, recall, and Character Error Rate (CER). It takes the OCR model predictions (results
) and the ground truth data (ground_truth_data
) as inputs.
# Create Excel sheet
CreateMetricExcel(image_names, model_predictions, ground_truth_texts,
precision, recall, cer_data_list, overall_model_precision, overall_model_recall,
self.output_file).create_excel_sheet()
- The
CreateMetricExcel
class is responsible for creating an output Excel sheet with the evaluation metrics and OCR results. It takes various input data, including image names, model predictions, ground truth texts, evaluation metrics, and the output filename (self.output_file
).
Overall, the RecMain
class orchestrates the entire text recognition process, from loading the OCR model to generating the evaluation Excel sheet with detailed metrics. It provides an organized and reusable way to evaluate the performance of an OCR model on a given set of images with ground truth data.
Note: Format of the Ground Truth Text File
To perform OCR evaluation using the
RecMain
class and the provided code, it's essential to format the ground truth (GT) text file correctly. The GT text file should be in the following format:image_name.jpg text
Each line of the file represents an image’s GT text.
Each line contains the filename of the image, followed by a tab character (
\t
), and then the GT text for that image.Ensure that the GT text file contains GT text entries for all the images present in the image folder specified in the
RecMain
class. The GT text should match the actual text content present in the images. This format is necessary for accurate evaluation of the OCR model's performance.
If you’re eager to explore the full implementation of the OCR evaluation using PaddleOCR, you’re in luck! I’ve made the entire code available on my public Git repository. You can access it at here. The repository contains the RecMain
class along with other necessary classes that enable you to perform OCR, calculate evaluation metrics, and generate output sheets. Feel free to clone the repository, try out the code with your own data, and even contribute to its improvement!
Conclusion
we explored the process of text detection and recognition using PaddleOCR, an advanced OCR toolkit based on deep learning. We walked through a code snippet that demonstrates the step-by-step implementation of text detection and recognition. With PaddleOCR’s powerful pre-trained models and easy-to-use API, performing OCR on images has never been easier.
Now it’s your turn to try out the code snippet and experiment with different images or text recognition scenarios.
You can find the source code here.