Training an OCR Model for Russian License Plate Recognition on PyTorch

TL;DR

The task is recognizing text from Russian license plate images (cropped plate regions).

CRNN architecture (CNN + BiLSTM + CTC) — a classic lightweight OCR approach, ~9M parameters.

Dataset: 37,775 training and 4,891 validation images. Alphabet — 22 characters (digits + Cyrillic).

Mixed Precision (AMP) on NVIDIA RTX 5090 provides ~2x training speedup.

Result: plate accuracy 95%+ after 50–100 epochs with warmup and cosine annealing.

Problem statement

Recognizing text from Russian license plate images — an OCR (Optical Character Recognition) task formulated as sequence-to-sequence prediction: input is a fixed-size plate image, output is a character sequence.

Alphabet

22 characters: 0–9 (digits) and АВЕКМНОРСТУХ (Cyrillic with Latin visual analogs).

Input format

Grayscale image 1×32×256 (1 channel, height 32px, width 256px).

Output format

Character sequence (8 or 9 characters for Russian plates).

Approach

CRNN + CTC — training without character-level annotation, knowing only the final plate text.

Where it's used: as part of the Video Analytics Platform for real-time license plate recognition in video streams.

Model architecture

Image (1x32x256)
    ↓
[CNN Backbone] — Feature Extraction
    ↓
Tensor (B, 512, 1, 63) → (B, 63, 512)
    ↓
[BiLSTM x2] — Sequence Modeling
    ↓
[Linear 512→23] — Classification
    ↓
[CTC Decoding] — Greedy Decode
    ↓
Plate text

CRNN (Shi et al., 2015) is a classic OCR architecture. Three blocks: CNN extracts features, BiLSTM models the sequence, CTC trains without character-level labels.

CNN backbone (Feature Extraction)

A convolutional neural network extracts feature sequences from the image. The architecture consists of 5 blocks with BatchNorm after each convolutional layer.

Key design decisions:

1) BatchNorm after each layer — stabilizes training and allows higher learning rates.

2) MaxPool(2,1) in blocks 3 and 4 — reduces height only, preserving width (horizontal resolution is critical for character sequences).

3) Final Conv with kernel=2 and no padding — collapses height to 1, preparing data for RNN.

The resulting tensor (B, 512, 1, 63) is reshaped to (B, 63, 512) — 63 time steps, each with 512 features.

Block	Layers	Output (H×W)
Block 1	Conv(1→64)+BN+ReLU, Conv(64→64)+BN+ReLU, MaxPool(2×2)	16×128
Block 2	Conv(64→128)+BN+ReLU, Conv(128→128)+BN+ReLU, MaxPool(2×2)	8×64
Block 3	Conv(128→256)+BN+ReLU, Conv(256→256)+BN+ReLU, MaxPool(2,1)	4×64
Block 4	Conv(256→512)+BN+ReLU, Conv(512→512)+BN+ReLU, MaxPool(2,1)	2×64
Block 5	Conv(512→512, kernel=2)+BN+ReLU	1×63

BiLSTM (Sequence Modeling)

A bidirectional LSTM processes the feature sequence, capturing context in both directions. This is critical for understanding character boundaries.

nn.LSTM(
    input_size=512,
    hidden_size=256,
    num_layers=2,
    bidirectional=True,
    dropout=0.2,
)

Configuration: input_size=512, hidden_size=256, 2 layers, bidirectional=True, dropout=0.2.

Bidirectional means each time step receives information from both left and right. 2 layers provide sufficient depth, dropout 0.2 adds regularization.

CTC Loss and decoding

CTC (Connectionist Temporal Classification) is the key component for training without character-level annotation. We only know the final plate text, not which pixels correspond to which characters.

A blank token (index 0) is introduced for "no character". The model predicts a distribution over all 23 classes (22 chars + blank) for each of the 63 time steps.

CTC loss sums probabilities of all alignment paths leading to the target text. zero_infinity=True prevents NaN on early epochs.

Decoding (greedy): for each time step, the highest-probability character is selected, then duplicates and blanks are removed.

Data preparation

Dataset: 37,775 training and 4,891 validation images (~88.5% / 11.5%). Labels are extracted from filenames (e.g., A001AA50.png → A001AA50).

Preprocessing: OpenCV (BGR→RGB), grayscale (1 channel), resize to 32×256, normalization to [-1, 1] range.

Augmentations (train only)

To improve robustness to real-world capture conditions:

Augmentation	Probability	Purpose
Gaussian Noise	40%	Simulating camera sensor noise
Brightness/Contrast	40%	Varying lighting conditions
Gaussian Blur	25%	Defocus, motion blur

Training process

Adam optimizer with learning rate 1e-4. Warmup + Cosine Annealing: first 5 epochs LR ramps from 0 to 1e-4 (smooth start), then cosine decay to 1e-12.

Mixed Precision (AMP): forward pass in float16, GradScaler prevents gradient underflow. On RTX 5090 — ~2x speedup and larger batch size.

Gradient Clipping by norm max_norm=5.0 — prevents gradient explosion in LSTM, especially during early training.

Model is saved on validation loss improvement (not train) — prevents overfitting.

Hyperparameters

Parameter	Value
Batch size	64
Epochs	200
Learning rate	1e-4
Warmup epochs	5
Gradient clip	5.0
LSTM layers / hidden	2 / 256
Model parameters	~8.9M

Metrics and results

Plate Accuracy

Share of plates recognized completely without errors — the main metric. 95%+ after 50–100 epochs.

Character Accuracy

Share of individual characters recognized correctly. 90%+ as early as epoch 10–20.

CTC Loss

Loss function for optimization. Drops rapidly in early epochs, plateaus around epoch 100.

Training dynamics

Early epochs — CTC loss drops fast. By 10–20 — char accuracy 90%+. By 50–100 — plate accuracy 95%+. Cosine LR decay ensures smooth convergence.

Comparison with alternatives

Two approaches were tested during development:

	PyTorch CRNN (v2)	TensorFlow EfficientNetV2L
Framework	PyTorch 2.11	TensorFlow 2.21
Backbone	Custom CNN (~9M params)	EfficientNetV2L (~120M params)
GPU on Windows	CUDA 12.8 works	CPU only (TF dropped GPU)
Input	Grayscale 32×256	RGB 200×50

Why PyTorch: TensorFlow dropped Windows GPU support from version 2.15. PyTorch with CUDA 12.8 fully utilizes the RTX 5090 — training is tens of times faster than on CPU.

Tech stack

Component	Technology
Framework	PyTorch 2.11 + CUDA 12.8
GPU	NVIDIA GeForce RTX 5090 (32 GB VRAM)
Image processing	OpenCV, torchvision
Monitoring	TensorBoard
Dataset	~42,600 images (plate crops)

Common OCR training mistakes

1) Training without augmentations — model fails on real-world conditions (noise, lighting, blur).

2) Using RGB instead of grayscale — plates don't need color info, and 3 channels increase model size.

3) No gradient clipping — LSTM is prone to gradient explosion during early training.

4) Saving model by train loss instead of val loss — leads to overfitting.

5) Starting with full LR without warmup — destructive weight updates early on.

6) Ignoring plate format (variable length) — without a proper collate function, CTC won't train.

Training monitoring

TensorBoard integration for visualization: train/val loss curves, plate accuracy, character accuracy, and learning rate changes.

Launch: tensorboard --logdir runs/v2_ocr

FAQ

Why CRNN instead of Transformer?

CRNN + CTC is a lightweight architecture (~9M params) that works great for fixed domains (plates, receipts). Transformers require significantly more data and compute.

How much data do you need?

~38k training examples were sufficient for Russian plates. Quality depends heavily on augmentation diversity, not just volume.

Can it be used for other plate types?

Yes — replace the alphabet and fine-tune on new data. The architecture is universal for OCR sequences.

How fast is inference?

On RTX 5090 — real-time for video streams. On CPU — batch file processing works fine.

Why not use off-the-shelf OCR (Tesseract, EasyOCR)?

General-purpose OCR is worse on narrow domains. A custom model is more accurate on Russian plates, lighter, and faster.

Key Takeaways

1) CRNN + CTC is an efficient lightweight architecture for plate OCR (~9M parameters).

2) Mixed Precision + RTX 5090 provide significant training speedup.

3) Warmup + Cosine Annealing ensure stable convergence without oscillations.

4) Augmentations are critical for robustness to noise and lighting in real conditions.

5) PyTorch is currently the only option for GPU training on Windows.