Training an OCR Model for Russian License Plate Recognition on PyTorch
A practical walkthrough: from OCR problem statement for license plates to training a CRNN + CTC model on 42,600 images from scratch.
TL;DR
The task is recognizing text from Russian license plate images (cropped plate regions).
CRNN architecture (CNN + BiLSTM + CTC) β a classic lightweight OCR approach, ~9M parameters.
Dataset: 37,775 training and 4,891 validation images. Alphabet β 22 characters (digits + Cyrillic).
Mixed Precision (AMP) on NVIDIA RTX 5090 provides ~2x training speedup.
Result: plate accuracy 95%+ after 50β100 epochs with warmup and cosine annealing.
Problem statement
Recognizing text from Russian license plate images β an OCR (Optical Character Recognition) task formulated as sequence-to-sequence prediction: input is a fixed-size plate image, output is a character sequence.
Alphabet
22 characters: 0β9 (digits) and ΠΠΠΠΠΠΠΠ Π‘Π’Π£Π₯ (Cyrillic with Latin visual analogs).
Input format
Grayscale image 1Γ32Γ256 (1 channel, height 32px, width 256px).
Output format
Character sequence (8 or 9 characters for Russian plates).
Approach
CRNN + CTC β training without character-level annotation, knowing only the final plate text.
Where it's used: as part of the Video Analytics Platform for real-time license plate recognition in video streams.
Model architecture
Image (1x32x256)
β
[CNN Backbone] β Feature Extraction
β
Tensor (B, 512, 1, 63) β (B, 63, 512)
β
[BiLSTM x2] β Sequence Modeling
β
[Linear 512β23] β Classification
β
[CTC Decoding] β Greedy Decode
β
Plate text
CRNN (Shi et al., 2015) is a classic OCR architecture. Three blocks: CNN extracts features, BiLSTM models the sequence, CTC trains without character-level labels.
CNN backbone (Feature Extraction)
A convolutional neural network extracts feature sequences from the image. The architecture consists of 5 blocks with BatchNorm after each convolutional layer.
Key design decisions:
1) BatchNorm after each layer β stabilizes training and allows higher learning rates.
2) MaxPool(2,1) in blocks 3 and 4 β reduces height only, preserving width (horizontal resolution is critical for character sequences).
3) Final Conv with kernel=2 and no padding β collapses height to 1, preparing data for RNN.
The resulting tensor (B, 512, 1, 63) is reshaped to (B, 63, 512) β 63 time steps, each with 512 features.
BiLSTM (Sequence Modeling)
A bidirectional LSTM processes the feature sequence, capturing context in both directions. This is critical for understanding character boundaries.
nn.LSTM(
input_size=512,
hidden_size=256,
num_layers=2,
bidirectional=True,
dropout=0.2,
)
Configuration: input_size=512, hidden_size=256, 2 layers, bidirectional=True, dropout=0.2.
Bidirectional means each time step receives information from both left and right. 2 layers provide sufficient depth, dropout 0.2 adds regularization.
CTC Loss and decoding
CTC (Connectionist Temporal Classification) is the key component for training without character-level annotation. We only know the final plate text, not which pixels correspond to which characters.
A blank token (index 0) is introduced for "no character". The model predicts a distribution over all 23 classes (22 chars + blank) for each of the 63 time steps.
CTC loss sums probabilities of all alignment paths leading to the target text. zero_infinity=True prevents NaN on early epochs.
Decoding (greedy): for each time step, the highest-probability character is selected, then duplicates and blanks are removed.
Data preparation
Dataset: 37,775 training and 4,891 validation images (~88.5% / 11.5%). Labels are extracted from filenames (e.g., A001AA50.png β A001AA50).
Preprocessing: OpenCV (BGRβRGB), grayscale (1 channel), resize to 32Γ256, normalization to [-1, 1] range.
Augmentations (train only)
To improve robustness to real-world capture conditions:
Training process
Adam optimizer with learning rate 1e-4. Warmup + Cosine Annealing: first 5 epochs LR ramps from 0 to 1e-4 (smooth start), then cosine decay to 1e-12.
Mixed Precision (AMP): forward pass in float16, GradScaler prevents gradient underflow. On RTX 5090 β ~2x speedup and larger batch size.
Gradient Clipping by norm max_norm=5.0 β prevents gradient explosion in LSTM, especially during early training.
Model is saved on validation loss improvement (not train) β prevents overfitting.
Hyperparameters
Metrics and results
Plate Accuracy
Share of plates recognized completely without errors β the main metric. 95%+ after 50β100 epochs.
Character Accuracy
Share of individual characters recognized correctly. 90%+ as early as epoch 10β20.
CTC Loss
Loss function for optimization. Drops rapidly in early epochs, plateaus around epoch 100.
Training dynamics
Early epochs β CTC loss drops fast. By 10β20 β char accuracy 90%+. By 50β100 β plate accuracy 95%+. Cosine LR decay ensures smooth convergence.
Comparison with alternatives
Two approaches were tested during development:
Why PyTorch: TensorFlow dropped Windows GPU support from version 2.15. PyTorch with CUDA 12.8 fully utilizes the RTX 5090 β training is tens of times faster than on CPU.
Tech stack
Common OCR training mistakes
1) Training without augmentations β model fails on real-world conditions (noise, lighting, blur).
2) Using RGB instead of grayscale β plates don't need color info, and 3 channels increase model size.
3) No gradient clipping β LSTM is prone to gradient explosion during early training.
4) Saving model by train loss instead of val loss β leads to overfitting.
5) Starting with full LR without warmup β destructive weight updates early on.
6) Ignoring plate format (variable length) β without a proper collate function, CTC won't train.
Training monitoring
TensorBoard integration for visualization: train/val loss curves, plate accuracy, character accuracy, and learning rate changes.
Launch: tensorboard --logdir runs/v2_ocr
FAQ
Why CRNN instead of Transformer?
CRNN + CTC is a lightweight architecture (~9M params) that works great for fixed domains (plates, receipts). Transformers require significantly more data and compute.
How much data do you need?
~38k training examples were sufficient for Russian plates. Quality depends heavily on augmentation diversity, not just volume.
Can it be used for other plate types?
Yes β replace the alphabet and fine-tune on new data. The architecture is universal for OCR sequences.
How fast is inference?
On RTX 5090 β real-time for video streams. On CPU β batch file processing works fine.
Why not use off-the-shelf OCR (Tesseract, EasyOCR)?
General-purpose OCR is worse on narrow domains. A custom model is more accurate on Russian plates, lighter, and faster.
Key Takeaways
1) CRNN + CTC is an efficient lightweight architecture for plate OCR (~9M parameters).
2) Mixed Precision + RTX 5090 provide significant training speedup.
3) Warmup + Cosine Annealing ensure stable convergence without oscillations.
4) Augmentations are critical for robustness to noise and lighting in real conditions.
5) PyTorch is currently the only option for GPU training on Windows.
Who this is for
Computer Vision teams, video surveillance developers, OCR specialists, and ML engineers who need license plate recognition or other character sequence OCR from images.
Contact via Telegram β