I participated in the Denoising ShabbyPages competition at kaggle where the goal is to denoise dirty documents. You are given image pairs, where one is a cutout of a (clean) document while the corresponding other is a dirty version. These snippets were made ‘unclean’ with the Augraphy Python library.
Augraphy is a Python library that can randomly distort images of documents so that it mimics paper printing, faxing, scanning and copy machine processes. You can build very realistic pipelines that can help you with data augmentation for your deep learning projects.
The task was to clean them up as much as possible, effectively reversing the Augraphy noise. This has practical applications in the real world. Having an effective way to clean physical documents helps them being better captured in digital form. Here is a typical example:
Some time ago I wrote about image manipulations for data augmentations for OCR. This was more destined for individual words or sentences while Augraphy is more tailored for complete pages.
Note that is applicable for more traditional OCR engines. One could argue that transformer based models for document understanding are trained to deal with noise inherently. Preprocessing text images would be unnecessary in this scenario. See for example these examples from GPT-4V(ision):