用于文本识别的合成数据生成器
https://github.com/Belval/TextRecognitionDataGenerator
A synthetic data generator for text recognition
说明:
功能与上篇博客介绍的文本图片生成类似。
安装相关的依赖后,按要求即可以运行demo。
可以生成自己所希望的语料的文本,也可以添加自己所需要的背景。
例如,火车票信息,可以将所有可能的车站名称、车次名称、等一些固定的信息都放在里面,随机生成需要的样本数据。
python run.py -l cn --output_dir MY_samples -i texts/city.txt -c 1000 -b 3 -w 5
另外,对于中文字体(黑体、宋体...),如何修改还在探索。
生成样本如下图:
TextRecognitionDataGenerator
A synthetic data generator for text recognition
What is it for?
Generating text image samples to train an OCR software. Now supporting non-latin text!
What do I need to make it work?
I use Archlinux so I cannot tell if it works on Windows yet.
Python 3.X OpenCV 3.2 (It probably works with 2.4) Pillow Numpy Requests BeautifulSoup tqdm
You can simply use
pip install -r requirements.txttoo.
New
- Specify text color range using
-tc min,max
- Explicit alignement when using
-al
with fixed width (0: Left, 1: Center, 2: Right) - Fixed width using
-wd
- Generate random strings with letters, numbers and symbols (Thank you @FHainzl)
- Save the labels in a file instead of in the file name (Thank you @FHainzl)
- Add support for Simplified and Traditional Chinese
How does it work?
python run.py -w 5 -f 64
You get 1000 randomly generated images with random text on them like:
What if you want random skewing? Add
-kand
-rk(
python run.py -w 5 -f 64 -k 5 -rk)
But scanned document usually aren't that clear are they? Add
-bland
-rblto get gaussian blur on the generated image with user-defined radius (here 0, 1, 2, 4):
Maybe you want another background? Add
-bto define one of the three available backgrounds: gaussian noise (0), plain white (1), quasicrystal (2) or picture (3).
When using picture background (3). A picture from the pictures/ folder will be randomly selected and the text will be written on it.
Or maybe you are working on an OCR for handwritten text? Add
-hw! (Experimental)
It uses a Tensorflow model trained using this excellent project by Grzego.
The project does not require TensorFlow to run if you aren't using this feature
You can also add distorsion to the generated text with
-dand
-do
The text is chosen at random in a dictionary file (that can be found in the dicts folder) and drawn on a white background made with Gaussian noise. The resulting image is saved as [text]_[index].jpg
There are a lot of parameters that you can tune to get the results you want, therefore I recommand checking out
python run.py -hfor more informations.
How to create images with Chinese (both simplified and traditional) text
It is simple! Just do
python run.py -l cn -c 1000 -w 5!
Unfortunately I do not speak Chinese so you may have to edit
texts/cn.txtto include some meaningful words instead of random glyphs.
Here are examples of what I could make with it:
Traditional:
Simplified:
Can I add my own font?
Yes, the script picks a font at random from the fonts directory.
fonts/latin | English, French, Spanish, German |
fonts/cn | Chinese |
Simply add / remove fonts until you get the desired output.
If you want to add a new non-latin language, the amount of work is minimal.
- Create a new folder with your language two-letters code
- Add a .ttf font in it
- Edit
run.py
to add an if statement inload_fonts()
- Add a text file in
dicts
with the same two-letters code - Run the tool as you normally would but add
-l
with your two-letters code
It only supports .ttf for now.
Benchmarks
- Intel Core i7-4710HQ @ 2.50Ghz + SSD (-c 1000 -w 1)
-t 1
: 363 img/s -t 2
: 694 img/s-t 4
: 1300 img/s-t 8
: 1500 img/s
-t 1: 558 img/s
-t 2: 1045 img/s
-t 4: 2107 img/s
-t 8: 3297 img/s
Contributing
- Create an issue describing the feature you'll be working on
- Code said feature
- Create a pull request
Feature request & issues
If anything is missing, unclear, or simply not working, open an issue on the repository.
What is left to do?
- Better background generation
- Better handwritten text generation
- More customization parameters (mostly regarding background)
- 一个语句将列数据合成一个文本
- 用于图片文本识别的Tesseract-OCR的安装说明(windows10)
- 用于图片文本识别的pytesser3的安装说明(windows10)
- Oracle Database 10g 中的正规表达式特性是一个用于处理文本数据的强大工具
- 字符串处理是许多程序中非常重要的一部分,它们可以用于文本显示,数据表示,查找键和很多目的.在Unix下,用户可以使用正则表达式的强健功能实现这些 目的,从Java1.4起,Java核心API就引入了java.util.regex程序包,它是一种有价值的基础
- python读取文本数据到矩阵(用于scikit-learn输入)
- Java基本功练习十四(字符串和文本IO【文件读写数据、字符串生成器、String类方法、相关技巧】)
- 场景文本识别-常用数据集
- 基于文本,优于轻量,用于交换数据——json解析(下篇)
- 写了个单元测试辅助类,用于读取文本加入数据,删除数据和检查数据的
- 宝贵数据集——用于数据挖掘、机器学习、文本挖掘
- OpenCV人脸识别之一:数据收集和预处理
- asp.net从数据库导出数据到word、excel、txt文本文件
- 第三百节,python操作redis缓存-其他常用操作,用于操作redis里的数据name,不论什么数据类型
- 可以用于switch语句的判断的数据类型总结
- 哈希表用于数据索引
- Java_SE07-基本IO操作,文本数据IO操作
- 实用的SQL函数(用于将符合条件的某列所有记录合成一行)
- “指定的SAS安装数据(sid)文件不能用于选定的SAS软件订单
- [ArcPy] 批量波段合成 Landsat8数据为例