Tesseract 3.02中文字库训练
2014-06-21 11:31
337 查看
Tesseract3.02中文字库训练
下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件
下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件
0.准备
为了方便tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言fontname是字体
比如我们要训练自定义字库mjorcen字体名normal
那么我们把tif文件重命名mjorcen.normal.exp0.jpg
图片:
下面开始训练字库:
1、生成.box文件
tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0-lchi_simbatch.nochopmakebox
把图片文件和box文件放在同一目录,
2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件
3、生成.tr文件
tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0nobatchbox.train
4、成一个unicharset文件
unicharset_extractormjorcen.normal.exp0.box
5、新建一个font_properties文件
里面内容写入normal00000表示默认普通字体
6、运行命令
shapeclustering-Ffont_properties-Uunicharsetmjorcen.normal.exp0.tr mftraining-Ffont_properties-Uunicharset-Ounicharsetmjorcen.normal.exp0.tr cntrainingmjorcen.normal.exp0.tr
结果如下:
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering-Ffont_propertie s-Uunicharsetmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Buildingmastershapetable Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...01234 Stoppedwith0merged,mindist0.365385 Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun ichars=0 E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining-Ffont_properties-U unicharset-Ounicharsetmjorcen.normal.exp0.tr Readshapetableshapetableof5shapes Readingmjorcen.normal.exp0.tr... Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntrainingmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Clustering... Writingnormproto...
7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.
8、执行combine_tessdatanormal.
9、把normal.traineddata[b]复制到Tesseract-OCR安装目录下的tessdata文件夹中[/b]
10、测试
tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0-lnormal
debug:
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp0-lchi_simbatch.nochopmakebox
Toomanyunicharsinambiguityonline22358424
Toomanyunicharsinambiguityonline22358424
Toomanyunicharsinambiguityonline14941344
TesseractOpenSourceOCREnginev3.02withLeptonica
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jp
gmjorcen.normal.exp0nobatchbox.train
TesseractOpenSourceOCREnginev3.02withLeptonica
APPLY_BOXES:
Boxesreadfromboxfile:6
Found6goodblobs.
TRAINING...Fontname=normal
Generatedtrainingdatafor2words
E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractormjorcen.norm
al.exp0.box
Extractingunicharsetfrommjorcen.normal.exp0.box
Wroteunicharsetfile./unicharset.
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering-Ffont_propertie s-Uunicharsetmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Buildingmastershapetable Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...01234 Stoppedwith0merged,mindist0.365385 Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun ichars=0 E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining-Ffont_properties-U unicharset-Ounicharsetmjorcen.normal.exp0.tr Readshapetableshapetableof5shapes Readingmjorcen.normal.exp0.tr... Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntrainingmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Clustering... Writingnormproto...
E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdatanormal.
Combiningtessdatafiles
TessdataManagercombinedtesseractdatafiles.
Offsetfortype0is-1
Offsetfortype1is140
Offsetfortype2is-1
Offsetfortype3is489
Offsetfortype4is123081
Offsetfortype5is123134
Offsetfortype6is-1
Offsetfortype7is-1
Offsetfortype8is-1
Offsetfortype9is-1
Offsetfortype10is-1
Offsetfortype11is-1
Offsetfortype12is-1
Offsetfortype13is123920
Offsetfortype14is-1
Offsetfortype15is-1
Offsetfortype16is-1
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp0-lnormal
TesseractOpenSourceOCREnginev3.02withLeptonica
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp1-lchi_sim
Toomanyunicharsinambiguityonline15280712
Toomanyunicharsinambiguityonline15280712
Toomanyunicharsinambiguityonline4324296
TesseractOpenSourceOCREnginev3.02withLeptonica
normal结果
应收:119
普通的中文结果:
应收=II苜
脚本(需要java环境):
目录结果如下:
脚本如下:
window
@echooff set"src=%1%" set"font_name=%2%" set"desc=%3%" ifnotdefinedsrcset/psrc="pleasepassyourfilename:" ifnotdefinedfont_nameset/pfont_name="pleasepassyourfont_name:" rem判断参数的合法性 ifnotdefinedsrcechoIllegalArgumentExceptionarg1mustnotbenull&pause>nul&exit ifnotdefinedfont_nameechoIllegalArgumentExceptionarg2mustnotbenull&pause>nul&exit ifnotdefineddescset"desc=%src:~0,-4%" echodesc%desc% rem如果目录下没有font_properties文件创建font_properties,并写入文件 ifexistfont_properties( echofont_propertiesexist )else( ECHO%font_name%00000>"font_properties" ) rem删除原有文件 ifexist%font_name%.unicharsetECHODEL%font_name%.unicharset&DEL/Qnames%font_name%.unicharset ifexist%font_name%.inttempECHODEL%font_name%.inttemp&DEL/Qnames%font_name%.inttemp ifexist%font_name%.pffmtableECHODEL%font_name%.pffmtable&DEL/Qnames%font_name%.pffmtable ifexist%font_name%.shapetableECHODEL%font_name%.shapetable&DEL/Qnames%font_name%.shapetable ifexist%font_name%.normprotoECHODEL%font_name%.normproto&DEL/Qnames%font_name%.normproto ifexist%font_name%.font_propertiesECHODEL%font_name%.font_properties&DEL/Qnames%font_name%.font_properties remmakebox tesseract%src%%desc%-lchi_simbatch.nochopmakebox java-Xms128m-Xmx512m-jarjTessBoxEditor/jTessBoxEditor.jar ECHOPleasechangeyourresults,andpressanykeytocontinue pause>nul tesseract%src%%desc%nobatchbox.train unicharset_extractor%desc%.box shapeclustering-Ffont_properties-Uunicharset%desc%.tr mftraining-Ffont_properties-Uunicharset-Ounicharset%desc%.tr cntraining%desc%.tr rem配置新文件 ifexistunicharsetECHOrenameunicharset%font_name%.unicharset&renameunicharset%font_name%.unicharset ifexistinttempECHOrenameinttemp%font_name%.inttemp&renameinttemp%font_name%.inttemp ifexistpffmtableECHOrenamepffmtable%font_name%.pffmtable&renamepffmtable%font_name%.pffmtable ifexistshapetableECHOrenameshapetable%font_name%.shapetable&renameshapetable%font_name%.shapetable ifexistnormprotoECHOrenamenormproto%font_name%.normproto&renamenormproto%font_name%.normproto combine_tessdata%font_name%. ifexistfont_propertiesECHOrenamefont_properties%font_name%.font_properties&renamefont_properties%font_name%.font_properties ECHOpressanykeytocontinue pause>nul
调用:
注意:参数1:文件全名,参数2字体名,参数3:输出文件名,不填默认为文件名
E:\data\Users\Administrator\Desktop\ocrBuider3>run.batmjorcen.normal.exp0.jpgnormal
实例:
E:\data\Users\Administrator\Desktop\ocrBuider3>run.batmjorcen.normal.exp0.jpgn ormal descmjorcen.normal.exp0 font_propertiesexist Toomanyunicharsinambiguityonline2188584 Toomanyunicharsinambiguityonline2188584 Toomanyunicharsinambiguityonline2686128 TesseractOpenSourceOCREnginev3.02withLeptonica Pleasechangeyourresults,andpressanykeytocontinue TesseractOpenSourceOCREnginev3.02withLeptonica APPLY_BOXES: Boxesreadfromboxfile:6 Found6goodblobs. TRAINING...Fontname=normal Generatedtrainingdatafor2words Extractingunicharsetfrommjorcen.normal.exp0.box Wroteunicharsetfile./unicharset. Readingmjorcen.normal.exp0.tr... Buildingmastershapetable Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...01234 Stoppedwith0merged,mindist0.365385 Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun ichars=0 Readshapetableshapetableof5shapes Readingmjorcen.normal.exp0.tr... Done! Readingmjorcen.normal.exp0.tr... Clustering... Writingnormproto... renameunicharsetnormal.unicharset renameinttempnormal.inttemp renamepffmtablenormal.pffmtable renameshapetablenormal.shapetable renamenormprotonormal.normproto Combiningtessdatafiles TessdataManagercombinedtesseractdatafiles. Offsetfortype0is-1 Offsetfortype1is140 Offsetfortype2is-1 Offsetfortype3is489 Offsetfortype4is123081 Offsetfortype5is123134 Offsetfortype6is-1 Offsetfortype7is-1 Offsetfortype8is-1 Offsetfortype9is-1 Offsetfortype10is-1 Offsetfortype11is-1 Offsetfortype12is-1 Offsetfortype13is123920 Offsetfortype14is-1 Offsetfortype15is-1 Offsetfortype16is-1 renamefont_propertiesnormal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
linux(出自文档:
#!/bin/bash
tesseractzzz.ocra.exp0.tifzzz.ocra.exp0nobatchbox.train
unicharset_extractorzzz.ocra.exp0.box
echo"ocra00100">font_properties
shapeclustering-Ffont_properties-Uunicharsetzzz.ocra.exp0.tr
mftraining-Ffont_properties-Uunicharset-Ozzz.unicharsetzzz.ocra.exp0.tr
cntrainingzzz.ocra.exp0.tr
cpnormprotozzz.normproto
cpinttempzzz.inttemp
cppffmtablezzz.pffmtable
cpshapetablezzz.shapetable
combine_tessdatazzz.
cpzzz.traineddata/home/youruserid/tessdata/.
sudocpzzz.traineddata/usr/share/tesseract-ocr/tessdata/.
tesseractzzz.ocra.exp0.tifoutput-lzzz