您的位置:首页 > 其它

Tesseract 3.02中文字库训练

2014-06-21 11:31 337 查看
Tesseract3.02中文字库训练


下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件


0.准备


为了方便tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言fontname是字体
比如我们要训练自定义字库mjorcen字体名normal
那么我们把tif文件重命名mjorcen.normal.exp0.jpg




图片:



下面开始训练字库:

1、生成.box文件

tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0-lchi_simbatch.nochopmakebox


把图片文件和box文件放在同一目录,

2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件



3、生成.tr文件

tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0nobatchbox.train


4、成一个unicharset文件

unicharset_extractormjorcen.normal.exp0.box


5、新建一个font_properties文件

里面内容写入normal00000表示默认普通字体

6、运行命令

shapeclustering-Ffont_properties-Uunicharsetmjorcen.normal.exp0.tr

mftraining-Ffont_properties-Uunicharset-Ounicharsetmjorcen.normal.exp0.tr

cntrainingmjorcen.normal.exp0.tr


结果如下:

E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering-Ffont_propertie
s-Uunicharsetmjorcen.normal.exp0.tr
Readingmjorcen.normal.exp0.tr...
Buildingmastershapetable
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...01234
Stoppedwith0merged,mindist0.365385
Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun
ichars=0

E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining-Ffont_properties-U
unicharset-Ounicharsetmjorcen.normal.exp0.tr
Readshapetableshapetableof5shapes
Readingmjorcen.normal.exp0.tr...
Done!

E:\data\Users\Administrator\Desktop\ocrBuider3>cntrainingmjorcen.normal.exp0.tr

Readingmjorcen.normal.exp0.tr...
Clustering...

Writingnormproto...


7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.

8、执行combine_tessdatanormal.

9、把normal.traineddata[b]复制到Tesseract-OCR安装目录下的tessdata文件夹中[/b]

10、测试

tesseractmjorcen.normal.exp0.jpgmjorcen.normal.exp0-lnormal


debug:

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp0-lchi_simbatch.nochopmakebox
Toomanyunicharsinambiguityonline22358424
Toomanyunicharsinambiguityonline22358424
Toomanyunicharsinambiguityonline14941344
TesseractOpenSourceOCREnginev3.02withLeptonica

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jp
gmjorcen.normal.exp0nobatchbox.train
TesseractOpenSourceOCREnginev3.02withLeptonica
APPLY_BOXES:
Boxesreadfromboxfile:6
Found6goodblobs.
TRAINING...Fontname=normal
Generatedtrainingdatafor2words

E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractormjorcen.norm
al.exp0.box
Extractingunicharsetfrommjorcen.normal.exp0.box
Wroteunicharsetfile./unicharset.

E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering-Ffont_propertie s-Uunicharsetmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Buildingmastershapetable Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances...0 Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances... Stoppedwith0merged,mindist999.000000 Computingshapedistances...01234 Stoppedwith0merged,mindist0.365385 Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun ichars=0 E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining-Ffont_properties-U unicharset-Ounicharsetmjorcen.normal.exp0.tr Readshapetableshapetableof5shapes Readingmjorcen.normal.exp0.tr... Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntrainingmjorcen.normal.exp0.tr Readingmjorcen.normal.exp0.tr... Clustering... Writingnormproto...

E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdatanormal.
Combiningtessdatafiles
TessdataManagercombinedtesseractdatafiles.
Offsetfortype0is-1
Offsetfortype1is140
Offsetfortype2is-1
Offsetfortype3is489
Offsetfortype4is123081
Offsetfortype5is123134
Offsetfortype6is-1
Offsetfortype7is-1
Offsetfortype8is-1
Offsetfortype9is-1
Offsetfortype10is-1
Offsetfortype11is-1
Offsetfortype12is-1
Offsetfortype13is123920
Offsetfortype14is-1
Offsetfortype15is-1
Offsetfortype16is-1

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp0-lnormal
TesseractOpenSourceOCREnginev3.02withLeptonica

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseractmjorcen.normal.exp0.jpg
mjorcen.normal.exp1-lchi_sim
Toomanyunicharsinambiguityonline15280712
Toomanyunicharsinambiguityonline15280712
Toomanyunicharsinambiguityonline4324296
TesseractOpenSourceOCREnginev3.02withLeptonica


normal结果

应收:119


普通的中文结果:

应收=II苜


脚本(需要java环境):

目录结果如下:



脚本如下:

window

@echooff

set"src=%1%"
set"font_name=%2%"
set"desc=%3%"

ifnotdefinedsrcset/psrc="pleasepassyourfilename:"

ifnotdefinedfont_nameset/pfont_name="pleasepassyourfont_name:"

rem判断参数的合法性

ifnotdefinedsrcechoIllegalArgumentExceptionarg1mustnotbenull&pause>nul&exit

ifnotdefinedfont_nameechoIllegalArgumentExceptionarg2mustnotbenull&pause>nul&exit

ifnotdefineddescset"desc=%src:~0,-4%"

echodesc%desc%

rem如果目录下没有font_properties文件创建font_properties,并写入文件
ifexistfont_properties(
echofont_propertiesexist
)else(
ECHO%font_name%00000>"font_properties"
)

rem删除原有文件
ifexist%font_name%.unicharsetECHODEL%font_name%.unicharset&DEL/Qnames%font_name%.unicharset
ifexist%font_name%.inttempECHODEL%font_name%.inttemp&DEL/Qnames%font_name%.inttemp
ifexist%font_name%.pffmtableECHODEL%font_name%.pffmtable&DEL/Qnames%font_name%.pffmtable
ifexist%font_name%.shapetableECHODEL%font_name%.shapetable&DEL/Qnames%font_name%.shapetable
ifexist%font_name%.normprotoECHODEL%font_name%.normproto&DEL/Qnames%font_name%.normproto
ifexist%font_name%.font_propertiesECHODEL%font_name%.font_properties&DEL/Qnames%font_name%.font_properties

remmakebox

tesseract%src%%desc%-lchi_simbatch.nochopmakebox

java-Xms128m-Xmx512m-jarjTessBoxEditor/jTessBoxEditor.jar

ECHOPleasechangeyourresults,andpressanykeytocontinue

pause>nul

tesseract%src%%desc%nobatchbox.train

unicharset_extractor%desc%.box

shapeclustering-Ffont_properties-Uunicharset%desc%.tr

mftraining-Ffont_properties-Uunicharset-Ounicharset%desc%.tr

cntraining%desc%.tr

rem配置新文件
ifexistunicharsetECHOrenameunicharset%font_name%.unicharset&renameunicharset%font_name%.unicharset
ifexistinttempECHOrenameinttemp%font_name%.inttemp&renameinttemp%font_name%.inttemp
ifexistpffmtableECHOrenamepffmtable%font_name%.pffmtable&renamepffmtable%font_name%.pffmtable
ifexistshapetableECHOrenameshapetable%font_name%.shapetable&renameshapetable%font_name%.shapetable
ifexistnormprotoECHOrenamenormproto%font_name%.normproto&renamenormproto%font_name%.normproto

combine_tessdata%font_name%.

ifexistfont_propertiesECHOrenamefont_properties%font_name%.font_properties&renamefont_properties%font_name%.font_properties

ECHOpressanykeytocontinue
pause>nul


调用:

注意:参数1:文件全名,参数2字体名,参数3:输出文件名,不填默认为文件名

E:\data\Users\Administrator\Desktop\ocrBuider3>run.batmjorcen.normal.exp0.jpgnormal


实例:

E:\data\Users\Administrator\Desktop\ocrBuider3>run.batmjorcen.normal.exp0.jpgn
ormal
descmjorcen.normal.exp0
font_propertiesexist
Toomanyunicharsinambiguityonline2188584
Toomanyunicharsinambiguityonline2188584
Toomanyunicharsinambiguityonline2686128
TesseractOpenSourceOCREnginev3.02withLeptonica
Pleasechangeyourresults,andpressanykeytocontinue
TesseractOpenSourceOCREnginev3.02withLeptonica
APPLY_BOXES:
Boxesreadfromboxfile:6
Found6goodblobs.
TRAINING...Fontname=normal
Generatedtrainingdatafor2words
Extractingunicharsetfrommjorcen.normal.exp0.box
Wroteunicharsetfile./unicharset.
Readingmjorcen.normal.exp0.tr...
Buildingmastershapetable
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...0
Stoppedwith0merged,mindist999.000000
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...
Stoppedwith0merged,mindist999.000000
Computingshapedistances...01234
Stoppedwith0merged,mindist0.365385
Mastershape_table:Numberofshapes=5maxunichars=1numberwithmultipleun
ichars=0
Readshapetableshapetableof5shapes
Readingmjorcen.normal.exp0.tr...
Done!
Readingmjorcen.normal.exp0.tr...
Clustering...

Writingnormproto...
renameunicharsetnormal.unicharset
renameinttempnormal.inttemp
renamepffmtablenormal.pffmtable
renameshapetablenormal.shapetable
renamenormprotonormal.normproto
Combiningtessdatafiles
TessdataManagercombinedtesseractdatafiles.
Offsetfortype0is-1
Offsetfortype1is140
Offsetfortype2is-1
Offsetfortype3is489
Offsetfortype4is123081
Offsetfortype5is123134
Offsetfortype6is-1
Offsetfortype7is-1
Offsetfortype8is-1
Offsetfortype9is-1
Offsetfortype10is-1
Offsetfortype11is-1
Offsetfortype12is-1
Offsetfortype13is123920
Offsetfortype14is-1
Offsetfortype15is-1
Offsetfortype16is-1
renamefont_propertiesnormal.font_properties

E:\data\Users\Administrator\Desktop\ocrBuider3>



linux(出自文档:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc):

#!/bin/bash
tesseractzzz.ocra.exp0.tifzzz.ocra.exp0nobatchbox.train
unicharset_extractorzzz.ocra.exp0.box
echo"ocra00100">font_properties
shapeclustering-Ffont_properties-Uunicharsetzzz.ocra.exp0.tr
mftraining-Ffont_properties-Uunicharset-Ozzz.unicharsetzzz.ocra.exp0.tr
cntrainingzzz.ocra.exp0.tr
cpnormprotozzz.normproto
cpinttempzzz.inttemp
cppffmtablezzz.pffmtable
cpshapetablezzz.shapetable
combine_tessdatazzz.
cpzzz.traineddata/home/youruserid/tessdata/.
sudocpzzz.traineddata/usr/share/tesseract-ocr/tessdata/.
tesseractzzz.ocra.exp0.tifoutput-lzzz
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐
章节导航