您的位置:首页 > 运维架构

Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine

2015-08-31 10:32 501 查看
Keywords: Open source, OCR, Tesseract, .NET, DOTNET, C#, VB.NET, C++/CLI
Current version : 2.04.0, 02SEP09 (see version history)
The big picture

Tesseract is a C++ open source OCR engine. Tessnet2 is .NET assembly that expose very simple methods to do OCR.

Tessnet2 is multi threaded. It uses the engine the same way Tesseract.exe does. Tessdll uses another method (no thresholding).
License

Tessnet2 is under Apache 2 license (like tesseract), meaning you can use it like you want, included in commercial products. You can read full license info in source file.
Quick Tessnet2 usage

Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.

Download language data definition file here and put it in tessdata directory. Tessdata directory and your exe
must be in the same directory.

Look at the Program.cs sample

Note: Tessnet2.dll needs Visual C++ 2008 Runtime. When deploying your application be sure to install C++ runtime (x86, x64)
Tessnet2 usage
Bitmap image = new Bitmap("eurotext.tif");

tessnet2.Tesseract ocr
= new tessnet2.Tesseract();

ocr.SetVariable("tessedit_char_whitelist", "0123456789");
// If digit only

ocr.Init(@"c:\temp", "fra", false);
// To use correct tessdata

List<tessnet2.Word>
result = ocr.DoOCR(image, Rectangle.Empty);

foreach (tessnet2.Word word in result)

Console.WriteLine("{0} : {1}", word.Confidence, word.Text);
Tessnet2 source code and recompiling

Download Tesseract source code here and expand it in a directory

Download Tessnet2 source code here and expand it in Tesseract source code root directory (it should create
dotnet sub directory)

Open the project solution tessnet2.sln. It's a Visual Studio 2008 C++/CLI project

Memory leak
Tesseract C++ source code is full of memory leak. Using tessnet2 assembly several time will cause memory overflow. This is not tessnet2 leak, this is tesseract leak and I spent two days in tesseract source code trying
to improve this with no success. See what I think about this.
Tessnet2 demo

In the Tessnet2 source code you have two C# demo project. TesseractOCR is a multi-tread WinForm demo with a progression bar. TesseractConsole is a console demo.



The confidence score is between braquets. < 160 mean not bad
Version History
07JUN08: First release on Tesserect 2.03
10JUN08: Version 2.03.1. Change Confidence behavior, now it's calculated from each word letter and not from the first letter. Type change from byte to double.
0 = perfect, 100 = reject
13JUN08 : Version 2.03.2
After 3 days in Tesseract code (urgh), here is Tessnet2 version 2.03.2

The corrections deals with the following problems

* Confidence was not very useful, the value was strange. This has been corrected, setting the variable tessedit_write_ratings=true. After many test I found this mode is the best for confidence accuracy. Value range from 0 (perfect) to 255 (reject) . When value
goes over 160 this really mean the OCR was bad.

* Calling DoOCR twice was not giving the same result. It was, as expected, a problem with global variables. The problem is almost fixed, sometime it doesn’t work but right now I can’t find what is not correctly reinitialized.

Some improvements:

* I expose Tesseract variables and expose a GetVariableList() method. Interessting variables are tessedit_char_whitelist or tessedit_char_blacklist to set before calling Tessnet2.Init().

* Misspelled Width for Word variable (thanks Lothar) has been corrected.

I didn’t implement character array with confidence info, simply because all characters in a word have the same confidence value. Internally tesseract build words and create characters from these words.
21JUN08 : Version 2.03.3
No bug correction, tessnet2.Character has been added. Now in each tessnet2.Word there is a Character list to get each character position.

foreach (tessnet2.Word word in m_words)

{

e.Graphics.DrawRectangle(pen, word.Left + panel2.AutoScrollPosition.X, word.Top + panel2.AutoScrollPosition.Y, word.Right - word.Left, word.Bottom - word.Top);

foreach (tessnet2.Character c in word.CharList)

e.Graphics.DrawRectangle(Pens.BlueViolet, c.Left + panel2.AutoScrollPosition.X, c.Top + panel2.AutoScrollPosition.Y, c.Right - c.Left, c.Bottom - c.Top);

}
11AUG08: Version 2.03.4
Added a suggestion from Julien Benoit: handle 32 bits pictures
28AUG08: Version 2.03.5
Now correctly handle bottom-up images, for example created with Peagasus Imaging (Ed Brown)
17SEP08: Version 2.03.6
Method to get thresholded image is now available. This can be useful to see what the OCR engine really see.

Bitmap tessnet2.Tesseract.GetThresholdedImage (Bitmap bitmap, Rectangle rect)
21APR09: Version 2.03.7
New method void SetRootPath (string rootPath, string lang); to force tessdata path. For example SetRootPath(@"c:\temp\tessdata", "eng");

Call this method before calling Init()
12MAY09: Version 2.03.8
Project recompile in 32 and 64 bits with Visual Studio 2008, you need the corresponding C++ runtime

C++ part have been marked as "unmanaged" to avoid this problem
06JUN09: Version 2.04.0
Project based on tesseract svn 2.0.4. This version include necessary correction. SetRootPath disapear and Init now take one argument, tessdata path. If you set this value to null it works like previous version.

Assembly are now renamed tessnet2_32.dll and tessnet2_64.dll for the 32 and 64 bits version.
02SEP09: Version 2.04.1
Signed version assembly (strong name). Use "sn.exe -v tessnet2_32.dll" or "x64\sn.exe -v tessnet2_64.dll" to check signature.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: