Git Product home page Git Product logo

tesseractocr's Introduction

image

What is TesseractOCR

It is a .NET wrapper for Tesseract 5.1.0 that is originally copied from Charles Weld (https://github.com/charlesw/tesseract) and modified for my own needs

How to use

You need trained data in tessdata by language You can get them at https://github.com/tesseract-ocr/tessdata or https://github.com/tesseract-ocr/tessdata_fast

OCR a page

using var engine = new Engine(@"./tessdata", Language.English, EngineMode.Default);
using var img = TesseractOCR.Pix.Image.LoadFromFile(testImagePath);
using var page = engine.Process(img);
Console.WriteLine("Mean confidence: {0}", page.MeanConfidence);
Console.WriteLine("Text: \r\n{0}", page.Text);

Iterate through the layout of a page

using var engine = CreateEngine();
using var img = Pix.Image.LoadFromFile(testImagePath);
using var page = engine.Process(img);

foreach (var block in page.Layout)
{
    result.AppendLine($"Block confidence: {block.Confidence}");
    if (block.BoundingBox != null)
    {
        var boundingBox = block.BoundingBox.Value;
        result.AppendLine($"Block bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                          $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
    }
    result.AppendLine($"Block text: {block.Text}");

    foreach (var paragraph in block.Paragraphs)
    {
        result.AppendLine($"Paragraph confidence: {paragraph.Confidence}");
        if (paragraph.BoundingBox != null)
        {
            var boundingBox = paragraph.BoundingBox.Value;
            result.AppendLine($"Paragraph bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                              $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
        }
        var info = paragraph.Info;
        result.AppendLine($"Paragraph info justification: {info.Justification}");
        result.AppendLine($"Paragraph info is list item: {info.IsListItem}");
        result.AppendLine($"Paragraph info is crown: {info.IsCrown}");
        result.AppendLine($"Paragraph info first line ident: {info.FirstLineIdent}");
        result.AppendLine($"Paragraph text: {paragraph.Text}");
        
        foreach (var textLine in paragraph.TextLines)
        {
            if (textLine.BoundingBox != null)
            {
                var boundingBox = textLine.BoundingBox.Value;
                result.AppendLine($"Text line bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                                  $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
            }
            result.AppendLine($"Text line confidence: {textLine.Confidence}");
            result.AppendLine($"Text line text: {textLine.Text}");

            foreach (var word in textLine.Words)
            {
                result.AppendLine($"Word confidence: {word.Confidence}");
                if (word.BoundingBox != null)
                {
                    var boundingBox = word.BoundingBox.Value;
                    result.AppendLine($"Word bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                                      $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
                }
                result.AppendLine($"Word is from dictionary: {word.IsFromDictionary}");
                result.AppendLine($"Word is numeric: {word.IsNumeric}");
                result.AppendLine($"Word language: {word.Language}");
                result.AppendLine($"Word text: {word.Text}");

                foreach (var symbol in word.Symbols)
                {
                    result.AppendLine($"Symbol confidence: {symbol.Confidence}");
                    if (symbol.BoundingBox != null)
                    {
                        var boundingBox = symbol.BoundingBox.Value;
                        result.AppendLine($"Symbol bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                                          $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
                    }
                    result.AppendLine($"Symbol is superscript: {symbol.IsSuperscript}");
                    result.AppendLine($"Symbol is dropcap: {symbol.IsDropcap}");
                    result.AppendLine($"Symbol text: {symbol.Text}");
                }
            }
        }
    }
}

For more examples see https://github.com/Sicos1977/TesseractOCR/wiki/examples.md

Supported input formats

Tesseract uses the Leptonica library to read images with one of these formats:

  • PNG - requires libpng, libz
  • JPEG - requires libjpeg / libjpeg-turbo
  • TIFF - requires libtiff, libz
  • JPEG 2000 - requires libopenjp2
  • GIF - requires libgif (giflib)
  • WebP (including animated WebP) - requires libwebp
  • BMP - no library required* = PNM - no library required*
  • Except Leptonica

I have dropped support for the Windows.Drawing.Image namespace since this only works good on Windows and not on other systems. You should be fine with Leptonica

Installing via NuGet

The easiest way to install TesseractOCR is via NuGet.

In Visual Studio's Package Manager Console, simply enter the following command:

Install-Package TesseractOCR

License Information

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Core Team

tesseractocr's People

Contributors

fengjixuchui avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.