September 13, 2011

OCR in C# using MODI -Microsoft Office Document Imaging(Fetch Text From image in C#)

Check my previous post to see what OCR is
http://www.dotnetissues.com/2011/09/ocr-in-c-using-googles-tessnet2-fetch.html

In this Post I am going to use  MODI  -Microsoft Office Document Imaging and it gives 100% correct results and works perfectly fine with digits too.

1.) I Installed MSOffice 2007
2.) Go to Add Reference -->Com and Select Microsoft Office Document Imaging library .
  3.)Below is the sample code for Extract Text from Image using MODI



using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using System.Threading;
using MODI;


namespace TesseractConsole
{
    class Program
    {
        static void Main(string[] args)
        {  
            DocumentClass doc = new DocumentClass();
            doc.Create(@"C:\Documents and Settings\lak\Desktop/quotes_7a.jpg");
            doc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);


            foreach (MODI.Image image in doc.Images)
            {
                Console.WriteLine(image.Layout.Text);
            }
        }
    }
}


I got 100% correct result and found it better then Google's tessnet2


September 12, 2011

OCR in C# using Google's Tessnet2 (Fetch Text From image in C#)

I am experimenting on OCR (Optical Character Recognition) .
which is Read Data from an image , I searched a lot over the web 
Found two solutions 


1.) Using Google's Tessnet2 
2.) Using MODI (Microsoft_Office_Document_Imaging Library)


MODI So far i culdn' try as this library comes with MSOfiice 2007 or XP Which I can not get hold of so far.


I tried Google's Tessnet2 and it gave me 98% correct result but only read Alphabets couldn't read digits though.


Below are the Steps which I have used to use this .


1.) Download Tessnet2 binary  from the below link
http://www.pixel-technology.com/freeware/tessnet2/


2.) Add reference of Tessnet2 _32.dll (for 32 bit OS) Tessnet2 _64.dll(for 64 bit os)
in Visual Studio Project Solution






3)Download language data definition file(tesseract-2.00.eng.tar.gz) (I did it for English ) from the below link
http://code.google.com/p/tesseract-ocr/downloads/list


4) UnZip the Above folder and Keep all files in Directory 'tessdata'
    Place this directory in your App/bin/debug  folder
    ex. my case I put it here "D:\TanviDoc\OCRApp\OCRApp\bin\Debug\tessdata"


5.) Below is the sample code to do OCR





using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using System.Threading;




namespace TesseractConsole
{
    class Program
    {
        static void Main(string[] args)
        {
        
           
            Bitmap bmp = new Bitmap(@"C:\Documents and Settings\lak\Desktop/quotes_7a.jpg");
            tessnet2.Tesseract ocr = new tessnet2.Tesseract();
            // ocr.SetVariable("tessedit_cha/r_whitelist", "0123456789");
            ocr.Init(null, "eng", false);
            // List<tessnet2.Word> r1 = ocr.DoOCR(bmp, new Rectangle(792, 247, 130, 54));
            List<tessnet2.Word> r1 = ocr.DoOCR(bmp, Rectangle.Empty);
            int lc = tessnet2.Tesseract.LineCount(r1);
            for (int i = 0; i < lc; i++)
            {
                List<tessnet2.Word> lineWords = tessnet2.Tesseract.GetLineWords(r1, i);
                Console.WriteLine("Line {0} = {1}", i, tessnet2.Tesseract.GetLineText(r1, i));
            }
            foreach (tessnet2.Word word in r1)
                Console.WriteLine("{0}:{1}", word.Confidence, word.Text);
        
        }
    }


  


}




6.) Execute this ,you 'll find Image converted into text.


I Got all the data correct just got 'In' in place of 'On'


Above technique did not convert Digits from Image to text , which still I have to figure out.


Microsoft_Office_Document_Imaging Library)And Still have to experiment with MODI( and have to figure out which one is the better one