How to scan and OCR like a pro with free software

From LinuxReviews
Jump to navigationJump to search

OCR using free software

There are several free software OCR technologies available for your Optical character recognition pleasure.

All of them are command line tools which inputs images and spits out text.

Preperation

Get the technology.

gentoo

emerge app-text/tesseract app-text/gocr app-text/ocrad
  • Scan the images.

You must have the files in pnm to use the OCR technology.

#!/bin/bash
function prepareocr(){
  for f in *.tif; do
    convert $f `basename $f tif`pnm 
  done
}
prepareocr

Technology

Ocrad - The GNU OCR

Command line tool for scanning.

GNU Ocrad - Optical Character Recognition program.
Reads pnm file(s), or standard input, and sends text to standard output.

Usage: ocrad [options] [files]
Options:
  -h, --help               display this help and exit
  -V, --version            output version information and exit
  -a, --append             append text to output file
  -b, --block=<n>          process only the specified text block
  -c, --charset=<name>     try `--charset=help' for a list of names
  -e, --filter=<name>      try `--filter=help' for a list of names
  -f, --force              force overwrite of output file
  -F, --format=<fmt>       output format (byte, utf8)
  -i, --invert             invert image levels (white on black)
  -l, --layout=<n>         layout analysis, 0=none, 1=column, 2=full
  -o <file>                place the output into <file>
  -p, --crop=<l,t,r,b>     crop input image by given rectangle
  -s, --scale=[-]<n>       scale input image by [1/]<n>
  -t, --transform=<name>   try `--transform=help' for a list of names
  -T, --threshold=<n%>     threshold for binarization (0-100%)
  -v, --verbose            be verbose
  -x <file>                export OCR Results File to <file>

Report bugs to bug-ocrad@gnu.org

gocr

gocr story is that libusr is dead[1].

$ gocr  -h
 Optical Character Recognition --- gocr 0.45 20071126
 Copyright (C) 2001-2007 Joerg Schulenburg
 released under the GNU General Public License
 using: gocr [options] pnm_file_name  # use - for stdin
 options (see gocr manual pages for more details):
 -h        - get this help
 -i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
 -o name   - output file  (redirection of stdout)
 -e name   - logging file (redirection of stderr)
 -x name   - progress output to fifo (see manual)
 -p name   - database path including final slash (default is ./db/)
 -f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
 -l num    - threshold grey level 0<160<=255 (0 = autodetect)
 -d num    - dust_size (remove small clusters, -1 = autodetect)
 -s num    - spacewidth/dots (0 = autodetect)
 -v num    - verbose (see manual page)
 -c string - list of chars (debugging, see manual)
 -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
 -m num    - operation modes (bitpattern, see manual)
 -a num      value of certainty (in percent, 0..100, default=95)
 examples:
        gocr -m 4 text1.pbm                   # do layout analyzis
        gocr -m 130 -p ./database/ text1.pbm  # extend database
        djpeg -pnm -gray text.jpg | gocr -    # use jpeg-file via pipe

 webpage: http://jocr.sourceforge.net/

Tesseract-ocr

http://code.google.com/p/tesseract-ocr/

OCR files

function makeocr(){
  for f in *.pnm; do
    ocrad 0152.pnm --charset=iso-8859-15 -o `basename $f`ocrad.txt
    gocr -s 0 -l 0 -d -1 -f ISO8859_1 $f -o `basename $f`gocr.txt
    #gocr -i $f -f HTML -o `basename $f`gocr.html
  done
}
makeocr

A page

Ocrad output

OSVNLIGA MAKTER
.
.
_LAND växterna, särski_t bland dem SOIR spira
upp om vgrgn, finoss _et hera, som allt efter
väderlekgné växlingar antaga ett helt o_ika växt-
sätt: vid kallt väder, när temper&turen närlna_ sig
nollpunkten, trncka de sig tätt in till marken_ som
ville de invid jorde_s moderssköte söka skydd mot
döl_en, vid ,varmarg väderlek räta de däremot upp
.sig och v_xa cakt i höjdcn mot ljuéet_ Exempel
_.pa sadana värmek_nsliga växter _r bland andra den
vanliga rödplistern, vars stjälkar pa varsidan allt
efter tempera_urens gang intaga et_ hor¡sonta_t,
;snett, e_ler vertikalt läge, och son_ därrör med fog
._kan betecknas som en levande termQmet_r.
vara van_iga varsippobT _ virsippán, gulsipp&n
_och blasippan _ utmä_ka sig ocksg för en vä) ut-
'vecklad värmekäns_ighet: vid kallt väder sluta sig
blombladen samman och blomskadtet böjer sig nedat _
mot jorden, rnen när solen _yser ach de varma

gocr output

References

  1. [http://jocr.sourceforge.net/api/ December 24, 2006 libgocr is dead