How to scan and OCR like a pro with free software

From LinuxReviews
Jump to navigationJump to search

OCR using free software[edit | edit source]

There are several free software OCR technologies available for your Optical character recognition pleasure.

All of them are command line tools which inputs images and spits out text.

Preperation[edit | edit source]

Get the technology.

gentoo

emerge app-text/tesseract app-text/gocr app-text/ocrad
  • Scan the images.

You must have the files in pnm to use the OCR technology.

#!/bin/bash
function prepareocr(){
  for f in *.tif; do
    convert $f `basename $f tif`pnm 
  done
}
prepareocr

Technology[edit | edit source]

Ocrad - The GNU OCR[edit | edit source]

Command line tool for scanning.

GNU Ocrad - Optical Character Recognition program.
Reads pnm file(s), or standard input, and sends text to standard output.

Usage: ocrad [options] [files]
Options:
  -h, --help               display this help and exit
  -V, --version            output version information and exit
  -a, --append             append text to output file
  -b, --block=<n>          process only the specified text block
  -c, --charset=<name>     try `--charset=help' for a list of names
  -e, --filter=<name>      try `--filter=help' for a list of names
  -f, --force              force overwrite of output file
  -F, --format=<fmt>       output format (byte, utf8)
  -i, --invert             invert image levels (white on black)
  -l, --layout=<n>         layout analysis, 0=none, 1=column, 2=full
  -o <file>                place the output into <file>
  -p, --crop=<l,t,r,b>     crop input image by given rectangle
  -s, --scale=[-]<n>       scale input image by [1/]<n>
  -t, --transform=<name>   try `--transform=help' for a list of names
  -T, --threshold=<n%>     threshold for binarization (0-100%)
  -v, --verbose            be verbose
  -x <file>                export OCR Results File to <file>

Report bugs to bug-ocrad@gnu.org

gocr[edit | edit source]

gocr story is that libusr is dead[1].

$ gocr  -h
 Optical Character Recognition --- gocr 0.45 20071126
 Copyright (C) 2001-2007 Joerg Schulenburg
 released under the GNU General Public License
 using: gocr [options] pnm_file_name  # use - for stdin
 options (see gocr manual pages for more details):
 -h        - get this help
 -i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
 -o name   - output file  (redirection of stdout)
 -e name   - logging file (redirection of stderr)
 -x name   - progress output to fifo (see manual)
 -p name   - database path including final slash (default is ./db/)
 -f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
 -l num    - threshold grey level 0<160<=255 (0 = autodetect)
 -d num    - dust_size (remove small clusters, -1 = autodetect)
 -s num    - spacewidth/dots (0 = autodetect)
 -v num    - verbose (see manual page)
 -c string - list of chars (debugging, see manual)
 -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
 -m num    - operation modes (bitpattern, see manual)
 -a num      value of certainty (in percent, 0..100, default=95)
 examples:
        gocr -m 4 text1.pbm                   # do layout analyzis
        gocr -m 130 -p ./database/ text1.pbm  # extend database
        djpeg -pnm -gray text.jpg | gocr -    # use jpeg-file via pipe

 webpage: http://jocr.sourceforge.net/

Tesseract-ocr[edit | edit source]

http://code.google.com/p/tesseract-ocr/

OCR files[edit | edit source]

function makeocr(){
  for f in *.pnm; do
    ocrad 0152.pnm --charset=iso-8859-15 -o `basename $f`ocrad.txt
    gocr -s 0 -l 0 -d -1 -f ISO8859_1 $f -o `basename $f`gocr.txt
    #gocr -i $f -f HTML -o `basename $f`gocr.html
  done
}
makeocr

A page[edit | edit source]

Ocrad output[edit | edit source]

OSVNLIGA MAKTER
.
.
_LAND växterna, särski_t bland dem SOIR spira
upp om vgrgn, finoss _et hera, som allt efter
väderlekgné växlingar antaga ett helt o_ika växt-
sätt: vid kallt väder, när temper&turen närlna_ sig
nollpunkten, trncka de sig tätt in till marken_ som
ville de invid jorde_s moderssköte söka skydd mot
döl_en, vid ,varmarg väderlek räta de däremot upp
.sig och v_xa cakt i höjdcn mot ljuéet_ Exempel
_.pa sadana värmek_nsliga växter _r bland andra den
vanliga rödplistern, vars stjälkar pa varsidan allt
efter tempera_urens gang intaga et_ hor¡sonta_t,
;snett, e_ler vertikalt läge, och son_ därrör med fog
._kan betecknas som en levande termQmet_r.
vara van_iga varsippobT _ virsippán, gulsipp&n
_och blasippan _ utmä_ka sig ocksg för en vä) ut-
'vecklad värmekäns_ighet: vid kallt väder sluta sig
blombladen samman och blomskadtet böjer sig nedat _
mot jorden, rnen när solen _yser ach de varma

gocr output[edit | edit source]

References[edit | edit source]

  1. [http://jocr.sourceforge.net/api/ December 24, 2006 libgocr is dead