How to scan and OCR like a pro with free software
From LinuxReviews
Jump to navigationJump to searchOCR using free software
There are several free software OCR technologies available for your Optical character recognition pleasure.
All of them are command line tools which inputs images and spits out text.
Preperation
Get the technology.
gentoo
emerge app-text/tesseract app-text/gocr app-text/ocrad
- Scan the images.
You must have the files in pnm to use the OCR technology.
#!/bin/bash function prepareocr(){ for f in *.tif; do convert $f `basename $f tif`pnm done } prepareocr
Technology
Ocrad - The GNU OCR
Command line tool for scanning.
GNU Ocrad - Optical Character Recognition program. Reads pnm file(s), or standard input, and sends text to standard output. Usage: ocrad [options] [files] Options: -h, --help display this help and exit -V, --version output version information and exit -a, --append append text to output file -b, --block=<n> process only the specified text block -c, --charset=<name> try `--charset=help' for a list of names -e, --filter=<name> try `--filter=help' for a list of names -f, --force force overwrite of output file -F, --format=<fmt> output format (byte, utf8) -i, --invert invert image levels (white on black) -l, --layout=<n> layout analysis, 0=none, 1=column, 2=full -o <file> place the output into <file> -p, --crop=<l,t,r,b> crop input image by given rectangle -s, --scale=[-]<n> scale input image by [1/]<n> -t, --transform=<name> try `--transform=help' for a list of names -T, --threshold=<n%> threshold for binarization (0-100%) -v, --verbose be verbose -x <file> export OCR Results File to <file> Report bugs to bug-ocrad@gnu.org
gocr
gocr story is that libusr is dead[1].
$ gocr -h Optical Character Recognition --- gocr 0.45 20071126 Copyright (C) 2001-2007 Joerg Schulenburg released under the GNU General Public License using: gocr [options] pnm_file_name # use - for stdin options (see gocr manual pages for more details): -h - get this help -i name - input image file (pnm,pgm,pbm,ppm,pcx,...) -o name - output file (redirection of stdout) -e name - logging file (redirection of stderr) -x name - progress output to fifo (see manual) -p name - database path including final slash (default is ./db/) -f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII) -l num - threshold grey level 0<160<=255 (0 = autodetect) -d num - dust_size (remove small clusters, -1 = autodetect) -s num - spacewidth/dots (0 = autodetect) -v num - verbose (see manual page) -c string - list of chars (debugging, see manual) -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII) -m num - operation modes (bitpattern, see manual) -a num value of certainty (in percent, 0..100, default=95) examples: gocr -m 4 text1.pbm # do layout analyzis gocr -m 130 -p ./database/ text1.pbm # extend database djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe webpage: http://jocr.sourceforge.net/
Tesseract-ocr
http://code.google.com/p/tesseract-ocr/
OCR files
function makeocr(){ for f in *.pnm; do ocrad 0152.pnm --charset=iso-8859-15 -o `basename $f`ocrad.txt gocr -s 0 -l 0 -d -1 -f ISO8859_1 $f -o `basename $f`gocr.txt #gocr -i $f -f HTML -o `basename $f`gocr.html done } makeocr
A page
Ocrad output
OSVNLIGA MAKTER . . _LAND växterna, särski_t bland dem SOIR spira upp om vgrgn, finoss _et hera, som allt efter väderlekgné växlingar antaga ett helt o_ika växt- sätt: vid kallt väder, när temper&turen närlna_ sig nollpunkten, trncka de sig tätt in till marken_ som ville de invid jorde_s moderssköte söka skydd mot döl_en, vid ,varmarg väderlek räta de däremot upp .sig och v_xa cakt i höjdcn mot ljuéet_ Exempel _.pa sadana värmek_nsliga växter _r bland andra den vanliga rödplistern, vars stjälkar pa varsidan allt efter tempera_urens gang intaga et_ hor¡sonta_t, ;snett, e_ler vertikalt läge, och son_ därrör med fog ._kan betecknas som en levande termQmet_r. vara van_iga varsippobT _ virsippán, gulsipp&n _och blasippan _ utmä_ka sig ocksg för en vä) ut- 'vecklad värmekäns_ighet: vid kallt väder sluta sig blombladen samman och blomskadtet böjer sig nedat _ mot jorden, rnen när solen _yser ach de varma
gocr output
References
- ↑ [http://jocr.sourceforge.net/api/ December 24, 2006 libgocr is dead