How to scan and OCR like a pro with free software
From LinuxReviews
Jump to navigationJump to searchOCR using free software[edit]
There are several free software OCR technologies available for your Optical character recognition pleasure.
All of them are command line tools which inputs images and spits out text.
Preperation[edit]
Get the technology.
gentoo
emerge app-text/tesseract app-text/gocr app-text/ocrad
- Scan the images.
You must have the files in pnm to use the OCR technology.
#!/bin/bash function prepareocr(){ for f in *.tif; do convert $f `basename $f tif`pnm done } prepareocr
Technology[edit]
Ocrad - The GNU OCR[edit]
Command line tool for scanning.
GNU Ocrad - Optical Character Recognition program. Reads pnm file(s), or standard input, and sends text to standard output. Usage: ocrad [options] [files] Options: -h, --help display this help and exit -V, --version output version information and exit -a, --append append text to output file -b, --block=<n> process only the specified text block -c, --charset=<name> try `--charset=help' for a list of names -e, --filter=<name> try `--filter=help' for a list of names -f, --force force overwrite of output file -F, --format=<fmt> output format (byte, utf8) -i, --invert invert image levels (white on black) -l, --layout=<n> layout analysis, 0=none, 1=column, 2=full -o <file> place the output into <file> -p, --crop=<l,t,r,b> crop input image by given rectangle -s, --scale=[-]<n> scale input image by [1/]<n> -t, --transform=<name> try `--transform=help' for a list of names -T, --threshold=<n%> threshold for binarization (0-100%) -v, --verbose be verbose -x <file> export OCR Results File to <file> Report bugs to bug-ocrad@gnu.org
gocr[edit]
gocr story is that libusr is dead[1].
$ gocr -h Optical Character Recognition --- gocr 0.45 20071126 Copyright (C) 2001-2007 Joerg Schulenburg released under the GNU General Public License using: gocr [options] pnm_file_name # use - for stdin options (see gocr manual pages for more details): -h - get this help -i name - input image file (pnm,pgm,pbm,ppm,pcx,...) -o name - output file (redirection of stdout) -e name - logging file (redirection of stderr) -x name - progress output to fifo (see manual) -p name - database path including final slash (default is ./db/) -f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII) -l num - threshold grey level 0<160<=255 (0 = autodetect) -d num - dust_size (remove small clusters, -1 = autodetect) -s num - spacewidth/dots (0 = autodetect) -v num - verbose (see manual page) -c string - list of chars (debugging, see manual) -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII) -m num - operation modes (bitpattern, see manual) -a num value of certainty (in percent, 0..100, default=95) examples: gocr -m 4 text1.pbm # do layout analyzis gocr -m 130 -p ./database/ text1.pbm # extend database djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe webpage: http://jocr.sourceforge.net/
Tesseract-ocr[edit]
http://code.google.com/p/tesseract-ocr/
OCR files[edit]
function makeocr(){ for f in *.pnm; do ocrad 0152.pnm --charset=iso-8859-15 -o `basename $f`ocrad.txt gocr -s 0 -l 0 -d -1 -f ISO8859_1 $f -o `basename $f`gocr.txt #gocr -i $f -f HTML -o `basename $f`gocr.html done } makeocr
A page[edit]
Ocrad output[edit]
OSVNLIGA MAKTER . . _LAND växterna, särski_t bland dem SOIR spira upp om vgrgn, finoss _et hera, som allt efter väderlekgné växlingar antaga ett helt o_ika växt- sätt: vid kallt väder, när temper&turen närlna_ sig nollpunkten, trncka de sig tätt in till marken_ som ville de invid jorde_s moderssköte söka skydd mot döl_en, vid ,varmarg väderlek räta de däremot upp .sig och v_xa cakt i höjdcn mot ljuéet_ Exempel _.pa sadana värmek_nsliga växter _r bland andra den vanliga rödplistern, vars stjälkar pa varsidan allt efter tempera_urens gang intaga et_ hor¡sonta_t, ;snett, e_ler vertikalt läge, och son_ därrör med fog ._kan betecknas som en levande termQmet_r. vara van_iga varsippobT _ virsippán, gulsipp&n _och blasippan _ utmä_ka sig ocksg för en vä) ut- 'vecklad värmekäns_ighet: vid kallt väder sluta sig blombladen samman och blomskadtet böjer sig nedat _ mot jorden, rnen när solen _yser ach de varma
gocr output[edit]
References[edit]
- ↑ [http://jocr.sourceforge.net/api/ December 24, 2006 libgocr is dead