- installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner
- Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce.
- We made three attempts-
1. Original uncropped page image (img00001.jpg)
|
Tesseract produced total gibberish (see out0.txt)
|
”H
3L7“ V U! “it“ ]\.13311\ XVIII JMHII ‘I’HLMH ‘~I14|!H'\ .)111 JA) \),)llU[ILIU}
-.1:ni J\[1 annlmny pm‘ \qnwl \llll ”WINK” '~JEHH[<V.) Jill 1mm
- ‘ulfl 116M"
'\U\‘EIHHPIV1)‘. \.[\Il1‘.\*\[l\§ ll“!,[ [[JV) ”:1“ |‘\ ‘ llt‘ltl .)\.>I[l
UN \[ll\!IllTB-J\) )lllfiw-‘HL'UH‘IHI l!'!ll\‘1lll V{1I\\ \lllllfl ) tlv'lll I'MHV PUP
SQUIV {‘(I 'JUERH H'IJ’I 1H \y.)‘IlI\I_ ) )\TI\ 1 'I'IHHXH’I ll‘l )111:113)UP111’
‘HHIHJJHI [«»1111~,».i,1mu \EH'xll‘awII‘ll’1H1|JI’)~I11’E«II*“.1[\I q 11’ [MW
EIIxUIL‘fiIH [I my |"]‘711U‘:)’1'(l\-’l up 3n mwitl [myn‘hn n‘.\ r; way! :1:
m1;»";‘:.' "'l‘C H {r w.’:[
(seyuown (up; uem a):
sawmoa 51 mqemnoaup
mm , 3L
'1}n'7/n;[ [mu ”filmy-'11“ “W" lxrx v‘. 1‘ “
”MN "'y‘g' .‘xn \' )(1 uliuval Hmpx fiqvlvmw.» ,xrx‘ 2'1me 11 ‘Illl Ml
ILHVHIIH! .HI I" H“ KNIT; “1'33 'I>?Il‘.\ '» H»!II!'T\IH -\'\IHIV‘\ J" ‘ ‘IHIIH‘
1“ :UI HI ;{|[ ZUI'IIIW.) In; """‘!“'I!E’ M‘WI'I V * Mt]
"Y”“IE‘W
J1“ 11} “up.” "’1 111:.va valr“;1v‘lv'1 ~ w H 1!“
JH \':)([ A711 VHM'H WWI“ \Unni ,IIZVllfli [w mu] qml
I/uw .1 m; 5‘
'Iruv'1.:{;11| H 5’ III (In .YIHN minim-II .llHlx
pm! ‘lvmttlnm! m: I]! {7'} l m, mulr- illvl‘fln» Hwy] “(I HW'M
l{ .1 III IVII ',[ ZI WWI" “‘"(I 14111.: 1,1 Jillllll h) J’fnl .1!“ I" VlHlmH
-‘l[1 HULL"! H1 vl'mmlm ,H'] ,{luu [HI .n[1 .hEl'l \lll‘IlI ‘mxunmi ”11"“
‘ ['[-/ r/nm‘l‘ 'IJIJHH'H)
ll! [Hill ’ .10” [)0] [WIJVH .11)“, [U [[lUHJI U I‘VXEV‘r—J? VII-Nil WHEIH‘JI <{l
SHUHJIIN 'I MN“ )'I I. (IN S'I\ INC-I I.\'[\' \‘1 1A\ )[YddV
1W
‘ u‘
“*2“ V . fl.
r-—--——~..
m‘
|
2. Cropped and rotated image
|
Produced mix of words and gibberish (out1.txt)
|
Al‘l’AILYI‘lfS, MA'I [CRIAIS AND 'I'IitllNlflAL METHODS
by lrguling llll'lll zxguimt u lt‘llfllll ()f glass or metal rm] f} 0r 7 mm in
(lléllllt‘H'l. (l‘iiglur‘ 1']. 1],.
\\'l1mi pmu'mq plan's misr [1“, lirl uuly llu‘ enough In permit the
lllrlllll] 0] [hr tulwur lmtllv tux-1mm l’mu' zilmut l2 13 ml in (-ach
plulc. Dry plan's slightly HIM'H Figynr [1.2 in an imulmtor. and
\tmr mmlium \i(lt' up in u rvlligriulm‘.
- N ‘
IUun".1/l;/I j
Ilg'm' Ii
Twill/lg (In/Hm: .lli «1m. ‘ltf/Hfun} n/ I’M/Mg" [3'01"
K ‘v. lmu hm ml; (‘ullmn- mwliu. Inulirulzuly mmlia likv D(l;\ 0r
\Vilwu uml lllgm, mu) Hui (nu wlll(’l'£ll)l§' (ind \lmultl lJf' erml in the :
lnllmxiu}: \uxyi
I’m-pun: Mllnl u-ulnld (lilluiuus. loi‘ (‘xzimplny 10" to 107 of
lulnum u]~ \‘guiuus mgzmisms \x'lm’h \\ill gm“ un 01‘ I)? lIll)ll)ltf’d
l))' [111- uuwliuuL l'ru' (*xumplw, \xlu-n Imtiug I)(I;\ ust- Sthmm',
filly/Mm Stu/Ilium)Mm. sm'x-i'ul ()[lH'I \leixmnvlluv and [Z‘M'IIAY/[II‘ L'se ,
10‘5 ,
2 COlOnleS
10'
No COlOnleS
10“
Uncountable 19 colonies
(more than 200 colonies)
l‘ig‘un’ Iii. illiin mill Alum (mm!
zit lam luui' \\rll-(lriwl plan's nl‘tlits {mt medium {m (‘fl('l1 nrganism
zuinl at l(‘£l\‘[ tun plan‘s (‘;1(‘ll(\lk;ll\'11\)\\'ll Sulixllu’tm‘y mumul medium,
and hill gmu‘ml mmlium, ugh .\Iun'(7w»11l\'(‘y or erH) agar. Do Klilt's
and .\li<i';1 (ll‘nl) (mums \\lIll lllt‘ sx‘rial (lilulinus nlilln' organix'ms on
Ilu'sc plzm-s w (11:11 (‘11(‘ll plutv ix ll\\‘(l fur wvrml (lilutiunx‘. ‘Figure
14.3} 'SH' p. llll.)
Count Ilu‘ (*nlnuim, mlml‘m' the rv‘xulm and rmupan' the per-
furmam‘cs‘ of. tho \m‘ium media. 'I'livsv may suggt‘st that in a new
llh
|
3. Cropped, rotated, desaturated and contrasted image.
|
Produced mostly correct text (out2.txt)
|
APPARATUS, MATERIALS AND TECHNICAL METHODS
by leaning them against a length of glass or metal rod 6 or 7 mm in
diameter. (Figure 14.1). -
When pouring plates raise the lid only far enough to permit the
mouth of the tube or bottle to enter. Pour about 12—15 ml in each
plate. Dry plates slightly open (Figure 14.2) in an incubator, and
store medium side up in a refrigerator.
$
Figure 14.2. Drying a plate
Testing Culture Media. ‘Efficienga of Plating’ (EOP)
New batches of culture media, particularly media like DCA or
Wilson and Blair, may vary considerably and should be tested in the
following way.
Prepare serial tenfold dilutions, for example, 10'2 to 10‘7 of
cultures of various organisms which will grow on or be inhibited
by the medium. For example, when testing DCA use Sh.:onnei,
Syphi, SJyphimurium, several other salmonellae and Esch.wli. Use
10-5
2 colonies
19 colonies
(more than 200 colonies)
Figure 14.3. Mile: and Mimi am:
at least four well-dried plates of the test medium for each organism
and at least two plates each of a known satisfactory control medium,
and of a general medium, e.g. MacConkey or Lemco agar. Do Miles
and Misra drop counts with the serial dilutions of the organisms on
these plates so that each plate is used for several dilutions. (Figure
14.3) (See p. 180.) .
Count the colonies, tabulate the results and compare the per-
formances of the various media. These may suggest that in a new
L116‘
|
- The command-line incantation used was:
Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng
Tesseract Open Source OCR Engine v3.05.00 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
real 0m15.946s
user 0m14.255s
sys 0m0.340s
- Lessons from this experiment:
- We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well
- The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure.
- Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc.
- Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary.
- Trent has some app UI ideas about how to improve the correction workflow.
- I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib.
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it.