16 March 2017: Tested a Trial Copy of ABBYY FineReader
Jump to navigation
Jump to search
Experiments[edit | edit source]
- In order to better judge what's possible for OCR, we are sampling both proprietary and open-source softwares
- We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini
- Asked it to convert the images from the previous experiment with Tesseract
- It produced a PDF containing the first three images (File:Test001.pdf) -a limit of their trial version- with the following issues-
- Positive
- Pages were automatically oriented for English LRTB
- Pages were automatically straightened
- It produced indexed, searchable PDF
- It indexes scientific terminology, including possibly unrecognized words - a bonus for scientific literature
- It recognized images, tables and diagrams, and paginated them to match original layout in the resulting PDF
- Neutral
- The resulting PDF contains not only text, diagrams and images, it also contains the entire original scanner image
- Negative
- None of the pages was automatically cropped, so the scanner platen occupies most of the image
- One of the pages was cropped, by FineReader, but incorrectly (page 3), removing all text content
- FineReader is not fast.
- FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post.
- Positive
- FineReader vs. Tesseract
- FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
- FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't.
- FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
- Tesseract goes straight to text, does not include a background image
- Tesseract has a library which can be invoked from a scripted scan-OCR solution
- ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
- FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.
Summary-
- FineReader looks to be a very good product, and is worthy of consideration
- The price for a single station starts at $120
- Getting the same results found in FineReader might take a bunch of developer effort
- Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit
- Cropping could be handled in a script prior to FineReader ingestion
- As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.