16 March 2017: Tested a Trial Copy of ABBYY FineReader

From Noisebridge
Jump to navigation Jump to search

Experiments[edit | edit source]

  • In order to better judge what's possible for OCR, we are sampling both proprietary and open-source softwares
  • We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini
  • Asked it to convert the images from the previous experiment with Tesseract
  • It produced a PDF containing the first three images (File:Test001.pdf) -a limit of their trial version- with the following issues-
    • Positive
      • Pages were automatically oriented for English LRTB
      • Pages were automatically straightened
      • It produced indexed, searchable PDF
      • It indexes scientific terminology, including possibly unrecognized words - a bonus for scientific literature
      • It recognized images, tables and diagrams, and paginated them to match original layout in the resulting PDF
    • Neutral
      • The resulting PDF contains not only text, diagrams and images, it also contains the entire original scanner image
    • Negative
      • None of the pages was automatically cropped, so the scanner platen occupies most of the image
      • One of the pages was cropped, by FineReader, but incorrectly (page 3), removing all text content
      • FineReader is not fast.
      • FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post.
  • FineReader vs. Tesseract
    • FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
    • FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't.
    • FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
    • Tesseract goes straight to text, does not include a background image
    • Tesseract has a library which can be invoked from a scripted scan-OCR solution
    • ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
    • FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.

Summary-

  • FineReader looks to be a very good product, and is worthy of consideration
  • The price for a single station starts at $120
  • Getting the same results found in FineReader might take a bunch of developer effort
  • Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit
  • Cropping could be handled in a script prior to FineReader ingestion
  • As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.