16 March 2017: Tested a Trial Copy of ABBYY FineReader: Difference between revisions

Latest revision as of 22:49, 16 March 2017

Experiments[edit]

In order to better judge what's possible for OCR, we are sampling both proprietary and open-source softwares
We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini
Asked it to convert the images from the previous experiment with Tesseract
It produced a PDF containing the first three images (File:Test001.pdf) -a limit of their trial version- with the following issues-
- Positive
  - Pages were automatically oriented for English LRTB
  - Pages were automatically straightened
  - It produced indexed, searchable PDF
  - It indexes scientific terminology, including possibly unrecognized words - a bonus for scientific literature
  - It recognized images, tables and diagrams, and paginated them to match original layout in the resulting PDF
- Neutral
  - The resulting PDF contains not only text, diagrams and images, it also contains the entire original scanner image
- Negative
  - None of the pages was automatically cropped, so the scanner platen occupies most of the image
  - One of the pages was cropped, by FineReader, but incorrectly (page 3), removing all text content
  - FineReader is not fast.
  - FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post.
FineReader vs. Tesseract
- FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
- FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't.
- FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
- Tesseract goes straight to text, does not include a background image
- Tesseract has a library which can be invoked from a scripted scan-OCR solution
- ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
- FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.

Summary-

FineReader looks to be a very good product, and is worthy of consideration
The price for a single station starts at $120
Getting the same results found in FineReader might take a bunch of developer effort
Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit
Cropping could be handled in a script prior to FineReader ingestion
As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.

@@ Line 4: / Line 4: @@
 * We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini
 * Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract
-* It produced a PDF containing the first three images (a limit of their trial version), with the following issues-
+* It produced a PDF containing the first three images ([[File:Test001.pdf|Test001.pdf]]) -a limit of their trial version- with the following issues-
 ** Positive
 *** Pages were automatically oriented for English LRTB
@@ Line 18: / Line 18: @@
 *** FineReader is not fast.
 *** FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi.  This could be handled by fine-tuning during capture, or rewriting EXIF in post.
-** FineReader vs. Tesseract
+* FineReader vs. Tesseract
-*** FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
+** FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
-*** FineReader groks a page's structure and reproduces it in the resulting PDF.  Tesseract doesn't.
+** FineReader groks a page's structure and reproduces it in the resulting PDF.  Tesseract doesn't.
-*** FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
+** FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
-*** Tesseract goes straight to text, does not include a background image
+** Tesseract goes straight to text, does not include a background image
-*** Tesseract has a library which can be invoked from a scripted scan-OCR solution
+** Tesseract has a library which can be invoked from a scripted scan-OCR solution
-*** ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
+** ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
-*** FineReader is hands-off, for what it does.  After it receives images, it works without prompt or interrupt.
+** FineReader is hands-off, for what it does.  After it receives images, it works without prompt or interrupt.
 Summary-
@@ Line 33: / Line 33: @@
 * Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit
 * Cropping could be handled in a script prior to FineReader ingestion
+* As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience.  Time to usable product.  With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.

16 March 2017: Tested a Trial Copy of ABBYY FineReader: Difference between revisions

Latest revision as of 22:49, 16 March 2017

Experiments[edit]

Navigation menu

Search