16 March 2017: Tested a Trial Copy of ABBYY FineReader: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
* We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini | * We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini | ||
* Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract | * Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract | ||
* It produced a PDF containing the first three images (a limit of their trial version | * It produced a PDF containing the first three images ([[File:Test001.pdf|Test001.pdf]]) -a limit of their trial version- with the following issues- | ||
** Positive | ** Positive | ||
*** Pages were automatically oriented for English LRTB | *** Pages were automatically oriented for English LRTB | ||
Line 18: | Line 18: | ||
*** FineReader is not fast. | *** FineReader is not fast. | ||
*** FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post. | *** FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post. | ||
* FineReader vs. Tesseract | |||
** FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range | |||
** FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't. | |||
** FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture. | |||
** Tesseract goes straight to text, does not include a background image | |||
** Tesseract has a library which can be invoked from a scripted scan-OCR solution | |||
** ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time. | |||
** FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt. | |||
Summary- | Summary- | ||
Line 33: | Line 33: | ||
* Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit | * Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit | ||
* Cropping could be handled in a script prior to FineReader ingestion | * Cropping could be handled in a script prior to FineReader ingestion | ||
* As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive. |
Latest revision as of 22:49, 16 March 2017
Experiments[edit]
- In order to better judge what's possible for OCR, we are sampling both proprietary and open-source softwares
- We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini
- Asked it to convert the images from the previous experiment with Tesseract
- It produced a PDF containing the first three images (File:Test001.pdf) -a limit of their trial version- with the following issues-
- Positive
- Pages were automatically oriented for English LRTB
- Pages were automatically straightened
- It produced indexed, searchable PDF
- It indexes scientific terminology, including possibly unrecognized words - a bonus for scientific literature
- It recognized images, tables and diagrams, and paginated them to match original layout in the resulting PDF
- Neutral
- The resulting PDF contains not only text, diagrams and images, it also contains the entire original scanner image
- Negative
- None of the pages was automatically cropped, so the scanner platen occupies most of the image
- One of the pages was cropped, by FineReader, but incorrectly (page 3), removing all text content
- FineReader is not fast.
- FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post.
- Positive
- FineReader vs. Tesseract
- FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range
- FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't.
- FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture.
- Tesseract goes straight to text, does not include a background image
- Tesseract has a library which can be invoked from a scripted scan-OCR solution
- ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time.
- FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.
Summary-
- FineReader looks to be a very good product, and is worthy of consideration
- The price for a single station starts at $120
- Getting the same results found in FineReader might take a bunch of developer effort
- Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit
- Cropping could be handled in a script prior to FineReader ingestion
- As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.