Editing
16 March 2017: Tested a Trial Copy of ABBYY FineReader
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Experiments === * In order to better judge what's possible for OCR, we are sampling both proprietary and open-source softwares * We installed ABBYY FineReader 12.1.x onto the dorkroom mac mini * Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract * It produced a PDF containing the first three images ([[File:Test001.pdf|Test001.pdf]]) -a limit of their trial version- with the following issues- ** Positive *** Pages were automatically oriented for English LRTB *** Pages were automatically straightened *** It produced indexed, searchable PDF *** It indexes scientific terminology, including possibly unrecognized words - a bonus for scientific literature *** It recognized images, tables and diagrams, and paginated them to match original layout in the resulting PDF ** Neutral *** The resulting PDF contains not only text, diagrams and images, it also contains the entire original scanner image ** Negative *** None of the pages was automatically cropped, so the scanner platen occupies most of the image *** One of the pages was cropped, by FineReader, but incorrectly (page 3), removing all text content *** FineReader is not fast. *** FineReader complained about image resolution, which is recorded in EXIF as 72dpi, but it wanted at least 300dpi. This could be handled by fine-tuning during capture, or rewriting EXIF in post. * FineReader vs. Tesseract ** FineReader approaches 100% accuracy, though we haven't compared all words yet. Tesseract appears to be in the 70-80% range ** FineReader groks a page's structure and reproduces it in the resulting PDF. Tesseract doesn't. ** FineReader includes the background image, which in the previous test is fairly low contrast. It may be distracting for some. Contrast could be bumped up during capture. ** Tesseract goes straight to text, does not include a background image ** Tesseract has a library which can be invoked from a scripted scan-OCR solution ** ABBYY offers to license the library for FineReader to developers, though the price is unknown at this time. ** FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt. Summary- * FineReader looks to be a very good product, and is worthy of consideration * The price for a single station starts at $120 * Getting the same results found in FineReader might take a bunch of developer effort * Their OCR library might also be gotten at a reasonable cost, given Noisebridge is a non-profit * Cropping could be handled in a script prior to FineReader ingestion * As to the question of whether to utilise proprietary softwares, or keep to diy/open-source, can be boiled down to convenience. Time to usable product. With a primary objective of scanning a huge backlog of books, getting high-quality PDFs (or other digital formats) from books very soon is attractive.
Summary:
Please note that all contributions to Noisebridge are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
Noisebridge:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Request account
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Dig in!
Noisebridge
- Status: MOVED
- Donate
- ABOUT
- Accessibility
- Vision
- Blog
Manual
MANUAL
Visitors
Participation
Community Standards
Channels
Operations
Events
EVENTS
Guilds
GUILDS
- Meta
- Electronics
- Fabrication
- Games
- Music
- Library
- Neuro
- Philosophy
- Funding
- Art
- Crypto
- Documentation/Wiki
Wiki
Recent Changes
Random Page
Help
Categories
(Edit)
Tools
What links here
Related changes
Special pages
Page information