30 May 2017: Test a copy of PDFScanner
Jump to navigation
Jump to search
Experiments[edit | edit source]
- PDF resulting from this testing session is here: pdfscanner_test_MQ.pdf
- We installed PDFScanner onto the newly rebuilt BookScanner Mac Mini
- Asked it to convert the images from the previous experiment with Tesseract
- PDFScanner yielded a .pdf file containing the fourteen pages from that session
- Positive
- It produced indexed, searchable PDF
- OCR accuracy was fairly good, though not 100%, recognizing most English words and correctly spelling many scientific terms
- It recognizes tables, non-standard indents
- Appears to safely ignore images and diagrams, preserving page layout around images
- Tables and such were paginated to match original layout in the resulting PDF, lining up exactly with original scanned images
- It indexes scientific terminology, and unrecognizable (non-dictionary) terms - within the limits of its OCR abilities
- Neutral
- The resulting PDF pages contain not just text, diagrams and images, but also the original scanner images - adds to file size, but retains the book's original look
- Negative
- Text is not accurately reproduced
- Problems in adding spaces between letters within a single word, and losing spaces between words, losing spaces between words and punctuation
- Pages were not automatically oriented for English LRTB
- Pages were not automatically straightened
- None of the pages was automatically cropped (the reviewer cropped all images beforehand)
- PDFScanner is faster than ABBYY FineReader
- Positive
- PDFScanner vs FineReader
- FineReader approaches 100% accuracy
- PDFScanner appears to reach into the 80-90% accuracy range, but since page structure and background are retained, this would only impact copying text as Unicode/ASCII to another program, and indexing by search engines
- Both PDFScanner and FineReader grok a page's structure and reproduces it in the resulting PDF.
- Both PDFScanner and FineReader includes the background image
- As with the FineReader test, image is fairly low contrast. and again may be distracting for some. Contrast could be bumped up during capture.
- PDFScanner is about US$16 and is based on FOSS libraries, including Tesseract. ABBYY's library licensing terms are unknown
- PDFScanner requires manual manipulation of image files, but provides built-in tools for rotation and cropping.
- FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.
- PDFScanner doesn't provide contrast or saturation filters - these would have to be added to the post-capture pipeline to help increase OCR accuracy (a low contrast scan image adversely affects Tesseract's results)
Summary-
- PDFScanner is faster than FineReader. It looks to be a very good product, not as good as FineReader, but is worthwhile considering the price and results
- The price for a single station is US$16
- Orientation/rotation and color management should be handled prior to PDFScanner
- A user should be able to run a script to capture images from cameras, have them be automatically oriented and their color processed, then output to a "ready-for-PDFScanner" folder
- IF all goes well, that folder can be drag-and-dropped onto the PDFScanner app icon, and be automatically OCRed
- User will be required to press Cmd-S at the end, to export as PDF
- This is a reasonably usable and useful product. It's a contender w/r/t ABBYY FineReader. Either would do well.