30 May 2017: Test a copy of PDFScanner

From Noisebridge
Jump to navigation Jump to search

Experiments[edit | edit source]

  • PDF resulting from this testing session is here: pdfscanner_test_MQ.pdf
  • We installed PDFScanner onto the newly rebuilt BookScanner Mac Mini
  • Asked it to convert the images from the previous experiment with Tesseract
  • PDFScanner yielded a .pdf file containing the fourteen pages from that session
    • Positive
      • It produced indexed, searchable PDF
      • OCR accuracy was fairly good, though not 100%, recognizing most English words and correctly spelling many scientific terms
      • It recognizes tables, non-standard indents
      • Appears to safely ignore images and diagrams, preserving page layout around images
      • Tables and such were paginated to match original layout in the resulting PDF, lining up exactly with original scanned images
      • It indexes scientific terminology, and unrecognizable (non-dictionary) terms - within the limits of its OCR abilities
    • Neutral
      • The resulting PDF pages contain not just text, diagrams and images, but also the original scanner images - adds to file size, but retains the book's original look
    • Negative
      • Text is not accurately reproduced
      • Problems in adding spaces between letters within a single word, and losing spaces between words, losing spaces between words and punctuation
      • Pages were not automatically oriented for English LRTB
      • Pages were not automatically straightened
      • None of the pages was automatically cropped (the reviewer cropped all images beforehand)
      • PDFScanner is faster than ABBYY FineReader
  • PDFScanner vs FineReader
    • FineReader approaches 100% accuracy
    • PDFScanner appears to reach into the 80-90% accuracy range, but since page structure and background are retained, this would only impact copying text as Unicode/ASCII to another program, and indexing by search engines
    • Both PDFScanner and FineReader grok a page's structure and reproduces it in the resulting PDF.
    • Both PDFScanner and FineReader includes the background image
    • As with the FineReader test, image is fairly low contrast. and again may be distracting for some. Contrast could be bumped up during capture.
    • PDFScanner is about US$16 and is based on FOSS libraries, including Tesseract. ABBYY's library licensing terms are unknown
    • PDFScanner requires manual manipulation of image files, but provides built-in tools for rotation and cropping.
    • FineReader is hands-off, for what it does. After it receives images, it works without prompt or interrupt.
    • PDFScanner doesn't provide contrast or saturation filters - these would have to be added to the post-capture pipeline to help increase OCR accuracy (a low contrast scan image adversely affects Tesseract's results)

Summary-

  • PDFScanner is faster than FineReader. It looks to be a very good product, not as good as FineReader, but is worthwhile considering the price and results
  • The price for a single station is US$16
  • Orientation/rotation and color management should be handled prior to PDFScanner
  • A user should be able to run a script to capture images from cameras, have them be automatically oriented and their color processed, then output to a "ready-for-PDFScanner" folder
  • IF all goes well, that folder can be drag-and-dropped onto the PDFScanner app icon, and be automatically OCRed
  • User will be required to press Cmd-S at the end, to export as PDF
  • This is a reasonably usable and useful product. It's a contender w/r/t ABBYY FineReader. Either would do well.