TransWikia.com

How to destructively scan well preserved books?

Ebooks Asked on September 26, 2021

How do I best destructively scan well preserved books, with the goal of reading them on dedicated e-readers, tablets, computers, etc.?

One Answer

It depends on the amount of books: If you have more than several thousand you should get it done professionally, possibly you could also get a special deal on buying them already ocr'd and everything. Abort this guide here.

If you absolutely need true text ebooks where you can freely adjust font size and the like, scanning yourself is also not a good idea. OCR is too much of a manual process. But for most purposes, it should be ok to scan the pages as images. Try to buy the largest e-reader you can so the text doesn't end up too small. 7.8" should be good enough, but some books will appear a little small unless you can cut the pages up and rotate the screen so that the lines are parallel to the longer side of the screen. On computers and tablets and the like zoom functions work well enough that you should have no problems even on average phone screens.

So if you are still in, there are a few steps:

  1. Separate book into individual pages.
  2. Scan pages with scanner using adf automatic document feed.
  3. Merge pages into a pdf per book.
  4. Import into calibre e-book management (optional).
  5. Cut the pages into two or three pieces in the whitespace between lines - this enables you to have bigger font sizes on ebook readers where the screen aspect ratio doesn't match the book aspect ratio. Other assorted image improvements.
  6. Instead of cutting the pages into a few pieces, you can do OCR which enables to book to reflow exactly to the font and screen size of your choosing. This is optional and not recommended if the book is still in copyright and you are just doing this for yourself to read the book. Otherwise you should go help the people on Project Gutenberg and the like - they have processes to ensure the best possible quality.

Separate book into individual pages

If a books has a hard cover (that can't go through the scanner), cut it off with a paper knife. You need two cuts, one at the front and one at the back.

The next step depends on the amount of books:

  • Only a few: You can cut the pages apart with a paper knife.
  • More: You first separate the book into stacks that fit into a paper guillotine.
    • The sad truth is that for most books, you can just rip them apart. This doesn't apply to bound books where the pages are bound together with thread, but even hardcover books that might look like the have thread are often just glued. Open it as wide as possible, more than 180°, to pre-break the back. Then grip both halves at the top, firmly gripping both bundles so you don't pull on individual pages. Then slowly pull the halves apart. As long as you pull slowly, the glue tends to rip before the paper and even if sometime the paper tears a little, the tears very seldom reach into the printed text.
    • If you have a bound book, you need to unbind it, don't try to rip it apart. I haven't encountered enough of those to give many tips.
    • Then feed the bundles into a paper guillotine. If you have hundreds of books (or less if your time is valuable), get a paper guillotine with at least 5mm cutting depth, which will set you back around US$ 250. That way you can cut apart most books in less than 10 steps. If you have several thousand books (or less if your time is valuable), get a paper guillotine with at least 5cm cutting depth, which will set you back around US$ 1000. That way you can cut apart most books in one go (just remove hard covers first, those are usually not good for the blade).

Paper guillotine recommendation: Ideal 3005 (for up to hundreds of books) or a higher end one if it makes sense for you financially.

Scan

  • Make sure the title page or other distinctive page is the first page scanned. No harm in scanning such a page twice (once first, second time where it belongs). That way it will be easy to see where one book starts and another ends when you look at the sequence of scanned files.
  • Make sure the scanner puts all pages that you scan at once in a single pdf file - that way each book will only have a few files. That way it will be easy to see where one book starts and another ends when you look at the sequence of scanned files.
  • Choose the right color settings: Most scanners do a good job going from grayscale to black and white, so for normal black and white text, choose black and white. There have been informal studies that showed grayscale is not better for OCR than black and white and text display on e-readers is not improved by selecting grayscale, there is really no need for it. If there are color/grayscale images (or for the title pages, soft covers), you might want to choose color/grayscale for just those pages. If you scan grayscale/color the scans will be comparatively huge (as in, bigger than the movie version of the book). If the book is actually not that well preserved and the paper is not sufficiently white, you will need to scan it in color (and post process it to black and white if you want to read it on e-ink displays).
  • Always use the highest available resolution.
  • Don't mix up bundles of pages. I like to have stacks of separated books on one side of the scanner, and the pages of the currently scanned book that I already removed from the output tray on the other side.
  • When a books is finished, lay it on a floor or table with the cover up. Once you have it on your computer and made at least one backup copy (and only then), move the book into an appropriate (recyclable) waste bag or other container.
  • Use a scanner that doesn't need complex interaction to start a scan and which numbers its output files so you get them in the right order.

Scanner recommendation: Brother ADS 2800W, scan to USB stick, set up shortcut buttons for the scanning modes you need.

Merge pages into one pdf per book

My usual workflow is:

  1. Go into preview/icon view mode in the file explorer, in the folder where the scans are.
  2. You should be able to read the author and title from the previews.
  3. Create a new folder (in another window) for each book named something like "author title", no need to be complete, this is going to be used as a search term later.
  4. Copy the books files into the new folder.
  5. Go into the folder where you have the nicely named book folders, and run the following script:
#!/bin/bash
set -x

mkdir -p out
for d in */; do
  if test "$d" = "out/"; then
      continue
  fi
  cd "$d"
  pdftk *.pdf cat output ../out/"$(basename "$d")".pdf &
  cd ..
done
echo -n "Waiting for pdftk..."
wait
echo " done."
  1. Now you have an "out" folder with a pdf for each book, named like the folders you made.
  2. Import the whole folder into calibre
  3. Select the imported books, go to "edit metadata" and select "download metadata and covers".
  4. Clean up the meta data and covers.
  5. Back up your calibre database
  6. Go through the physical books still lying around, verifying them in calibre (first few pages, last few pages, total amount of pages) and then throw them away.

Cut the pages into suitably sized pieces and other image conversions

I'm still working on that, but I have an automated script for denoising and compressing and also describe individual steps below.

Remove blank jpeg pages

Blank pages scanned in color (say the unprinted sides of soft covers for example) compress well, but still take some space and are also just annoying. Remove using pdftk input.pdf cat 1 3-r3 r1 output output.pdf (example removes second page and second to last page) or even a graphical pdf viewer using print to pdf.

Remove noise

For 600 dpi scans, -morphology Close Diamond:1 should remove noise and be pretty conservative about it, i.e. you don't risk removing too much. If you have a higher resolution, you can use Diamond:2 or higher, for a lower resolution possibly you can't use this method at all. As a second step, -define connected-components:area-threshold=30 -define connected-components:mean-color=true -connected-components 4 removes all blobs of black smaller than 30 pixels, which is also pretty safe if you set the amount of pixels to be less than the amount of pixels in a full stop (which is going to be the same for all scans at the same dpi, a value of 30 is good for 600dpi).

Compress using jbig2

The best compression algorithm for the purpose, achieves incredible compression ratios. Get encoder from Github, sadly you will need to compile it yourself.

It can handle files with noise and still achieves decent compression ratios, but then it takes forever. If you can remove some noise the output will be much smaller, more pleasant on the eye and generated in next to no time (compared to the denoising steps suggested above).

mkdir jbig2
cd jbigt2
../../jbig2enc/src/jbig2 -s -p ../input-image-folder/*.png
../../jbig2enc/pdf.py output > out.pdf

Script against noise and for compression

Available on github. Can take an hour per book, sorry. Use for example like this:

#!/bin/bash

for a in <list of author directories>; do
  for f in "${a%/}/"*/*.pdf; do
    sem -j 3 "nice -n 19 ../../incoming/compress.sh "$f" denoise"
  done
done

To use a parallelism of 3 and low scheduling priority of 19, so your computer remains responsive.

Correct answer by Nobody on September 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP