TransWikia.com

Clean up converted ebook

Ebooks Asked by fuxia on August 13, 2021

I have some ebooks which are converted from other formats to ePub. Some are ill-formatted: there are hard line breaks in words and orphaned page numbers between paragraphs.

Example:
Screenshot from Calibre

How can I repair these ebooks? Do I have to open and edit the source files, or is there a better way?

I use Calibre on Windows to organize the ebooks, if that matters, but I am not bound to that program, and I can use Linux too.

5 Answers

Calibre has a feature that allows you to unpack an ePub file into the component parts (usually chapters), which you can then edit. When you have finished your edits, Calibre will repackage them back into an ePub file.

From the main Calibre view, right click on the book listing. You should get a popup menu with an option to "Edit book". Selecting this will give you a book editing window where you can edit the individual parts of the book.

Correct answer by Donald.McLean on August 13, 2021

Somehow I missed that … thanks to Donald.McLean’s answer, I found it.

In the ePub editor, I hit Ctrl+F, and a search & replace tool showed up at the bottom:

enter image description here

I have used the regex mode and the following patterns:

  • -</p>n<p[^>]*> replaced with nothing to remove paragraph breaks within words. I hit Replace all here.

  • s</p>n<p[^>]*> replaced with a single space to remove paragraph breaks within sentences. This had to be done manually, unfortunately, because there were some spaces at the end of lines which should stay separated.

  • n<p[^>]*>d+</p> replaced with nothing to remove orphaned page numbers. Replace all again.

I have also installed, but not yet tested, the plugin Modify ePub by Grant Drake. It offers some automated tasks:

enter image description here

Answered by fuxia on August 13, 2021

I have not found an elegant way to do this yet. However, the inelegant way works: 1. Highlight on the epub to clean. 2. Click on convert. 3. Set the conversion settings to clean all the things you want (especially in the heuristics section to scan and fix things) and make sure that the output file is epub format.

It will save the original messy file as "original_epub" and create a second epub file.

Answered by Cthulhu's Son on August 13, 2021

In Calibre you can play around Heuristic Processing option while converting your files. Calibre book converting options

Answered by Valery Noname on August 13, 2021

I used "wondershare PDF converter" to convert my pdf to epub, then I opened the epub with Calibre and removed the line breaks with these regex :

<span style="">([^.><]*)</span>s*</p>s*<p>

replaced by

<span style="">1</span>

Then:

¬</span>s*</p>s*<p>

replaced by

</span>

Answered by Anthony on August 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP