Simple EPUB ebooks with Python

The EPUB standard is based on several other, well-established technologies. This makes it easy to generate simple ebooks in this format with the standard libraries of many modern languages. I’ll show here a way to do it with Python, and for the minimal set of features described here we could go back as far as Python 1.6 and would still need no external library.

Let’s assume we already have a bunch of HTML files that we want to bundle into an ebook. The EPUB file format is basically a ZIP file with some meta informations in it. We have to create that ZIP container, the meta data and put the HTML files in it. That’s (almost) all.

(It’s all we have to do, because luckily the EPUB payload itself is defined as a subset of HTML and CSS. “Subset” means, generally speaking, that elements for HTML forms like <input> are usually not supported.)

Here is some Python code, that just does that (you might want to look up Python’s zipfile module):

import os.path
import zipfile

epub = zipfile.ZipFile('my_ebook.epub', 'w')

# The first file must be named "mimetype"
epub.writestr("mimetype", "application/epub+zip")

# The filenames of the HTML are listed in html_files
html_files = ['foo.html', 'bar.html']

# We need an index file, that lists all other HTML files
# This index file itself is referenced in the META_INF/container.xml
# file
epub.writestr("META-INF/container.xml", '''<container version="1.0"
           xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/Content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>''');

# The index file is another XML file, living per convention
# in OEBPS/Content.xml
index_tpl = '''<package version="2.0"
  xmlns="http://www.idpf.org/2007/opf">
  <metadata/>
  <manifest>
    %(manifest)s
  </manifest>
  <spine toc="ncx">
    %(spine)s
  </spine>
</package>'''

manifest = ""
spine = ""

# Write each HTML file to the ebook, collect information for the index
for i, html in enumerate(html_files):
    basename = os.path.basename(html)
    manifest += '<item id="file_%s" href="%s" media-type="application/xhtml+xml"/>' % (
                  i+1, basename)
    spine += '<itemref idref="file_%s" />' % (i+1)
    epub.write(html, 'OEBPS/'+basename)

# Finally, write the index
epub.writestr('OEBPS/Content.opf', index_tpl % {
  'manifest': manifest,
  'spine': spine,
})

You see, it’s quite simple to generate a working EPUB ebook. Of course, this basic script leaves some elements away, that are usually good style to embed, e. g., meta data about the author, date and so on, an extended index or a cover file. It also makes some assumptions, that cannot hold in the generic case:

It uses only the basename of each HTML file. If they are nested in folders, there are complications to be expected.
For this reason, linking between chapters might not work. Also, if the links in the HTML files are absolute, they will not work in the EPUB file.
The same problem arises with embedded content like images or stylesheets. Full URLs might work depending on the reader, but absolute paths are most probably wrong.
We also assumed, that the files contain only the subset of HTML, that is valid in EPUB books.

To handle some of these linking issues, we can, for example, use BeautifulSoup to manipulate the HTML, before we store it:

from BeautifulSoup import BeautifulSoup

images = []

for i, html in enumerate(html_files):
    basename = os.path.basename(html)
    manifest += '<item id="file_%s" href="%s" media-type="application/xhtml+xml"/>' % (
                  i+1, basename)
    spine += '<itemref idref="file_%s" />' % (i+1)
    soup = BeautifulSoup(open(html).read())
    for img in soup.findAll('img'):
        if not img['src'].startswith('http://'):
            # Same problem again: We flatten layers, so this won't work
            # properly in the wild
            images.append(os.path.basename(img['src']))
            img['src'] = os.path.basename(img['src'])
    # likewise loop through links, objects, ..., then store the
    # manipulated HTML
    epub.writestr(str(soup), 'OEBPS/'+basename)

# Now we have to embed the found images as well:
for img in images:
    epub.write(img, 'OEBPS/'+img)

That’s it so far. The code above is meant only to get you started. In the real world the HTML files will have headers and footers, that need to be removed. JavaScript doesn’t work in almost any EPUB reader (with iBooks as exception). But I hope you got the basic idea, how simple EPUB creation in Python works.