How to OCR and/or recreate lines of Egyptian Hieroglyphs in Unicode/HTML?

Question

I am wondering how to take these Hieroglyphs and make them into Unicode. I read through the Tesseract docs on how to create training data, but it seems largely tailored toward "traditional" writing, that goes in one direction and has only one character per vertical-space-per-line. Whereas in Egyptian hieroglyphs, it is sort of nested writing, you might have 3 characters stacked vertically, followed by 2 characters stacked vertically, etc.
So I'm wondering what the best option is for doing the full spectrum, of taking one of the hieroglyphic images in that book, like this:

And layout not completely horizontally like this:
? ? ? ????? ??? ??? ? ? ?? ?? ?? ???? ??? ?? ???? ?? ??? ?? ?? ??? ??? ???? ????? ??? ? ??? ??? ??? ?? ?? ? ???? ?? ??? ?? ????? ??? ?? ???? ??? ????? ?????? ? ???? ??? ? ???? ?????? ???? ??????
but instead layout in HTML exactly like they are laid out in the original textbook.
What I would imagine to do is, use OCR (or similar technology?) to identify the individual glyphs. I would manually draw squares around characters and map them to unicode characters. Then I would go and use that as training data somehow. Then it would go, page-by-page, through the document, and layout the text as it sees it in the document (with Egyptian Hieroglyphic quirks and all, everything in the proper place). I would imagine that there would be some tool that could use my manually trained squares and return { x: ..., y: ..., w: ..., h: ..., char: ... } for each glyph it finds. Then I can simply do my think with that data and convert it to HTML with CSS and absolute positioning.
The question is, how do I do that middle step? How do I do everything up to the HTML portion even? Should I be looking at Tesseract, or is there a more general tool like an "image recognition" tool that I could use to take my models (of some kind, x/y/w/h perhaps of boxes) and return my x/y/w/h/char array for each page. How can I do that, at a high level? I can figure out the specific CLI commands to run, or library API methods to call, I am just not sure what the state of the art is here and what type of tool would be best and easiest in this case.

How to OCR and/or recreate lines of Egyptian Hieroglyphs in Unicode/HTML?

Add your own answers!

Ask a Question