dteviot / webtoepub Goto Github PK
View Code? Open in Web Editor NEWA simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
License: Other
A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
License: Other
Currently plugin downloads all content (chapters and images) before packing them into the EPUB. This works OK at moment, but may have problems later when try processing items that are very large,.
Should re-architect so packs each piece of content into the epub file on disk as it's downloaded.
Note, this may need to wait until Chrome 52, which implements streams.
From Firefox review of plugin.
Note that old versions of JSZip have known issues which might make extract zip files created with it impossible. Before requesting full review, please upgrade to the latest version.
Also note:
Yeah, I know what you guys are going to say... but I still need to report this one.
Basically IU experience this:
Calibre bugtracker has this entry "https://bugs.launchpad.net/calibre/+bug/1293102", but it's tagged as "wontfix"; would it be possible to "fix" it on this side?
I don't know if the solution suggested in that post will impact the /div removing thing.
webtopdf is 0009 on chrome.
Strange thing is I remember it working flawlessy, and I didn't update calibre after that since I just updated it before.
The popup on firefox looks a little goofy compared to chrome. Suggest styling it a bit with some css hacks if need be.
I think it would be nice if the extension could include Series metadata in the generated epub file. It should be quite easy to parse series name from the URL (at least on Baka Tsuki) and the series metadata is pretty useful for ebook library management in Calibre and other such tools. Some e-book reading devices can even make use of the series metadata to automatically assign books into collections (at least the Kobo Auro H2O when managed with Calibre).
Surprisingly I have not been able to find information about how the series metadata should look like in the epub spec (well, I haven't really looked very deeply), but this is how the ebook-meta
tool provided by Calibre does it:
ebook-meta foo.epub --series bar_series
It can also be used by just passing the name of an epub file to inspect epub metadata, including series information. In any case checking how the epub file looks like after being modified by the ebook-meta
tool might provide insight how series metadata works.
While I can't really help with extension development, I can certainly help with testing of this if needed. :)
Please add field to enter translator's name.
Split author's name input field into lastname-firstname parts and update the way it's recorded in the opf.
<dc:creator opf:file-as="last, first" opf:role="aut">first last</dc:creator>
As for the translator's name:
<dc:contributor opf:file-as="name" opf:role="trl">name</dc:contributor>
Edit: Since there's no response from dteviot yet, I'll add one more metadata issue.
The meta cover tag has content before name.
Current: <meta content="image0000" name="cover"/>
Should be: <meta name="cover" content="image0000"/>
First thing thanks for this extension, for which I'd like to suggest some improvement:
I suppose that if the old code could do it should be possible to do it now too.
I hope I don't sound annoying, it's not my intention, I'm asking since the web thing was really handy and had god results quality wise.
@belldandu, @typhoon71, @toshiya44, @dreamer2908
I've added your nicknames to the credits list in the readme.
If you prefer I use your real names, please e-mail me, including your real names as I only know Belldandu's.
Maybe you already had this idea, but I was thinking: what about getting some well formatted epubs, even hand made, and check them to see how they're actually formatted?
That could help with getting solutions/possibilities on how to build them.
Too wild?
Observed on Firefox:
If target web page is in process of loading, the content script that is injected into the web page does not return.
I suspect it's waiting for whole page to load before it can obtain the pages content, which it returns.
IDs must start with a letter but in the toc only 4 digit numbers are used.
Please replace them with something like toc_[4digitnumberhere]
for maximum compatibility.
In BakaTsukiParser.js it says,
// discard br tags as epubcheck says they are invalid in the places they are at in xhtml util.removeElements(util.getElements(element, "br"));
br tags are not invalid, they only need to be closed properly in xhtml, like <br/>
In Baka-Tsuki pages br tags show up as <br>
, that's why epubcheck didn't like it.
Can you please replace all <br>
by <br/>
instead of removing it?
Uhm, I got this today, while trying out the 0.0.0.6 release.
I tried the Zashikiwarashi vol 9 before this one, and it did pack it.
Btw: I tried with 0.0.0.7 (sonako) manually installing it, the same happens both on firefox and chrome.
[google store has 0.0.0.6 avail right now, as for mozilla I can't even find it]
metadata, manifest, spine and guide sections in the opf has xmlns=""
inserted in them.
<metadata xmlns="" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<manifest xmlns="">
<spine xmlns="" toc="ncx">
<guide xmlns="">
Again, calibre can still open the book, but I guess it might be an issue for other lightweight apps.
(using Firefox 49.0a2)
I notices stuff grabbed from RoyalRoad is missing chapter names.
Check "http://royalroadl.com/fiction/1233".
It happened after the removing of the "Try the beta reader" line (sonako 16/07/2016).
At the end of the chapter(s) there is an incomplete link to the next chapter on RoyalRoad web (arrow thing), maybe that should be removed too.
div
tag is inside p
tag.u
tag (underline) into suitable form for epub.<p>Normal <u>underline></u></p>
should become <p>Normal <span style="text-decoration: underline;">underline></span></p>
ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0001_Prologue.xhtml(4,85): element "u" not allowed anywhere; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")
span
tag inside h* tag are not fixed, like <h3><span class="mw-headline" id="1st_time">1<sup>st</sup> time</span></h3>
ERROR: /home/yumi/Downloads/Utsuro_no_...koVolume_1.epub/OEBPS/Text/0002_1st_time.xhtml(1,497): value of attribute "id" is invalid; must be an XML name without colons
<h3 id="1st_time">
, but it's still not fixed, and not useful here.Well, some more, but I lost the samples.
center
tag isn't allowed in epub, too. <center>text</center>
should become <p style="text-align: center;"></p>
align
attribute in p/span/div should be converted into css style text-align:
BTE-GEN moves up heading if higher levels are missing, i.e h2
to h1
, h3
to h2
if there's no h1
. Can this be considered?
In list of references (translator's notes) in B-T web, the link to jump up to where the reference belongs to only has a single ↑
symbol. The same in BTE-GEN's output. In WebToEpub's output, it becomes Jump up ↑
. If you remove cite-accessibility-label
(class), the Jump up
text will stop popping up out of nowhere.
Full disclose: I'm developing my own (not easy-to-use) Baka-Tsuki to epub converter, which is for freaks like me, and not for normal users at all.
This question is for everyone including the creator of this project.
Since we cannot get rid of the blank section in Novel Illustrations without causing issues, the next best thing is to fill it with some kind of content.
Should we fill it with copyright disclaimer information, or something else maybe?
You guys decide.
If it's possible and not too much work, I suggest to keep compatibility with non-WebExtension "enabled" Browser.
This includes Firefox 47 which is the last stable since Firefox 48 should be out in August(and support WebExtension), and other forks like Palemoon.
I won't ask for Other Chrome derivates because they should already be working (Iron does work).
I would like to have the betas of webtoepub available on the relevant Mozilla addon channel (beta); since they are not scrutinized by Mozilla as the release ones, it's possible to release them more frequently.
It should be useful for testing and stuff.
-If it's possible- I'd like you to add Wuxiaworld support, since there are some nice novels there.
The website has project pages like "www.wuxiaworld.com/wmw-index/", it should be possible to work with.
Thanks anyway.
This file has no namespace. Its namespace must be http://www.w3.org/1999/xhtml. Set the namespace by defining the xmlns attribute on the element, like this
This is the error calibre's editor tells me. It asks me to put <html xmlns="http://www.w3.org/1999/xhtml">
instead of writing just <html>
in the xhtml or html files in the book.
Also, every single paragraph and heading tag has xmlns="http://www.w3.org/1999/xhtml"
written inside it. Sample,
<p xmlns="http://www.w3.org/1999/xhtml">
<h2 xmlns="http://www.w3.org/1999/xhtml">
<h3 xmlns="http://www.w3.org/1999/xhtml">
Calibre can still open the book, but I guess it might be an issue for other lightweight apps.
(using Firefox 49.0a2)
(note for self)
Put "Advanced Options" button to right of progress bar.
Button is only visible if Parser advertises that it has advanced options.
If clicked, additional options are put on dialog.
e.g.
Also note
Reported by dreamer2908
Ugh, apparently Calibre has problems with three dots in image filenames if I enable --smarten-punctuation option. Maybe I should report it to them when I stop being lazy.
Maybe replace the three dots (which are supposed to indicate an elipsis) with something else. Not sure what would be best. Some options are:
@toshiya44, @belldandu, @typhoon71, @dreamer2908
I'm now up to 1600+ downloads of the plug-in.
Even if only 1 in 10 people actually use it, if I post a version to the store that's broken, that's 160 angry people.
So, need to come up with a way to beta test software before releasing to general public.
Suggestions, anyone?
First great work with this :)
Popup close up every time we change tab and it's a pain if we(?) to change multiple param if they are in another tab.
So is it possible to add a real window/tab instead of a popup ? Or maybe both ?
Another possibility will be to store already modified parameter to localstorage to avoid losing them every time popup close
Requested by a user on the baka-tsuki forums.
Title says it all. This is a marker for me.
(note for self)
https://developer.chrome.com/extensions/i18n
Not sure when but one of the readme updates broke the link on the main branch
(Request by "Guest")
On the topic of cover issues with the extension, could I make a suggestion, to allow the setting of covers using an image from any URL instead of just those available on the page?
For example, take this page
As visible on that page itself (and therefore available for packing by the extension), the closest one can get to a cover would be
https://www.baka-tsuki.org/project/images/3/31/Zashiki_v09_000.jpg
However, this is too wide as it includes the front cover, the spine, and the back cover as well.
If on the other hand one were to check the main series page,
https://www.baka-tsuki.org/project/index.php?title=The_Zashiki_Warashi_of_Intellectual_Village
there is a much better option available to act as a cover, not present on the volume's full text page.
https://www.baka-tsuki.org/project/images/3/30/Zashiki_Volume_9_Cover.jpg
As-is, the extension does not allow for setting this as the cover, and therefore the epub needs to be manually tweaked after the fact to replace the cover.
Since I'm uncertain if there is any practical easy one-size-fits-all fix to somehow magically detect the presence of a cover image on a page other than the one being viewed, then a solution could be to allow entering an image URL to fetch a specific image to act as cover.
In calibre, after the images at the start of a epub, there's a white page, just before the novel start.
Can you check if it's caused by the reader or not? It's not supposed to be there.
I can't try with sumatra because there's the link issue still around (so the page isn't empty and I don't know if it's on calibre side).
Since the wayback machine tries to preserve the integrity of each site there should be no real change to any parser stuff except for maybe small parts.
parserFactory.register("web.archive.org/web/*someregexthatallowsnumbersonly*/http://www.baka-tsuki.org", function() { return new BakaTsukiParser() });
This is just a suggestion and it would be neat to have implemented. I emailed you already about this so yeah.
I noticed there was something along the lines of changing the icon, so... I suggest this one:
http://www.flaticon.com/free-icon/books_150360
It's the one I use on Firefox; since you already have to unzip, zip, change extension... one can just replace the icon .png. The strange thing is that the default icon is OK.-ish on chrome, but renders horribly on Firefox.
I think it would be a good idea to check for some tag that's always on the volume pages and if that tag doesn't exist then throw a message at the user saying "Hey this is not a book page" instead of throwing meaningless errors about hrefs and stuff being missing (because its obvious they are missing since the parent element is nowhere on the page).
Initial investigation shows problem is setting element is not working as expected when the test file is loaded asynchronously from local file.
Fix is not urgent, as value is set correctly when loading from network. (In this case, is only used to try and simulate loading from network for test.)
Note, also works when file is loaded synchronously.
requested by amit34521 over at Baka-Tsuki
Can someone add a similar generator for the mobile users or some other option for the mobile users who used the epub generator before???
Simply add an index.html file and call the required files and add more javascript if need be.
I can easily host it on my dedi when this is done :D
(Note for me.)
Sometimes the attempt to get the high resolution version of an image fails. (e.g. WayBackMachine did not preserve the file.)
In this case, the HTTP exception that is thrown aborts the "fetch all images" operation.
Need an option to say "fetch rest of images on image failure". In which case, use the thumbnail image on the original page and try fetching the rest of the images.
Since the source is javascript this should be easy.
I have made this issue as a marker for myself and its low priority for me.
Leave a thumbs up on the first comment if you like this issue. Please do not spam text +1's though as this causes spammy emails.
I will be refering to this https://hacks.mozilla.org/2015/10/porting-chrome-extensions-to-firefox-with-webextensions/
This was requested by a guest
This is a self marker for me and is exactly what the title says.
Warning message from Calibre
The cover image has an id != "cover". Renaming to work around bug in Nook Color.
Refer #46
This is the last remaining error that i have so far been unable to fix
[kami@Index ~]$ epubcheck Toaru_Maju...dexVolume1.epub
Validating using EPUB version 2.0.1 rules.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,884): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,1080): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,2006): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,2191): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,2735): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
ERROR(RSC-005): Toaru_Maju...dexVolume1.epub/OEBPS/toc.ncx(1,2923): Error while parsing file 'different playOrder values for navPoint/navTarget/pageTarget that refer to same target'.
Check finished with errors
epubcheck completed
This is the toc.ncx
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
<head>
<meta content="https://web.archive.org/web/20140803022146/http://www.baka-tsuki.org/project/index.php?title=Toaru_Majutsu_no_Index:Volume1" name="dtb:uid" />
<meta content="2" name="dtb:depth" />
<meta content="0" name="dtb:totalPageCount" />
<meta content="0" name="dtb:maxPageNumber" />
</head>
<docTitle><text>Toaru Majutsu no Index:Volume1</text></docTitle>
<navMap>
<navPoint id="body0001" playOrder="1">
<navLabel><text>Novel Illustrations</text></navLabel>
<content src="Text/0000_Novel_Illustrations.xhtml" />
</navPoint>
<navPoint id="body0002" playOrder="2">
<navLabel><text>Prologue: The Tale of the Illusion Killer Boy. The_Imagine-Breaker.</text></navLabel>
<content src="Text/0001_Prologue_T...ne-Breaker.xhtml" />
</navPoint>
<navPoint id="body0003" playOrder="3">
<navLabel><text>Chapter 1: The Magician Lands on the Tower. FAIR,_Occasionally_GIRL.</text></navLabel>
<content src="Text/0002_Chapter_1_...nally_GIRL.xhtml" />
<navPoint id="body0004" playOrder="4">
<navLabel><text>Part 1</text></navLabel>
<content src="Text/0002_Chapter_1_...nally_GIRL.xhtml" />
</navPoint>
<navPoint id="body0005" playOrder="5">
<navLabel><text>Part 2</text></navLabel>
<content src="Text/0003_Part_2.xhtml" />
</navPoint>
<navPoint id="body0006" playOrder="6">
<navLabel><text>Part 3</text></navLabel>
<content src="Text/0004_Part_3.xhtml" />
</navPoint>
<navPoint id="body0007" playOrder="7">
<navLabel><text>Part 4</text></navLabel>
<content src="Text/0005_Part_4.xhtml" />
</navPoint>
<navPoint id="body0008" playOrder="8">
<navLabel><text>Part 5</text></navLabel>
<content src="Text/0006_Part_5.xhtml" />
</navPoint>
<navPoint id="body0009" playOrder="9">
<navLabel><text>Part 6</text></navLabel>
<content src="Text/0007_Part_6.xhtml" />
</navPoint>
<navPoint id="body0010" playOrder="10">
<navLabel><text>Part 7</text></navLabel>
<content src="Text/0008_Part_7.xhtml" />
</navPoint>
</navPoint>
<navPoint id="body0011" playOrder="11">
<navLabel><text>Chapter 2: The Illusionist Bestows Demise. The_7th-Egde.</text></navLabel>
<content src="Text/0009_Chapter_2_...e_7th-Egde.xhtml" />
<navPoint id="body0012" playOrder="12">
<navLabel><text>Part 1</text></navLabel>
<content src="Text/0009_Chapter_2_...e_7th-Egde.xhtml" />
</navPoint>
<navPoint id="body0013" playOrder="13">
<navLabel><text>Part 2</text></navLabel>
<content src="Text/0010_Part_2.xhtml" />
</navPoint>
<navPoint id="body0014" playOrder="14">
<navLabel><text>Part 3</text></navLabel>
<content src="Text/0011_Part_3.xhtml" />
</navPoint>
<navPoint id="body0015" playOrder="15">
<navLabel><text>Part 4</text></navLabel>
<content src="Text/0012_Part_4.xhtml" />
</navPoint>
</navPoint>
<navPoint id="body0016" playOrder="16">
<navLabel><text>Chapter 3: The Grimoire Peacefully Smiles. "Forget_me_not."</text></navLabel>
<content src="Text/0013_Chapter_3_...get_me_not.xhtml" />
<navPoint id="body0017" playOrder="17">
<navLabel><text>Part 1</text></navLabel>
<content src="Text/0013_Chapter_3_...get_me_not.xhtml" />
</navPoint>
<navPoint id="body0018" playOrder="18">
<navLabel><text>Part 2</text></navLabel>
<content src="Text/0014_Part_2.xhtml" />
</navPoint>
<navPoint id="body0019" playOrder="19">
<navLabel><text>Part 3</text></navLabel>
<content src="Text/0015_Part_3.xhtml" />
</navPoint>
<navPoint id="body0020" playOrder="20">
<navLabel><text>Part 4</text></navLabel>
<content src="Text/0016_Part_4.xhtml" />
</navPoint>
</navPoint>
<navPoint id="body0021" playOrder="21">
<navLabel><text>Chapter 4: The Exorcist Chooses the End. (N)Ever_Say_Good_bye.</text></navLabel>
<content src="Text/0017_Chapter_4_...y_Good_bye.xhtml" />
</navPoint>
<navPoint id="body0022" playOrder="22">
<navLabel><text>Epilogue: The Conclusion of the Index of Prohibited Books Girl. Index-Librorum-Prohibitorum.</text></navLabel>
<content src="Text/0018_Epilogue_T...ohibitorum.xhtml" />
</navPoint>
<navPoint id="body0023" playOrder="23">
<navLabel><text>Afterword</text></navLabel>
<content src="Text/0019_Afterword.xhtml" />
<navPoint id="body0024" playOrder="24">
<navLabel><text>Translator's Notes</text></navLabel>
<content src="Text/0020_Translators_Notes.xhtml" />
</navPoint>
<navPoint id="body0025" playOrder="25">
<navLabel><text>Alternate Translations</text></navLabel>
<content src="Text/0021_Alternate_...anslations.xhtml" />
</navPoint>
</navPoint>
</navMap>
</ncx>
This is probably the only thing so far i'm stumped by. If anyone wants to enlighten me as to whats wrong here as far as the playOrder is concerned i'm all ears (and eyes).
Well, it's like this:
Link: https://baka-tsuki.org/project/index.php?title=Ultimate_Antihero:Volume_4
Title: Ultimate Antihero:Volume 4 (this is what is shown in the tag)
Filename: Ultimate_A...roVolume_4.epub
I suppose proper filename should be "Ultimate_Antihero-Volume 4.epub"; I noticed it changed from before, when the space were stripped instead of sobstituted with "", maybe that introduced this.
Many Baka-Tsuki web pages have an image gallery at the start of the web page.
Some of these images also appear in the story text.
Provide an option to have the ePUB generator remove any images in the gallery that also appear in the text.
Suggested implementation notes.
(Obviously, the above processing needs to be done BEFORE the images in the gallerybox are “flattened” to be outside the box.)
"Guest" wrote:
Using the "remove duplicate images" option leaves a bunch of empty
<div>
</div>
<div>
</div>
scattered around in the illustrations html.
I mean, I guess it doesn't really affect the user-facing result so it's hardly high priority, but it does seem a wee bit untidy.
As the title says this will NOT affect chapter names and image names, it will affect the main epub file name.
I'm thinking of adding this to #14 but i was looking for opinions on whether its a good idea or not.
If implemented concenation will be on by default and users that want it off can turn it off from advanced options.
In sumatrapdf there's a link after every image.
Happens to both the images at the start and those in the middle of chapters.
Doesn't happen in calibre (or sigil, but that's not a reader).
I checked in icecream too: the starting images are replaced by a bunch of links, exept the cover which is fine; the images in the middle of chapters are fine too.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.