Git Product home page Git Product logo

wattpad2epub's Introduction

Readme (THIS REPOSITORY IS DEPRECATED)

Deprecation warning

This repsitoriy is being deprecated. No further changes will be done. You are free to fork it if you want to.

I had been fixing, rewriting and testing things for a new and improved version of Wattpad2Epub when, suddenly, everything started to break apart.

Investigating the new breaks I found out Wattpad has made changes to their page generation, adding most of the page content dynamically through JavaScript.

The simplest way to work around this would be by using the 'requests_html' python module, but it works by installing and manipulating a browser (chrome by default), and that is something I'm not interested in.

I had already lost most of my interest in wattpad to begin with, and this is the straw that breaks the camel's back, so there will be no more changes here.

What is Wattpad2Epub

Wattpad2Epub downloads and converts Wattpad books into Epub files you can use with your favorite ebook reader.

This is a command line program, so a basic understanding of the command line is expected (at least until a GUI is added (don't hold your breath on that)). If that's a problem for you, this may not be the program you are looking for.

Why Wattpad2Epub?

Wattpad doesn't offer an option to download a book. This forces you to remain online while reading, and use wattpad's application to access the stories.

Having those stories in epub format allows storing them as a backup, offline reading and self publication.

Requeriments

You will need python3. You can install it with brew in osx. For our main script you will need BeatifulSoup4 and ebooklib, you can install them with:

pip3 install BeautifulSoup4
pip3 install ebooklib

Running it

You can run the python script doing:

python3 wattpad2epub.py your_url_argument

your_url_argument should be your story URL, for example: http://www.wattpad.com/story/53207033-the-arwain-chronicles

Output

After the script finished, you will have your epub file inside the root folder, and it will be named in the format Title - Author .epub, for example: The Arwain Chronicles Book I - IceheartPhoenix.epub

On Wattpad's API

As of Dec. 2018m the API has been split into two part, a "public API" and a "private API".

Public API:

  • Has been in beta state since 2015, and they warn that it's subject to change, which makes it unreliable. (2018)
  • Does NOT provide a way to retrieve story or chapter text, making it unsuitable for Wattpad2Epub purposes. (2018)
  • Needs double authentication (application + user). (2015-2018)

Private API:

  • Has been moved behind a login for which the wattpad user account doesnt work and there is nothing to help you find out how to gain access or whether it's at all possible. (2018)
  • Isn't publicized. Found it through a comment on Stack Overflow, which probably means it's not meant for external applications use. (2018)
  • When the documentation for the "private API" was accesible it was incomplete, with some essential parts missing. (2015)
  • There were some server failures during my tests. (2015)
  • Couldn't find a reliable way to retrieve a full story text. (2015)
  • Needs double authentication (application + user). (2015-2018)

Based on all this, I've given up on using the API at all, and chosen to keep parsing the HTML because it allows us to:

  • Retrieve full story text (essential)
  • Not require the user to authenticate (important)
  • Not require application autentication (nice to have)

wattpad2epub's People

Contributors

gatoloko avatar julyj avatar mtrnord avatar sebastiangiro avatar silbaer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

wattpad2epub's Issues

Italics are broken

I really like this script.
The only problem is when italics are converted the blocks are copied to the paragraph and then to the blank space again.
This breaks the flow of a book. Here is a example

Text Text Text Text Italics Text Text Text Text.

Italics

Text Text Text Text Text Text Text Text.

Thanks, Mikandal

no module named bs4

Traceback (most recent call last):
File "wattpad2epub.py", line 33, in
import gsweb
File "C:\Users\User\Downloads\Wattpad2Epub-master\libs\gsweb.py", line 19, in
from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'

i've already tried the solution on Google but still no luck.

Graphical user interface

Right now, to use the script you to use a terminal, and most users are too lazy for that, so we should provide a GUI.

A simple GUI would have an editable field to input the story URL and a button to start the process.

A nicer GUI would also have a progress bar (advancing with each downloaded chapter) and a finalization message.

The README.md needs some love.

As I have been made aware, the readme is a mess.

  • The requirements section mentions how to install python on OSX, but nothing about windows or linux.

  • The same requirement section explains one way to install beautifulsoup and ebooklib, does this work in windows? Should we mention other ways?

  • Do we really want to explain how to install python and modules at all?

I'd really like to get some feedback about this.

Duplicated paragraphs

Once I run the script with the URL from the story, it goes all the way to the last chapter and then it starts all over again from the first to the last and the final result is an EPUB file with duplicated paragraphs, from the beginning to the end.

Is there another command?

Bad format because of <pre> tag

Every chapter is enclosed in

 tags. This breaks the display of the epub, because every paragraph is displayed in a single line without line break.

Problem with italic word or italic paragraph

Thanks for the git!
But there is a problem with the results when it finds italic tags within words or paragraphs. Like the example below:
Disclaimer
Result:
Disclaimer Disclaimer

other example:
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
Result:
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
standard

Another example:
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
Result:
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

How to resolve this ?

Stylesheet

Even though body.css includes basic stylization for lots of tags, the chapters' xhtml files haven't got an import line for the css script. It took me a while to notice this, so I kept checking on the HTML and the CSS to see if I was coding something wrong until I compared the code to another epub I downloaded prior.

It would be interesting if the script could write the import line into head of each xhtml individual files.

Question

Is it right that I have to change initial_url for changing the book and it would be nice to be able to have an argument where I can put it in

No downloading URL with latin characters such as í á ó à, etc.

Hi, I am know to github and programming in general. Manage to install and play this wonderful project, but just with URL and books written in english. When I tried to download books wich are in portuguese, the above message appears. Thankyou for this amazing git!:

asus@asus-K46CA:~/wattpadscraper/Wattpad2Epub-master$ python3 wattpad2epub.py https://www.wattpad.com/903426169-percy-jackson-filho-de-voldemort-capítulo-1
Traceback (most recent call last):
File "wattpad2epub.py", line 211, in
get_book(args.initial_url[0])
File "wattpad2epub.py", line 115, in get_book
html = gsweb.get_soup(initial_url)
File "/home/asus/wattpadscraper/Wattpad2Epub-master/libs/gsweb.py", line 77, in get_soup
html = get_url(url)
File "/home/asus/wattpadscraper/Wattpad2Epub-master/libs/gsweb.py", line 51, in get_url
response = urllib.request.urlopen(request)
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/usr/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 1368, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.6/urllib/request.py", line 1325, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib/python3.6/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1292, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.6/http/client.py", line 1140, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 51: ordinal not in range(128)

403 Forbidden Error.

When attempting to download a book, a 403 Forbidden error is given. I'd assume Wattpad has introduced new backend features that block applications from scraping books.

FileNotFoundError: [Errno 2]

New to this kinda thing, googled my way through it all. python wattpad2epub.py [link] won't work, so I added the directory of the file. I got this reaction:

Traceback (most recent call last):
File "C:\Users-\Desktop\Wattpad2Epub-master\wattpad2epub.py", line 282, in
get_book(args.initial_url[0])
File "C:\Users-\Desktop\Wattpad2Epub-master\wattpad2epub.py", line 214, in get_book
content=open("CSS/nav.css").read())
FileNotFoundError: [Errno 2] No such file or directory: 'CSS/nav.css'

I got bs4, ebooklib and latest ver of python. anything wrong?

Add a cover to the books

Most wattpad stories have a cover image.

While there is code to download it and add it as a cover to the epub file, it was commented because something didn't work as expected.

Figure out what is wrong and fix it.

Update existing epub with new chapters

Right now, the only way to download new chapters is to download the whole story.

Figure out whether there is a way to add new chapters to existing files, and implement it.

Socket Timeout & incorrect error handling

The program has a timeout error on the SSL handshake and that error causes more errors because it is not caught correctly. See (partially redacted) logs below:

Entered command: D:\DOWNLO~1\WATTPA~1>WATTPA~1.PY https://www.wattpad.com/story/185706320-earthshine-the-raintree-chronicles-book-1

b"'Earthshine: The Raintree Chronicles Book 1' by grahambower"
https://a.wattpad.com/cover/185706320-352-k381040.jpg
b'Working on: 1. When the student is ready'
Current url: http://www.wattpad.com/725708244-earthshine-the-raintree-chronicles-book-1-1-when
Pages in this chapter: 3


... ... ...


b'Working on: 3. The party'
Current url: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the
Pages in this chapter: 6
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/1
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/2
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/3
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/4
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/5
Working on: http://www.wattpad.com/727354183-earthshine-the-raintree-chronicles-book-1-3-the/page/6
b'Working on: 4. The first incursion'
Current url: http://www.wattpad.com/731927539-earthshine-the-raintree-chronicles-book-1-4-the
Pages in this chapter: 5
Working on: http://www.wattpad.com/731927539-earthshine-the-raintree-chronicles-book-1-4-the/page/1
Working on: http://www.wattpad.com/731927539-earthshine-the-raintree-chronicles-book-1-4-the/page/2
Working on: http://www.wattpad.com/731927539-earthshine-the-raintree-chronicles-book-1-4-the/page/3
Traceback (most recent call last):
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 966, in send
    self.connect()
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1414, in connect
    server_hostname=server_hostname)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 423, in wrap_socket
    session=session
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 870, in _create
    self.do_handshake()
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1059: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\DOWNLO~1\WATTPA~1\wattpad2epub.py", line 285, in <module>
    get_book(args.initial_url[0])
  File "D:\DOWNLO~1\WATTPA~1\wattpad2epub.py", line 241, in get_book
    chapter = get_chapter("{}{}".format(base_url, item['href']))
  File "D:\DOWNLO~1\WATTPA~1\wattpad2epub.py", line 145, in get_chapter
    for j in get_page(page_url):
  File "D:\DOWNLO~1\WATTPA~1\wattpad2epub.py", line 128, in get_page
    text = get_html(text_url).select('pre')
  File "D:\DOWNLO~1\WATTPA~1\wattpad2epub.py", line 65, in get_html
    request = urllib.request.urlopen(req)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
    result = self._call_chain(*args)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 543, in _open
    '_open', req)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Users\prime\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error _ssl.c:1059: The handshake operation timed out>

Better documentation

It would be nice if we had a better documentation with:

  1. How to install
  2. How to run
  3. Where the file output will be

404 / Timeout Error

After the script run for about 5 minutes, at some point, I had a 404 timeout:

Traceback (most recent call last):
  File "wattpad2epub.py", line 271, in <module>
    get_book(args.initial_url[0])
  File "wattpad2epub.py", line 228, in get_book
    chapter = get_chapter("{}{}".format(base_url, item['href']))
  File "wattpad2epub.py", line 135, in get_chapter
    for j in get_page(page_url):
  File "wattpad2epub.py", line 118, in get_page
    text = get_html(text_url).select('pre')
  File "wattpad2epub.py", line 63, in get_html
    request = urllib.request.urlopen(req)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 564, in error
    result = self._call_chain(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 756, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

What should happen?
Just keep trying until I cancel it, or at least don't lose all the things the script did in the process, so we can resume it. (This error happened like 4 times in 10 tries)

AttributeError at the end of the prosses

Hi,
At the end of the script (I'm using python 3.6 with Windows 7), I have an error "AttributeError: 'bytes' object has no attribute 'encode'".
I try also with your own example, there isi the same issue... I think it should be an error in the usage of the ebooklib library...

(...)
Pages in this chapter: 3
Working on: http://www.wattpad.com/464522910-asylum-chapitre-40/page/1
Working on: http://www.wattpad.com/464522910-asylum-chapitre-40/page/2
Working on: http://www.wattpad.com/464522910-asylum-chapitre-40/page/3
Working on: Chapitre 41
Current url: http://www.wattpad.com/464524707-asylum-chapitre-41
Pages in this chapter: 2
Working on: http://www.wattpad.com/464524707-asylum-chapitre-41/page/1
Working on: http://www.wattpad.com/464524707-asylum-chapitre-41/page/2
Traceback (most recent call last):
File "D:\Téléchargements\Wattpad2Epub-master\wattpad2epub.py", line 261, in
get_book(args.initial_url[0])
File "D:\Téléchargements\Wattpad2Epub-master\wattpad2epub.py", line 239, in ge
t_book
epub.write_epub(epubfile, book, {})
File "C:\Python36\lib\site-packages\ebooklib\epub.py", line 1534, in write_epu
b
epub.write()
File "C:\Python36\lib\site-packages\ebooklib\epub.py", line 1224, in write
self._write_items()
File "C:\Python36\lib\site-packages\ebooklib\epub.py", line 1213, in _write_it
ems
self.out.writestr('%s/%s' % (self.book.FOLDER_NAME, item.file_name), item.ge
t_content())
File "C:\Python36\lib\site-packages\ebooklib\epub.py", line 453, in get_conten
t
self.content = self.book.get_template('cover')
File "D:\Téléchargements\Wattpad2Epub-master\wattpad2epub.py", line 106, in ne
w_get_template
return original_get_template(*args, **kwargs).encode(encoding='utf8')
AttributeError: 'bytes' object has no attribute 'encode'

Resuming / Updating existing epubs

Resuming/updating would be very useful to handle stories where the author is still adding new chapters, and maybe other cases.

Some test to "update" an existing epub have been done, but for some reason it ends up being a mess that doesn't comply with the epub specification and can't be handled by most readers.

This need's to be figured out and implemented properly.

Installation is confusing

I have a problem with running program. I can't understand where should I input those commands:

pip3 install BeautifulSoup4
pip3 install ebooklib

And this comand as well:
python3 wattpad2epub.py your_url_argument

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.