Git Product home page Git Product logo

Comments (25)

georgjaehnig avatar georgjaehnig commented on June 8, 2024 2

Take a look: A nice person who wants to stay anonymous, generated an all-article epub (and mobi):
https://drive.google.com/drive/folders/10q-YfudWvfs5vF3HCMW-dEiQjn4x2TlK?usp=sharing

Is it it good? Then we could share it.

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024 1

@eggsyntax I've created the first EPUBs in 2017, and then in mid-2019 when SSC was still up.

Now, because of #9, I've recreated 2019 and added 2020 – with links from archive.org, since now SSC is down.

So adding links from the 2013-2018 definitions won't help, since they're not accessible on SSC right now.

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024 1

Just did a bit of digging into the various APIs that web.archive.org supports. I was really hoping there'd be something like:

https://web.archive.org/web/latest/https://slatestarcodex.com/2016/08/25/devoodooifying-psychology/

to retrieve the most recently stored version, but if there is, I haven't been able to find it. There's a consolation prize, though:

Return a plaintext list of captures:
https://web.archive.org/web/timemap/http://slatestarcodex.com/2016/08/25/devoodooifying-psychology/

Results are in this format, with the date of the capture (ie the date that can be substituted into the final URL) as the second (space-separated) field.
com,slatestarcodex)/2016/08/25/devoodooifying-psychology 20200624122713 https://slatestarcodex.com/2016/08/25/devoodooifying-psychology/ text/html 404 T3AQJDGWC4JGRJHRQHERX5JZGG76K5BO - - 764 3008132448 archiveteam_archivebot_go_20200624150002/old.reddit.com-inf-20200623-164549-7ljnn-00022.warc.gz

The last line (the one I pasted above) is the most recent capture, and the direct URL of that latest capture is
https://web.archive.org/web/20200624122713/http://slatestarcodex.com/2016/08/25/devoodooifying-psychology/

The trouble is that that date is from just a few days ago, after the site had been taken down, so it returns a 404 page :/ . This would have meant that my imagined https://web.archive.org/web/latest/... wouldn't have worked even if it did exist.

What would presumably work better would be to get the first version, at the cost of missing any later edits, or to get the last version before the takedown, at the cost of additional parsing & date comparison.

There's also an equivalent URL for grabbing the same info in JSON instead of plaintext.

My JS chops are pretty rusty, and I'm really short on time at the moment, but hopefully this info can get someone at least a bit closer.

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024 1

Thinking out of the box: A different approach could be to simply use the generated by-year EPUBs, and merge them into one big file. I am no EPUB expert though, but if someone is, this should be a safe (and easy?) way.

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

I just finished putting together a YAML file for that and was going to open an issue to see if @georgjaehnig wanted it, but I guess I don't need to open an issue now!

I took the html source from this link, named it ssc-source, then ran the following in bash:

grep -oP '(?<=")(https://web.archive.org/web/20200618075053/https://slatestarcodex.com/[\d]+/[\d]+/[\d]+/[^"]++)(?=")' ssc-source | grep -vP '(/open-thread)|(/ot)|(meetup)|(classified)' | sort | uniq | awk '{ print "- " $0 }' > ssc-archive.yml 

then edited from that to create this yaml file. I haven't actually tried running this yet.

Note: I removed open threads, classifieds, and meetups. Also, I'm sure there is a much prettier way to get the same result, but I'm working with limited command line knowledge :)

edit: I accidentally didn't do it with the latest version of the archive. That's fixed now.

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

Thanks, @benide! Got your file and running the script now... :)

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

@benide So the script ran but some URLs threw 404. Here's a list of them (not exhaustive, there are more from older years):

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

Weird, I'll try to fix it up a bit more later today :-)

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

Yup, maybe run a wget on all your URLs first and check how many come back with 404.

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024

Thanks to both of you for working on this!

I think the failure is probably because those all point to the same capture (20200618....) and the wayback machine presumably didn't capture all pages on SSC at the same time. That's a guess, but it seems likely.

Actually I'm not sure how this worked in the first place using the definitions in the repo -- the 2019 and 2020 definitions point to archive.org links, but the 2013-2018 definitions point directly to SSC. I noticed that because I was thinking it might be easier to combine all of @georgjaehnig 's year-by-year files rather than @benide 's approach of generating links from the archive page, on the assumption that the year-by-year files presumably all work.

Interestingly, I checked one of the ones that you listed as failing (the first one) and it does show up in the epub, so I'm not sure what's going on unless you originally captured the 2013-2018 pages before SSC went down?

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024

I'm definitely not especially knowledgeable about epubs.

That said, I think I may be able to find time tonight to throw together a script to use the list of URLs that Ben grabbed to pull the timemaps from the archive and build a valid archive.org URL for each. If I can find time to do it I'll just run it locally and hand over a list of the valid URLs.

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

Sounds great. Crossing fingers. :)

and hand over a list of the valid URLs.

You can also try running this webpages-to-ebook script. It only needs a clone, npm i, and then

$ node index.js definitions/slatestarcodex.base.yml your-definition-file.yml

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

Glad to see you all are getting somewhere! I'll let you all take that route, I'm going to write a script that grabs the url of working archive links using some of what was mentioned above :)

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

Ok, so, here's what I've come up with that is hopefully reusable. It ain't pretty, but I've learned some random command line things along the way, so I'm happy with it haha:

echo "shortname: ssc_archive
metadata:
  dc:
    title: SSC Archive
tags:
  title: h2
content:" > ssc.yml

curl -s https://web.archive.org/web/20200618075053/https://slatestarcodex.com/archives/ | \
grep sya_container | \
grep -oP '(https://slatestarcodex.com/[\d]{4}/[\d]{2}/[\d]{2}/[^"]++)(?=")' | \
grep -vP '(/open-thread)|(/ot)|(meetup)|(classified)|(links)' | \
tac | uniq | \
xargs -n1 -I % sh -c "curl -s https://web.archive.org/web/timemap/% | awk -F' ' '{ print \$2 }' | tac | awk 'BEGIN {A=20200620000000} {if (\$0<A) {print \"https://web.archive.org/web/\" \$0 \"/%\"; A=0}}'" | \
awk -F '/' 'BEGIN {A=""} { if ($9!=A) {print "- raw: \"<h1>" $9 "</h1>\""; A=$9}; print "- " $0 }' >> ssc.yml

In theory, this will give the latest working links for everything, and the resulting file should be immediately usable with @georgjaehnig's script. It has to query archive.org for every single post, so it's taking a bit to generate the yml file...

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

Updated, same gist as before. I'll give it a try my self and let you know.

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024

Ha! I took a much longer-winded approach and put together a fairly verbose Clojure script. My bash-fu isn't that good as yet.

Mine'll be running for a while, though; I kept getting rate-limited, so now I've slowed it way down.

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

That happened to me too, found I could do 100 API calls very quickly and then it was unhappy... Eventually got it, 100 at a time haha.

edit: Wow, Clojure makes me feel very at home. I use Emacs, and this just looks like elisp. I guess I need to start playing with some different lisps.

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024

Looks like our results were nearly identical (there were some small differences; eg the first couple of lines were in different orders; I didn't include the year headers in mine).

Clojure's an enormous pleasure, by far my favorite language and has been for about 5 years so far (out of 6-10 langs that I know fairly well). It combines all the inherent lispy goodness with a really sensible take on functional programming (it nudges you away from mutable state and toward pure functions wherever that's reasonable, but won't stand in your way when you're doing something small or one-off and just want to hack in place on a bunch of global vars). And it just has a tremendous amount of coherence in the design; you can tell the author spent years thinking really hard about this and implementing prototypes before settling on a design where everything clicks nicely with everything else. If you're curious about it but not necessarily committed to diving in whole-hog, one of the nicest places to start is with a couple of Rich Hickey's talks. I generally like talks less than I like reading, and less than I like learning by experimentation, but his talks are a dramatic exception; the ideas he expresses in them really transformed the way I think about programming. The two I'd recommend most would be Are We There Yet? and Simple Made Easy.

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

In the latest bash I put above, I was more careful to not reorder posts. It only matters for things posted on the same day, and probably doesn't really matter at all. You have more posts than me because I've filtered a few things out, which is coming from this part:

grep -vP '(/open-thread)|(/ot)|(meetup)|(classified)|(links)'

There are still things that aren't really necessary left over (like posts asking people to take yearly surveys), but no easy way to remove them without doing it entirely by hand.

I'll check Clojure out for sure, thanks for the links!

@georgjaehnig: I tried running

$ node index.js definitions/slatestarcodex.base.yml ssc-archive.yml

Everything downloaded happily. A few things got skipped from not parsing correctly, not sure why. It eventually said "Done", but I can't find the epub anywhere. In particular, it isn't in output/epub.

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

@benide

Everything downloaded happily. A few things got skipped from not parsing correctly, not sure why.

Oh, I guess this is because they didn't download correctly but got a 404 instead – which my script is not catching properly. I'll look at this now.

It eventually said "Done", but I can't find the epub anywhere. In particular, it isn't in output/epub.

Oh, this is weird. Don't know what happened. Can you try to isolate the bug and e.g. run a dummy definition with only 2 (working) URLs or so?

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

Everything downloaded happily.

So I've fixed the download error reporting now, if wget doesn't exit with 0, a problem is reported to the console.

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

I'll come back to this tomorrow 👍

from webpages-to-ebook.

benide avatar benide commented on June 8, 2024

Seems to have images, good bookmarks, etc. Definitely worth sharing!

from webpages-to-ebook.

eggsyntax avatar eggsyntax commented on June 8, 2024

We've been scooped! It looks terrific, they did a really nice job on it.

The epub (viewing in ibooks on mac since that's the machine I'm on) tried to make web connections to SSC and to archive.org without me clicking on anything (unless it was a misclick, which is conceivable), but maybe it's just lazily fetching an image or two? No clue, I didn't realize they could even do that without a user action.

Woohoo, I can rest easy knowing I have it all at my fingertips in one big volume :)

Thanks, y'all!

from webpages-to-ebook.

georgjaehnig avatar georgjaehnig commented on June 8, 2024

Great, I've put it here: https://www.reddit.com/r/slatestarcodex/comments/hkbfj4/all_articles_20132020_in_one_ebook_epub_mobi_pdf/

from webpages-to-ebook.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.