openzim / sotoki Goto Github PK
View Code? Open in Web Editor NEWStackExchange websites to ZIM scraper
Home Page: https://library.kiwix.org/?category=stack_exchange
License: GNU General Public License v3.0
StackExchange websites to ZIM scraper
Home Page: https://library.kiwix.org/?category=stack_exchange
License: GNU General Public License v3.0
It's not the favicon (ico format) of the web site, but a 48x48 PNG file with the logo of the web site.
Traceback (most recent call last):
File "sotoki.py", line 599, in
copytree('static', os.path.join('work', 'output'))
File "/usr/lib/python2.7/shutil.py", line 177, in copytree
os.makedirs(dst)
File "/home/someone/dev/kiwix/sotoki/venv/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 17] File exists: 'work/output'
=> cp -r static/* work/output/static/. is a solution, we need to do this in python
All page should attribute content to SE and respective authors:
https://archive.org/details/stackexchange
I have created a dump at:
http://zimfarm.kiwix.org/sotoki/
There are css/js files, but they seem to not apply to the content. For example at this page:
http://zimfarm.kiwix.org/sotoki/tags.html
SO is much bigger than SU. It's assumed that the data fits into RAM both during build and during browser e.g. the index page displays every single tag, this might not work on SO.
Already 5453000 Users done !
Already 5454000 Users done !
Already 5455000 Users done !
Already 5456000 Users done !
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
What is behind that error?
Here the html:
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&A for professional and enthusiast programmers" />
Or be somehow hardcoded in the script (less good).
HTML tag are not escape in title, comment
If the ZIM metadata uses a language code ISO with 3 letters (better) the filenames uses a language code with two letters (if possible) because it is the far more common.
Script continues but I see following errors
Already 2059000 Users done !
Already 2060000 Users done !
Already 2061000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2062000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2063000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2064000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
sha1 -> sha512 ?
User badge should be show in a better UI than a ul list.
Also we should have a count for badge and not show it more than once if user get it twice or more.
Actually a zim file of a stackechange web site is around 10 time bigger.
So for stackoverflow it's wilm make a zim of ~250Go...
I don't know how zim compression work but maybe de should try to see if minify file is better
I have launched a first dump and the result is available here:
http://zimfarm.kiwix.org/sotoki/
Unfortunately there is no (default) welcome page like:
http://zimfarm.kiwix.org/sotoki/index.html
This page might for example propose a cloud of tags but other ideas would be for sure doable.
$ sotoki stackoverflow.com Kiwix
found
--2017-04-15 13:01:27-- https://archive.org/download/stackexchange/stackoverflow.com-Badges.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia600500.us.archive.org/22/items/stackexchange/stackoverflow.com-Badges.7z [following]
--2017-04-15 13:01:28-- https://ia600500.us.archive.org/22/items/stackexchange/stackoverflow.com-Badges.7z
Resolving ia600500.us.archive.org (ia600500.us.archive.org)... 207.241.227.180
Connecting to ia600500.us.archive.org (ia600500.us.archive.org)|207.241.227.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 165889201 (158M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Badges.7z’
stackoverflow.com-Badges.7z 100%[================================================================================================================>] 158.20M 7.45MB/s in 22s
2017-04-15 13:01:52 (7.18 MB/s) - ‘stackoverflow.com-Badges.7z’ saved [165889201/165889201]
stackoverflow.com-Badges.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-Badges.7z
Extracting Badges.xml
Everything is Ok
Size: 2567271793
Compressed: 165889201
found
--2017-04-15 13:02:24-- https://archive.org/download/stackexchange/stackoverflow.com-Comments.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Comments.7z [following]
--2017-04-15 13:02:25-- https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Comments.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3260662817 (3.0G) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Comments.7z’
stackoverflow.com-Comments.7z 100%[================================================================================================================>] 3.04G 1.08MB/s in 32m 10s
2017-04-15 13:34:36 (1.61 MB/s) - ‘stackoverflow.com-Comments.7z’ saved [3260662817/3260662817]
stackoverflow.com-Comments.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-Comments.7z
Extracting Comments.xml
Everything is Ok
Size: 14688930453
Compressed: 3260662817
found
--2017-04-15 13:40:23-- https://archive.org/download/stackexchange/stackoverflow.com-PostLinks.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-PostLinks.7z [following]
--2017-04-15 13:40:24-- https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-PostLinks.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56595578 (54M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-PostLinks.7z’
stackoverflow.com-PostLinks.7z 100%[================================================================================================================>] 53.97M 1.38MB/s in 20s
2017-04-15 13:40:45 (2.67 MB/s) - ‘stackoverflow.com-PostLinks.7z’ saved [56595578/56595578]
stackoverflow.com-PostLinks.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-PostLinks.7z
Extracting PostLinks.xml
Everything is Ok
Size: 489020243
Compressed: 56595578
found
--2017-04-15 13:40:56-- https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Posts.7z [following]
--2017-04-15 13:40:57-- https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Posts.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10790576915 (10G) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Posts.7z’
stackoverflow.com-Posts.7z 100%[================================================================================================================>] 10.05G 1.47MB/s in 1h 45m
2017-04-15 15:26:53 (1.62 MB/s) - ‘stackoverflow.com-Posts.7z’ saved [10790576915/10790576915]
stackoverflow.com-Posts.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-Posts.7z
Extracting Posts.xml
Everything is Ok
Size: 53806453854
Compressed: 10790576915
found
--2017-04-15 15:45:14-- https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Tags.7z [following]
--2017-04-15 15:45:15-- https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Tags.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 689401 (673K) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Tags.7z’
stackoverflow.com-Tags.7z 100%[================================================================================================================>] 673.24K 1002KB/s in 0.7s
2017-04-15 15:45:17 (1002 KB/s) - ‘stackoverflow.com-Tags.7z’ saved [689401/689401]
stackoverflow.com-Tags.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-Tags.7z
Extracting Tags.xml
Everything is Ok
Size: 4313435
Compressed: 689401
found
--2017-04-15 15:45:19-- https://archive.org/download/stackexchange/stackoverflow.com-Users.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Users.7z [following]
--2017-04-15 15:45:20-- https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Users.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275877596 (263M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Users.7z’
stackoverflow.com-Users.7z 100%[================================================================================================================>] 263.10M 10.1MB/s in 2m 14s
2017-04-15 15:47:34 (1.97 MB/s) - ‘stackoverflow.com-Users.7z’ saved [275877596/275877596]
stackoverflow.com-Users.7z: OK
Ok we have get dump
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: stackoverflow.com-Users.7z
Extracting Users.xml
Everything is Ok
Size: 2084134903
Compressed: 275877596
/media/kelson/SOTOKI/work/stackoverflow_com /media/kelson/SOTOKI
Traceback (most recent call last):
File "/media/kelson/SOTOKI/venv/local/lib/python2.7/site-packages/sotoki/merge_links.py", line 18, in <module>
line_2=csvreader.next()
StopIteration
Unable to prepare xml :(
Getting this crash late in the process is not OK.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Traceback (most recent call last):
File "sotoki.py", line 700, in <module>
os.makedirs(output)
File "/srv/sotoki/venv/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 17] File exists: 'work/output'
There is a badges.xml file in the dump look at that or try to dump badges from the API.
Use wiredtiger as database backend and see what happens.
When user have ProfileImageUrl we should download it and not generating a random one.
See stackoverflow.com_eng_all_2017-05/A/question/41651035.html
During testing on my home machine offlin-ing hangs for some reasons, I don't why. We can't not deliver half backed zim files so make sure to threat this error properly during build
Raise error or log error in case of problem.
It would be nice to link the original question url.
Make it easier to run all the process
External links (so not pointing to a ZIM article) should be visually tagged. This is important for users to make the different between available/non available contents. This is already the case on Wikipedia ZIM files, we use a small icon we put beside the link. I would recommend to take and just tag the external link with a css class (adding this icon on the right of the link).
now : each user have a random generate image
todo : have the good image of user
This is pretty weird and the zimwriterfs can not really deal with them
create directory entries
collect articles
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/ebcb05d93418f58d5026ce59d21552d3566b7001e454b77353feb5dbdadfa024.html
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/953db56137fe269271cb97b9317fc7bc39e9f8622fc0fa8531ce13a4d3532b80.html
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/48b65904759a70301bd8451f0a491b605cef39500f65e1845f2bcf0fffbb46f9.html
Currently the script complains about pre-existing data.
Solid logging infrastructure is required to debug the load script in case of error. This must be fault tolerant in the sens that a maximum number of files/entries must be loaded so that we can have a good overview of the data errors if any.
I think it will be better to have a search engine integrated to the static version of the website. The problem is that the index is too big to be built by the user even for superuser (20MB). lunrjs provide numerous entry points. Here is the solutions that remains to be explored:
I think it's better to contribute back to lunr.js which is an established project regarding this topic. I had a look a look at other solutions but they more complicated and we only require english support.
Maybe it's better to thing I18N and go with a FTS solution that can handle mutliple languages. Again, I think it's better to have search integrated to the static part of the ZIM file that have search engine built inside the kiwix reader which breaks UX.
Local url are not encoded
Exemple : tag/g++/1.html should be tag/g%2B%2B/1.html
I can not decide how this should work, but having to redownload everything/everytime seems to be a problem.
For example:
Already 13233000 questions done!
[Errno 36] File name too long: 'work/stackoverflow_com/output/static/images/62b4b33ecb0f4756c04d0a1328a104360233d0874e5c6ff3a73759e1822959cc.latex?%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%20a1%20%26%20b1%20%26%20c1%20%26%20d1%5C%5C%20a2%20%26%20b2%20%26%20c2%20%26%20d2%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29*%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%20x%5C%5C%20y%5C%5C%20z%5C%5C%201%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29%3D%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%200%5C%5C%200%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29'
Already 13234000 questions done!
Already 13235000 questions done!
Create a command that offline images embedded in questions.
Here it is:
https://en.wikipedia.org/wiki/ISO_639-3
... and without "local" part
For now it seems to be ISO-639-2 with local/country part.
They are order by name, but we can order by count (number of topic with each tag)
What is the best ?
-name : easy if you search for something specific
which will call zimwriterfs with --withFulltextIndex argument
Need also to run a script to offline images.
The command line sotoki tool should allow to specify directory path were the XML files are.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.