Git Product home page Git Product logo

sotoki's People

Contributors

amirouche avatar benoit74 avatar dan-niles avatar dattaz avatar fledgexu avatar haksoat avatar imaybeabitshy avatar kelson42 avatar lgtm-migrator avatar okkebal avatar rgaudin avatar ritikjaiswal019 avatar sam-masaki avatar satyamtg avatar thecrazyt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sotoki's Issues

error on copy dependencies JS/CSS

Traceback (most recent call last):
File "sotoki.py", line 599, in
copytree('static', os.path.join('work', 'output'))
File "/usr/lib/python2.7/shutil.py", line 177, in copytree
os.makedirs(dst)
File "/home/someone/dev/kiwix/sotoki/venv/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 17] File exists: 'work/output'

=> cp -r static/* work/output/static/. is a solution, we need to do this in python

Respect SE attribution rules

All page should attribute content to SE and respective authors:
https://archive.org/details/stackexchange

  • Visually display or otherwise indicate the source of the content as coming from the Stack Exchange Network. This requirement is satisfied with a discreet text blurb, or some other unobtrusive but clear visual indication.
  • Ensure that any Internet use of the content includes a hyperlink directly to the original question on the source site on the Network (e.g., http://stackoverflow.com/questions/12345)
  • Visually display or otherwise clearly indicate the author names for every question and answer used
  • Ensure that any Internet use of the content includes a hyperlink for each author name directly back to his or her user profile page on the source site on the Network (e.g., http://stackoverflow.com/users/12345/username), directly to the Stack Exchange domain, in standard HTML (i.e. not through a Tinyurl or other such indirect hyperlink, form of obfuscation or redirection), without any “nofollow” command or any other such means of avoiding detection by search engines, and visible even with JavaScript disabled.

Make the script scale for StackOverflow

SO is much bigger than SU. It's assumed that the data fits into RAM both during build and during browser e.g. the index page displays every single tag, this might not work on SO.

  • Fix build
  • Fix website

SSL issues

Already 5453000 Users done !
Already 5454000 Users done !
Already 5455000 Users done !
Already 5456000 Users done !
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>

What is behind that error?

Metadata title/descripton shoudl come from the online web site

Here the html:

<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&amp;A for professional and enthusiast programmers" />

Or be somehow hardcoded in the script (less good).

Problem with generating Stack Overflow ZIM

Script continues but I see following errors

Already 2059000 Users done !
Already 2060000 Users done !
Already 2061000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2062000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2063000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
Already 2064000 Users done !
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files
[Errno 24] Too many open files

Question page

  • use html meta keywords
  • create proper titles
  • check for closed question
  • fix tag links
  • fix comments order
  • fix answers order

Better UI for user badge

  • User badge should be show in a better UI than a ul list.

  • Also we should have a count for badge and not show it more than once if user get it twice or more.

minify HTML file ?

Actually a zim file of a stackechange web site is around 10 time bigger.
So for stackoverflow it's wilm make a zim of ~250Go...
I don't know how zim compression work but maybe de should try to see if minify file is better

Crash with stackoverflow

$ sotoki stackoverflow.com Kiwix 
found
--2017-04-15 13:01:27--  https://archive.org/download/stackexchange/stackoverflow.com-Badges.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia600500.us.archive.org/22/items/stackexchange/stackoverflow.com-Badges.7z [following]
--2017-04-15 13:01:28--  https://ia600500.us.archive.org/22/items/stackexchange/stackoverflow.com-Badges.7z
Resolving ia600500.us.archive.org (ia600500.us.archive.org)... 207.241.227.180
Connecting to ia600500.us.archive.org (ia600500.us.archive.org)|207.241.227.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 165889201 (158M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Badges.7z’

stackoverflow.com-Badges.7z                        100%[================================================================================================================>] 158.20M  7.45MB/s    in 22s     

2017-04-15 13:01:52 (7.18 MB/s) - ‘stackoverflow.com-Badges.7z’ saved [165889201/165889201]

stackoverflow.com-Badges.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-Badges.7z

Extracting  Badges.xml

Everything is Ok

Size:       2567271793
Compressed: 165889201
found
--2017-04-15 13:02:24--  https://archive.org/download/stackexchange/stackoverflow.com-Comments.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Comments.7z [following]
--2017-04-15 13:02:25--  https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Comments.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3260662817 (3.0G) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Comments.7z’

stackoverflow.com-Comments.7z                      100%[================================================================================================================>]   3.04G  1.08MB/s    in 32m 10s 

2017-04-15 13:34:36 (1.61 MB/s) - ‘stackoverflow.com-Comments.7z’ saved [3260662817/3260662817]

stackoverflow.com-Comments.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-Comments.7z

Extracting  Comments.xml

Everything is Ok

Size:       14688930453
Compressed: 3260662817
found
--2017-04-15 13:40:23--  https://archive.org/download/stackexchange/stackoverflow.com-PostLinks.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-PostLinks.7z [following]
--2017-04-15 13:40:24--  https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-PostLinks.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56595578 (54M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-PostLinks.7z’

stackoverflow.com-PostLinks.7z                     100%[================================================================================================================>]  53.97M  1.38MB/s    in 20s     

2017-04-15 13:40:45 (2.67 MB/s) - ‘stackoverflow.com-PostLinks.7z’ saved [56595578/56595578]

stackoverflow.com-PostLinks.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-PostLinks.7z

Extracting  PostLinks.xml

Everything is Ok

Size:       489020243
Compressed: 56595578
found
--2017-04-15 13:40:56--  https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Posts.7z [following]
--2017-04-15 13:40:57--  https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Posts.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10790576915 (10G) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Posts.7z’

stackoverflow.com-Posts.7z                         100%[================================================================================================================>]  10.05G  1.47MB/s    in 1h 45m  

2017-04-15 15:26:53 (1.62 MB/s) - ‘stackoverflow.com-Posts.7z’ saved [10790576915/10790576915]

stackoverflow.com-Posts.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-Posts.7z

Extracting  Posts.xml

Everything is Ok

Size:       53806453854
Compressed: 10790576915
found
--2017-04-15 15:45:14--  https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Tags.7z [following]
--2017-04-15 15:45:15--  https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Tags.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 689401 (673K) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Tags.7z’

stackoverflow.com-Tags.7z                          100%[================================================================================================================>] 673.24K  1002KB/s    in 0.7s    

2017-04-15 15:45:17 (1002 KB/s) - ‘stackoverflow.com-Tags.7z’ saved [689401/689401]

stackoverflow.com-Tags.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-Tags.7z

Extracting  Tags.xml

Everything is Ok

Size:       4313435
Compressed: 689401
found
--2017-04-15 15:45:19--  https://archive.org/download/stackexchange/stackoverflow.com-Users.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Users.7z [following]
--2017-04-15 15:45:20--  https://ia800500.us.archive.org/22/items/stackexchange/stackoverflow.com-Users.7z
Resolving ia800500.us.archive.org (ia800500.us.archive.org)... 207.241.230.50
Connecting to ia800500.us.archive.org (ia800500.us.archive.org)|207.241.230.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275877596 (263M) [application/x-7z-compressed]
Saving to: ‘stackoverflow.com-Users.7z’

stackoverflow.com-Users.7z                         100%[================================================================================================================>] 263.10M  10.1MB/s    in 2m 14s  

2017-04-15 15:47:34 (1.97 MB/s) - ‘stackoverflow.com-Users.7z’ saved [275877596/275877596]

stackoverflow.com-Users.7z: OK
Ok we have get dump

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: stackoverflow.com-Users.7z

Extracting  Users.xml

Everything is Ok

Size:       2084134903
Compressed: 275877596
/media/kelson/SOTOKI/work/stackoverflow_com /media/kelson/SOTOKI


Traceback (most recent call last):
  File "/media/kelson/SOTOKI/venv/local/lib/python2.7/site-packages/sotoki/merge_links.py", line 18, in <module>
    line_2=csvreader.next()
StopIteration
Unable to prepare xml :(

"work/output" should be cleaned/removed before starting

Getting this crash late in the process is not OK.

 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Traceback (most recent call last):
  File "sotoki.py", line 700, in <module>
    os.makedirs(output)
  File "/srv/sotoki/venv/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: 'work/output'

Improve badges support

There is a badges.xml file in the dump look at that or try to dump badges from the API.

External links should be tagged

External links (so not pointing to a ZIM article) should be visually tagged. This is important for users to make the different between available/non available contents. This is already the case on Wikipedia ZIM files, we use a small icon we put beside the link. I would recommend to take and just tag the external link with a css class (adding this icon on the right of the link).

non deflated HTML files int the "images" directory

This is pretty weird and the zimwriterfs can not really deal with them

create directory entries
collect articles
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/ebcb05d93418f58d5026ce59d21552d3566b7001e454b77353feb5dbdadfa024.html
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/953db56137fe269271cb97b9317fc7bc39e9f8622fc0fa8531ce13a4d3532b80.html
Can not initialize inflate stream for: work/stackoverflow_com/output/static/images/48b65904759a70301bd8451f0a491b605cef39500f65e1845f2bcf0fffbb46f9.html

Solid logging required to debug the `load`

Solid logging infrastructure is required to debug the load script in case of error. This must be fault tolerant in the sens that a maximum number of files/entries must be loaded so that we can have a good overview of the data errors if any.

Integrated search engine

I think it will be better to have a search engine integrated to the static version of the website. The problem is that the index is too big to be built by the user even for superuser (20MB). lunrjs provide numerous entry points. Here is the solutions that remains to be explored:

  • build the index using lunr and store in the build directory. Load the index from the build directory.
  • If the above doesn't work because the index is too big (and it might be too big for slow machines). Create a new index for lunr.js that fetch store items lazily as required by the query engine this will save a lot of memory.

I think it's better to contribute back to lunr.js which is an established project regarding this topic. I had a look a look at other solutions but they more complicated and we only require english support.

Maybe it's better to thing I18N and go with a FTS solution that can handle mutliple languages. Again, I think it's better to have search integrated to the static part of the ZIM file that have search engine built inside the kiwix reader which breaks UX.

Create a --resume option

I can not decide how this should work, but having to redownload everything/everytime seems to be a problem.

File name too long error

For example:

Already 13233000 questions done!
[Errno 36] File name too long: 'work/stackoverflow_com/output/static/images/62b4b33ecb0f4756c04d0a1328a104360233d0874e5c6ff3a73759e1822959cc.latex?%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%20a1%20%26%20b1%20%26%20c1%20%26%20d1%5C%5C%20a2%20%26%20b2%20%26%20c2%20%26%20d2%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29*%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%20x%5C%5C%20y%5C%5C%20z%5C%5C%201%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29%3D%5Cbigl%28%5Cbegin%7Bsmallmatrix%7D%200%5C%5C%200%20%5Cend%7Bsmallmatrix%7D%5Cbigr%29'
Already 13234000 questions done!
Already 13235000 questions done!

Tag in index page order by name

They are order by name, but we can order by count (number of topic with each tag)

What is the best ?
-name : easy if you search for something specific

  • count : better for popularity

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.