spencermountain / dumpster-dive Goto Github PK

View Code? Open in Web Editor NEW

235.0 235.0 45.0 3.41 MB

roll a wikipedia dump into mongo

License: Other

JavaScript 100.00%

dumpster-dive's People

Stargazers

Watchers

dumpster-dive's Issues

parse dump file "page-meta-history"

Hi,

Thank you for your amazing work. I was wondering if your script could be used to put into a mongoDB collection the dump called "page-meta-history" which is much more complete than the "pages-articles" you are refering to.

Is the dev branch supposed to be working at all?

I just got a chance to try it, and I'm getting 0 pages for every batch....

16mb record issue

Thanks a lot for sharing and maintaining this amazing project!

I would like to parse all articles from the "enwiki-latest-pages-articles.xml" (02 Aug. 2018; with --html flag) and noticed that the parser seems to gets "stuck" on specific entries. Instead of skipping the article, parsing essentially comes to a halt. The script is still running, but does not proceed beyond the current entry:


...
     current: 3,877,976 pages - "Vapor–liquid separator"     


     current: 3,877,976 pages - "Vapor–liquid separator"     


     current: 3,877,976 pages - "Vapor–liquid separator"     


     current: 3,877,976 pages - "Vapor–liquid separator"     


     current: 3,877,976 pages - "Vapor–liquid separator"     
...

There are no errors being thrown. In a previous attempt to parse a full dump from 2017, the same behavior occurred (for a different article).

Continue importing without start from the beginning

Hi. I have been importing the english wikipedia articles (53GB) into my mongodb database for more than a day. And my computer got very slow, and I have to reboot it as it does not respond to any input.

But the importing was not done, and I want to finish the importing.

I was wondering if it's possible to add an option to continue importing while skipping all the articles already in the database, maybe just query mongodb for that entry's title and jump if it already exists?

I have zero knowledge of node.js, and I would really appreciate it if someone could add it. Thanks!

Article count of enwiki

Hi, thanks for making this tool. It's not really an issue but I couldn't find an answer anywhere online.

I started processing the 13 gb enwiki using the simple single-threaded approach, thinking that plus minus a few hours wouldn't make that much of a difference... 4 days later it's still going at ~8.4 million records processed.

Does anyone have an estimate on how many items there are? I expected about 5 million so I kept waiting that it would be done in any minute 😆

node src/worker.js not working

I am following you guide on how to do the redis way to speed stuff up, articles are loading in to redis but runnining node src/worker.js gives me lots of errors

TypeError: cb is not a function
at Object.parse (/Users/snrb/Documents/wikipedia-to-mongodb-master/src/parse.js:65:12)
at /Users/snrb/Documents/wikipedia-to-mongodb-master/src/worker.js:26:16
at Worker. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/worker.js:230:7)
at Job. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/job.js:706:12)
at multi_callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:89:14)
at Command.callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:116:9)
at normal_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:721:21)
at RedisClient.return_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:819:9)
at JavascriptRedisParser.returnReply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:192:18)
at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)

at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)
TypeError: collection.insert is not a function
at Object.parse (/Users/snrb/Documents/wikipedia-to-mongodb-master/src/parse.js:74:14)
at /Users/snrb/Documents/wikipedia-to-mongodb-master/src/worker.js:26:16
at Worker. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/worker.js:230:7)
at Job. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/job.js:706:12)
at multi_callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:89:14)
at Command.callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:116:9)
at normal_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:721:21)
at RedisClient.return_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:819:9)
at JavascriptRedisParser.returnReply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:192:18)
at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)

Getting error using Workers

On Windows x64, using the -w option I get errors from the file stream/BZip stream. Adding an error handler
var stream = fs.createReadStream(file, {bufferSize: 64 * 1024}).pipe(bz2()); stream.on('error', function(err){console.log("unbzip2-stream error: " + err); process.exit(1)}) var xml = new XmlStream(stream);

I get

rangeerror: Array buffer allocation failed

Works fine (but much more slowly) using single-threaded.

Any ideas?

wp2mongo hangs during article insertion.

Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.

There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).

I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.

Markdown mode collapses paragraphs

When running with the "markdown" option set, the adjacent paragraphs are collapsed into one paragraph (i.e. there is no "\n" character between them. (dumpster-dive version 3.4.1)

Other feeds

Great work for fixing this mate in 5.3.0

Importing EN now, do you know of other feeds people use with it?

Have you ever thought about doing something like this with the CommonCrawl?

Encoding of table keys

Hi, thanks so much for all the recent updates to the parser and dumpster-dive!
I just ran into following errors on version 3.6.0:

...
   #0  - Art
   #0  - Agnostida
   #0  - Abortion
   #0  - Abstract (law)
   #0  - American Revolutionary War
   #0  - Ampere
   #0  - Algorithm
   #0  - Annual plant
====error!===
Error: key Dante]] ''L'Inferno'', Canto IV. 131–135 must not contain '.'
    at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:914:19)
    at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
    at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
    at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
    at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
    at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
   #0  - Anthophyta
   #0  - Atlas (disambiguation)
   #0  - Mouthwash
   #0  - Alexander the Great
...

Jade to Pug

I am exploring this repo as part of wikipedia data analysis.

check if this is useful for you:
08:59 $ npm install -g wikipedia-to-mongodb/
npm WARN deprecated [email protected]: Jade has been renamed to pug, please install the latest version of pug instead of jade

I think some documentation would go a long way to make this repo awesome :) Let me know if I can help. I don't know JS, I'm java guy for all I know.

Sri Harsha

missing pages, silent

Dumpster-dive version 3.1.0

sudo nohup dumpster enwiki-latest-pages-articles.xml --batch_size 100 &
sudo nohup dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 &

results stopped at page count 186,441, script finish with no error.

Dumpster-dive version 3.6.1

dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 --verbose true --workers 20

Script finish and page count is 1,274,403 with no error.

The number of pages suppose to be 5M++.

disambiguation-page parsing

Certain disambiguation pages to do not contain all of the expected links.

For example,
https://en.wikipedia.org/wiki/Fly_(disambiguation)

When looking in MongoDb for
title:"Fly (disambiguation)"

The type of this document is page and it only contains the intro in text property. None of the other paragraphs/links have been imported.

Great progress ! ... console is outputting "Error: Cannot find module '../../infobox/infobox' ..."

Glad to see such great progress !

I downloaded the latest english Wikipedia dump (enwiki-latest-pages-articles.xml.bz2), as documented in the readme. Extracted via the archive manager (OS: Ubuntu 16.04).

Now loading into mongodb via command line (dumpster ./enwiki-latest-pages-articles.xml)

As the script is running, an error message is repeatedly displayed: "Error: Cannot find module '../../infobox/infobox' ..."

Within mongo shell, the enwiki database is not yet visible.

Need to know if I should best terminate terminate this run (e.g. cntrl-c) and restart with the infobox option enabled. Ideally, I would like to have the infobox pages with the other articles in mongo.

Greatly appreciate the continued progress with enabling this type of capability !

Not importing

gyp: Call to 'node -e "require('nan')"' returned exit status 127 while in binding.gyp. while trying to load binding.gyp
gyp ERR! configure error
gyp ERR! stack Error: gyp failed with exit code: 1
gyp ERR! stack at ChildProcess.onCpExit (/usr/share/node-gyp/lib/configure.js:354:16)
gyp ERR! stack at emitTwo (events.js:87:13)
gyp ERR! stack at ChildProcess.emit (events.js:172:7)
gyp ERR! stack at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
gyp ERR! System Linux 4.4.0-116-generic
gyp ERR! command "/usr/bin/nodejs" "/usr/bin/node-gyp" "rebuild"
gyp ERR! cwd /usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/iconv
gyp ERR! node -v v4.2.6
gyp ERR! node-gyp -v v3.0.3
gyp ERR! not ok
/usr/local/lib
└── (empty)

npm ERR! Linux 4.4.0-116-generic
npm ERR! argv "/usr/bin/nodejs" "/usr/bin/npm" "install" "-g" "wikipedia-to-mongodb"
npm ERR! node v4.2.6
npm ERR! npm v3.5.2
npm ERR! code ELIFECYCLE

npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the iconv package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs iconv
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls iconv
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR! /home/burf2000/npm-debug.log
npm ERR! code 1

Cannot insert pages with section titles starting with $

When importing into mongodb helper.js chokes if there are keys beginning with a dollar sign. This
seems to happen with section titles. From a quick search I think the error message occurs from this section in the Subway page.

Error: key $5 footlongs must not start with '$'
    at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:751:19)
    at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
    at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
    at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
    at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
    at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
    at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
    at BSON.serialize (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27)
    at Query.toBin (/home/ubuntu/wikipedia-to-mongodb/node_modules/mongodb-core/lib/connection/commands.js:140:25)
    at Pool.write (/home/ubuntu/wikipedia-to-mongodb/node_modules/mongodb-core/lib/connection/pool.js:984:23)
events.js:182
      throw er; // Unhandled 'error' event
      ^

TypeError: Converting circular structure to JSON
    at JSON.stringify (<anonymous>)
    at Job.update (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:832:17)
    at Job.reattempt (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:585:23)
    at Job.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:616:14)
    at Job.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:555:7)
    at normal_reply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:721:21)
    at RedisClient.return_reply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:819:9)
    at JavascriptRedisParser.returnReply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:192:18)
    at JavascriptRedisParser.execute (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis-parser/lib/parser.js:574:12)
    at Socket.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:274:27)

What's the best way to escape those keys?

A parsing error from wtf_wikipedia

An error occurred when importing a certain page into mongodb, maybe this is a part of wtf_wikipedia's problem.

I am not familiar to node.js, and it would be great if you can help with this. Thanks!

Command Line error:

Bayern
Bavaria
Brandenburg
Federal Chancellor
Bundestag
Bundesrat
Bundesregierung
BMW
Blaue Reiter
Bisexual (disambiguation)

[somewhere]/js-wikipedia-to-mongodb/node_modules/wtf_wikipedia/index.js:534
        if(!img.match(/^(image|file|fichier|Datei)/i)){
                ^
TypeError: Object 350 has no method 'match'
    at Object.main [as parse] ([somewhere]/js-wikipedia-to-mongodb/node_modules/wtf_wikipedia/index.js:534:17)
    at XmlStream.<anonymous> ([somewhere]/js-wikipedia-to-mongodb/index.js:28:26)
    at XmlStream.emit (events.js:106:17)
    at fn ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:132:14)
    at FiniteAutomata.run ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/finite-automata.js:32:19)
    at FiniteAutomata.leave ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/finite-automata.js:85:7)
    at null.<anonymous> ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:434:8)
    at emit (events.js:95:17)
    at Parser.parse ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/node_modules/node-expat/lib/node-expat.js:23:22)
    at parseChunk ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:513:14)

wp2mongo fails with an error

I have installed wikipedia-to-mongodb on a Ubuntu 16.04 EC2 machine (with Node.js version v4.2.6), and I get the following when trying to run the command:

ubuntu@ip-172-31-15-234:~$ wp2mongo afwiki-latest-pages-articles.xml.bz2
/usr/local/lib/node_modules/wikipedia-to-mongodb/bin/wp2mongo.js:2
let program = require('commander')
^^^

SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode
    at exports.runInThisContext (vm.js:53:16)
    at Module._compile (module.js:374:25)
    at Object.Module._extensions..js (module.js:417:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:442:10)
    at startup (node.js:136:18)
    at node.js:966:3

Any ideas as to why this is failing would be greatly appreciated!

this doesn't work `wp2mongo /path/to/my-wikipedia-article-dump.xml.bz2`

hi spencer, thanks for the amazing work though i couldn't get it to work. here what i'm seeing:

wp2mongo ~/Downloads/afwiki-latest-pages-articles-multistream.xml.bz2


 --- starting xml parsing --

=================done!=================
0  pages stored in db 'afwiki'

any ideas why?

FATAL ERROR of memory while index

Hello Spencer,
i am trying to load wikipedia in spanish and i got this error (twice). What could i do to finish the process?

$ node index.js eswiki-latest-pages-articles.xml
Andorra
Argentina
Geografía de Andorra
Demografía de Andorra
Comunicaciones de Andorra
Artes visuales
Agricultura
Astronomía galáctica
ASCII
Arquitectura
Anoeta
Ana María Matute
Agujero negro
Antropología
Anarquía
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6

Thanks in advance

Number of CPU workers doesn't work

Hi again, I noticed that the number of CPU workers are much less than what is supposed to be
even when specify --workers. The script seems to use on average of 5 cpu workers regardless of the size of the machine or the configuration available.

**top - 05:23:18 up 13:27,  4 users,  load average: 6.98, 6.76, 5.55**
Tasks: 235 total,   6 running, 229 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34.7 us,  3.7 sy,  0.0 ni, 61.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 61851660 total,   570464 free, 13277764 used, 48003432 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 47567560 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                             
17559 root      20   0 1531264 719272  25636 R 121.9  1.2  23:30.91 node                                
17565 root      20   0  981404 174652  25712 R 120.5  0.3  22:17.39 node                                
17567 root      20   0 1114476 313784  25476 R 119.5  0.5  23:01.75 node                                
17635 root      20   0  986044 179396  25524 R 119.5  0.3  22:20.90 node

Pages not imported

Hi
I tested to download the latest en wiki pages.
(https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)

When I imported it only imported the id and title.

Tried to figure out why, and it seems like Wikipedia added an attribute to the node.

It's now:
<text bytes="80055" xml:space="preserve">

Your code does not seem to handle the bytes attribute.

I tested adding it and it seems to work now.

Thought about sending a pull request, but was a little bit unsure about:
https://github.com/spencermountain/dumpster-dive/blob/master/src/worker/01-parsePage.js#L41

Maybe the <text> regexp should be even more "open" to any kind of attributes?

Example:
<text[^>]*>([\s\S]*?)<\/text>

Parsing bug with multiple workers

Hey,
I've found a bug occurring when using multiple workers.
Take for example the tinywiki dataset.

When I run the following code:

const dumpster = require('dumpster-dive');
options = {
	file: process.argv[2],
	db: 'tinywiki',
	skip_redirects: false,
	skip_disambig: false,
	batch_size: 1000,
	workers: 4,
	custom: function(doc) {
        console.log(doc.title(), doc.text().length);
        return {};
	}
};
dumpster(options, () => console.log('Parsing is Done!'));

where I pass the script the path to the tinywiki XML file through argv[2] which is ./tests/tinywiki-latest-pages-articles.xml.

When I run it with 1 worker, I get the following print:

Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 788
Redirect page 0
Disambiguation page 238
Bodmin 7921

In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):

Redirect page 0
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 0
Disambiguation page 238
Bodmin 0

Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?

Thanks!

Suggested option to omit redirects and disambiguation pages

Issue #19 concerned the number of articles in the English Wikipedia, and it was noted that a large proportion of them are redirects. I'd like to add a little extra insight, and make a suggestion.

As I mentioned in issue #21 , during the first 16 hours of the run (on my reasonably beefy PC portable workstation), the app got through nearly 7.4 million articles. In the two days since then, it's gotten through a bit over 2.3 million. In other words, the progress through the dump is not linear.

I assume the later articles don't require dramatically more parsing time than the earlier ones. Moreover, the Node.js and Mongo DB processes together are using over 3GB of RAM, but they're not taxing the machine (I can still use it normally for everything else I have to do). My conclusion, therefore, is that the big slowdown is coming from the increasing difficulty of inserting new records as the size of the collection and its indices grows.

Therefore, I suggest adding an option to skip redirects (and probably disambiguation pages as well): for some purposes, including mine, these aren't needed, and not having to index them would dramatically cut the run time.

Couldn't find some tags after importing the dump

After restoring a dump of 2018-11-20, I'm not able to find any page with a "{{pov" in it's content. I'm trying to find articles marked as with Point Of View, that wikipedia marks with a tag that starts with "{{pov".

It's strange that the 15gb dump restored only 750k articles, once english wikipedia has currently over 5kk articles, mas in this 750kk at least some would be marked with the tag.

Am I doing something wrong? Is there any params to import all the things? For instance, when using another tool importing to a mysql database, on the first 6k records I already found articles with the desired text. It looks that all the special tags are beeing ripped off.

Wiktionary parses but may be missing some pages

Many thanks again for building and sharing such a convenient (and fast) parser !

This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).

The following is the summary provided at the end of the run:
#1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"

💪  worker #0 has finished 💪 
  - 1 workers still running -

#1 +140 pages - 4ms - "irascebare"

💪  worker #1 has finished 💪 
  - 0 workers still running -



  👍  closing down.

 -- final count is 5,702,608 pages --
   took 13.8 minutes
          🎉

When I checked the Wiktionary statistics page (https://en.wiktionary.org/wiki/Wiktionary:Statistics), the following statistics were listed for Wiktionary:

Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents

1 Selected language breakdowns
2 See also

See also: Special:Statistics

Number of entries: 5,721,450

Number of total pages: 6,322,904

Number of encoded languages: 8052

Number of uploaded files: 29

Number of user accounts: 3,446,188

Number of administrators: 98

Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.

Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.

Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.

Thanks again !

TypeError with worker.js

I seem to have wp2mongo running just fine (under Windows 10) with a bz dump I had already downloaded. However, when I try to do things with Redis, I run into a problem. A script called by worker.js is throwing the following error, apparently on every article (none of them get into the db):

C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\infobox\index.js:32
let inside = tmpl.match(/^{{nowrap|(.*?)}}$/)[1];
^

TypeError: Cannot read property '1' of null
at C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\infobox\index.js:3
at Array.forEach ()
at Object.parse_recursive [as infobox] (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_w
\infobox\index.js:19:11)
at main (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\index.js:46:
at Object.parse (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\index.js:4
at Object.parse (C:\code\JavaScript\wikipedia-to-mongodb\src\doPage.js:50:18)
at C:\code\JavaScript\wikipedia-to-mongodb\src\worker.js:25:14
at Worker. (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\kue\lib\queue\worker.j
at Job. (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\kue\lib\queue\job.js:706:
at multi_callback (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\redis\lib\multi.js:89:14)

(The caret in the third line should point to the "[1]".)

Evidently this sort of error with regexes is not uncommon, but I can't figure out why I'm getting it here.

EDIT: I thought wp2mongo was working fine, but it's reliably hanging on article number 777. After it took literally days to get the app compiled (apparently due to a corrupt file, following by the thing looking for a dependency in the wrong place), I'm starting to think this is the hardest "easy way" I've ever experienced.

Title parsing error in custom function

Hey,
I'm running a JS script containing the following code:

const dumpster = require('dumpster-dive');

options = {
	file: process.argv[2],
	db: 'tinywiki',
	skip_redirects: false,
	skip_disambig: false,
	batch_size: 1000,
	custom: function(doc) {
	        console.log(doc.title());
        	return {};
	}
};
dumpster(options, () => console.log('Parsing is Done!'));

where I pass the script the path to the tinywiki XML file through argv[2] which is ./tests/tinywiki-latest-pages-articles.xml.
The only titles being printed are Toronto, Royal Cinema, Belleville, while all the others print undefined - 3 out of 8 pages (even the duplicate pages should print at this point), however the DB is populated with the correct titles and pageIDs.

BUT! If I run the same code without a custom function, it populates the DB correctly, i.e with the correct titles and everything (text, etc.), which leads me to suspect that there is something wrong passed on with the doc object into the custom function, but the code in:

dumpster-dive/src/worker/02-parseWiki.js

Line 43 in c11587c

data.title = page.title || data.title;

Just adds the title afterwards and pushes into the DB, making it look as if the title is accessible, when in fact it ISN'T accessible from within the custom function.

Any suggestions on running in this in the cloud?

Space limitations on my new computer. With all the other binaries installed my computer is struggling to download and extract wikipedia. Any ideas on how to set this up in a cloud environment like Amazon?

Output

Does it modify the existing file, or output to a different file?

Type error while loading data into mongo using dumpster

Followed the steps to download english wiki dump, completed up to extract process but while adding data into mongo it is showing this error:-

TypeError: data[k].text is not a function
at Object.keys.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:126:25)
at Array.forEach ()
at Object.subject bar (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:125:23)
at doTemplate (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:35:31)
at templates.top.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:69:12)
at Array.forEach ()
at Object.allTemplates [as templates] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:68:17)
at doSection (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:21:16)
at Object.splitSections [as section] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:52:15)
at main (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/index.js:47:22)

I also tried dumpster ./my/wikipedia-dump.xml --images false after doing this it will give error

TypeError: i.json is not a function

I'm just getting this error message when I try to run dumpster ./enwiki-20180601-pages-articles.xml

TypeError: i.json is not a function

TypeError: i.json is not a function
at data.images.doc.images.map.i \dumpster-dive\node_modules\wtf_wikipedia\src\output\json\index.js:43:43

fatal error js stack trace

After reaching "Disruptor Beam 9192071" I get this message. I have 32gb RAM. Is it a special finishing up process that failed or just that article?
Should I do something or have all the articles been moved to mongodb? :)

Thanks!

<--- Last few GCs --->

51669740 ms: Scavenge 1394.1 (1457.7) -> 1394.1 (1457.7) MB, 0.2 / 0 ms (+ 2.4 ms in 2 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep].
51670266 ms: Mark-sweep 1394.1 (1457.7) -> 1284.9 (1457.7) MB, 525.2 / 0 ms (+ 3.5 ms in 4 steps since start of marking, biggest step 2.2 ms) [last resort gc].
51670912 ms: Mark-sweep 1284.9 (1457.7) -> 1284.9 (1457.7) MB, 646.1 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x3872217b4629
1: RegExpExecNoTests [native regexp.js:~59] [pc=0x11185f437ec3] (this=0x3872217d8991 ,j=0x2f5829f089b9 ,k=0x2f5829f08991 <String[104]: Disruptor Beam is a developer of mobile and social game products based in [[Framingham, Massachusetts]].>,n=0)
2: match(aka match) [native string.js:~118] [pc=0x11185f437ce3] (this=0x2f5829f08991 <String[104]: Disruptor Beam ...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted

pageid field

Is there a way to include the pageid as a field in the resulting Mongo documents? This is very useful information to have and its annoying to have to add it via postprocessing.

cpu worker process same records multiple times

i noticed that my current run is progressing and the output is keep popping up. However the number of pages doesn't increase for like half an hour now. and keep giving duplicate warning. so i selected the page title that is stuck and grep it from the output of nohup and here is the results. Each line starts with a number that indicated the "line number" in nohup.out. It seems that the same page is processed by multiple core. You will notice the page "American wigeon" is processed multiple times from different cores.

sudo grep -n 'American wigeon' nohup.out 
15517: #8  +100 pages  -  193ms - "American wigeon"
15519: #12  +0 pages  -  172ms - "American wigeon"
15538: #11  +0 pages  -  190ms - "American wigeon"
31954: #9  +0 pages  -  210ms - "American wigeon"
31958: #5  +0 pages  -  204ms - "American wigeon"
33109: #3  +0 pages  -  157ms - "American wigeon"
33121: #1  +0 pages  -  175ms - "American wigeon"
43075: #14  +0 pages  -  324ms - "American wigeon"
43077: #4  +0 pages  -  344ms - "American wigeon"
43093: #0  +0 pages  -  169ms - "American wigeon"
43247: #6  +0 pages  -  173ms - "American wigeon"
43253: #15  +0 pages  -  154ms - "American wigeon"
43257: #10  +0 pages  -  173ms - "American wigeon"
45677: #2  +0 pages  -  206ms - "American wigeon"
45693: #13  +0 pages  -  198ms - "American wigeon"
45719: #7  +0 pages  -  212ms - "American wigeon"

missing important pages in parsed results

Salute all contributors to this amazing work!

I just finished parsing enwiki-20180901-pages-articles.xml. It seems to work properly with 5,081,528 records in Mongodb (which is as expected). However, when I query the DB (e.g. { "title": "London" }), it seems that some important records are missed, e.g. London, China, India, etc. They should be contained in the original dump (as the index file indicates), but are somehow missed in the parsed DB.

During the parsing, errors come out from time to time as shown in the screenshot, maybe related to the issue:

Just wondering if anyone had similar experience or may have any thoughts on possible causes?

Thanks! Once again, great work!

npm install

When trying to run npm install I get errors

xml title passes through doc object

Hi all,
There's a problem with redirection pages.
As it stands, in the worker/index.js file, the wiki page is parsed using the parsePage function.

dumpster-dive/src/worker/index.js

Line 23 in e7c6b83

let page = parsePage(xml);

In it, there's a call to the shouldSkip function, which returns true if the page is a redirection page. In such case, the parsePage function returns null to the calling function (in index.js). There's no check in the shouldSkip or parsePage functions that check whether the skip_redirects option is false or true.
This all results in the fact that in index.js, pages which are redirection pages are ignored no matter the value of the skip_redirects option.
Moreover, when I do change the shouldSkip function's return value to false, in order to not skip redirects, the redirection pages are processed like regular pages. This behavior seems unintuitive to me. I think the behavior should be that the redirection page should have a special "redirection" field which should point to the redirected-to page. This can be very helpful, since I'd like to treat the redirection page just like the redirected-to page, in terms of the text of the page, etc., so I'd like to be able to get to the redirected-to page from the redirection page.

MongoError: BSONObj Size

Node v11.14.0
NPM v6.9.0
Mongod v3.4.20
Ubuntu Linux 16.04
dumpster-dive v5.1.2

I'm getting the following error repeatedly:

   ====DB write error (worker 0)===
{ MongoError: BSONObj size: 16857622 (0x1013A16) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
    at /home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
    at authenticateStragglers (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
    at Connection.messageHandler (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
    at emitMessageHandler (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
    at Socket.<anonymous> (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17)
    at Socket.emit (events.js:193:13)
    at addChunk (_stream_readable.js:295:12)
    at readableAddChunk (_stream_readable.js:276:11)
    at Socket.Readable.push (_stream_readable.js:231:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:150:17)
  ok: 0,
  errmsg:
   'BSONObj size: 16857622 (0x1013A16) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"',
  code: 10334,
  codeName: 'Location10334',
  name: 'MongoError',
  [Symbol(mongoErrorContextSymbol)]: {} }

The process runs but only saves a single page in Mongo throughout -- which it seems to write at the very beginning of the process.

Seeing a lot of messages like:

     current: 1 pages - "true"     

-- 500  duplicate pages --
 #5  +0 pages  -  1s - "true"
-- 500  duplicate pages --
 #2  +0 pages  -  900ms - "true"
-- 500  duplicate pages --
 #1  +0 pages  -  1s - "true"
-- 500  duplicate pages --
 #4  +0 pages  -  944ms - "true"
-- 500  duplicate pages --
 #2  +0 pages  -  1s - "true"

     current: 1 pages - "true"     

      ─── worker #0 ───: 
         +4,500 pages
         -7,350 redirects
         -206 disambig
         0 ns
      ─── worker #2 ───: 
         +9,500 pages
         -21,113 redirects
         -521 disambig
         0 ns
      ─── worker #3 ───: 
         +4,500 pages
         -7,353 redirects
         -206 disambig
         0 ns

--plaintext not working

Hi there, been a big fan of this, ever since it was called wp2mongo

I am sure that --plaintext=true used to give me the plaintext of the page.

I run it now (with an older Wiki download as the current format doesn't work as per other bugs)

I get Id, title, categories, sections, pageId ? what have I missed?

license and performance improvement

hi, thanks for doing that script! It is extremely simple and works perfect with small xml dataset i.e. af wikipedia but with the biggest one it is quite slow and one error stops script execution.

I guess the bottleneck is wikipedia.parse because it is run synchronously on one CPU and it is blocking. Once I commented wikipedia.parse it speed up even to x10 on my computer.

My solution would be to use some job queue for calculating wikipedia.parse and saving to mongodb on all CPU's.

I would like to make such a improvement and make PR for your review but please define license for your script before.

TypeError: Cannot read property '0' of undefined

I have this issue while running the script for fa wikipedia dump

events.js:72
        throw er; // Unhandled 'error' event
              ^
TypeError: Cannot read property '0' of undefined
    at f (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/lib/bit_iterator.js:24:34)
    at Object.bzip2.decompress (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/lib/bzip2.js:272:13)
    at decompressBlock (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:29:28)
    at decompressAndQueue (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:46:20)
    at Stream.end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:82:23)
    at _end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:65:9)
    at Stream.stream.end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:74:5)
    at ReadStream.onend (_stream_readable.js:485:10)
    at ReadStream.g (events.js:180:16)
    at ReadStream.emit (events.js:117:20)

html conversion doesn't work for links/references

Hey,
I noticed that the generated html documents seem to be missing a bunch of linked/referenced information.
For instance, here's some snippets of the dumpster-generated Wikipedia article for Abraham Lincoln and the original reference:

Dumpster 1:

Original 1:

Dumpster 2:

Original 2:

In case it's convenient, here's the relevant json and the converted html (Both generated with 3.1.0).

Abraham_Lincoln_json.txt

Abraham_Lincoln_html.txt
Cheers

Permissions

Every time i run git clone [email protected]:spencermountain/wikipedia-to-mongodb.git.. I get the permission error.. I've disabled selinux and turned off firewalls..
I'm not sure what im doing wrong.

A mongod problem importing afrikaans Wikipedia

Hi,

I have a problem importing the example wiki you gave. When Iaunch the command "node index.js af-wiki..." I get after some time the following message :

{ [MongoError: server localhost:27017 sockets closed]
  name: 'MongoError',
  message: 'server localhost:27017 sockets closed' }

And also my mongo shell show e 9 simultaneous connections. Is that normal ?

Thank you

Support wikia dumps

Hi, I've tried to parse the current.xml dump from this wiki but while it seems to work, the actual Mongo Documents are very sparsely populated.

{

"_id" : "Doppelgängers",

"title" : "Doppelgängers",

"categories" : [ ],

"sections" : [ ],

"coordinates" : [ ],

"infoboxes" : [ ],

"images" : [ ],

"references" : [ ],

"pageID" : "3200"

}

I assume this is to do with the xml dump being non-standard, or at least not matching the Wikipedia format. I thought I'd raise an issue in case this was an easy fix in the code.

Error on keys with periods

Hi,
Since I've been using the new version (4.0.1), I've been experiencing errors when keys (titles, I guess? unique Id?) contain periods, for example:

Error: key entertainment ed. must not contain '.'
    at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:914:19)
    at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
    at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
    at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
    at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
    at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
    at BSON.serialize (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/bson.js:63:28)

This happens both with a custom function and by simply running dumpster

The process does not end on Windows

Ran the process with file and db on windows. The process does not end, it hangs after showing 'worker 6 has finished'

There are messaging appearing: current: 107,920 pages - "undefined"

When I try to end the process, below exception occurs.

one sec, cleaning-up the workers...
--uncaught top-process error--
ProcessTerminatedError: cancel after 0 retries!
at ...\pool.js:111:39
at Array.forEach ()
at WorkerNodes.handleWorkerExit (...\pool.js:110:14)
at Worker. (...\pool.js:160:44)
at Worker.emit (events.js:203:13)
at WorkerProcess. (...\worker.js:39:18)
at Object.onceWrapper (events.js:291:20)
at WorkerProcess.emit (events.js:203:13)
at ChildProcess. (...\worker\process.js:42:41)
at Object.onceWrapper (events.js:291:20) {
name: 'ProcessTerminatedError',
message: 'cancel after 0 retries!'

Missing ids on windows

Trying to parse the big 65.7 GB English wiki, enwiki-20190101-pages-articles-multistream.xml.

Parsing goes well, data gets saved to Mongo, but at some point, every line is the same.

current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"

about 200 times, and nothing else after that.

If I CTRL + C to break the process, then I suddenly get 8 times this error (one per worker, I guess)

one sec, cleaning-up the workers...
--uncaught top-process error--

{ ProcessTerminatedError: cancel after 0 retries!
at tasks.filter.forEach.task (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:111:39)
at Array.forEach ()
at WorkerNodes.handleWorkerExit (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:110:14)
at Worker.worker.on.exitCode (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:160:44)
at Worker.emit (events.js:182:13)
at WorkerProcess.Worker.process.once.code (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\worker.js:39:18)
at Object.onceWrapper (events.js:273:13)
at WorkerProcess.emit (events.js:182:13)
at ChildProcess.WorkerProcess.child.once.code (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\worker\process.js:42:41)
at Object.onceWrapper (events.js:273:13)
name: 'ProcessTerminatedError',
message: 'cancel after 0 retries!' }

I launched the parser 4 times, it skips what was done previously, then gets to this again :

current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"

same error, and stops here every time.

spencermountain / dumpster-dive Goto Github PK

dumpster-dive's People

Stargazers

Watchers

Forkers

dumpster-dive's Issues

Dumpster 1:

Original 1:

Dumpster 2:

Original 2:

Recommend Projects

Recommend Topics

Recommend Org