spencermountain / dumpster-dive Goto Github PK
View Code? Open in Web Editor NEWroll a wikipedia dump into mongo
License: Other
roll a wikipedia dump into mongo
License: Other
Hi,
Thank you for your amazing work. I was wondering if your script could be used to put into a mongoDB collection the dump called "page-meta-history" which is much more complete than the "pages-articles" you are refering to.
I just got a chance to try it, and I'm getting 0 pages for every batch....
Thanks a lot for sharing and maintaining this amazing project!
I would like to parse all articles from the "enwiki-latest-pages-articles.xml" (02 Aug. 2018; with --html flag) and noticed that the parser seems to gets "stuck" on specific entries. Instead of skipping the article, parsing essentially comes to a halt. The script is still running, but does not proceed beyond the current entry:
...
current: 3,877,976 pages - "Vapor–liquid separator"
current: 3,877,976 pages - "Vapor–liquid separator"
current: 3,877,976 pages - "Vapor–liquid separator"
current: 3,877,976 pages - "Vapor–liquid separator"
current: 3,877,976 pages - "Vapor–liquid separator"
...
There are no errors being thrown. In a previous attempt to parse a full dump from 2017, the same behavior occurred (for a different article).
Hi. I have been importing the english wikipedia articles (53GB) into my mongodb database for more than a day. And my computer got very slow, and I have to reboot it as it does not respond to any input.
But the importing was not done, and I want to finish the importing.
I was wondering if it's possible to add an option to continue importing while skipping all the articles already in the database, maybe just query mongodb for that entry's title and jump if it already exists?
I have zero knowledge of node.js, and I would really appreciate it if someone could add it. Thanks!
Hi, thanks for making this tool. It's not really an issue but I couldn't find an answer anywhere online.
I started processing the 13 gb enwiki using the simple single-threaded approach, thinking that plus minus a few hours wouldn't make that much of a difference... 4 days later it's still going at ~8.4 million records processed.
Does anyone have an estimate on how many items there are? I expected about 5 million so I kept waiting that it would be done in any minute 😆
I am following you guide on how to do the redis way to speed stuff up, articles are loading in to redis but runnining node src/worker.js gives me lots of errors
TypeError: cb is not a function
at Object.parse (/Users/snrb/Documents/wikipedia-to-mongodb-master/src/parse.js:65:12)
at /Users/snrb/Documents/wikipedia-to-mongodb-master/src/worker.js:26:16
at Worker. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/worker.js:230:7)
at Job. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/job.js:706:12)
at multi_callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:89:14)
at Command.callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:116:9)
at normal_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:721:21)
at RedisClient.return_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:819:9)
at JavascriptRedisParser.returnReply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:192:18)
at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)
at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)
TypeError: collection.insert is not a function
at Object.parse (/Users/snrb/Documents/wikipedia-to-mongodb-master/src/parse.js:74:14)
at /Users/snrb/Documents/wikipedia-to-mongodb-master/src/worker.js:26:16
at Worker. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/worker.js:230:7)
at Job. (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/kue/lib/queue/job.js:706:12)
at multi_callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:89:14)
at Command.callback (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/lib/multi.js:116:9)
at normal_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:721:21)
at RedisClient.return_reply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:819:9)
at JavascriptRedisParser.returnReply (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis/index.js:192:18)
at JavascriptRedisParser.execute (/Users/snrb/Documents/wikipedia-to-mongodb-master/node_modules/redis-parser/lib/parser.js:574:12)
On Windows x64, using the -w option I get errors from the file stream/BZip stream. Adding an error handler
var stream = fs.createReadStream(file, {bufferSize: 64 * 1024}).pipe(bz2()); stream.on('error', function(err){console.log("unbzip2-stream error: " + err); process.exit(1)}) var xml = new XmlStream(stream);
I get
rangeerror: Array buffer allocation failed
Works fine (but much more slowly) using single-threaded.
Any ideas?
Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.
There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).
I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.
When running with the "markdown" option set, the adjacent paragraphs are collapsed into one paragraph (i.e. there is no "\n" character between them. (dumpster-dive version 3.4.1)
Great work for fixing this mate in 5.3.0
Importing EN now, do you know of other feeds people use with it?
Have you ever thought about doing something like this with the CommonCrawl?
Hi, thanks so much for all the recent updates to the parser and dumpster-dive!
I just ran into following errors on version 3.6.0
:
...
#0 - Art
#0 - Agnostida
#0 - Abortion
#0 - Abstract (law)
#0 - American Revolutionary War
#0 - Ampere
#0 - Algorithm
#0 - Annual plant
====error!===
Error: key Dante]] ''L'Inferno'', Canto IV. 131–135 must not contain '.'
at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:914:19)
at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
at serializeObject (/lfs/raiders10/hdd/jrausch/git/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
#0 - Anthophyta
#0 - Atlas (disambiguation)
#0 - Mouthwash
#0 - Alexander the Great
...
Hi
I am exploring this repo as part of wikipedia data analysis.
check if this is useful for you:
08:59 $ npm install -g wikipedia-to-mongodb/
npm WARN deprecated [email protected]: Jade has been renamed to pug, please install the latest version of pug instead of jade
I think some documentation would go a long way to make this repo awesome :) Let me know if I can help. I don't know JS, I'm java guy for all I know.
Sri Harsha
Dumpster-dive version 3.1.0
sudo nohup dumpster enwiki-latest-pages-articles.xml --batch_size 100 &
sudo nohup dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 &
results stopped at page count 186,441, script finish with no error.
Dumpster-dive version 3.6.1
dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 --verbose true --workers 20
Script finish and page count is 1,274,403 with no error.
The number of pages suppose to be 5M++.
Certain disambiguation pages to do not contain all of the expected links.
For example,
https://en.wikipedia.org/wiki/Fly_(disambiguation)
When looking in MongoDb for
title:"Fly (disambiguation)"
The type of this document is page
and it only contains the intro in text
property. None of the other paragraphs/links have been imported.
Glad to see such great progress !
I downloaded the latest english Wikipedia dump (enwiki-latest-pages-articles.xml.bz2), as documented in the readme. Extracted via the archive manager (OS: Ubuntu 16.04).
Now loading into mongodb via command line (dumpster ./enwiki-latest-pages-articles.xml)
As the script is running, an error message is repeatedly displayed: "Error: Cannot find module '../../infobox/infobox' ..."
Within mongo shell, the enwiki database is not yet visible.
Need to know if I should best terminate terminate this run (e.g. cntrl-c) and restart with the infobox option enabled. Ideally, I would like to have the infobox pages with the other articles in mongo.
Greatly appreciate the continued progress with enabling this type of capability !
gyp: Call to 'node -e "require('nan')"' returned exit status 127 while in binding.gyp. while trying to load binding.gyp
gyp ERR! configure error
gyp ERR! stack Error: gyp
failed with exit code: 1
gyp ERR! stack at ChildProcess.onCpExit (/usr/share/node-gyp/lib/configure.js:354:16)
gyp ERR! stack at emitTwo (events.js:87:13)
gyp ERR! stack at ChildProcess.emit (events.js:172:7)
gyp ERR! stack at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
gyp ERR! System Linux 4.4.0-116-generic
gyp ERR! command "/usr/bin/nodejs" "/usr/bin/node-gyp" "rebuild"
gyp ERR! cwd /usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/iconv
gyp ERR! node -v v4.2.6
gyp ERR! node-gyp -v v3.0.3
gyp ERR! not ok
/usr/local/lib
└── (empty)
npm ERR! Linux 4.4.0-116-generic
npm ERR! argv "/usr/bin/nodejs" "/usr/bin/npm" "install" "-g" "wikipedia-to-mongodb"
npm ERR! node v4.2.6
npm ERR! npm v3.5.2
npm ERR! code ELIFECYCLE
npm ERR! [email protected] install: node-gyp rebuild
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the iconv package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs iconv
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls iconv
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! /home/burf2000/npm-debug.log
npm ERR! code 1
When importing into mongodb helper.js
chokes if there are keys beginning with a dollar sign. This
seems to happen with section titles. From a quick search I think the error message occurs from this section in the Subway page.
Error: key $5 footlongs must not start with '$'
at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:751:19)
at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at BSON.serialize (/home/ubuntu/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27)
at Query.toBin (/home/ubuntu/wikipedia-to-mongodb/node_modules/mongodb-core/lib/connection/commands.js:140:25)
at Pool.write (/home/ubuntu/wikipedia-to-mongodb/node_modules/mongodb-core/lib/connection/pool.js:984:23)
events.js:182
throw er; // Unhandled 'error' event
^
TypeError: Converting circular structure to JSON
at JSON.stringify (<anonymous>)
at Job.update (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:832:17)
at Job.reattempt (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:585:23)
at Job.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:616:14)
at Job.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/kue/lib/queue/job.js:555:7)
at normal_reply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:721:21)
at RedisClient.return_reply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:819:9)
at JavascriptRedisParser.returnReply (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:192:18)
at JavascriptRedisParser.execute (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis-parser/lib/parser.js:574:12)
at Socket.<anonymous> (/home/ubuntu/wikipedia-to-mongodb/node_modules/redis/index.js:274:27)
What's the best way to escape those keys?
An error occurred when importing a certain page into mongodb, maybe this is a part of wtf_wikipedia
's problem.
I am not familiar to node.js, and it would be great if you can help with this. Thanks!
Command Line error:
Bayern
Bavaria
Brandenburg
Federal Chancellor
Bundestag
Bundesrat
Bundesregierung
BMW
Blaue Reiter
Bisexual (disambiguation)
[somewhere]/js-wikipedia-to-mongodb/node_modules/wtf_wikipedia/index.js:534
if(!img.match(/^(image|file|fichier|Datei)/i)){
^
TypeError: Object 350 has no method 'match'
at Object.main [as parse] ([somewhere]/js-wikipedia-to-mongodb/node_modules/wtf_wikipedia/index.js:534:17)
at XmlStream.<anonymous> ([somewhere]/js-wikipedia-to-mongodb/index.js:28:26)
at XmlStream.emit (events.js:106:17)
at fn ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:132:14)
at FiniteAutomata.run ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/finite-automata.js:32:19)
at FiniteAutomata.leave ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/finite-automata.js:85:7)
at null.<anonymous> ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:434:8)
at emit (events.js:95:17)
at Parser.parse ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/node_modules/node-expat/lib/node-expat.js:23:22)
at parseChunk ([somewhere]/js-wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:513:14)
I have installed wikipedia-to-mongodb
on a Ubuntu 16.04 EC2 machine (with Node.js version v4.2.6), and I get the following when trying to run the command:
ubuntu@ip-172-31-15-234:~$ wp2mongo afwiki-latest-pages-articles.xml.bz2
/usr/local/lib/node_modules/wikipedia-to-mongodb/bin/wp2mongo.js:2
let program = require('commander')
^^^
SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode
at exports.runInThisContext (vm.js:53:16)
at Module._compile (module.js:374:25)
at Object.Module._extensions..js (module.js:417:10)
at Module.load (module.js:344:32)
at Function.Module._load (module.js:301:12)
at Function.Module.runMain (module.js:442:10)
at startup (node.js:136:18)
at node.js:966:3
Any ideas as to why this is failing would be greatly appreciated!
hi spencer, thanks for the amazing work though i couldn't get it to work. here what i'm seeing:
wp2mongo ~/Downloads/afwiki-latest-pages-articles-multistream.xml.bz2
--- starting xml parsing --
=================done!=================
0 pages stored in db 'afwiki'
any ideas why?
Hello Spencer,
i am trying to load wikipedia in spanish and i got this error (twice). What could i do to finish the process?
$ node index.js eswiki-latest-pages-articles.xml
Andorra
Argentina
Geografía de Andorra
Demografía de Andorra
Comunicaciones de Andorra
Artes visuales
Agricultura
Astronomía galáctica
ASCII
Arquitectura
Anoeta
Ana María Matute
Agujero negro
Antropología
Anarquía
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6
Thanks in advance
Hi again, I noticed that the number of CPU workers are much less than what is supposed to be
even when specify --workers. The script seems to use on average of 5 cpu workers regardless of the size of the machine or the configuration available.
**top - 05:23:18 up 13:27, 4 users, load average: 6.98, 6.76, 5.55**
Tasks: 235 total, 6 running, 229 sleeping, 0 stopped, 0 zombie
%Cpu(s): 34.7 us, 3.7 sy, 0.0 ni, 61.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 61851660 total, 570464 free, 13277764 used, 48003432 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 47567560 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17559 root 20 0 1531264 719272 25636 R 121.9 1.2 23:30.91 node
17565 root 20 0 981404 174652 25712 R 120.5 0.3 22:17.39 node
17567 root 20 0 1114476 313784 25476 R 119.5 0.5 23:01.75 node
17635 root 20 0 986044 179396 25524 R 119.5 0.3 22:20.90 node
Hi
I tested to download the latest en wiki pages.
(https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)
When I imported it only imported the id and title.
Tried to figure out why, and it seems like Wikipedia added an attribute to the node.
It's now:
<text bytes="80055" xml:space="preserve">
Your code does not seem to handle the bytes attribute.
I tested adding it and it seems to work now.
Thought about sending a pull request, but was a little bit unsure about:
https://github.com/spencermountain/dumpster-dive/blob/master/src/worker/01-parsePage.js#L41
Maybe the <text>
regexp should be even more "open" to any kind of attributes?
Example:
<text[^>]*>([\s\S]*?)<\/text>
Hey,
I've found a bug occurring when using multiple workers.
Take for example the tinywiki
dataset.
When I run the following code:
const dumpster = require('dumpster-dive');
options = {
file: process.argv[2],
db: 'tinywiki',
skip_redirects: false,
skip_disambig: false,
batch_size: 1000,
workers: 4,
custom: function(doc) {
console.log(doc.title(), doc.text().length);
return {};
}
};
dumpster(options, () => console.log('Parsing is Done!'));
where I pass the script the path to the tinywiki
XML file through argv[2]
which is ./tests/tinywiki-latest-pages-articles.xml
.
When I run it with 1 worker, I get the following print:
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 788
Redirect page 0
Disambiguation page 238
Bodmin 7921
In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):
Redirect page 0
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 0
Disambiguation page 238
Bodmin 0
Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?
Thanks!
Issue #19 concerned the number of articles in the English Wikipedia, and it was noted that a large proportion of them are redirects. I'd like to add a little extra insight, and make a suggestion.
As I mentioned in issue #21 , during the first 16 hours of the run (on my reasonably beefy PC portable workstation), the app got through nearly 7.4 million articles. In the two days since then, it's gotten through a bit over 2.3 million. In other words, the progress through the dump is not linear.
I assume the later articles don't require dramatically more parsing time than the earlier ones. Moreover, the Node.js and Mongo DB processes together are using over 3GB of RAM, but they're not taxing the machine (I can still use it normally for everything else I have to do). My conclusion, therefore, is that the big slowdown is coming from the increasing difficulty of inserting new records as the size of the collection and its indices grows.
Therefore, I suggest adding an option to skip redirects (and probably disambiguation pages as well): for some purposes, including mine, these aren't needed, and not having to index them would dramatically cut the run time.
After restoring a dump of 2018-11-20, I'm not able to find any page with a "{{pov" in it's content. I'm trying to find articles marked as with Point Of View, that wikipedia marks with a tag that starts with "{{pov".
It's strange that the 15gb dump restored only 750k articles, once english wikipedia has currently over 5kk articles, mas in this 750kk at least some would be marked with the tag.
Am I doing something wrong? Is there any params to import all the things? For instance, when using another tool importing to a mysql database, on the first 6k records I already found articles with the desired text. It looks that all the special tags are beeing ripped off.
Many thanks again for building and sharing such a convenient (and fast) parser !
This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).
The following is the summary provided at the end of the run:
#1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"
💪 worker #0 has finished 💪
- 1 workers still running -
#1 +140 pages - 4ms - "irascebare"
💪 worker #1 has finished 💪
- 0 workers still running -
👍 closing down.
-- final count is 5,702,608 pages --
took 13.8 minutes
🎉
When I checked the Wiktionary statistics page (https://en.wiktionary.org/wiki/Wiktionary:Statistics), the following statistics were listed for Wiktionary:
Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents
1 Selected language breakdowns
2 See also
See also: Special:Statistics
Number of entries: 5,721,450
Number of total pages: 6,322,904
Number of encoded languages: 8052
Number of uploaded files: 29
Number of user accounts: 3,446,188
Number of administrators: 98
Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.
Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.
Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.
Thanks again !
I seem to have wp2mongo running just fine (under Windows 10) with a bz dump I had already downloaded. However, when I try to do things with Redis, I run into a problem. A script called by worker.js is throwing the following error, apparently on every article (none of them get into the db):
C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\infobox\index.js:32
let inside = tmpl.match(/^{{nowrap|(.*?)}}$/)[1];
^TypeError: Cannot read property '1' of null
at C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\infobox\index.js:3
at Array.forEach ()
at Object.parse_recursive [as infobox] (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_w
\infobox\index.js:19:11)
at main (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\parse\index.js:46:
at Object.parse (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\wtf_wikipedia\src\index.js:4
at Object.parse (C:\code\JavaScript\wikipedia-to-mongodb\src\doPage.js:50:18)
at C:\code\JavaScript\wikipedia-to-mongodb\src\worker.js:25:14
at Worker. (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\kue\lib\queue\worker.j
at Job. (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\kue\lib\queue\job.js:706:
at multi_callback (C:\code\JavaScript\wikipedia-to-mongodb\node_modules\redis\lib\multi.js:89:14)
(The caret in the third line should point to the "[1]".)
Evidently this sort of error with regexes is not uncommon, but I can't figure out why I'm getting it here.
EDIT: I thought wp2mongo was working fine, but it's reliably hanging on article number 777. After it took literally days to get the app compiled (apparently due to a corrupt file, following by the thing looking for a dependency in the wrong place), I'm starting to think this is the hardest "easy way" I've ever experienced.
Hey,
I'm running a JS script containing the following code:
const dumpster = require('dumpster-dive');
options = {
file: process.argv[2],
db: 'tinywiki',
skip_redirects: false,
skip_disambig: false,
batch_size: 1000,
custom: function(doc) {
console.log(doc.title());
return {};
}
};
dumpster(options, () => console.log('Parsing is Done!'));
where I pass the script the path to the tinywiki XML file through argv[2]
which is ./tests/tinywiki-latest-pages-articles.xml
.
The only titles being printed are Toronto
, Royal Cinema
, Belleville
, while all the others print undefined
- 3 out of 8 pages (even the duplicate pages should print at this point), however the DB is populated with the correct title
s and pageID
s.
BUT! If I run the same code without a custom function, it populates the DB correctly, i.e with the correct titles and everything (text, etc.), which leads me to suspect that there is something wrong passed on with the doc
object into the custom function, but the code in:
dumpster-dive/src/worker/02-parseWiki.js
Line 43 in c11587c
Just adds the title afterwards and pushes into the DB, making it look as if the title
is accessible, when in fact it ISN'T accessible from within the custom function.
Space limitations on my new computer. With all the other binaries installed my computer is struggling to download and extract wikipedia. Any ideas on how to set this up in a cloud environment like Amazon?
Does it modify the existing file, or output to a different file?
Followed the steps to download english wiki dump, completed up to extract process but while adding data into mongo it is showing this error:-
TypeError: data[k].text is not a function
at Object.keys.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:126:25)
at Array.forEach ()
at Object.subject bar (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:125:23)
at doTemplate (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:35:31)
at templates.top.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:69:12)
at Array.forEach ()
at Object.allTemplates [as templates] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:68:17)
at doSection (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:21:16)
at Object.splitSections [as section] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:52:15)
at main (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/index.js:47:22)
I also tried dumpster ./my/wikipedia-dump.xml --images false after doing this it will give error
TypeError: data[k].text is not a function
at Object.keys.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:126:25)
at Array.forEach ()
at Object.subject bar (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/misc.js:125:23)
at doTemplate (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:35:31)
at templates.top.forEach (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:69:12)
at Array.forEach ()
at Object.allTemplates [as templates] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/templates/index.js:68:17)
at doSection (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:21:16)
at Object.splitSections [as section] (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/section/index.js:52:15)
at main (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/index.js:47:22)
I'm just getting this error message when I try to run dumpster ./enwiki-20180601-pages-articles.xml
TypeError: i.json is not a function
TypeError: i.json is not a function
at data.images.doc.images.map.i \dumpster-dive\node_modules\wtf_wikipedia\src\output\json\index.js:43:43
After reaching "Disruptor Beam 9192071" I get this message. I have 32gb RAM. Is it a special finishing up process that failed or just that article?
Should I do something or have all the articles been moved to mongodb? :)
Thanks!
<--- Last few GCs --->
51669740 ms: Scavenge 1394.1 (1457.7) -> 1394.1 (1457.7) MB, 0.2 / 0 ms (+ 2.4 ms in 2 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep].
51670266 ms: Mark-sweep 1394.1 (1457.7) -> 1284.9 (1457.7) MB, 525.2 / 0 ms (+ 3.5 ms in 4 steps since start of marking, biggest step 2.2 ms) [last resort gc].
51670912 ms: Mark-sweep 1284.9 (1457.7) -> 1284.9 (1457.7) MB, 646.1 / 0 ms [last resort gc].<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x3872217b4629
1: RegExpExecNoTests [native regexp.js:~59] [pc=0x11185f437ec3] (this=0x3872217d8991 ,j=0x2f5829f089b9 ,k=0x2f5829f08991 <String[104]: Disruptor Beam is a developer of mobile and social game products based in [[Framingham, Massachusetts]].>,n=0)
2: match(aka match) [native string.js:~118] [pc=0x11185f437ce3] (this=0x2f5829f08991 <String[104]: Disruptor Beam ...FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
Is there a way to include the pageid as a field in the resulting Mongo documents? This is very useful information to have and its annoying to have to add it via postprocessing.
i noticed that my current run is progressing and the output is keep popping up. However the number of pages doesn't increase for like half an hour now. and keep giving duplicate warning. so i selected the page title that is stuck and grep it from the output of nohup and here is the results. Each line starts with a number that indicated the "line number" in nohup.out. It seems that the same page is processed by multiple core. You will notice the page "American wigeon" is processed multiple times from different cores.
sudo grep -n 'American wigeon' nohup.out
15517: #8 +100 pages - 193ms - "American wigeon"
15519: #12 +0 pages - 172ms - "American wigeon"
15538: #11 +0 pages - 190ms - "American wigeon"
31954: #9 +0 pages - 210ms - "American wigeon"
31958: #5 +0 pages - 204ms - "American wigeon"
33109: #3 +0 pages - 157ms - "American wigeon"
33121: #1 +0 pages - 175ms - "American wigeon"
43075: #14 +0 pages - 324ms - "American wigeon"
43077: #4 +0 pages - 344ms - "American wigeon"
43093: #0 +0 pages - 169ms - "American wigeon"
43247: #6 +0 pages - 173ms - "American wigeon"
43253: #15 +0 pages - 154ms - "American wigeon"
43257: #10 +0 pages - 173ms - "American wigeon"
45677: #2 +0 pages - 206ms - "American wigeon"
45693: #13 +0 pages - 198ms - "American wigeon"
45719: #7 +0 pages - 212ms - "American wigeon"
Salute all contributors to this amazing work!
I just finished parsing enwiki-20180901-pages-articles.xml. It seems to work properly with 5,081,528 records in Mongodb (which is as expected). However, when I query the DB (e.g. { "title": "London" }), it seems that some important records are missed, e.g. London, China, India, etc. They should be contained in the original dump (as the index file indicates), but are somehow missed in the parsed DB.
During the parsing, errors come out from time to time as shown in the screenshot, maybe related to the issue:
Just wondering if anyone had similar experience or may have any thoughts on possible causes?
Thanks! Once again, great work!
Hi all,
There's a problem with redirection pages.
As it stands, in the worker/index.js
file, the wiki page is parsed using the parsePage
function.
dumpster-dive/src/worker/index.js
Line 23 in e7c6b83
shouldSkip
function, which returns true
if the page is a redirection page. In such case, the parsePage
function returns null to the calling function (in index.js
). There's no check in the shouldSkip
or parsePage
functions that check whether the skip_redirects
option is false or true.index.js
, pages which are redirection pages are ignored no matter the value of the skip_redirects
option.shouldSkip
function's return value to false, in order to not skip redirects, the redirection pages are processed like regular pages. This behavior seems unintuitive to me. I think the behavior should be that the redirection page should have a special "redirection" field which should point to the redirected-to page. This can be very helpful, since I'd like to treat the redirection page just like the redirected-to page, in terms of the text of the page, etc., so I'd like to be able to get to the redirected-to page from the redirection page.Node v11.14.0
NPM v6.9.0
Mongod v3.4.20
Ubuntu Linux 16.04
dumpster-dive v5.1.2
I'm getting the following error repeatedly:
====DB write error (worker 0)===
{ MongoError: BSONObj size: 16857622 (0x1013A16) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
at /home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
at authenticateStragglers (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
at Connection.messageHandler (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
at emitMessageHandler (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
at Socket.<anonymous> (/home/ben/.nvm/versions/node/v11.14.0/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17)
at Socket.emit (events.js:193:13)
at addChunk (_stream_readable.js:295:12)
at readableAddChunk (_stream_readable.js:276:11)
at Socket.Readable.push (_stream_readable.js:231:10)
at TCP.onStreamRead (internal/stream_base_commons.js:150:17)
ok: 0,
errmsg:
'BSONObj size: 16857622 (0x1013A16) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"',
code: 10334,
codeName: 'Location10334',
name: 'MongoError',
[Symbol(mongoErrorContextSymbol)]: {} }
The process runs but only saves a single page in Mongo throughout -- which it seems to write at the very beginning of the process.
Seeing a lot of messages like:
current: 1 pages - "true"
-- 500 duplicate pages --
#5 +0 pages - 1s - "true"
-- 500 duplicate pages --
#2 +0 pages - 900ms - "true"
-- 500 duplicate pages --
#1 +0 pages - 1s - "true"
-- 500 duplicate pages --
#4 +0 pages - 944ms - "true"
-- 500 duplicate pages --
#2 +0 pages - 1s - "true"
current: 1 pages - "true"
─── worker #0 ───:
+4,500 pages
-7,350 redirects
-206 disambig
0 ns
─── worker #2 ───:
+9,500 pages
-21,113 redirects
-521 disambig
0 ns
─── worker #3 ───:
+4,500 pages
-7,353 redirects
-206 disambig
0 ns
Hi there, been a big fan of this, ever since it was called wp2mongo
I am sure that --plaintext=true used to give me the plaintext of the page.
I run it now (with an older Wiki download as the current format doesn't work as per other bugs)
I get Id, title, categories, sections, pageId ? what have I missed?
hi, thanks for doing that script! It is extremely simple and works perfect with small xml dataset i.e. af wikipedia but with the biggest one it is quite slow and one error stops script execution.
I guess the bottleneck is wikipedia.parse
because it is run synchronously on one CPU and it is blocking. Once I commented wikipedia.parse
it speed up even to x10 on my computer.
My solution would be to use some job queue for calculating wikipedia.parse
and saving to mongodb on all CPU's.
I would like to make such a improvement and make PR for your review but please define license for your script before.
I have this issue while running the script for fa wikipedia dump
events.js:72
throw er; // Unhandled 'error' event
^
TypeError: Cannot read property '0' of undefined
at f (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/lib/bit_iterator.js:24:34)
at Object.bzip2.decompress (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/lib/bzip2.js:272:13)
at decompressBlock (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:29:28)
at decompressAndQueue (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:46:20)
at Stream.end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:82:23)
at _end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:65:9)
at Stream.stream.end (/usr/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:74:5)
at ReadStream.onend (_stream_readable.js:485:10)
at ReadStream.g (events.js:180:16)
at ReadStream.emit (events.js:117:20)
Hey,
I noticed that the generated html documents seem to be missing a bunch of linked/referenced information.
For instance, here's some snippets of the dumpster-generated Wikipedia article for Abraham Lincoln and the original reference:
In case it's convenient, here's the relevant json and the converted html (Both generated with 3.1.0
).
Abraham_Lincoln_html.txt
Cheers
Every time i run git clone [email protected]:spencermountain/wikipedia-to-mongodb.git.. I get the permission error.. I've disabled selinux and turned off firewalls..
I'm not sure what im doing wrong.
Hi,
I have a problem importing the example wiki you gave. When Iaunch the command "node index.js af-wiki..." I get after some time the following message :
{ [MongoError: server localhost:27017 sockets closed]
name: 'MongoError',
message: 'server localhost:27017 sockets closed' }
And also my mongo shell show e 9 simultaneous connections. Is that normal ?
Thank you
Hi, I've tried to parse the current.xml dump from this wiki but while it seems to work, the actual Mongo Documents are very sparsely populated.
{
"_id" : "Doppelgängers",
"title" : "Doppelgängers",
"categories" : [ ],
"sections" : [ ],
"coordinates" : [ ],
"infoboxes" : [ ],
"images" : [ ],
"references" : [ ],
"pageID" : "3200"
}
I assume this is to do with the xml dump being non-standard, or at least not matching the Wikipedia format. I thought I'd raise an issue in case this was an easy fix in the code.
Hi,
Since I've been using the new version (4.0.1), I've been experiencing errors when keys (titles, I guess? unique Id?) contain periods, for example:
Error: key entertainment ed. must not contain '.'
at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:914:19)
at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:728:17)
at serializeObject (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:348:18)
at serializeInto (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/parser/serializer.js:938:17)
at BSON.serialize (/usr/lib/node_modules/dumpster-dive/node_modules/bson/lib/bson/bson.js:63:28)
This happens both with a custom function and by simply running dumpster
Ran the process with file and db on windows. The process does not end, it hangs after showing 'worker 6 has finished'
There are messaging appearing: current: 107,920 pages - "undefined"
When I try to end the process, below exception occurs.
one sec, cleaning-up the workers...
--uncaught top-process error--
ProcessTerminatedError: cancel after 0 retries!
at ...\pool.js:111:39
at Array.forEach ()
at WorkerNodes.handleWorkerExit (...\pool.js:110:14)
at Worker. (...\pool.js:160:44)
at Worker.emit (events.js:203:13)
at WorkerProcess. (...\worker.js:39:18)
at Object.onceWrapper (events.js:291:20)
at WorkerProcess.emit (events.js:203:13)
at ChildProcess. (...\worker\process.js:42:41)
at Object.onceWrapper (events.js:291:20) {
name: 'ProcessTerminatedError',
message: 'cancel after 0 retries!'
Trying to parse the big 65.7 GB English wiki, enwiki-20190101-pages-articles-multistream.xml.
Parsing goes well, data gets saved to Mongo, but at some point, every line is the same.
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
about 200 times, and nothing else after that.
If I CTRL + C to break the process, then I suddenly get 8 times this error (one per worker, I guess)
one sec, cleaning-up the workers...
--uncaught top-process error--
{ ProcessTerminatedError: cancel after 0 retries!
at tasks.filter.forEach.task (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:111:39)
at Array.forEach ()
at WorkerNodes.handleWorkerExit (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:110:14)
at Worker.worker.on.exitCode (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:160:44)
at Worker.emit (events.js:182:13)
at WorkerProcess.Worker.process.once.code (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\worker.js:39:18)
at Object.onceWrapper (events.js:273:13)
at WorkerProcess.emit (events.js:182:13)
at ChildProcess.WorkerProcess.child.once.code (C:\Users\Jeremy\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\worker\process.js:42:41)
at Object.onceWrapper (events.js:273:13)
name: 'ProcessTerminatedError',
message: 'cancel after 0 retries!' }
I launched the parser 4 times, it skips what was done previously, then gets to this again :
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
current: 471,877 pages - "Shangluo"
same error, and stops here every time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.