Git Product home page Git Product logo

bwbaugh / wikipedia-extractor Goto Github PK

View Code? Open in Web Editor NEW
258.0 13.0 94.0 123 KB

This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.

Home Page: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

Python 100.00%

wikipedia-extractor's Introduction

This is a mirror repo for the script by Giuseppe Attardi, and contains history before the official repo started.

Please refer to the official repo if there any issues: https://github.com/attardi/wikiextractor


Wikipedia Extractor

Introduction

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

Wikipedia dumps are available from Wikipedia database download.

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled Armonium:

 <page>
 <title>Armonium</title>
 <id>2</id>
 <timestamp>2008-06-22T21:48:55Z</timestamp>
 <username>Nemo bis</username>
 <comment>italiano</comment>
 <text xml:space="preserve">thumb|right|300 px

 L'armonium' (in francese, harmonium) è uno
  strumento musicale azionato con una tastiera, detta
 manuale. Sono stati costruiti anche alcuni armonium con due manuali.

 ==Armonium occidentale==
 Come l'organo, l'armonium è utilizzato tipicamente in
 chiesa, per l'esecuzione di musica sacra, ed è
 fornito di pochi registri, quando addirittura in certi casi non ne possiede
 nemmeno uno: il suo timbro è molto meno ricco di quello
 organistico e così pure la sua estensione.

 ...

 ==Armonium indiano==
 Template:S sezione

 == Voci correlate ==
 *Musica
 *Generi musicali</text>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con
una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con
due manuali.

Armonium occidentale.
Come l'organo, l'armonium è utilizzato tipicamente in chiesa, per l'esecuzione
di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello
organistico e così pure la sua estensione.
...
</doc>

The extraction tool is written in Python and requires no additional library. it aims to achieve high accuracy in extraction task.

Wikipedia articles are written in the MediaWiki Markup Language, which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are often misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys several heuristics in order to circumvent such problems. A currently missing feature for the extractor is template expansion.

Description

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

Usage:

 WikiExtractor.py [options]

Options:

 -c, --compress        : compress output files using bzip
 -b, --bytes= n[KM]    : put specified bytes per output file (default 500K)
 -B, --base= URL       : base URL for the Wikipedia pages
 -o, --output= dir     : place output files in specified directory (default
                         current)
 -l, --link            : preserve links
 --help                : display this help and exit

Example of Use

The following commands illustrate how to apply the script to a Wikipedia dump:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> bzcat itwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted -

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bunzip2 -c {} \; > text.xml
> rm -rf extracted

Related Work

wikipedia-extractor's People

Contributors

bwbaugh avatar senorflor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipedia-extractor's Issues

Max Template Extraction exceeded!

Greetings,
I am parsing the Italian wikipedia and I get the Warning Max template recursion exceeded!. Then the script stops.

I am wondering what can be done about this. I am not sure I really understand what is going on here. I was reading in the script at line 565 and I still don't understand what is happening.

keeping links and sections

works fine without -l option but

when trying to run: python WikiExtractor.py -l wiki.xml

i get:

INFO: Preprocessing dump to collect template definitions: this may take some time.
INFO: Starting processing pages from wiki.xml.
INFO: Using 2 CPUs.
INFO: 1 Ajalugu
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "WikiExtractor.py", line 2335, in run
job.extract(self._splitter)
File "WikiExtractor.py", line 407, in extract
text = clean(self, text)
File "WikiExtractor.py", line 1877, in clean
text = replaceInternalLinks(text)
File "WikiExtractor.py", line 1486, in replaceInternalLinks
res += text[cur:s] + makeInternalLink(title, label) + trail
File "WikiExtractor.py", line 1764, in makeInternalLink
return '%s' % (urllib.quote(title.encode('utf-8')), anchor)
NameError: global name 'anchor' is not defined

Is it possible to keep the redirects?

Hello Baugh.

I am using your source code to filter the wikipedia dump and it works really great. I just have a question concerning the redirect links. From what I understood, redirects are surrounded by [[ ]].
For example : [[Legal positivism]] where Legal Positivism is a redirect link.

The double brackets are filtered and removed in the final output. Which part of the python code should I modify in order to keep the redirects?

Error on English wikipedia dump

After two days of work on an English Wikipedia dump (enwiki-20150602-pages-articles.xml) I got following stack trace:

File "wiki_extractor.py", line 677, in expandTemplate
instantiated = template.subst(params, self)
File "wiki_extractor.py", line 301, in subst
return ''.join([tpl.subst(params, extractor, depth) for tpl in self])
File "wiki_extractor.py", line 358, in subst
res = extractor.expandTemplates(defaultValue)
File "wiki_extractor.py", line 458, in expandTemplates
res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
File "wiki_extractor.py", line 677, in expandTemplate
instantiated = template.subst(params, self)
File "wiki_extractor.py", line 301, in subst
return ''.join([tpl.subst(params, extractor, depth) for tpl in self])
File "wiki_extractor.py", line 351, in subst
paramName = self.name.subst(params, extractor, depth+1)
File "wiki_extractor.py", line 294, in subst
logging.debug('subst tpl (%d, %d) %s', len(extractor.frame), depth, self)
File "/usr/lib/python2.7/logging/init.py", line 1622, in debug
root.debug(msg, _args, *_kwargs)
File "/usr/lib/python2.7/logging/init.py", line 1139, in debug
if self.isEnabledFor(DEBUG):
File "/usr/lib/python2.7/logging/init.py", line 1351, in isEnabledFor
return level >= self.getEffectiveLevel()
RuntimeError: maximum recursion depth exceeded

and the script stops to work.

Readme not works!

Example of call from read me is invalid:

bzip2 -c -d "plwiki-latest-pages-articles.xml.bz2" | WikiExtractor.py -o -

Give unclear information instead parsing.

What does it mean?

$bzip2 -c -d "C:\Users\Crezary Wagner\Documents\GitHub\word2vec\plwiki-latest-pages-articles.xml.bz2" | WikiExtractor.
py -o - -
Traceback (most recent call last):
File "C:\Users\Crezary Wagner\Documents\GitHub\wikipedia-extractor\WikiExtractor.py", line 2587, in
main()
File "C:\Users\Crezary Wagner\Documents\GitHub\wikipedia-extractor\WikiExtractor.py", line 2583, in main
args.compress, args.processes)
File "C:\Users\Crezary Wagner\Documents\GitHub\wikipedia-extractor\WikiExtractor.py", line 2297, in process_dump
raise ValueError("to use templates with stdin dump, must supply explicit template-file")
ValueError: to use templates with stdin dump, must supply explicit template-file

bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: No error
Input file = C:\Users\Crezary Wagner\Documents\GitHub\word2vec\plwiki-latest-pages-articles.xml.bz2, output file
= (stdout)

What templates???

Syntax Error

python WikiExtractor.py
File "WikiExtractor.py", line 877
afterPat = { o:re.compile(openPat+'|'+c, re.DOTALL) for o,c in izip(openDelim, closeDelim)}
^
SyntaxError: invalid syntax

Any idea what I am doing wrong here? I am using python 2.6.1

NOTE: This is a mirror repo.

This is a mirror repo (mainly for history before the author moved to github).

If you have an issue, please consider making an issue on the official repo: https://github.com/attardi/wikiextractor

It looks like the official repo sometimes has a newer version of the script than the wiki website that this repo is mirroring, so the problem might already be fixed there.


I will accept push requests on the README file for this repo since it is not automatically mirrored from the wiki site.

'maximum template recursion' error after a few hours

I am running a python script on English Wikipedia xml dump file to extract text for the Wikipedia articles. I am doing everything as told. It seems a fairly straightforward task.

The script runs fine till some time and extracts some articles. But after a few hours, it starts giving this error 'Maximum template recursion'. I don't understand why.

Invalid syntax error when running

python WikiExtractor.py -cb 250K -o extracted itwiki-20150316-pages-articles1.xml.bz2
File "WikiExtractor.py", line 860
afterPat = { o:re.compile(openPat+'|'+c, re.DOTALL) for o,c in izip(openDelim, closeDelim)}
^
SyntaxError: invalid syntax

On latest version, running 2.6.6 on CentOS 6
Of note, it does then when given no arguments or -h or anything as well.

OSError 12: cannot allocate memory

When I use wikipedia-extractor to process the newest enwiki (about 11.7GB) in ubuntu 14.04 with 2GB memory and 700G disk. I got the error when it processed in about 17 million articles.

Extraction bug

Article http://en.wikipedia.org/wiki/Alabama have following text:
"At 1,300 miles (2,100 km), Alabama has one of the longest navigable inland waterways in the nation"
which extracted as follows:
At , Alabama has one of the longest navigable inland waterways in the nation.

WikiExtractor crash consistently during extraction in folder JK with no error on output or log

source file:
http://dumps.wikimedia.org/enwiki/20140502/enwiki-20140502-pages-articles-multistream.xml.bz2

command:
setsid nohup cat wiki.xml | ./WikiExtractor.py -c -l -o extracted > /dev/null 2>&1 &

Expect in folder extracted will list subfolders from AA to ZZ.
But always stop till JK

  • size when stopped was 3.5G.
  • enwiki-20140502-pages-articles-multistream.xml.bz2 is more than 10G

It is unsure why it stops

Error on Farsi wikipedia dump: "NameError: global name 'templatePrefix' is not defined"

I encountered the problem after running WikiExtractor.py (with python 2.7 in Windows 8.1 x64) on an farsi wiki dump.
Can you explain why this error occurs?

python h:\wiki\WikiExtractor.py h:\wiki\fawiki-20150602-pages-articles.xml.bz2 -cb 5M -o h:\wiki\extracted
INFO: Preprocessing 'h:\wiki\fawiki-20150602-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Loaded 109314 templates in 685.3s
INFO: Starting page extraction from h:\wiki\fawiki-20150602-pages-articles.xml.bz2.
INFO: Using 1 extract processes.
Process Process-2:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(self._args, _self._kwargs)
File "h:\wiki\WikiExtractor.py", line 2427, in extract_process
Extractor(_job[:3]).extract(out) # (id, title, page)
File "h:\wiki\WikiExtractor.py", line 423, in extract
text = clean(self, text)
File "h:\wiki\WikiExtractor.py", line 1896, in clean
text = extractor.expandTemplates(text)
File "h:\wiki\WikiExtractor.py", line 479, in expandTemplates
res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
File "h:\wiki\WikiExtractor.py", line 636, in expandTemplate
title = fullyQualifiedTemplateTitle(title)
File "h:\wiki\WikiExtractor.py", line 1121, in fullyQualifiedTemplateTitle
return templatePrefix + ucfirst(templateTitle)
NameError: global name 'templatePrefix' is not defined

run error

I run the python program by the command (in linux):
bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -cb 250k -o extracted

And I get:
usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KM]] [-B BASE] [-c] [-l]
[-ns ns1,ns2] [-q] [-s] [-a] [--templates TEMPLATES]
[-v]
input
WikiExtractor.py: error: too few arguments

Version 2.52 (March 6, 2016) Seems not working

I use version 2.52 to extract zhwiki and all docs will rise an error.
Error occurred in both windows and Ubuntu. Python version in windows is:
Python 2.7.8 (default, Jun 30 2014, 16:08:48) [MSC v.1500 64 bit (AMD64)] on win32

The error message was:

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Wiki\WikiExtractor2.py", line 2565, in extract_process
    e.extract(out)
  File "C:\Wiki\WikiExtractor2.py", line 455, in extract
    text = self.clean()
  File "C:\Wiki\WikiExtractor2.py", line 570, in clean
    if escape_doc:
NameError: global name 'escape_doc' is not defined

The changes is so much and I think it's too hard for me to find the bugs. I roll back to Version 2.32 and that version works.

In Version 2.39, I use 'python WikiExtractor239.py -o extracted239 zhwiki-20160305-pages-articles.xml'
and the output is NameError: global name 'templatePrefix' is not defined

INFO: Preprocessing 'zhwiki-20160305-pages-articles.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Preprocessed 2300000 pages
INFO: Preprocessed 2400000 pages
INFO: Preprocessed 2500000 pages
INFO: Preprocessed 2600000 pages
INFO: Preprocessed 2700000 pages
INFO: Loaded 868759 templates in 378.3s
INFO: Starting page extraction from zhwiki-20160305-pages-articles.xml.
INFO: Using 7 extract processes.
PPPProcess Process-2:
rocess Process-4:
rocess Process-5:
TTTraceback (most recent call last):
raceback (most recent call last):
raceback (most recent call last):
PProcess Process-6:
   rocess Process-8:
rocess Process-3:
T File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
 File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
 File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
Traceback (most recent call last):
raceback (most recent call last):
  T      self.run()
   self.run()
raceback (most recent call last):
   self.run()
 File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
 File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
P     rocess Process-7:
 File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
 File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
 File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
   self.run()
T     self.run()
   raceback (most recent call last):
   self._target(*self._args, **self._kwargs)
   self._target(*self._args, **self._kwargs)
    self.run()
   self._target(*self._args, **self._kwargs)
 File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
   File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
 File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
 File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
 File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
   File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
      self._target(*self._args, **self._kwargs)
   self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
   self.run()
   Extractor(*job[:3]).extract(out) # (id, title, page)
      Extractor(*job[:3]).extract(out) # (id, title, page)
   File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
 File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
   Extractor(*job[:3]).extract(out) # (id, title, page)
 File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
   File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    File "C:\Wiki\WikiExtractor239.py", line 423, in extract
 File "C:\Wiki\WikiExtractor239.py", line 423, in extract
     Extractor(*job[:3]).extract(out) # (id, title, page)
   Extractor(*job[:3]).extract(out) # (id, title, page)
 File "C:\Wiki\WikiExtractor239.py", line 423, in extract
     Extractor(*job[:3]).extract(out) # (id, title, page)
   self._target(*self._args, **self._kwargs)
      text = clean(self, text)
   text = clean(self, text)
   File "C:\Wiki\WikiExtractor239.py", line 423, in extract
 File "C:\Wiki\WikiExtractor239.py", line 423, in extract
   text = clean(self, text)
   File "C:\Wiki\WikiExtractor239.py", line 423, in extract
 File "C:\Wiki\WikiExtractor239.py", line 2427, in extract_process
    File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
 File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
     text = clean(self, text)
 File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
     text = clean(self, text)
   text = clean(self, text)
   text = extractor.expandTemplates(text)
     Extractor(*job[:3]).extract(out) # (id, title, page)
   text = extractor.expandTemplates(text)
    File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
   text = extractor.expandTemplates(text)
   File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
 File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
 File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
   File "C:\Wiki\WikiExtractor239.py", line 423, in extract
 File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
      text = extractor.expandTemplates(text)
 File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
     text = extractor.expandTemplates(text)
   text = extractor.expandTemplates(text)
   res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
     text = clean(self, text)
   res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
    File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
 File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
   File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
 File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
   res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
   File "C:\Wiki\WikiExtractor239.py", line 1896, in clean
 File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
      res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
   title = fullyQualifiedTemplateTitle(title)
     res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
   title = fullyQualifiedTemplateTitle(title)
 File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
     text = extractor.expandTemplates(text)
   res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
    File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
 File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
   File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
 File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
   title = fullyQualifiedTemplateTitle(title)
   File "C:\Wiki\WikiExtractor239.py", line 479, in expandTemplates
 File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
      title = fullyQualifiedTemplateTitle(title)
   return templatePrefix + ucfirst(templateTitle)
     title = fullyQualifiedTemplateTitle(title)
   return templatePrefix + ucfirst(templateTitle)
 File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
 N   res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
   title = fullyQualifiedTemplateTitle(title)
 N  File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
ameError: global name 'templatePrefix' is not defined
   File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
ameError: global name 'templatePrefix' is not defined
   return templatePrefix + ucfirst(templateTitle)
  File "C:\Wiki\WikiExtractor239.py", line 636, in expandTemplate
 File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
 N   return templatePrefix + ucfirst(templateTitle)
     return templatePrefix + ucfirst(templateTitle)
ameError: global name 'templatePrefix' is not defined
N   title = fullyQualifiedTemplateTitle(title)
   return templatePrefix + ucfirst(templateTitle)
NameError: global name 'templatePrefix' is not defined
 NameError: global name 'templatePrefix' is not defined
 File "C:\Wiki\WikiExtractor239.py", line 1121, in fullyQualifiedTemplateTitle
ameError: global name 'templatePrefix' is not defined
    return templatePrefix + ucfirst(templateTitle)
NameError: global name 'templatePrefix' is not defined**

wiki extractor results directories end up in QN

I want to get the title and the content of every wikipedia articles. I found the wiki extractor to be very useful to this purpose. I use wiki extractor according to the instructions on the github. When running wiki extractor V2.8, I ran into 'maximum template recursion' error after a few hours. I am getting wiki extractor from this github webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.py

So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.

In wiki extractor V2.4, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QH.

In wiki extractor V2.5, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

In wiki extractor V2.6, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

But I am really confused, because I have no idea which version has the complete wikipedia articles. In my understanding, it seems none of them succeed. Because in the resulting directory it should contain from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in V2.5 and V2.6, it stops at QN.

Could any one who run the wiki extractor successfully please shed some light on me?

Anchor is undefined

On line 1764:

return '%s' % (urllib.quote(title.encode('utf-8')), anchor)

should 'anchor' be changed to 'label'?

Version 2.8 dead loops on "Reached max template recursion"

I tried version 2.8 on latest full wikipedia english dump. If "Reached max template recursion" is triggered by one page, this warning may propagate to the following pages. Most of the time, the extractor dead loops forever until it's killed by OS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.