Git Product home page Git Product logo

Comments (15)

attardi avatar attardi commented on August 26, 2024

On 4/10/2015 09:00, agoyaliitk wrote:

Can you explain why this error occurs?


Reply to this email directly or view it on GitHub
#2.

Because thre are template definitions that invoke themselves recursively.
In the case in question the template invocation

{{Multiple sclerosis}}

expands to a body

{{Navbox
| name = Demyelinating diseases of CNS
| title = [[Multiple sclerosis]] and other [[demyelinating disease]]s of
[[Centr
al nervous system|CNS]]([[ICD-10 Chapter VI: Diseases of the nervous
system#%28G3
5–G37%29 Demyelinating diseases of the central nervous system|G35–G37]],
[[List of
ICD-9 codes 320–359: diseases of the nervous system#Other disorders of
the cent
ral nervous system %28340–349%29|340–341]])
|bodyclass = hlist
|{{Multiple sclerosis|state=expanded}})
| titlestyle = background: Silver;
...

and the template expansion procedure would keep expanding forever.
Templates are to be considered as macros, in which recursion is not allowed.

I added a check on the depth of recursive expansion, similar to the one
used in the official code from MediaWiki, to handle these malformed
templates.

-- Beppe

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

What can I do now to get past this?
It's giving the memory error.

from wikiextractor.

attardi avatar attardi commented on August 26, 2024

Please tell me the ID number of the article, that was printed before the Traceback, and the version of wikipedia dump you are using, so that I can investigate.

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

I guess the id you are asking for would be 66512.
Wikipedia dump https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

I have again attached the error with some more detail.
Thanks for your help:)

INFO:root:66495 Final Fantasy III
INFO:root:66496 Hippogriff
INFO:root:66499 Informal sector
INFO:root:66505 Secrecy
INFO:root:66511 MX record
INFO:root:66512 Fern
WARNING:root:Reached max template recursion: 16
WARNING:root:Reached max template recursion: 16
Traceback (most recent call last):
File "./WikiExtractor.py", line 1797, in
main()
File "./WikiExtractor.py", line 1793, in main
process_data(input_file, args.templates, output_splitter)
File "./WikiExtractor.py", line 1621, in process_data
extract(id, title, page, output)
File "./WikiExtractor.py", line 132, in extract
text = clean(text)
File "./WikiExtractor.py", line 1256, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 313, in expandTemplates
res += text[cur:]
MemoryError

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

Wikipedia dump
enwiki-latest-pages-articles.xml.bz2

06-Apr-2015 22:06

11820881800

from wikiextractor.

sanja7s avatar sanja7s commented on August 26, 2024

I get a similar error (I edited the file a bit as I need only raw text output, no titles or urls, but that should not have changed anything in the core program):

File "WikiExtractor_v27s.py", line 789, in expandTemplate
params = templateParams(parts[1:], depth)
File "WikiExtractor_v27s.py", line 416, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "WikiExtractor_v27s.py", line 327, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "WikiExtractor_v27s.py", line 828, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "WikiExtractor_v27s.py", line 333, in expandTemplates
res += text[cur:]
MemoryError

And in my case, it has reached 313280 articles before this error. The last article is:

945695 Canada at the 1904 Summer Olympics

It is a rather interesting memory consumption that I was seeing during the execution, so I took a screentshot at some point:

mem_consumption_wikiextract

and the Wikipedia dump I use is:
-- 2015-03-07 Recombine articles, templates, media/file descriptions, and primary meta-pages.
-- enwiki-20150304-pages-articles.xml.bz2 10.9 GB

from wikiextractor.

attardi avatar attardi commented on August 26, 2024

I fixed a few issues and I was able to process the latest Wikipedia dump.
Processing the dump requires about 3GB of memory and runs for several hours.
I have added the option:
--no-templates
for extracting text without expanding templates, as in the previous releases of WikiExtractor.
This reduces the memory needed to about 500MB and speeds up significantly the processing, but all templates will be replaced with blanks.

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

I will try it again.
Thanks

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

No luck.
Still giving the same error.

INFO:root:66499 Informal sector
INFO:root:66505 Secrecy
INFO:root:66511 MX record
INFO:root:66512 Fern
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
File "./WikiExtractor.py", line 1838, in
main()
File "./WikiExtractor.py", line 1834, in main
process_data(input_file, args.templates, output_splitter)
File "./WikiExtractor.py", line 1658, in process_data
extract(id, title, page, output)
File "./WikiExtractor.py", line 154, in extract
text = clean(text)
File "./WikiExtractor.py", line 1293, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 838, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "./WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "./WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 838, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 338, in expandTemplates
res += text[cur:]
MemoryError

from wikiextractor.

attardi avatar attardi commented on August 26, 2024

Processing that file on my machine required 5GB of memory.
So it is possible that on your machine the memory gets exhausted.

You can try reducing the maximum depth of recursion, by setting for example

maxTemplateRecursionLevels = 8

If that does not help, you will have to disable templates with option

--no-templates.

Let me know.

-- Beppe

On 4/11/2015 22:39, agoyaliitk wrote:

No change.
Giving the same error again.

INFO:root:66499 Informal sector
INFO:root:66505 Secrecy
INFO:root:66511 MX record
INFO:root:66512 Fern
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
File "./WikiExtractor.py", line 1838, in
main()
File "./WikiExtractor.py", line 1834, in main
process_data(input_file, args.templates, output_splitter)
File "./WikiExtractor.py", line 1658, in process_data
extract(id, title, page, output)
File "./WikiExtractor.py", line 154, in extract
text = clean(text)
File "./WikiExtractor.py", line 1293, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 838, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "./WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "./WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "./WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 838, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 338, in expandTemplates
res += text[cur:]
MemoryError


Reply to this email directly or view it on GitHub
#2 (comment).

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

I'll try.
thanks

from wikiextractor.

cifkao avatar cifkao commented on August 26, 2024

I have a similar problem with this article: INFO:root:1908699 Lepospondyli. It takes a lot more time than other articles and then I get this output:

INFO:root:1908699       Lepospondyli
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
  File "wikiextractor/WikiExtractor.py", line 1838, in <module>
    main()
  File "wikiextractor/WikiExtractor.py", line 1834, in main
    process_data(input_file, args.templates, output_splitter)
  File "wikiextractor/WikiExtractor.py", line 1658, in process_data
    extract(id, title, page, output)
  File "wikiextractor/WikiExtractor.py", line 154, in extract
    text = clean(text)
  File "wikiextractor/WikiExtractor.py", line 1293, in clean
    text = expandTemplates(text)
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
    params = templateParams(parts[1:], depth+1)
  File "wikiextractor/WikiExtractor.py", line 423, in templateParams
    parameters = [expandTemplates(p, depth) for p in parameters]
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
    params = templateParams(parts[1:], depth+1)
  File "wikiextractor/WikiExtractor.py", line 423, in templateParams
    parameters = [expandTemplates(p, depth) for p in parameters]
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 838, in expandTemplate
    ret = expandTemplates(template, depth + 1)
  File "wikiextractor/WikiExtractor.py", line 338, in expandTemplates
    res += text[cur:]
MemoryError

I had 8 GB of memory reserved for the process.

from wikiextractor.

agoyaliitk avatar agoyaliitk commented on August 26, 2024

Got the same error as cifkao.
Memory error after article no. 1908699 Lepospondyli

from wikiextractor.

attardi avatar attardi commented on August 26, 2024

I have committed a new version that should fix the memory problems.
I completely revised the strategy of parameter evaluation.
For example, in article n. 3616279 Arthrodira, there was a very deep dendogram whose expansion was exponential on depth.
Now parameters are expanded before substitution and this solves the problem.
Please try it.

from wikiextractor.

attardi avatar attardi commented on August 26, 2024

from wikiextractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.