Git Product home page Git Product logo

wikitextparser's Introduction

https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master https://readthedocs.org/projects/wikitextparser/badge/?version=latest

A simple to use WikiText parsing library for MediaWiki.

The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.

  • Python 3.8+ is required
  • pip install wikitextparser
>>> import wikitextparser as wtp

WikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.

You may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).

>>> parsed = wtp.parse("{{text|value1{{text|value2}}}}")
>>> parsed.templates
[Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]
>>> parsed.templates[0].arguments
[Argument("|value1{{text|value2}}")]
>>> parsed.templates[0].arguments[0].value = 'value3'
>>> print(parsed)
{{text|value3}}

The pformat method returns a pretty-print formatted string for templates:

>>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t1, t2 = parsed.templates
>>> print(t2.pformat())
{{t2
    | e = e
    | f = f
}}
>>> print(t1.pformat())
{{t1
    | b = b
    | c = c
    | d = {{t2
        | e = e
        | f = f
    }}
}}

Template.rm_dup_args_safe and Template.rm_first_of_dup_args methods can be used to clean-up pages using duplicate arguments in template calls:

>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')

Template parameters:

>>> param = wtp.parse('{{{a|b}}}').parameters[0]
>>> param.name
'a'
>>> param.default
'b'
>>> param.default = 'c'
>>> param
Parameter('{{{a|c}}}')
>>> param.append_default('d')
>>> param
Parameter('{{{a|{{{d|c}}}}}}')
>>> wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]
>>> wl.title = 'new_title'
>>> wl.fragment = 'new_fragmet'
>>> wl.text = 'X'
>>> wl
WikiLink('[[new_title#new_fragmet|X]]')
>>> del wl.text
>>> wl
WikiLink('[[new_title#new_fragmet]]')

All WikiLink properties support get, set, and delete operations.

>>> parsed = wtp.parse("""
... == h2 ==
... t2
... === h3 ===
... t3
... === h3 ===
... t3
... == h22 ==
... t22
... {{text|value3}}
... [[Z|X]]
... """)
>>> parsed.sections
[Section('\n'),
 Section('== h2 ==\nt2\n=== h3 ===\nt3\n=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('== h22 ==\nt22\n{{text|value3}}\n[[Z|X]]\n')]
>>> parsed.sections[1].title = 'newtitle'
>>> print(parsed)

==newtitle==
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
>>> del parsed.sections[1].title
>>>> print(parsed)

t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]

Extracting cell values of a table:

>>> p = wtp.parse("""{|
... |  Orange    ||   Apple   ||   more
... |-
... |   Bread    ||   Pie     ||   more
... |-
... |   Butter   || Ice cream ||  and more
... |}""")
>>> p.tables[0].data()
[['Orange', 'Apple', 'more'],
 ['Bread', 'Pie', 'more'],
 ['Butter', 'Ice cream', 'and more']]

By default, values are arranged according to colspan and rowspan attributes:

>>> t = wtp.Table("""{| class="wikitable sortable"
... |-
... ! a !! b !! c
... |-
... !colspan = "2" | d || e
... |-
... |}""")
>>> t.data()
[['a', 'b', 'c'], ['d', 'd', 'e']]
>>> t.data(span=False)
[['a', 'b', 'c'], ['d', 'e']]

Calling the cells method of a Table returns table cells as Cell objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:

>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}

HTML attributes of Table, Cell, and Tag objects are accessible via get_attr, set_attr, has_attr, and del_attr methods.

The get_lists method provides access to lists within the wikitext.

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']

The sublists method can be used to get all sub-lists of the current list or just sub-lists of specific items:

>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']

It also has an optional pattern argument that works similar to lists, except that the current list pattern will be automatically added to it as a prefix:

>>> wikilist = wtp.WikiList('#a\n#b\n##ba\n#*bb\n#:bc\n#c', '\#')
>>> wikilist.sublists()
[WikiList('##ba\n'), WikiList('#*bb\n'), WikiList('#:bc\n')]
>>> wikilist.sublists(pattern='\*')
[WikiList('#*bb\n')]

Convert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:

>>> wl = wtp.WikiList(
...     ':*A1\n:*#B1\n:*#B2\n:*:continuing A1\n:*A2',
...     pattern=':\*'
... )
>>> print(wl)
:*A1
:*#B1
:*#B2
:*:continuing A1
:*A2
>>> wl.convert('#')
>>> print(wl)
#A1
##B1
##B2
#:continuing A1
#A2

Accessing HTML tags:

>>> p = wtp.parse('text<ref name="c">citation</ref>\n<references/>')
>>> ref, references = p.get_tags()
>>> ref.name = 'X'
>>> ref
Tag('<X name="c">citation</X>')
>>> references
Tag('<references/>')

WikiTextParser is able to handle common usages of HTML and extension tags. However it is not a fully-fledged HTML parser and may fail on edge cases or malformed HTML input. Please open an issue on github if you encounter bugs.

parent and ancestors methods can be used to access a node's parent or ancestors respectively:

>>> template_d = parse("{{a|{{b|{{c|{{d}}}}}}}}").templates[3]
>>> template_d.ancestors()
[Template('{{c|{{d}}}}'),
 Template('{{b|{{c|{{d}}}}}}'),
 Template('{{a|{{b|{{c|{{d}}}}}}}}')]
>>> template_d.parent()
Template('{{c|{{d}}}}')
>>> _.parent()
Template('{{b|{{c|{{d}}}}}}')
>>> _.parent()
Template('{{a|{{b|{{c|{{d}}}}}}}}')
>>> _.parent()  # Returns None

Use the optional type_ argument if looking for ancestors of a specific type:

>>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')
>>> comment = parsed.comments[0]
>>> comment.ancestors(type_='ParserFunction')
[ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]

To delete/remove any object from its parents use del object[:] or del object.string.

The remove_markup function or plain_text method can be used to remove wiki markup:

>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'

mwparserfromhell is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that I needed it for. See mwparserfromhell's issues 40, 42, 88, and other related issues. In many of those situation wikitextparser may be able to give you more acceptable results.

Also note that wikitextparser is still using 0.x.y version meaning that the API is not stable and may change in the future versions.

The tokenizer in mwparserfromhell is written in C. Tokenization in wikitextparser is mostly done using the regex library which is also in C. I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, wikitextparser has a decent performance in realistic cases and should be able to compete and may even have little performance benefits in some situations.

If you have had a chance to compare these libraries in terms of performance or capabilities please share your experience by opening an issue on github.

Some of the unique features of wikitextparser are: Providing access to individual cells of each table, pretty-printing templates, a WikiList class with rudimentary methods to work with lists, and a few other functions.

  • The contents of templates/parameters are not known to offline parsers. For example an offline parser cannot know if the markup [[{{z|a}}]] should be treated as wikilink or not, it depends on the inner-workings of the {{z}} template. In these situations wikitextparser tries to use a best guess. [[{{z|a}}]] is treated as a wikilink (why else would anyone call a template inside wikilink markup, and even if it is not a wikilink, usually no harm is done).
  • Localized namespace names are unknown, so for example [[File:...]] links are treated as normal wikilinks. mwparserfromhell has similar issue, see #87 and #136. As a workaround, Pywikibot can be used for determining the namespace.
  • Linktrails are language dependant and are not supported. Also not supported by mwparserfromhell. However given the trail pattern and knowing that wikilink.span[1] is the ending position of a wikilink, it is possible to compute a WikiLink's linktrail.
  • Templates adjacent to external links are never considered part of the link. In reality, this depends on the contents of the template. Example: parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'
  • List of valid extension tags depends on the extensions intalled on the wiki. The tags method currently only supports the ones on English Wikipedia. A configuration option might be added in the future to address this issue.
  • wikitextparser currently does not provide an ast.walk-like method yielding all descendant nodes.
  • Parser functions and magic words are not evaluated.

wikitextparser's People

Contributors

5j9 avatar bencantcode avatar dskrypa avatar iced0368 avatar kennychenbasis avatar truebrain avatar winstontsai avatar winterheart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikitextparser's Issues

Parsing page fragment in wikilink

Now that I can find all the wiki links and their targets, it is also good to find the page title and fragment separately.

Example:

p = wtp.parse("[[Foo (Bar)#frag|Bar Foo]]")
p = p.wikilinks[0]

p.target_title
# "Foo (Bar)"
p.target_fragment
# "frag"
p.label
# Bar Foo

I just tried to rename text with label above. Such a nomenclature seemed more coherent to me.

سلام

آیا امکان دارد بتوانم برای بهبود پروژه لیلک از کمکهای شما استفاده کنم.
منتظر تماستان هستم

Parsing tables generated from templates

I know that one of the limitations of offline parsers is that syntax elements produced by a template transclusion cannot be detected by offline parsers.

So, this is more of a feasibility question. I would like to ask whether you are aware of any possible way to parse wikicode of tables generated with templates. For instance, the football squads template (https://en.wikipedia.org/wiki/Template:Football_squad_start). Given the following wikicode,

{{Football squad start}}
{{football squad player | no=1  | nat=Spain      | pos=GK | name=[[Iker Casillas]]}}
{{football squad player | no=3  | nat=ESP        | pos=DF | name=[[Gerard Pique]]}}
{{football squad player | no=9  | nat=Singapore  | pos=FW | name=[[Aleksandar Đurić]] | other=team captain}}
{{football squad player | no=10 | nat=NED| pos=FW | name=[[Robin Van Persie]]}}
{{football squad mid}}
{{football squad player | no=13 | nat=South Korea| pos=MF | name=[[Park Ji-sung]]}}
{{football squad player | no=25 | nat=ENG  | pos=GK | name=[[Joe Hart]]}}
{{football squad player | no=1  | nat=Spain  | natvar=1931 | pos=GK | name=[[Ricardo Zamora]]}}
{{football squad player | no=   | nat=HUN  | natvar=1949 | pos=FW | name=[[Ferenc Puskas]]}}
{{football squad end}}

it produces a table like the following:

image

Although, this template uses other templates such as the Template:Flagicon, I don't care much for that part now, but more about retrieving the correct table structure. (Even though, if you know about this as well, any pointers would be welcome.)

Thanks in advance!

The `pre` tag causes an error when parsed and accessed

test = '<pre>test</pre>'
wtp.parse(test).get_tags()

results in

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-7416011d748c> in <module>
      1 test = '<pre>test</pre>'
----> 2 wtp.parse(test).get_tags()

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_tags(self, name)
   1219             tags_append(Tag(lststr, type_to_spans, span, 'Tag'))
   1220         spans.sort()
-> 1221         tags.sort(key=attrgetter('_span_data'))
   1222         return tags
   1223 

TypeError: '<' not supported between instances of 'NoneType' and '_regex.Match'

Cheers,
Nico

Function `plain_text()` fails if tags are not properly lowercased

Actual example from https://en.wikipedia.org/w/index.php?title=American_Civil_War&action=edit&oldid=744681152

import wikitextparser as wtp
test = '<Ref>Professor James Downs. "Sick from Freedom: African-American Illness and Suffering during the Civil War and Reconstruction". January 1, 2012.</ref>'
wtp.parse(test).plain_text()

throws

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-70-efc3216e3b5b> in <module>
      1 test = '<Ref>Professor James Downs. "Sick from Freedom: African-American Illness and Suffering during the Civil War and Reconstruction". January 1, 2012.</ref>'
----> 2 wtp.parse(test).plain_text()

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    597         # because removing tags and wikilinks creates invalid spans.
    598         if replace_bolds:
--> 599             for b in parsed.get_bolds():
    600                 b[:] = b.text
    601         if replace_italics:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    888         ):
    889             for e in getattr(self, t):
--> 890                 bolds += e.get_bolds(True)
    891         return bolds
    892 

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    869         shadow = self._shadow
    870         for match in BOLDS_FINDITER(
--> 871             shadow, endpos=self._relative_contents_end
    872         ):
    873             ms = match.start(1)

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_tag.py in _relative_contents_end(self)
    215     @property
    216     def _relative_contents_end(self) -> int:
--> 217         return self._match.end('contents')

AttributeError: 'NoneType' object has no attribute 'end'

Cheers,
Nico

Doesn't parse templates correctly if they contain {{!}} and style="background:transparent"

Here is an example of a template that parses incorrectly with wikitextparser, which is from the Infobox officeholder template on the Chinese Wikipedia page for Vladimir Putin: https://zh.wikipedia.org/wiki/%E5%BC%97%E6%8B%89%E5%9F%BA%E7%B1%B3%E5%B0%94%C2%B7%E6%99%AE%E4%BA%AC

putin_officeholder = """{{Infobox officeholder
| name           = 弗拉基米尔·弗拉基米罗维奇·普京
| native_name    = Владимир Владимирович Путин
| image          = President Vladimir Putin.jpg
| caption        = 於2018年《俄羅斯是個機會的國家》活動
| office         = [[File:Standard of the President of the Russian Federation.svg|22px|border]] 第3、4、6、7屆[[俄罗斯总统|俄羅斯聯邦总统]]<!--「屆」與「任」有所不同,屆次以人作量衡而任期則以時間作量度-->
| predecessor    = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| primeminister  = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]<br/>[[米哈伊尔·米舒斯京]]
| successor      =
| term_start     = 2012年5月7日
| term_end       =
| predecessor1   = [[鲍里斯·尼古拉耶维奇·叶利钦|鲍里斯·叶利钦]]
| primeminister1 = [[米哈伊尔·米哈伊洛维奇·卡西亚诺夫|米哈伊尔·卡西亚诺夫]]<br/>[[米哈伊尔·叶菲莫维奇·弗拉德科夫|米哈伊尔·弗拉德科夫]]<br/>[[维克托·祖布科夫]]
| successor1     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| term_start1    = 2000年5月7日
| term_end1      = 2008年5月7日<br/>{{small|代理总统:1999年12月31日–2000年5月7日}}


| office2        = [[File:Standard of the President of the Russian Federation.svg|22px|border]] [[俄羅斯聯邦安全會議]]主席
| predecessor2   = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| vicechairman2  = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]](自2020年)
| successor2     =
| term_start2    = 2012年5月7日
| term_end2      =
| predecessor3   = [[鲍里斯·尼古拉耶维奇·叶利钦|鲍里斯·叶利钦]]
| successor3     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| term_start3    = 2000年5月7日
| term_end3      = 2008年5月7日<br/>{{small|代理主席:1999年12月31日–2000年5月7日}}

| office4        = [[File:Flag of Russia.svg|22px|border]] [[俄罗斯总理|俄罗斯联邦总理]]
| deputy4        = 伊戈尔·舒瓦洛夫
| predecessor4   = [[维克托·祖布科夫]]
| president4     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| successor4     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| term_start4    = 2008年5月8日
| term_end4      = 2012年5月7日
| president5     = [[鲍里斯·尼古拉耶维奇·叶利钦|鲍里斯·叶利钦]]
| deputy5        = [[维克托·鲍里索维奇·赫里斯坚科|维克托·赫里斯坚科]]<br/>[[米哈伊尔·米哈伊洛维奇·卡西亚诺夫|米哈伊尔·卡西亚诺夫]]
| term_start5    = 1999年8月16日
| term_end5      = 2000年5月7日<br/>{{small|代理总理:1999年8月9日–1999年8月16日}}
| predecessor5   = [[谢尔盖·瓦季莫维奇·斯捷帕申|谢尔盖·斯捷帕申]]
| successor5     = [[米哈伊尔·米哈伊洛维奇·卡西亚诺夫|米哈伊尔·卡西亚诺夫]]
| office6        = [[File:Flag of the Union State.svg|22px|border]] [[俄白联盟|俄白联盟部长会议主席]]
| term_start6    = 2008年5月27日
| term_end6      = 2012年7月18日
| predecessor6   = [[维克托·祖布科夫]]
| successor6     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]  
| office7        = <!-- 注释出:[[File:Логотип партии -Единая Россия-.svg|22px|border]] --> [[统一俄罗斯|统一俄罗斯主席]]
| term_start7    = 2008年5月7日
| term_end7      = 2012年5月26日
| predecessor7   = [[鲍里斯·维亚切斯拉沃维奇·格雷兹洛夫|鲍里斯·格雷兹洛夫]]
| successor7     = [[德米特里·阿纳托利耶维奇·梅德韦杰夫|德米特里·梅德韦杰夫]]
| office8        = [[File:Emblem Security Council of Russia.svg|22px|border]] [[俄罗斯联邦安全会议|俄罗斯联邦安全会议秘书]]
| president8     = [[鲍里斯·尼古拉耶维奇·叶利钦|鲍里斯·叶利钦]]
| term_start8    = 1999年3月9日
| term_end8      = 1999年8月9日
| predecessor8   = [[尼古拉·博尔久扎]]
| successor8     = [[谢尔盖·鲍里索维奇·伊万诺夫|谢尔盖·伊万诺夫]]
| office9        = [[File:Flag of FSB.svg|22px|border]] [[俄罗斯联邦安全局|俄罗斯联邦安全局局长]]
| president9     = [[鲍里斯·尼古拉耶维奇·叶利钦|鲍里斯·叶利钦]]
| term_start9    = 1998年7月25日
| term_end9      = 1999年3月29日
| predecessor9   = 尼古拉·科瓦廖夫
| successor9     = [[尼古拉·帕特鲁舍夫]]
| party          = {{CPSU}}{{small|(1975年–1991年)}}<br/>{{le|我们的家园-俄罗斯|Our Home – Russia}}{{small|(1995年–1999年)}}<br/>[[团结 (俄罗斯)|团结]]{{small|(1999年–2001年)}}<br/>[[统一俄罗斯]]{{small|(2008年–2012年)<ref>{{cite web|url=http://www.telegraph.co.uk/news/worldnews/vladimir-putin/9223621/Vladimir-Putin-quits-as-head-of-Russias-ruling-party.html|title=Vladimir Putin quits as head of Russia's ruling party|date=24 April 2012|publisher=|via=www.telegraph.co.uk}}</ref>}}<br/>{{le|全俄人民阵线|All-Russia People's Front}}{{small|(2011年至今)}}<br/>[[無黨籍|独立人士]]{{small|(1991年–1995年;2001年–2008年;2012年至今)}}
| birth_date     = {{birth date and age|1952|10|7|df=y}}
| birth_place    = {{flag|USSR|1936}}[[列宁格勒]]
| nationality    = {{URS}}(1952年-1991年)<br>{{RUS}}(1991年至今)
| spouse         = {{marriage|[[柳德米拉·普京娜|柳德米拉·什克列布涅娃]]|1983-7-28|2014|reason=div}}
| children       = 2
| residence      = [[俄罗斯]][[莫斯科]]新奥加略沃
| alma_mater     = [[圣彼得堡国立大学|列宁格勒国立大学]](现:圣彼得堡国立大学)
|education       = [[圣彼得堡国立大学|列寧格勒國立大學]]國際法學系法學士<br/>'''学位'''<br/>{{flagicon|CHN}}[[清華大學]]名誉博士
| religion       = [[俄罗斯正教会|俄罗斯正教]]
| signature      = Putin signature.svg
| website        = {{Official website|http://eng.putin.kremlin.ru/}}
| allegiance     = {{Flag|Soviet Union|size=23px}}/{{Flag|DDR|size=23px}}
| branch         = [[File:Emblema KGB.svg|25px]] [[克格勃|国家安全委員會]]/[[File:Emblem Stasi.svg|25px]] [[国家安全部]]
| serviceyears   = 1975年–1991年
| rank           = [[File:CCCP air-force Rank podpolkovnik infobox.svg|25px]] 克格勃[[中校]]
| awards         = {{{!}} style="background:transparent"
{{!}}{{荣誉勋章 (俄罗斯联邦)}}{{!!}}{{苏联荣誉勋章}}
{{!}}}
{{{!}} style="background:transparent"
{{!}}{{法国荣誉军团勋章|type=1}}{{!!}}{{何塞·马蒂勋章}}{{!!}}{{Ho Chi Minh Order}}
{{!}}}
| module         = {{Infobox name module
 | putonghua = 普京
 | taiwan = 普欽<ref>[https://www.boca.gov.tw/sp-foof-countrycp-03-43-b55c4-04-1.html 中華民國外交部領事事務局]</ref>、蒲亭<ref>[http://www.cna.com.tw/search/hysearchws.aspx?q=Vladimir+Putin **通訊社]</ref>、普廷、普亭、普丁
 | cantonese = 普京<!-- 若不填寫澳門一項,則標籤顯示為港澳 -->
 }}
}}"""

import wikitextparser as wtp
print(wtp.parse(putin_officeholder).templates)

If we remove the following lines:

| awards         = {{{!}} style="background:transparent"
{{!}}{{荣誉勋章 (俄罗斯联邦)}}{{!!}}{{苏联荣誉勋章}}
{{!}}}
{{{!}} style="background:transparent"
{{!}}{{法国荣誉军团勋章|type=1}}{{!!}}{{何塞·马蒂勋章}}{{!!}}{{Ho Chi Minh Order}}
{{!}}}

Then the template gets parsed correctly

Tag parsing bug when multiple tags with the same name are on one line

Using version 0.29.1, I encountered a bug when there is a single line that contains multiple tags with the same name. It appears to be pairing opening/closing tags by taking the last occurrence of a closing tag that matches a given opening tag, rather than the first closing tag that matches it. Additionally, the resulting Tag object breaks when attempting to access its attributes.

Example:

>>> from wikitextparser import WikiText

>>> wt_obj = WikiText('March 20, 2018 <small>(Part 1)</small> <br/> March 27, 2018 <small>(Part 2)</small>')

>>> tag = wt_obj.tags()[0]

>>> tag
Tag('<small>(Part 1)</small> <br/> March 27, 2018 <small>(Part 2)</small>')

>>> tag.name
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\venv\lib\site-packages\wikitextparser\_tag.py", line 209, in name
    return self._match['name'].decode()
TypeError: 'NoneType' object is not subscriptable

>>> tag.contents
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\venv\lib\site-packages\wikitextparser\_tag.py", line 235, in contents
    s, e = self._match.span('contents')
AttributeError: 'NoneType' object has no attribute 'span'

>>> tag.attrs
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\venv\lib\site-packages\wikitextparser\_tag.py", line 119, in attrs
    spans = self._attrs_match.spans
AttributeError: 'NoneType' object has no attribute 'spans'

If I replace the last tag with a different one, it works as intended:

>>> wt_obj = WikiText('March 20, 2018 <small>(Part 1)</small> <br/> March 27, 2018 <a href="example">(Part 2)</a>')
>>> tag = wt_obj.tags()[0]
>>> tag
Tag('<small>(Part 1)</small>')
>>> tag.name
'small'
>>> tag.contents
'(Part 1)'
>>> tag.attrs
{}

The example is truncated; the source contained 4 sets of <small> tags on one line, and it grouped all 4 together.

List corrupted when a note is inserted

Let's take that french page for example: en (no need to understand what is printed though). That list counts 12 items. The 3rd item has a note and then the rest of the list is not fetched.

Here is a simple reproduction case (I copy/pasted interesting data to be able to add a simple regression test, if you are interested in fixing it):

# file: repro.py
import pprint
import wikitextparser as wtp


with open("demo.txt") as f:
    data = f.read()

sections = wtp.parse(data).get_sections(include_subsections=False)
lists = sections[1].get_lists()[0].items
pprint.pprint(lists)

I attached the file demo.txt. Just run the script, it will output:

▶ python repro.py
[' Utilisé après certains verbes.',
 ' Permet de préciser une matière.',
 ' Indique le [[lieu]].']

Anyway, thanks a lot for your work 💪 (I use it for https://github.com/BoboTiG/ebook-reader-dict [WIP]).

Parsable tag extensions with invalid attributes cause an error

test = '<span "16/32">test</span>'
wtp.parse(test).plain_text()

compiles to

<span "16/32">test</span>

which is fine. But

test = '<ref "16/32">test</ref>'
wtp.parse(test).plain_text()

throws

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-58-78b1821beafc> in <module>
      1 test = '<ref "16/32">test</ref>'
----> 2 wtp.parse(test).plain_text()

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    598         # get_bolds() will try to look into wikilinks for bold parts.
    599         if replace_bolds:
--> 600             for b in parsed.get_bolds():
    601                 b[:] = b.text
    602         if replace_italics:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    889         ):
    890             for e in getattr(self, t):
--> 891                 bolds += e.get_bolds(False)
    892         return bolds
    893 

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    870         shadow = self._shadow
    871         for match in BOLDS_FINDITER(
--> 872             shadow, endpos=self._relative_contents_end
    873         ):
    874             ms = match.start(1)

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_tag.py in _relative_contents_end(self)
    214     @property
    215     def _relative_contents_end(self) -> int:
--> 216         return self._match.end('contents')

AttributeError: 'NoneType' object has no attribute 'end'

Combination of external link, unparsable tag extension, and square brackets cause parsing error

Real example taken from: https://en.wikipedia.org/wiki/Active_Directory?oldid=745267164

test = """
[http://msdn.microsoft.com/en-us/library/cc223122.aspx <nowiki>[MS-ADTS]: Active Directory Technical Specification</nowiki>] (part of the [[Microsoft Open Specification Promise]])
"""
wtp.parse(test).plain_text()

results in a

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-e52ac320fb29> in <module>
      2 [http://msdn.microsoft.com/en-us/library/cc223122.aspx <nowiki>[MS-ADTS]: Active Directory Technical Specification</nowiki>] (part of the [[Microsoft Open Specification Promise]])
      3 """
----> 4 wtp.parse(test).plain_text()

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    598         # get_bolds() will try to look into wikilinks for bold parts.
    599         if replace_bolds:
--> 600             for b in parsed.get_bolds():
    601                 b[:] = b.text
    602         if replace_italics:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    889         ):
    890             for e in getattr(self, t):
--> 891                 bolds += e.get_bolds(False)
    892         return bolds
    893 

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    870         shadow = self._shadow
    871         for match in BOLDS_FINDITER(
--> 872             shadow, endpos=self._relative_contents_end
    873         ):
    874             ms = match.start(1)

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_tag.py in _relative_contents_end(self)
    207     @property
    208     def _relative_contents_end(self) -> int:
--> 209         return self._match.end('contents')

AttributeError: 'NoneType' object has no attribute 'end'

Cheers,
Nico

'typing' module not available before Python 3.5

According to the readme, wikitextparser supports Python 3.3+. However, the included file parameter.py requires importing a type called 'Optional' from the module 'typing'. Afaict, this module doesn't exist before Python3.5 (see documentation at https://docs.python.org/3/library/typing.html). And unfortunately, we're running 3.4 in our shop, so I get an
ImportError: No module named 'typing'

I guess either the documentation for wikitextparser should be changed to say that it requires 3.5, or the need for the 'typing' module needs to be removed. Obviously the former would be easier, although not to my liking...

Function `plain_text()` fails for HTML tags having multiple attributes without values

Hi,

test = '<ref firstattribute>Test</ref>'
wtp.parse(test).plain_text()

works fine, but

test = '<ref firstattribute secondattribute>Test</ref>'
wtp.parse(test).plain_text()

fails due to the second attribute without a value. Such attributes could occur e.g. in the form of data-xxx attributes.

I encountered that problem in https://en.wikipedia.org/wiki/Axon?oldid=739655195 with the following (invalid) markup:

<ref Axon amplifies somatic incomplete spikes to uniform>{{cite journal | last1 = Chen | first1 = Na | last2 = Yu | first2 = Jiandong | last3 = Qian | first3 = Hao | last4 = Jin-Hui | year = 2010 | doi=10.1371/journal.pone.0011868 | url = | journal = PLOS ONE | volume = 5 | issue = 7| page = e11868 | title=Axons Amplify Somatic Incomplete Spikes into Uniform Amplitudes in Mouse Cortical Pyramidal Neurons}}</ref>

Cheers,
Nico

Hang up on large Wikipedia template

Parsing the following Wikipedia template (https://en.wikipedia.org/w/index.php?title=Template:Switcher&action=edit, 2020-06-11) seems to lead to an endless loop:

<div class="switcher-container"><!--
-->{{#if:{{{2|}}}|<div>{{{1|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|1|data-switcher-default=""}}>{{trim|{{{2|}}}}}</span></div>}}<!--
-->{{#if:{{{4|}}}|<div>{{{3|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|2|data-switcher-default=""}}>{{trim|{{{4|}}}}}</span></div>}}<!--
-->{{#if:{{{6|}}}|<div>{{{5|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|3|data-switcher-default=""}}>{{trim|{{{6|}}}}}</span></div>}}<!--
-->{{#if:{{{8|}}}|<div>{{{7|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|4|data-switcher-default=""}}>{{trim|{{{8|}}}}}</span></div>}}<!--
-->{{#if:{{{10|}}}|<div>{{{9|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|5|data-switcher-default=""}}>{{trim|{{{10|}}}}}</span></div>}}<!--
-->{{#if:{{{12|}}}|<div>{{{11|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|6|data-switcher-default=""}}>{{trim|{{{12|}}}}}</span></div>}}<!--
-->{{#if:{{{14|}}}|<div>{{{13|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|7|data-switcher-default=""}}>{{trim|{{{14|}}}}}</span></div>}}<!--
-->{{#if:{{{16|}}}|<div>{{{15|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|8|data-switcher-default=""}}>{{trim|{{{16|}}}}}</span></div>}}<!--
-->{{#if:{{{18|}}}|<div>{{{17|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|9|data-switcher-default=""}}>{{trim|{{{18|}}}}}</span></div>}}<!--
-->{{#if:{{{20|}}}|<div>{{{19|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|10|data-switcher-default=""}}>{{trim|{{{20|}}}}}</span></div>}}<!--
-->{{#if:{{{22|}}}|<div>{{{21|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|11|data-switcher-default=""}}>{{trim|{{{22|}}}}}</span></div>}}<!--
-->{{#if:{{{24|}}}|<div>{{{23|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|12|data-switcher-default=""}}>{{trim|{{{24|}}}}}</span></div>}}<!--
-->{{#if:{{{26|}}}|<div>{{{25|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|13|data-switcher-default=""}}>{{trim|{{{26|}}}}}</span></div>}}<!--
-->{{#if:{{{28|}}}|<div>{{{27|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|14|data-switcher-default=""}}>{{trim|{{{28|}}}}}</span></div>}}<!--
-->{{#if:{{{30|}}}|<div>{{{29|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|15|data-switcher-default=""}}>{{trim|{{{30|}}}}}</span></div>}}<!--
-->{{#if:{{{32|}}}|<div>{{{31|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|16|data-switcher-default=""}}>{{trim|{{{32|}}}}}</span></div>}}<!--
-->{{#if:{{{34|}}}|<div>{{{33|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|17|data-switcher-default=""}}>{{trim|{{{34|}}}}}</span></div>}}<!--
-->{{#if:{{{36|}}}|<div>{{{35|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|18|data-switcher-default=""}}>{{trim|{{{36|}}}}}</span></div>}}<!--
-->{{#if:{{{38|}}}|<div>{{{37|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|19|data-switcher-default=""}}>{{trim|{{{38|}}}}}</span></div>}}<!--
-->{{#if:{{{40|}}}|<div>{{{39|}}}<span class="switcher-label" style="display:none" {{#ifeq:{{{default|}}}|20|data-switcher-default=""}}>{{trim|{{{40|}}}}}</span></div>}}<!--
--></div><noinclude>
{{Documentation}}
</noinclude>

Probably related to #24

Tried it with 0.29.0, 0.35.0 and 0.37.0

Thanks for the otherwise great parser btw! I really appreciate its speed when parsing the whole English Wikipedia.

Parse item lists

This is not a bug, this is a feature request.

As far as I can see, currently, there is no direct support for parsing item lists.

That means for something like this:

The North American Numbering Plan Area includes:
*[[North American Numbering Plan|+1]] {{flag|Canada}}
*[[North American Numbering Plan|+1]] {{flag|United States}}, including United States territories:
**[[Area code 340|+1 340]] {{flag|United States Virgin Islands}}
**[[Area code 670|+1 670]] {{flag|Northern Mariana Islands}}

The API could be similar to what is available via sections or tables.

For example:

import wikitextparser as wtp
p = wtp.parse(str)
p.lists[0].items[0]
p.lists[0].level # i.e. hiearrchy level as indicated by the stars

Motivation: sometimes it is useful to be able to iterate over such item lists when extracting data.

My current workaround: use wikitextparser to traverse over some highlevel structure like sections and the splitlines() on the contents, text based filter for asterisks and the re-parse those lines again.

External links are not parsed correctly when containing a wikilink in its text

Hey there,

it seems that external links are not parsed correctly if they contain a wikilink. This even leads to errors when trying to convert it to plain text.

import wikitextparser as wtp
test = '[http://www.example.com [[example wikilink]] some other text]'
print(wtp.parse(test).external_links)

produces an incorrectly parsed external link:

> [ExternalLink('[http://www.example.com [[example wikilink]')]

When calling

print(wtp.parse(test).plain_text())

we even get an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-dd08df7ca68f> in <module>
      3 """
      4 print(wtp.parse(test).external_links)
----> 5 print(wtp.parse(test).plain_text())

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    597         # because removing tags and wikilinks creates invalid spans.
    598         if replace_bolds:
--> 599             for b in parsed.get_bolds():
    600                 b[:] = b.text
    601         if replace_italics:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    888         ):
    889             for e in getattr(self, t):
--> 890                 bolds += e.get_bolds(True)
    891         return bolds
    892 

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    869         shadow = self._shadow
    870         for match in BOLDS_FINDITER(
--> 871             shadow, endpos=self._relative_contents_end
    872         ):
    873             ms = match.start(1)

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikilink.py in _relative_contents_end(self)
    145     @property
    146     def _relative_contents_end(self) -> int:
--> 147         return self._match.end(4)

AttributeError: 'NoneType' object has no attribute 'end'

Cheers,
Nico

Parsing [[http://example.com foo bar]]

>>> import wikitextparser as wtp
>>> p =wtp.parse('[[http://example.com foo bar]]')
>>> p.wikilinks
[WikiLink('[[http://example.com foo bar]]')]
>>> p.templates
[]
>>> p.external_links
[ExternalLink('[http://example.com foo bar]')]
>>> 

p.wikilinks should be empty.
Mediawiki parses [[http://example.com foo bar]] as an external link enclosed in textual brackets.

Only first template is removed when using `plain_text()`

Hi there,
first of all - thanks for the great parser! It already saved me a lot of work. And I really like that you are actively adding new functionality.

I wanted to try the new plain-text feature and noticed the following problem:

wtp.parse('{{ eN : tEmPlAtE : <!-- c --> t_1 # b | a }} hello world').plain_text()

returns ' hello world' as expected.

wtp.parse('{{ eN : tEmPlAtE : <!-- c --> t_1 # b | a }} hello world {{ eN : tEmPlAtE : <!-- c --> t_1 # b | a }}').plain_text()

returns ' hello world {{ eN : tEmPlAtE : <!-- c --> t_1 # b | a }}', i.e. the second template is not removed.

There seems to be a similar problem for HTML tags as well, but I'm still investigating.

Cheers,
Nico

Function `plain_text()` fails if a wikilink contains another wikilink in its text

Hi,

as the title already says:

test = '[[A Wikilink|Text of a Wikilink [[Another Wikilink]]]]'
print([e.plain_text() for e in wtp.parse(test).wikilinks])

returns

> ['', 'Another Wikilink']

More problematic,

test = """
[[A Wikilink|Text of a Wikilink [[Another Wikilink|''Italic Text'']]]]
"""
print([e.plain_text() for e in wtp.parse(test).wikilinks])

throws the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-cc1298f4052b> in <module>
      2 [[A Wikilink|Text of a Wikilink [[Another Wikilink|''Italic Text'']]]]
      3 """
----> 4 print([e.plain_text() for e in wtp.parse(test).wikilinks])

<ipython-input-39-cc1298f4052b> in <listcomp>(.0)
      2 [[A Wikilink|Text of a Wikilink [[Another Wikilink|''Italic Text'']]]]
      3 """
----> 4 print([e.plain_text() for e in wtp.parse(test).wikilinks])

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    601         if replace_italics:
    602             for i in parsed.get_italics():
--> 603                 i[:] = i.text
    604         if replace_parameters:
    605             for p in parsed.parameters:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_comment_bold_italic.py in text(self)
     38         """Return text value of self (without triple quotes)."""
     39         # noinspection PyUnresolvedReferences
---> 40         return self._match[1]
     41 
     42     @text.setter

TypeError: 'NoneType' object is not subscriptable

Cheers,
Nico

Compare wikitextparser against mwparser

The README contains a link to mwparserfromhell. A short comparison to it would be useful for readers new to wikitextparser.

Otherwise one might ask himself: what should I use? why another paser if there is already mwparserfromhell? What are the benefits of using wikitextparser over mwparserfromhell? Where are the fundamental differences in the approaches of both?

For example, the wikitextparser API explicitly supports table parsing, where mwparserfromhell does not.

How to walk through the parsed result?

Hi!

Maybe, I am missing something obvious, but I can't really find a way to get the parsed result in a serializable form: All subelements are available via their own functions: templates, parser_functions, etc, but is it possible to get it as a stream, in the same order as it appears in the text?

Something as simple as .children or similar, and then each child has a type, which can be further asked about it's children? Or, maybe, a tree-walker. The etree library is good example how it is done with XML.

If it's at all supported, it should be in my opinion be illustrated in the readme, right away. If it's not, maybe, it can be stated, that it's not the purpose of the library.

Link parsing bug when text ends with ]

Using both the latest packaged version and the current master version (0.30.0.dev0), I encountered this issue with a particular link that I found in the wild. Example:

>>> link = WikiText('[[example (test)|example [test]]]').wikilinks[0]

>>> link.text
'example [test'

>>> link.string
'[[example (test)|example [test]]'

The ] at the end of the text is treated as part of the closer for the link.

If a space is inserted at the end of the text, it works as intended:

>>> link = WikiText('[[example (test)|example [test] ]]').wikilinks[0]

>>> link.text
'example [test] '

>>> link.string
'[[example (test)|example [test] ]]'

Getting list directly under section

I'm wondering if there's a way to get only the list directly under the current section, but not the lists in this section's subsections.

For example, the word free has a list of definitions under the Adjective section heading, but if I call section.lists() on the Adjective section, it gives a list of all lists found in every subsection of Adjective.

Is there a way to just get the list of definitions under Adjective and not the lists from the other subsections?

Infinite or near-infinite parsing on certain input

Hello! I'm parsing a bunch of wikis on 0.28.1 and came across this monstrous template on zhwiki (https://zh.wikipedia.org/w/index.php?title=Template:IfdLinkNext&action=edit) that spins my CPU forever. Any ideas what is going on?

Repro:

import wikitextparser as wtp
test = """
{{{{{subst|}}}#ifexpr:{{{{{subst|}}}#time:Ymd}}<={{{{{subst|}}}#time:Ymd|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>}}
|
{{{{{subst|}}}#if:{{void}}|{{<includeonly>IfdLinkNext</includeonly>|subst=}}|<span style="color:#808080" title="下次記錄的時間尚未到">→</span> }}
|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +1 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +1 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +1 day}}→]]|
<span style="color:#C0C0C0;">…</span> - {{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +2 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +2 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +2 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +3 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +3 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +3 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +4 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +4 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +4 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +5 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +5 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +5 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +6 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +6 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +6 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +7 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +7 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +7 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +8 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +8 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +8 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +9 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +9 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>  +9 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +10 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +10 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +10 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +11 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +11 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +11 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +12 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +12 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +12 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +13 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +13 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +13 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +14 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +14 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +14 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +15 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +15 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +15 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +16 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +16 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +16 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +17 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +17 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +17 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +18 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +18 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +18 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +19 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +19 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +19 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +20 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +20 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +20 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +21 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +21 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +21 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +22 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +22 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +22 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +23 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +23 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +23 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +24 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +24 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +24 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +25 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +25 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +25 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +26 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +26 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +26 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +27 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +27 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +27 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +28 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +28 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +28 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +29 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +29 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +29 day}}→]]|
{{{{{subst|}}}#ifexist:Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +30 day}}|[[Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y/m/d|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +30 day}}|{{{{{subst|}}}#time:n月j日|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +30 day}}→]]|

{{{{{subst|}}}#if:{{void}}
|
{{<includeonly>IfdLinkNext</includeonly>|subst=}}
|
{{{{{subst|}}}#ifexpr:{{{{{subst|}}}#time:Ymd|-30 day}}>{{{{{subst|}}}#time:Ymd|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly>}}
|
[[Special:PrefixIndex/Wikipedia:檔案存廢討論/記錄/{{{{{subst|}}}#time:Y|<includeonly>{{{{{subst|}}}#titleparts:{{{{{subst|}}}FULLPAGENAME}}||-3}}</includeonly> +30 day}}|<small title="下次記錄在30天後,系統不能偵測超過30天範圍的記錄,請按此手動尋找,並把本連接取代成正確日期">(未能偵測)</small>→]]
|
<span style="color:#808080" title="下次的記錄尚未建立">→</span>
}}
}}

}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
}}<noinclude>
----
{{pp-protected}}
{{esoteric}}
本模板被用於[[Wikipedia:檔案存廢討論]]
</noinclude>
"""
wtp.parse(test)

Function `plain_text()` not working correctly for external links in extension tags

Hey!
I've found another small one:

wtp.parse('<ref>[http://www.example.com example text], [[Example Entity]], 2 March 2000</ref>').plain_text()

returns '[http://www.example.com example text], Example Entity, 2 March 2000'
while

wtp.parse('<div>[http://www.example.com example text], [[Example Entity]], 2 March 2000</div>').plain_text()

returns 'example text, Example Entity, 2 March 2000'

Cheers,
Nico

MemoryError with unclosed comment tags

FYI there are two articles in the current English file with unclosed comment tags that cause a MemoryError

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam/LinkReports/fastcompany.com
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam/LinkReports/egovmonitor.com

Example stack trace:

2019-01-30 17:20:28 ERROR Executor:91 - Exception in task 1517.2 in stage 1.0 (TID 1573)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/wikipedia-spark.py", line 34, in parse_wt
  File "/usr/local/lib/python3.7/site-packages/wikitextparser/_wikitext.py", line 121, in __init__
    type_to_spans = self._type_to_spans = parse_to_spans(byte_array)
  File "/usr/local/lib/python3.7/site-packages/wikitextparser/_spans.py", line 183, in parse_to_spans
    for match in COMMENT_FINDITER(byte_array):
MemoryError

Tag extensions parsed incorrectly when having invalid attribute value

Real example from: https://en.wikipedia.org/wiki/Alkene?oldid=741328573

test = '<ref name="Wade"309>{{cite book | last = Wade |  pages = 309 }}</ref>'
wtp.parse(test).plain_text()

throws the error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-d2a558242134> in <module>
      1 test = '<score name="Wade"309>{{cite book | last = Wade |  pages = 309 }}</score>'
----> 2 wtp.parse(test).plain_text()

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in plain_text(self, replace_templates, replace_parser_functions, replace_parameters, replace_tags, replace_external_links, replace_wikilinks, unescape_html_entities, replace_bolds, replace_italics, _mutate)
    598         # get_bolds() will try to look into wikilinks for bold parts.
    599         if replace_bolds:
--> 600             for b in parsed.get_bolds():
    601                 b[:] = b.text
    602         if replace_italics:

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    889         ):
    890             for e in getattr(self, t):
--> 891                 bolds += e.get_bolds(False)
    892         return bolds
    893 

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_wikitext.py in get_bolds(self, recursive)
    870         shadow = self._shadow
    871         for match in BOLDS_FINDITER(
--> 872             shadow, endpos=self._relative_contents_end
    873         ):
    874             ms = match.start(1)

~/anaconda3/lib/python3.7/site-packages/wikitextparser/_tag.py in _relative_contents_end(self)
    207     @property
    208     def _relative_contents_end(self) -> int:
--> 209         return self._match.end('contents')

AttributeError: 'NoneType' object has no attribute 'end'

It seems to fail for all parsable and unparsable tag extensions. For other tags (e.g. span) it works fine.

Cheers,
Nico

data lists are not parsed as expected

I have used your suggestion to get most of the way there. I am coming across another problem that could be added to this issue.

If you look at the entry for noun phrase you'll see the following markup:

===Noun===
{{en-noun}}

# {{lb|en|grammar}} A [[phrase]] that can serve as the [[subject]] or the [[object]] of a [[verb]]; it is usually headed by a [[noun]], (including [[pronoun]]s), with any associated [[dependent]]s such as [[determiner]]s or [[modifier]]s.

; Examples
* The term “noun phrase” itself
* “Fred” in “Fred fell asleep at the keyboard.”
* “The day Fred . . . keyboard” in “The day Fred fell asleep at the keyboard was very hot, and he had had too much to drink at lunchtime.”

; Additional examples:
* banana (a noun)
* big bananas (an [[adjective]] 'big', and a plural [[noun]])
* a big banana (an [[article]] 'a', an adjective and a singular noun)
* this big banana (a [[determiner]] 'this', an adjective and a singular noun)
* a very big banana (an article, an [[adverb]] 'very', defining an adjective, and a singular noun)
* a very big banana that tastes great (an article, an adverb defining an adjective, and a singular noun; followed by a [[relative clause]] made up of a [[relative pronoun]] 'that', a [[verb]] 'tastes', and an adjective 'great')

====Translations====
...

The words "Examples" and "Additional examples" are subheadings under Noun, but when running sections.lists() you get something like this data structure (ignoring translations):

[
	WikiList('# {{lb|en|grammar}} A [[phrase]] that can serve as the [[subject]] or the [[object]] of a [[verb]]; ...'), 
	WikiList('* The term “noun phrase” itself\n* “Fred” in “Fred fell asleep at the keyboard.”\n* “The day Fred ...'), 
	WikiList("* banana (a noun)\n* big bananas (an [[adjective]] 'big', and a plural [[noun]])\n* a big banana ...'), 
	WikiList('; Examples\n'), 
	WikiList('; Additional examples:\n')
]

Where the Examples headings are showing up as their own lists and their actual list elements show up above them.

Originally posted by @BrendanMartin in #23 (comment)

Header attribute of Cell is not resolved correctly

Hi,

import wikitextparser as wtp

test = wtp.parse("""
{|class=wikitable style=font-size:97%
!Edition
!Stage
!Date
!Location
!Against
!Surface
!Partner
!Opponents
!W/L
!Score
|-style="background:#ccf;"
| rowspan="2" | [[2013 Fed Cup]] <br/> Europe/Africa Zone Group I
| rowspan="2" | [[2013 Fed Cup Europe/Africa Zone Group I – Pool B|R/R]]
| 7 February 2013
| rowspan="2" | [[Eilat]], Israel
| {{flagicon|HUN}} [[Hungary Fed Cup team|Hungary]]
| rowspan="2" | Hard
| {{flagicon|POR}} [[Margarida Moura]]
| {{flagicon|HUN}} [[Réka-Luca Jani]] <br/> {{flagicon|HUN}} [[Katalin Marosi]]
| style="text-align:center; background:#ffa07a;"|L
| 4–6, 2–6
|-style="background:#ccf;"
| 8 February 2013
| {{flagicon|GBR}} [[Great Britain Fed Cup team|Great Britain]]
| {{flagicon|POR}} [[Michelle Larcher de Brito]]
| {{flagicon|GBR}} [[Laura Robson]] <br/> {{flagicon|GBR}} [[Heather Watson]]
| style="text-align:center; background:#ffa07a;"|L
| 2–6, 1–6
|}
""")

test.get_tables()[0].cells(row=0)[0]._header

returns False. Probably because the string value of the cell is \n!Edition and the header is created with header = match_row[0]['sep'] == '!'.

Cheers,
Nico

Template parsing appears to break with comments

When I parse the wikitext from this page https://en.wikipedia.org/w/index.php?title=Panama_City,_Florida&action=edit
it doesn't appear to extract the infobox as a template, I suspect this is because the infobox template begins as:

{{Infobox settlement
<!-- Basic info ---------------->| name = Panama City, Florida
| official_name = City of Panama City
| other_name =

and it is having trouble parsing it with the comment there.

This may be a similar parsing issue to #51 where a new parsing algorithm was introduced to fix bold&italic parsing

First list item not parsed if it is in the same line

Input

==Details==
{{Quest details
|requirements = *{{Skill clickpic|Mining|40}} [[Mining]] (boostable)
*The ability to defeat a level 170 [[Demons|demon]] (can be [[safespot]]ted)
}}

parsed = wtp.parse(text)
for t in parsed.templates:
    for lst in arg.lists():
        for item in lst.fullitems:
            print(f">> {item.strip()}")

I'd expect to see both lines, but I only see one.

If I add replace("= *", "=\n*") to the input text (making inline items go into a new line), the problem is fixed

Parsing table from wikimedia dump

Is there any way we can parse the table/infobox from wikimedia dump?
Suppose , from the mediawiki, we have table like:

{{Succession table monarch
|name1 = '''''[[Grand Duke Michael Alexandrovich of Russia|Michael II]]'''''<br>''Michael Aleksandrovich''
|nickname1 = Михаил II Александрович
|life1 = 4 December 1878<br>–<br>13 June 1918
|reignstart1= 15 March 1917
|reignend1 = 16 March 1917
|notes1 = Son of Alexander III<br>Abdicated after a nominal reign of only 18 hours,<br>ending dynastic rule in Russia<ref>Montefiore, Simon S. (2016) ''The Romanovs, 1613–1918'' London: Weidenfeld & Nicolson, pp. 619–621</ref><br>He is not usually recognised as a tsar, as Russian law did not allow Nicholas II to disinherit his son<ref>{{cite web|url=https://www.russianlegitimist.org/the-abdication-of-nicholas-ii-100-years-later/|website=The Russian Legitimist|title= The Abdication of Nicholas II: 100 Years Later|accessdate=30 January 2018}}</ref>
|family1 = [[House of Romanov#House of Holstein-Gottorp-Romanov|Holstein-Gottorp-Romanov]]
|image1 = Mihail II.jpg
|name2 = '''''[[Grand Duke Nikolai Nikolaevich of Russia (1856–1929)|Nikolai Nikolaevich]]'''''
|nickname2 = Николай Николаевич
|native2 =
|life2 = 6 November 1856<br>–<br>5 January 1929
|reignstart2= 8 August 1922
|reignend2 = 25 October 1922
|notes2 = Grandson of [[Nicholas I of Russia|Nicholas I]]<br>Proclaimed Emperor of Russia by the [[Zemsky Sobor]] of the [[Provisional Priamurye Government]]<br>His nominal rule came to an end when the areas controlled by the Provisional Priamurye Government were overrun by the communists
|family2 = [[House of Romanov#House of Holstein-Gottorp-Romanov|Holstein-Gottorp-Romanov]]
|image2 = Николай_Николаевич_Младший%2C_до_1914.jpg
|alt2 =
|name3 = '''''[[Grand Duke Kirill Vladimirovich of Russia|Kirill Vladimirovich]]'''''
|nickname3 = Кири́лл Влади́мирович Рома́нов
|life3 = 30 September 1876<br>–<br>12 October 1938
|reignstart3= 31 August 1924
|reignend3 = 12 October 1938
|notes3 = Grandson of Alexander II<br>Claimed the title Emperor of All the Russias while in exile<ref name=adg>{{cite book|year=1998|publisher=Almanach de Gotha|edition=182nd|title=Almanach de Gotha|page=214}}</ref><br>Recognised by a congress of legitimists delegates in Paris in 1926<ref>Shain, Yossi ''The Frontier of Loyalty: Political Exiles in the Age of the Nation-State'' University of Michigan Press (2005) p.69.</ref>
|family3 = [[House of Romanov#House of Holstein-Gottorp-Romanov|Holstein-Gottorp-Romanov]]
|image3 = Grand Duke Kirill Vladimirovich Romanov.JPG
|alt3 =
}}

TIA

Exposing other row and cell attributes of tables

Currently the parser for table data (Table.getdata) only returns the cells' content, but ignores all row attributes except rowspan and all cell attributes except colspan.

Both NEWLINE_CELL_REGEX and INLINE_HAEDER_CELL_REGEX already contain named groups for all attributes. Would it be possible to expose these in getdata (or possibly another method)? I was thinking the format could be tuples [(row_attributes, row_data), ...], where row_data is [(cell_attributes, cell_data), ...] (instead of the current [[cell_data], ...].

Parsing of Tag's Contents

Consider the following,

t = """
== QWER ==
{{abcdef}}
<abc>124</abc>
"""

p = wtp.parse(t)
p.tags()[0].parsed_contents
# Outputs "SubWikiText('WER ==\n{{ab')"

I don't think it is parsing the contents of the tag <abc>. Is this working as expected?

Strip markup function?

Hi, thank you for this library!
mwparserfromhell.wikicode has a function strip_code, from its docs:

Return a rendered string without unprintable code such as templates.

Basically it removes all mediawiki markup and returns plain text.
I didn't find anything like this in this library. Can I somehow get plain text from the WikiText class?

Sections title with a trailing space

Hi,

First of all, thanks for this awesome library!

I'm using it to edit thousands of pages everyday on the French Wiktionary. In some rare cases I saw that when I used the contents setter on a Section object, the text was append directly next to the title, on the same line, rather than on the line bellow (for example: https://fr.wiktionary.org/w/index.php?diff=25894594).

I found that it happens when there is a trailing white space character at the end of the section's title, e.g. "==this is a title== ".

As a workaround, I'm currently removing all those trailing white spaces before parsing the wikicode:
re.sub(r"== +\n", "==\n", wikicode)

Thanks.

Trouble parsing rowspan in table

I get the following error when trying to parse tables from the example.wiki file below:

$ ./example.py 
Traceback (most recent call last):
  File "./example.py", line 6, in <module>
    data = p.tables[0].getdata()
  File "/home/dan/work/wikitextparser/wikitextparser/_table.py", line 111, in getdata
    return self.data(span)
  File "/home/dan/work/wikitextparser/wikitextparser/_table.py", line 164, in data
    table_data = _apply_attr_spans(table_attrs, table_data)
  File "/home/dan/work/wikitextparser/wikitextparser/_table.py", line 411, in _apply_attr_spans
    rowspan = int(attrs_get(b'rowspan', 1))
ValueError: invalid literal for int() with base 10: b''

There is something slightly off about the regexes that parse attributes. It's parsing an empty rowspan from 'rowspan="2"'. I tried to debug for awhile without success.

Below is example.py to reproduce the error:

import wikitextparser as wtp
p = wtp.parse(open("example.wiki").read())
data = p.tables[0].getdata()

The example.wiki file is in this pastebin.

It is from https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_1996.

bold&italic parsing doesn't work

Test case:
Parse the wikitext contents of https://en.wikipedia.org/wiki/Supreme_Clientele
Then call .plain_text() on it.

Expected error:
_comment_bold_italic.py", line 41, in text return self._match[1] TypeError: 'NoneType' object is not subscriptable

Probable cause:
Seems to occur when the parser attempts to parse italic within bold text (i.e. 5 quote marks '''''like this''''')

`data()` function of table fails with non-integer rowspan/colspan

Calling

import wikitextparser as wtp
x = """
{| BORDER="0" CELLPADDING="3" CELLSPACING="0"
|- ALIGN="center" bgcolor="#e0e0e0"
! colspan="1" bgcolor="#ffffff" | &nbsp;
! rowspan="99" bgcolor="#ffffff" | &nbsp;
! rowspan="99" bgcolor="#ffffff" | &nbsp;
|- ALIGN="center" bgcolor="#e0e0e0"
! Season !! GP !!G !! A !! Pts !! PPG !!
|- ALIGN="center" bgcolor="#f0f0f0"
|    1998    ||    15    ||    1    ||    4    ||    5      || -- ||
|- ALIGN="center"
|    1999    ||    13    ||    11    ||    24    ||    35      || -- ||
|- ALIGN="center" bgcolor="#f0f0f0"
|    2000    ||    15      ||    12    ||    27    ||    39     || -- ||    
|- ALIGN="center"
|    2001    ||    15      ||    19    ||    19    ||    38      || -- ||    
|- ALIGN="center"  bgcolor="#e0e0e0"
! colspan="1.5" |Totals    !!    58    !!    43    !!    74    !! 117 !! -- !!
|}
"""
wtp.parse(x).tables[0].data()

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nheist/anaconda3/lib/python3.8/site-packages/wikitextparser/_table.py", line 180, in data
    table_data = _apply_attr_spans(table_attrs, table_data)
  File "/Users/nheist/anaconda3/lib/python3.8/site-packages/wikitextparser/_table.py", line 376, in _apply_attr_spans
    colspan = int(attrs_get(b'colspan', 1))
ValueError: invalid literal for int() with base 10: b'1.5'

Data is taken from https://en.wikipedia.org/wiki/Matt_Striebel?oldid=739100373

Cheers,
Nico

Function `plain_text()` not working correctly for sections

Hi there,

the plain_text() method shows a weird behavior when applying it to individual sections of a document. As some text of the later sections is cropped off, it seems that the original WikiText object is somehow mutated although the _mutate flag is False.

The following toy example shows the behavior:

import wikitextparser as wtp
test = """
Hello world.

==Section 1==
Text of Section 1 <ref>a tag</ref>

==Section 2==
Some text that is not displayed when using the plain_text method
"""

print([(s.title, s.plain_text()) for s in wtp.parse(test).get_sections()])
> [(None, '\nHello world.\n\n'), ('Section 1', 'ext of Section 1 a tag\n\n'), ('Section 2', 'n_text method\n')]

Cheers,
Nico

Parser does not terminate on table access

Steps to reproduce

Download the input:

curl -L 'http://en.wikipedia.org/w/index.php?title=Mobile_country_code&action=raw' > mcc.wtext

Copy and paste into a python3 session:

import pycountry as pyc
import wikitextparser as wtp
s = open('mcc.wtext').read()
wt = wtp.parse(s)
wt.sections[4].title.tables[0].get_data()

Expected result:

A print of the actual (small) table, after a few milliseconds.

Actual result:

Call doesn't terminate.

After a Ctrl+C:

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/juser/.local/lib/python3.4/site-packages/wikitextparser/wikitextparser.py", line 394, in     tables
    for m in TABLE_REGEX.finditer(shadow):
KeyboardInterrupt

I've installed the package today via pip3 (i.e. 0.7.5).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.