Git Product home page Git Product logo

tatuylonen / wikitextprocessor Goto Github PK

View Code? Open in Web Editor NEW
86.0 5.0 23.0 5.06 MB

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.

License: Other

Lua 12.64% Shell 0.06% Python 87.24% Makefile 0.06%
wikitext scribuntu wikipedia wiktionary mediawiki

wikitextprocessor's Introduction

wikitextprocessor

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

  • Parsing dump files, including built-in support for processing pages in parallel
  • Wikitext syntax parser that converts the whole page into a parse tree
  • Extracting template definitions and Scribunto Lua module definitions from dump files
  • Expanding selected templates or all templates, and heuristically identifying templates that need to be expanded before parsing is reasonably possible (e.g., templates that emit table start and end tags)
  • Processing and expanding wikitext parser functions
  • Processing, executing, and expanding Scribunto Lua modules (they are very widely used in, e.g., Wiktionary, for example for generating IPA strings for many languages)
  • Controlled expansion of parts of pages for applications that parse overall page structure before parsing but then expand templates on certain sections of the page
  • Capturing information from template arguments while expanding them, as template arguments often contain useful information not available in the expanded content.

This module is primarily intended as a building block for other packages that process Wikitionary or Wikipedia data, particularly for data extraction. You will need to write code to use this.

For pre-existing extraction modules that use this package, please see:

  • Wiktextract for extracting rich machine-readable dictionaries from Wiktionary. You can also find pre-extracted machine-readable Wiktionary data in JSON format at kaikki.org.

Getting started

Installing

Install from source:

git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Running tests

This package includes tests written using the unittest framework. The test dependencies can be installed with command python -m pip install -e .[dev].

To run the tests, use the following command in the top-level directory:

make test

To run a specific test, use the following syntax:

python -m unittest tests.test_[module].[Module]Tests.test_[name]

Python's unittest framework help and options can be accessed through:

python -m unittest -h

Obtaining WikiMedia dump files

This package is primarily intended for processing Wiktionary and Wikipedia dump files (though you can also use it for processing individual pages or other files that are in wikitext format). To download WikiMedia dump files, go to the dump download page. We recommend using the <name>-<date>-pages-articles.xml.bz2 files.

API documentation

Usage example:

from functools import partial
from typing import Any

from wikitextprocessor import Wtp, WikiNode, NodeKind, Page
from wikitextprocessor.dumpparser import process_dump

def page_handler(wtp: Wtp, page: Page) -> Any:
    wtp.start_page(page.title)
    # process parse tree
    tree = wtp.parse(page.body)
    # or get expanded plain text
    text = wtp.expand(page.body)

wtp = Wtp(
    db_path="en_20230801.db", lang_code="en", project="wiktionary"
)

# extract dump file then save pages to SQLite file
process_dump(
    wtp,
    "enwiktionary-20230801-pages-articles.xml.bz2",
    {0, 10, 110, 828},  # namespace id, can be found at the start of dump file
)

for _ in map(
    partial(page_handler, wtp), wtp.get_all_pages([0])
):
    pass

The basic operation is as follows:

  • Extract templates, modules, and other pages from the dump file and save them in a SQLite file
  • Heuristically analyze which templates need to be pre-expanded before parsing to make sense of the page structure (this cannot detect templates that call Lua code that outputs wikitext that affects parsed structure). These first steps together are called the "first phase".
  • Process the pages again, calling a page handler function for each page. The page handler can extract, parse, and otherwise process the page, and has full access to templates and Lua macros defined in the dump. This may call the page handler in multiple processes in parallel. Return values from the page handler calls are returned to the caller. This is called the second phase.

Most of the functionality is hidden behind the Wtp object. WikiNode objects are used for representing the parse tree that is returned by the Wtp.parse() function. NodeKind is an enumeration type used to encode the type of a WikiNode.

class Wtp

def __init__(
    self,
    db_path: Optional[Union[str, Path]] = None,
    lang_code="en",
    template_override_funcs: Dict[str, Callable[[Sequence[str]], str]] = {},
    project: str = "wiktionary",
):

The initializer can usually be called without arguments, but recognizes the following arguments:

  • db_path can be None, in which case a temporary database file will be created under /tmp, or a path for the database file which contains page texts and other data of the dump file. There are two reasons why you might want to set this:
    1. you don't have enough space on /tmp (3.4G for English dump file), or 2) for testing. If you specify the path and an existing database file exists, that file will be used, eliminating the time needed for Phase 1 (this is very important for testing, allowing processing single pages reasonably fast). In this case, you should not call Wtp.process() but instead use Wtp.reprocess() or just call Wtp.expand() or Wtp.parse() on wikitext that you have obtained otherwise (e.g., from some file). If the file doesn't exist, you will need to call Wtp.process() to parse a dump file, which will initialize the database file during the first phase. If you wish to re-create the database, you should remove the old file first.
  • lang_code - the language code of the dump file.
  • template_override_funcs - Python functions for overriding expanded template text.
  • project - "wiktionary" or "wikipedia".
def read_by_title(
    self, title: str, namespace_id: Optional[int] = None
) -> Optional[str]:

Reads the contents of the page with the specified title from the cache file. There is usually no need to call this function explicitly, as Wtp.process() and Wtp.reprocess() normally load the page automatically. This function does not automatically call Wtp.start_page().

Arguments are:

  • title - the title of the page to read
  • namespace_id - namespace id number, this argument is required if title donesn't have namespace prefix like Template:.

This returns the page contents as a string, or None if the page does not exist.

def parse(
    self,
    text: str,
    pre_expand=False,
    expand_all=False,
    additional_expand=None,
    do_not_pre_expand=None,
    template_fn=None,
    post_template_fn=None,
) -> WikiNode:

Parses wikitext into a parse tree (WikiNode), optionally expanding some or all the templates and Lua macros in the wikitext (using the definitions for the templates and macros in the cache files, as added by Wtp.process() or calls to Wtp.add_page().

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates, Lua macros, and error messages). The Wtp.process() and Wtp.reprocess() functions will call it automatically.

This accepts the following arguments:

  • text (str) - the wikitext to be parsed
  • pre_expand (boolean) - if set to True, the templates that were heuristically detected as affecting parsing (e.g., expanding to table start or end tags or list items) will be automatically expanded before parsing. Any Lua macros those templates use may also be called.
  • expand_all - if set to True, expands all templates and Lua macros in the wikitext before parsing.
  • additional_expand (set or None) - if this argument is provided, it should be a set of template names that should be expanded in addition to those specified by the other options (i.e., in addition to to the heuristically detected templates if pre_expand is True or just these if it is false; this option is meaningless if expand_all is set to True).

This returns the parse tree. See below for a documentation of the WikiNode class used for representing the parse tree.

def node_to_wikitext(self, node)

Converts a part of a parse tree back to wikitext.

  • node (WikiNode, str, list/tuple of these) - This is the part of the parse tree that is to be converted back to wikitext. We also allow strings and lists, so that node.children can be used directly as the argument.
def expand(self, text, template_fn=None, post_template_fn=None,
           pre_expand=False, templates_to_expand=None,
           expand_parserfns=True, expand_invoke=True)

Expands the selected templates, parser functions and Lua macros in the given Wikitext. This can selectively expand some or all templates. This can also capture the arguments and/or the expansion of any template as well as substitute custom expansions instead of the default expansions.

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates and Lua macros). The Wtp.process() and Wtp.reprocess() will call it automatically. The page title is also used in error messages.

The arguments are as follows:

  • text (str) - the wikitext to be expanded
  • template_fn (function) - if set, this will be called as template_fn(name, args), where name (str) is the name of the template and args is a dictionary containing arguments to the template. Positional arguments (and named arguments with numeric names) will have integer keys in the dictionary, whereas other named arguments will have their names as keys. All values corresponding to arguments are strings (after they have been expanded). This function can return None to cause the template to be expanded in the normal way, or a string that will be used instead of the expansion of the template. This can return "" (empty string) to expand the template to nothing. This can also capture the template name and its arguments.
  • post_template_fn (function) - if set, this will be called as post_template_fn(name, ht, expansion) after the template has been expanded in the normal way. This can return None to use the default expansion, or a string to use a that string as the expansion. This can also be used to capture the template, its arguments, and/or its expansion.
  • pre_expand (boolean) - if set to True, all templates that were heuristically determined as needing to be expanded before parsing will be expanded.
  • templates_to_expand (None or set or dictionary) - if this is set, these templates will be expanded in addition to any other templates that have been specified to be expanded. If a dictionary is provided, its keys will be taken as the names of the templates to be expanded. If this has not been set or is None, all templates will be expanded.
  • expand_parserfns (boolean) - Normally, wikitext parser functions will be expanded. This can be set to False to prevent parser function expansion.
  • expand_invoke (boolean) - Normally, the #invoke parser function (which calls a Lua module) will be expanded along with other parser functions. This can be set to False to prevent expansion of the #invoke parser function.
def start_page(self, title)

This function should be called before starting the processing of a new page or file. This saves the page title (which is frequently accessed by templates, parser functions, and Lua macros). The page title is also used in error messages.

The Wtp.process() and Wtp.reprocess() functions will automatically call this before calling the page handler for each page. This needs to be called manually when processing wikitext obtained from other sources.

The arguments are as follows:

  • title (str) - The page title. For normal pages, there is usually no prefix. Templates typically have Template: prefix and Lua modules Module: prefix, and other prefixes are also used (e.g., Thesaurus:). This does not care about the form of the name, but some parser functions do.
def start_section(self, title)

Sets the title of the current section on the page. This is automatically reset to None by Wtp.start_page(). The section title is only used in error, warning, and debug messages.

The arguments are:

  • title (str) - the title of the section, or None to clear it.
def start_subsection(self, title)

Sets the title of the current subsection of the current section on the page. This is automatically reset to None by Wtp.start_page() and Wtp.start_section(). The subsection title is only used in error, warning, and debug messages.

The arguments are:

  • title (str) - the title of the subsection, or None to clear it.
def add_page(self, title: str, namespace_id: int, body: Optional[str] = None,
             redirect_to: Optional[str] = None, need_pre_expand: bool = False,
             model: str = "wikitext") -> None:

This function is used to add pages, templates, and modules for processing. There is usually no need to use this if Wtp.process() is used; however, this can be used to add templates and pages for testing or other special processing needs.

The arguments are:

  • title - the title of the page to be added (normal pages typically have no prefix in the title, templates begin with Template:, and Lua modules begin with Module:)
  • namespace_id - namespace id
  • body - the content of the page, template, or module
  • redirect_to - title of redirect page
  • need_pre_expand - set to True if the page is a template that need to be expanded before parsing.
  • model - the model value for the page (usually wikitext for normal pages and templates and Scribunto for Lua modules)

The Wtp.analyze_templates() function needs to be called after calling Wtp.add_page() before pages can be expanded or parsed (it should preferably only be called once after adding all pages and templates).

def analyze_templates(self)

Analyzes the template definitions in the cache file and determines which of them should be pre-expanded before parsing because they affect the document structure significantly. Some templates in, e.g., Wiktionary expand to table start tags, table end tags, or list items, and parsing results are generally much better if they are expanded before parsing. The actual expansion only happens if pre_expand or some other argument to Wtp.expand() or Wtp.parse() tells them to do so.

The analysis is heuristic and is not guaranteed to find every such template. In particular, it cannot detect templates that call Lua modules that output Wikitext control structures (there are several templates in Wiktionary that call Lua code that outputs list items, for example). Such templates may need to be identified manually and specified as additional templates to expand. Luckily, there seem to be relatively few such templates, at least in Wiktionary.

This function is automatically called by Wtp.process() at the end of phase 1. An explicit call is only necessary if Wtp.add_page() has been used by the application.

Error handling

Various functions in this module, including Wtp.parse() and Wtp.expand() may generate errors and warnings. Those will be displayed on stdout as well as collected in Wtp.errors, Wtp.warnings, and Wtp.debugs. These fields will contain lists of dictionaries, where each dictionary describes an error/warning/debug message. The dictionary can have the following keys (not all of them are always present):

  • msg (str) - the error message
  • trace (str or None) - optional stacktrace where the error occurred
  • title (str) - the page title on which the error occurred
  • section (str or None) - the section where the error occurred
  • subsection (str or None) - the subsection where the error occurred
  • path (tuple of str) - a path of title, template names, parser function names, or Lua module/function names, giving information about where the error occurred during expansion or parsing.

The fields containing the error messages will be cleared by every call to Wtp.start_page() (including the implicit calls during Wtp.process() and Wtp.reprocess()). Thus, the page_handler function often returns these lists together with any information extracted from the page, and they can be collected together from the values returned by the iterators returned by these functions. The Wtp.to_return() function maybe useful for this.

The following functions can be used for reporting errors. These can also be called by application code from within the page_handler function as well as template_fn and post_template_fn functions to report errors, warnings, and debug messages in a uniform way.

def error(self, msg, trace=None)

Reports an error message. The error will be added to Wtp.errors list and printed to stdout. The arguments are:

  • msg (str) - the error message (need not include page title or section)
  • trace (str or None) - an optional stack trace giving more information about where the error occurred
def warning(self, msg, trace=None)

Reports a warning message. The warning will be added to Wtp.warnings list and printed to stdout. The arguments are the same as for Wtp.error().

def debug(self, msg, trace=None)

Reports a debug message. The message will be added to Wtp.debugs list and printed to stdout. The arguments are the same as for Wtp.error().

def to_return(self)

Produces a dictionary containing the error, warning, and debug messages from Wtp. This would typically be called at the end of a page_handler function and the value returned along with whatever data was extracted from that page. The error lists are reset by Wtp.start_page() (including the implicit calls from Wtp.process() and Wtp.reprocess()), so they should be saved (e.g., by this call) for each page. (Given the parallelism in the processing of the pages, they cannot just be accumulated in the subprocesses.)

The returned dictionary contains the following keys:

  • errors - a list of dictionaries describing any error messages
  • warnings - a list of dictionaries describing any warning messages
  • debugs - a list of dictionaries describing any debug messages.

class WikiNode

The WikiNode class represents a parse tree node and is returned by Wtp.parse(). This object can be printed or converted to a string and will display a human-readable format that is suitable for debugging purposes (at least for small parse trees).

The WikiNode objects have the following fields:

  • kind (NodeKind, see below) - The type of the node. This determines how to interpret the other fields.
  • children (list) - Contents of the node. This is generally used when the node has arbitrary size content, such as subsections, list items/sublists, other HTML tags, etc.
  • args (list or str, depending on kind) - Direct arguments to the node. This is used, for example, for templates, template arguments, parser function arguments, and link arguments, in which case this is a list. For some node types (e.g., list, list item, and HTML tag), this is directly a string.
  • attrs - A dictionary containing HTML attributes or a definition list definition (under the def key).

class NodeKind(enum.Enum)

The NodeKind type is an enumerated value for parse tree (WikiNode) node types. Currently the following values are used (typically these need to be prefixed by Nodekind., e.g., NodeKind.LEVEL2):

  • ROOT - The root node of the parse tree.
  • LEVEL2 - Level 2 subtitle (==). The args field contains the title and children field contains any contents that are within this section
  • LEVEL3 - Level 3 subtitle (===)
  • LEVEL4 - Level 4 subtitle (====)
  • LEVEL5 - Level 5 subtitle (=====)
  • LEVEL6 - Level 6 subtitle (======)
  • ITALIC - Italic, content is in children
  • BOLD - Bold, content is in children
  • HLINE - A horizontal line (no arguments or children)
  • LIST - Indicates a list. Each list and sublist will start with this kind of node. args will contain the prefix used to open the list (e.g., "##" - note this is stored directly as a string in args). List items will be stored in children.
  • LIST_ITEM - A list item in the children of a LIST node. args is the prefix used to open the list item (same as for the LIST node). The contents of the list item (including any possible sublists) are in children. If the list is a definition list (i.e., the prefix ends in ";"), then children contains the item label to be defined and definition contains the definition.
  • PREFORMATTED - Preformatted text where markup is interpreted. Content is in children. This is used for lines starting with a space in wikitext.
  • PRE - Preformatted text where markup is not interpreted. Content is in children. This is indicated in wikitext by <pre>...</pre>.
  • LINK - An internal wikimedia link ([[...]] in wikitext). The link arguments are in args. This tag is also used for media inclusion. Links with a trailing word end immediately after the link have the trailing part in children.
  • TEMPLATE - A template call (transclusion). Template name is in the first argument and template arguments in subsequent arguments in args. The children field is not used. In wikitext templates are marked up as {{name|arg1|arg2|...}}.
  • TEMPLATE_ARG - A template argument. The argument name is in the first item in args followed by any subsequet arguments (normally at most two items, but I've seen arguments with more - probably an error in those template definitions). The children field is not used. In wikitext template arguments are marked up as {{{name|defval}}}.
  • PARSER_FN - A parser function invocation. This is also used for built-in variables such as {{PAGENAME}}. The parser function name is in the first element of args and parser function arguments in subsequent elements.
  • URL - An external URL. The first argument is the URL. The second optional argument (in args) is the display text. The children field is not used.
  • TABLE - A table. Content is in children. In wikitext, a table is encoded as {| ... |}.
  • TABLE_CAPTION - A table caption. This can only occur under TABLE. The content is in children. The attrs field contains a dictionary of any HTML attributes given to the table.
  • TABLE_ROW - A table row. This can only occur under TABLE. The content is in children (normally the content would be TABLE_CELL or TABLE_HEADER_CELL nodes). The attrs field contains a dictionary of any HTML attributes given to the table row.
  • TABLE_HEADER_CELL - A table header cell. This can only occur under TABLE_ROW. Content is in children. The attrs field contains a dictionary of any HTML attributes given to the table row.
  • TABLE_CELL - A table cell. This can only occur under TABLE_ROW. Content is in children. The attrs field contains a dictionary of any HTML attributes given to the table row.
  • MAGIC_WORD - A MediaWiki magic word. The magic word is assigned directly to args as a string (i.e., not in a list). children is not used. An example of a magic word would be __NOTOC__.
  • HTML - A HTML tag (or a matched pair of HTML tags). args is the name of the HTML tag directly (not in a list and always without a slash). attrs is set to a dictionary of any HTML attributes from the tag. The contents of the HTML tag is in children.

Expected performance

This can generally process a few Wiktionary pages per second per processor core, including expansion of all templates, Lua macros, parsing the full page, and analyzing the parse. On a multi-core machine, this can generally process a few dozen to a few hundred pages per second, depending on the speed and the number of the cores.

Most of the processing effort goes to expanding Lua macros. You can elect not to expand Lua macros, but they are used extensively in Wiktionary and for important information. Expanding templates and Lua macros allows much more robust and complete data extraction, but does not come cheap.

Contributing and bug reports

Please create an issue on github to report bugs or to contribute!

wikitextprocessor's People

Contributors

brendanedwardgavin avatar dependabot[bot] avatar empiriker avatar fpw avatar garfieldnate avatar jmviz avatar kristian-clausal avatar lemoussel avatar tatuylonen avatar wyrun avatar xxyzz avatar yacoder avatar yoskari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wikitextprocessor's Issues

{{trans-top}}{{multitrans}} cause misparsing when there is no newline between them

Because of the way we handle expanding templates and the order in which things are parsed (which is different from Wikitext in obvious retrospect), when trans-top (the weird starting tag of a translation section) and multitrans (the optimized translation section template that accepts lists as arguments) are on the same line:

{{trans-top|bird}}{{multitrans|data=
* Blahian translation
* Barian translation
* Foonese translation
}}
{{trans-bottom}}

like this, the first line (Blahian translation) is broken up and used as the sense of the translation section, because it does not get parsed as part of the list below it. Why? I am pretty sure it's because {{multitrans}} is pre-expanded (or maybe even if it wasn't) and then you have:

{{trans-top|bird}}* Blahian translation
*Barian translation...

and the first line is not correctly formatted as a list item because {{trans-top}} is ahead of it on the same line. However, trans-top generates a newline or linebreak down the line in some form as the translation box is generated, and because the wikimedia parser does the final parsing of the page's tree/DOM-thingie at the end (unlike we, who do partial parsing in between without expanding all templates) it doesn't even notice that the list was "broken" for a while. But wikitextprocessor catches the broken list when its vulnerable (with the unexpanded {{trans-top}} still in the text, either as text or as a magical character).

Making this an issue as a memo to myself. This is going either be spectacularly simple or super hard.

HTML code in English word senses

Some HTML code got into the English dump.

Here's an example from wall (sense: The butterfly Lasiommata megera.)

{
   "code": "sw",
   "lang": "Swahili",
   "sense": "butterfly Lasiommata megera\n class=\"translations\" role=\"presentation\" style=\"width:100%;\" data-gloss=\"butterfly Lasiommata megera\"",
   "word": "kuta"
}

More examples:

  • love (A climbing plant, Clematis vitalba)
  • rose (A plant or species in the rose family. (Rosaceae))
  • they
  • read (Used after a euphemism to introduce the intended, more blunt meaning of a term)

Typo in README

The README says:

This can generally process a few Wiktionary pages second per processor core

Should this per "pages PER second"?

Module 'ne-conj' not found Lua error

Page: https://en.wiktionary.org/wiki/वाक्नु
Error: https://kaikki.org/dictionary/errors/details-Traceback--most-recent-call-last-----F-Plr1Wuzg.html

वाक्नु (Nepali verb) LUA error in #invoke('ne-conj', 'show') parent ('Template:ne-conj', {'i': 'y'})

Traceback (most recent call last):
  File "/home/ubuntu/temp-wiktionary/venv/lib/python3.10/site-packages/wikitextprocessor/luaexec.py", line 745, in call_lua_sandbox
    ret: tuple[bool, str] = ctx.lua_invoke(
  File "lupa/lua51.pyx", line 896, in lupa.lua51._LuaObject.__call__
  File "lupa/lua51.pyx", line 1795, in lupa.lua51.call_lua
  File "lupa/lua51.pyx", line 1821, in lupa.lua51.execute_lua_call
  File "lupa/lua51.pyx", line 1703, in lupa.lua51.raise_lua_error
lupa.lua51.LuaError: [string "_sandbox_phase2"]:179: Could not find module ne-conj: module 'ne-conj' not found
stack traceback:
	[string "_sandbox_phase2"]:179: in function <[string "_sandbox_phase2"]:121>
	[C]: in function 'error'

Unreproducible error, have no idea.

`Can not match` Lua errors in the "ja-usex" Module

Some Japanese pages have this error, all of them use bold wikitext in a link. Example page: https://en.wiktionary.org/wiki/ちゃんねる, error link.

#: {{ja-usex|[[w:ja:2ちゃんねる|2'''ちゃんねる''']]|^に-'''ちゃんねる'''|{{w|2channel}}}}

What puzzled me is when saving the link arguments, the argument contains the bold wikitext('''), but then when expanding the template, the argument now doesn't have the ''' around "ちゃんねる". I think this is because this link is somehow parsed twice, and the wrong magic number is used to retrieve the incorrect arguments, but I don't know where and when this happened in the code.

How to get text from from templates?

Hi, thanks for the project!

I'm trying to extract text from wiki dumps. Page example: https://en.wikipedia.org/wiki/Free_neutron_decay
I dowloaded the page and it's templates via https://en.wikipedia.org/wiki/Special:Export
The page contains the following template - {{val|879.6|0.8|u=[[second|s]]}} which I'd like to be converted to 879.6±0.8 s text.

Code (the latest in repo):

    def test_simple_page(self):

        def page_handler(page: Page, wtp: Wtp | None = None) -> Any:
            wtp.start_page(page.title)
            node = wtp.parse(page.body, pre_expand=True)
            value = wtp.node_to_wikitext(node)
            print(value)

        wtp = Wtp(db_path=Path('../db/mydb'))
        process_dump(
            wtp,
            "../Wikipedia-20230825082710.xml.bz2",
            {0, 10, 110, 828},  # namespace id
            save_pages_path=Path('../pages')
        )

        print("Num pages:", wtp.saved_page_nums())

        for _ in map(
                partial(page_handler, wtp=wtp), wtp.get_all_pages([0])
        ):
            pass

Output:

Num pages: 86
.... <strong class="error">Template:val</strong>...

If I set pre_expand to False, then:

Num pages: 86
.... {{val|879.6|0.8|u=[[second|s]]}}...

Probably it's smth simple, but I can't find a solution. Can you please help?

Github doesn't allowed to upload xml/bz2 files, so I uploaded my xml on Dropbox: link

zh edition dump missing many entries

I noticed that extraction from the zh edition dump misses many entries; in particular, the ZH wiktionary lists 100K pages under the "English" category, but when I run extraction like so I only get 30K lines in the resulting jsonl file:

./wiktwords --dump-file-language-code zh --all --language en --cache /tmp/wikt-cache --pages-dir pages --out data_en_zh1.jsonl zhwiktionary-20221120-pages-articles.xml.bz2| tee log_en_zh1.txt

Some examples of missing pages are chad, acrid, acidic, zigzag, and argon; those first 4 are missing the POS headers, but argon has the correct structure.

I noticed that these pages all use traditional characters in the headers, which might have been a pattern, but then I noticed that "sulfate" is not missing, and it uses traditional headers.

@xxyzz

Parser chokes on valid HTML attribute that looks like Wiki text

I ran into a problem parsing data from fiwiktionary-20220101-pages-articles.xml. Some of the expanded translation templates have <span> tags with attributes containing wiki text.

Here is a simple example of how the wikitextprocessor parser fails:

#!/usr/bin/env python3

from wikitextprocessor import Wtp, WikiNode, NodeKind

text = '''
<span data-x="''test''">test</span>
'''

ctx = Wtp()
ctx.start_page('test')
print(ctx.parse(text))

This outputs:

test: DEBUG: no corresponding start tag found for </span> at ['test'] parsing 
<ROOT(['test']){} '\n<span data-x="', <ITALIC(){} 'test'>, '">test</span>\n'>

The input is valid but the parser thinks there's an ITALIC block in the HTML attribute and parsing of the whole <span> tag fails.

In my actual example the input text looks like this:

<span class="Zzzz linkki" data-kuvaus-param="{&quot;1&quot;: &quot;Englanti&quot;, &quot;5&quot;: &quot;|critic&lt;nowiki&gt;|&lt;/nowiki&gt;'''ise'''&quot;}" data-kuvaus="käännös/*/$2/yleinen" lang="en">[[criticise#Englanti|critic<nowiki>|</nowiki>'''ise''']]</span>

I'm not sure how to fix this but the parser needs to avoid looking for Wiki text inside HTML attributes.

<ref> elements (and probably other html-like tags) inside list items can seeminly contain newlines

This is annoying, because of the structure of our parser.

If we have the source (from comprise/English):

# {{...}} To [[compose]]; to [[constitute]].<ref group="usage">Traditionally, the whole comprised its parts, ... an increasingly frequent and accepted usage.</ref><ref group="usage">In the passive voice, ... in this sense always requires {{m|en|of}}).

</ref> {{defdate|from the late 18th c.}}
#: {{ux|en|The whole is '''comprised''' of the parts.}}

the "from the late 18th c." template is still part of the same line as the preceding list item. If we do this:

# {{...}} To [[compose]]; to [[constitute]].<ref group="usage">Traditionally, the whole comprised its parts, ... an increasingly frequent and accepted usage.</ref><ref group="usage">In the passive voice, ... in this sense always requires {{m|en|of}}).</ref>
{{defdate|from the late 18th c.}}
#: {{ux|en|The whole is '''comprised''' of the parts.}}

with a newline before the defdate template, it behaves as expected and the "from the late 18th c." text is on a new line and breaks the table into two new tables.

The problem is, as usual, the way we have to parse wikitext needing to be different than what happens when wikitext is processed. We need to keep some data that is discarded when processing, like the fact that templates exist at all, while processing just (probably) involves a ton of template expansion passes and the final product is then parsed. We do parsing in between, which breaks a ton of HTML-tag related stuff, because those can appear from templates as they wish to generate new structure.

This needs to be hacked together to get lists working again; I have a vague idea, and hopefully it's as simple as it is vague.

Major issues with Lua memory stuff not resetting properly

Memo to me as a reminder about these issues.

There's a major issue with the Lua engine's memory not resetting properly as it should be. Some examples:

  • Before Module:xnt-decl was corrected on Wiktionary, it would overwrite the name table in global namespace, which would basically break every subsequent use of table (which is super-common) by the Lua engine
  • The article "A 1" (because it's super early in alphabetical order, I suspect) overwrites something, like #PAGENAME or something similar, which causes a ton of words to have {{head}} templates return "A 1". This weirdness is later caught by the "suspicious unhandled suffix" error, because "A 1" looks wrong.
  • There was a bug (which may or may reappear) where a/some lambda function(s) in call_lua_sandbox would "receive too many parameters"; this has to be a bug related to these memory issues, because it popped up in extract.log only about 66% of the way through the logs, which is a good indicator that there was a "trigger" function like Module:xnt-decl (which also showed the same behavior of errors appearing only N% through the log and which was why I was able to find the module in question), but now that I've replaced the lambdas with named functions the errors disappear for today...
  • Module:ja-translit can't access package properly, for some reason, or a specific field of package that I don't remember right now. EDIT: probably not anything to do with this issue.

The fault is most likely somewhere in luaexec.py, _sandbox_phase1.lua or _sandbox_phase2.lua.

This may be a pre-existing issue that we didn't notice with the older versions of Lupa we were using (because there were so many issues), or it might be an issue with how we converted to using Lua 5.1, or it might be an issue with some change in Lupa or how Lupa handles Lua 5.1 stuff.

Set a higher `chunksize` value to decrease the process time

The Python document says setting a large chunksize would improve speed. I tried to test this parameter on GitHub Actions but all jobs exceed the 6 hours limit. I also tried to run the command on my machine but it took 10 hours to finish and I only run the command once. I'm not sure whether this parameter would make a huge difference, maybe someone with a more powerful PC could test this.

You can find my test workflow and code at here: https://github.com/xxyzz/wiktextract/actions/runs/3202850224/workflow

The word senses for benzylidene are also not being parsed correctly.

          The word senses for [benzylidene](https://kaikki.org/dictionary/English/meaning/b/be/benzylidene.html) are also not being parsed correctly.
"translations": [
    {
      "code": "ca",
      "lang": "Catalan",
      "sense": "=\">\nC6H5-CH=",
      "tags": [
        "masculine"
      ],
      "word": "benzilidè"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "=\">\nC6H5-CH=",
      "word": "bentsylideeni"
    }
  ],

Originally posted by @lggruspe in #33 (comment)

ERROR: LUA error in #invoke('Biblio', 'lienWeb')

On Wikipedia Yémen page, on some Web Link, I have this error

Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': "Yémen : 3 agences de l'ONU appellent à la levée immédiate du blocus humanitaire", 'url': 'https://news.un.org/fr/audio/2017/11/1004021', 'site': 'ONU Info', 'date': '2017-11-16', 'consulté le': '2019-03-23'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'en', 'titre': 'Yemen: the Search for a Modern State', 'date': '', 'url': 'https://www.google.fr/books/edition/Yemen_the_Search_for_a_Modern_State/vFyaCwAAQBAJ?hl=fr&gbpv=1&dq=Ibrahim+al-Hamdi&printsec=frontcover', 'site': 'Google Books', 'consulté le': '17 août 2023'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Marib Dam: An Engineering Wonder of the Ancient World', 'url': 'https://www.saba.ye/en/news519330.htm', 'site': 'SabaNet - Yemen News Agency SABA', 'date': '2018-12-18', 'consulté le': '2021-11-27'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Quand le drapeau rouge flottait sur Aden', 'url': 'https://orientxxi.info/spip.php?action=ia_nojs&retour=%2Fmagazine%2Fquand-le-drapeau-rouge-flottait-sur-aden%2C2152', 'site': 'Orientxxi.info', 'date': '30 novembre 2017'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'auteur1': 'Reuters', 'titre': "Enquête du HCR sur l'impact des raids saoudiens au Yémen", 'url': 'http://www.euroinvestor.fr/news/story.aspx?id=10721627', 'site': 'http://www.euroinvestor.fr', 'année': '6 novembre 2009', 'consulté le': '7 novembre 2009'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'auteur1': 'Paul Handley', 'titre': 'Ryad veut “neutraliser” les rebelles chiites yéménites à sa frontière', 'url': 'http://www.google.com/hostednews/afp/article/ALeqM5hmrKVWhwcWfWz7FqnTsyzogO1f8Q', 'éditeur': '[[Agence France-Presse|AFP]]', 'année': '6 novembre 2009', 'consulté le': '7 novembre 2009'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Yémen : Des milliers de manifestants dans la rue', 'url': 'http://www.radio-canada.ca/nouvelles/International/2011/01/27/002-yemen-manifestations-sanaa.shtml', 'éditeur': 'Radio-Canada.ca avecAgence France Presse, Reuters et Associated Press', 'date': '27 janvier 2011', 'consulté le': '27 janvier 2011'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': "Attentats de Sanaa : quelles sont les forces qui s'affrontent au Yémen ?", 'url': 'https://www.lemonde.fr/proche-orient/article/2015/03/20/qui-affronte-qui-au-yemen_4598291_3218.html', 'site': '[[Le Monde#Le Monde.fr|lemonde.fr]]', 'date': '20 mars 2015'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yémen : l’ex-président Saleh est mort, tué par ses anciens alliés houthistes', 'url': 'https://www.lemonde.fr/yemen/article/2017/12/04/yemen-l-ex-president-saleh-est-mort-tue-par-des-rebelles-houthistes_5224391_1667193.html', 'site': '[[Le Monde#Le Monde.fr|lemonde.fr]]', 'date': '4 décembre 2017', 'consulté le': '13 décembre 2018'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yémen: 8 millions de personnes au bord de la famine (ONU)', 'url': 'http://www.lefigaro.fr/flash-actu/2017/12/11/97001-20171211FILWWW00258-yemen-8-millions-de-personnes-au-bord-de-la-famine-onu.php', 'site': '[[Le Figaro|lefigaro.fr]]', 'date': '11 décembre 2017'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yemen: 1 million de cas de choléra (CICR)', 'url': 'http://www.lefigaro.fr/flash-actu/2017/12/21/97001-20171221FILWWW00093-yemen-1-million-de-cas-de-cholera-cicr.php', 'site': '[[Le Figaro|lefigaro.fr]]', 'date': '21 décembre 2017', 'consulté le': '22 décembre 2017'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'en', 'titre': 'New report reveals: Arms exports by Spanish company Airbus to Saudi Arabia and UAE may have contributed to war crimes in Yemen', 'url': 'https://www.ecchr.eu/en/press-release/new-report-reveals-arms-exports-by-spanish-company-airbus/', 'site': 'ECCHR', 'consulté le': '12 mai 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'en', 'titre': 'A new report launched by Centre Delàs, Amnesty International and ECCHR, concludes that arms transfers by Spanish company Airbus to Saudi Arabia and UAE may have contributed to war crimes in Yemen', 'url': 'http://centredelas.org/actualitat/un-nuevo-informe-del-centre-delas-amnistia-internacional-y-el-ecchr-concluye-que-las-transferencias-de-armas-de-la-empresa-espanola-airbus-a-arabia-saudi-y-emiratos-arabes-unidos-pueden-haber-contr/?lang=en', 'site': 'Centre Delàs for Peace Studies', 'consulté le': '12 mai 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'en', 'titre': 'Spain: Report claims arms exports by Spanish company Airbus to Saudi Arabia & UAE may have contributed to war crimes in Yemen', 'url': 'https://www.business-humanrights.org/en/latest-news/spain-report-claims-that-arms-exports-by-spanish-company-airbus-to-saudi-arabia-and-uae-may-have-contributed-to-war-crimes-in-yemen/', 'site': 'Buisness  & Human Rights Resource Centre', 'consulté le': '16 mai 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'en', 'titre': 'SPANISH ARMS EXPORTS AND ALLEGED WAR CRIMES IN YEMEN', 'url': 'https://media.business-humanrights.org/media/documents/Spanish_Arms_Exports_and_Alleged_War_Crimes_in_Yemen.pdf', 'site': 'PDF', 'consulté le': '16 mai 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yémen : trois entreprises d’armement françaises soupçonnées de complicité de crimes de guerre', 'url': 'https://www.amnesty.fr/conflits-armes-et-populations/actualites/yemen-trois-entreprises-darmement-francaises-soupconnees', 'site': 'Amnesty international', 'consulté le': '1 juin 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yémen : trois entreprises françaises d’armement visées par une plainte pour complicité de crimes de guerre', 'url': 'https://www.la-croix.com/Monde/Yemen-trois-entreprises-francaises-darmement-visees-plainte-complicite-crimes-guerre-2022-06-02-1201218133', 'site': 'Le Croix', 'consulté le': '2 juin 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
Yémen: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'langue': 'fr', 'titre': 'Yémen : trois entreprises françaises visées par une plainte pour « complicité de crimes de guerre »', 'url': 'https://www.lemonde.fr/international/article/2022/06/02/yemen-trois-entreprises-francaises-visees-par-une-plainte-pour-complicite-de-crimes-de-guerre_6128654_3210.html', 'site': 'Le Monde', 'consulté le': '2 juin 2022'}) at ['Yémen', 'Lien web', '#invoke']
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)

Test code:

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig


class TestWebLink(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp = Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig()
        )


    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_weblink_1(self):
        self.wxr.wtp.start_page("Test")
        tree = self.wxr.wtp.parse(text="<ref>{{Lien web|langue=en|titre=Yemen: the Search for a Modern State|date=|url=https://www.google.fr/books/edition/Yemen_the_Search_for_a_Modern_State/vFyaCwAAQBAJ?hl=fr&gbpv=1&dq=Ibrahim+al-Hamdi&printsec=frontcover|site=Google Books|consulté le=17 août 2023}}.</ref>", expand_all=True)
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(text, '')

    def test_weblink_2(self):
        self.wxr.wtp.start_page("Test")
        tree = self.wxr.wtp.parse(text="<ref>{{Lien web |langue=fr |titre=Yémen : 3 agences de l'ONU appellent à la levée immédiate du blocus humanitaire |url=https://news.un.org/fr/audio/2017/11/1004021 |site=ONU Info |date=2017-11-16 |consulté le=2019-03-23}}.</ref>", expand_all=True)
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(text, '')

Is offline conversion into HTML a feature of the package?

The readme mentions that the package can be used for converting wikitext to HTML, but if I understand correctly, the intended usage of this package is to parse dump files and convert to parse tree objects (WikiNode) which can be later used for tasks such as wikitext to html conversion right?
Thank you

Filenames cause errors on exFAT

exFAT partitions disallow file names with the characters /\:*?\"<>| and unicode 0x00-0x1F; some wiktionary titles violate this rule, causing the following error when extracting the wiktionary raw dump:

2023-06-20 13:26:17,038 INFO:   ... 8420000 raw pages collected
Getting all pages for query: 'SELECT title, namespace_id, redirect_to, need_pre_expand, body, model FROM pages ORDER BY title ASC'
Traceback (most recent call last):
  File "/home/brendan/miniconda3/envs/wiktextract/bin/wiktwords", line 8, in <module>
    sys.exit(main())
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/site-packages/wiktextract/wiktwords.py", line 331, in main
    parse_wiktionary(wxr, args.path, word_cb,
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/site-packages/wiktextract/wiktionary.py", line 94, in parse_wiktionary
    for _ in wxr.wtp.process(
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/site-packages/wikitextprocessor/core.py", line 1513, in process
    process_dump(
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/site-packages/wikitextprocessor/dumpparser.py", line 120, in process_dump
    save_pages_to_file(ctx, save_pages_path)
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/site-packages/wikitextprocessor/dumpparser.py", line 201, in save_pages_to_file
    file_path.parent.mkdir(parents=True, exist_ok=True)
  File "/home/brendan/miniconda3/envs/wiktextract/lib/python3.10/pathlib.py", line 1175, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 22] Invalid argument: '../wiktextract-dump-2023-06-20/Words/!?'

`<nowiki />` tag breaks parsing nodes

Find this error in tatuylonen/wiktextract#453

Example: IPA link in page https://fr.wiktionary.org/wiki/Conjugaison:français/abattre

[[Annexe:Prononciation/français|<span>\\kə <nowiki />nu.z‿ɛ.jɔ̃.z‿a.ba.ty\\</span>]]

is parsed as plain text because of this code:

def repl_link(m: re.Match) -> CookieChar:
"""Replacement function for links [[...]]."""
nowiki = MAGIC_NOWIKI_CHAR in m.group(0)

can't tell the nowiki tag is inside HTML span tag.

nowiki should only changes the link node to plain text if it's the direct child of the link node, not sure how to fix this bug.

Template `{{date-|....}}` is misinterpreted?

Under the same conditions as #198, It seems that template {{date-|....}} is misinterpreted.

For example:
Le {{date-|6 décembre 2015}}, given Le <>6 décembre 2015,

The <> should not be present.
The result should be: Le 6 décembre 2015,

Parser fails on HTML tag spanning multiple lines

When parsing Wiktionary entries some of the entries, after template expansion, contain HTML tags that look like this:

<div


>
some text
</div>

The wikitextprocessor does not recognize the opening tag because it spans multiple lines.

Cannot import 'Page'

Hi,

Thanks for this project! I started looking into parsing Wiktionary XML dumps and quickly decided using your package is the better part of valor.

Sorry, this is probably just something dumb, but after I install (via pip) and attempt:

from wikitextprocessor import Wtp, WikiNode, NodeKind, Page

I get a "cannot import name 'Page'" error (but no problem with the other three). This is on a Mac Workstation (OSX 10.14.6) with python 3.8.

I also tried to install from source but I get an "SSL certificate problem: certificate has expired" when pip tries to clone luajit. (So that fails.)

Any idea what is going on here?

Module not found because of inconsistent naming in lua modules

This lua module requires "Module:language/data" https://uk.wiktionary.org/wiki/%D0%9C%D0%BE%D0%B4%D1%83%D0%BB%D1%8C:language
But I get "module not found" error because there is no such module on uk.wiktionary, but "Модуль:language/data"

Which is saved as "Модуль:language/data" and thus cannot be found in mw.loadData("Module:languages/data")

The simplest solution would be to use .replace("Module:", "Модуль:") or other namespace name for other languages.
Same goes for "Template". Already tried and it seems to fix most of the errors.

It should be replaced inside Lua Modules code or just everywhere while parsing the dump

Maybe in read_by_title(): rawdata.decode("utf-8").replace(...)
Or maybe in dumpparser.py raplace_namespaces_to_local_namespaces(....decode("utf-8"))
Or maybe in Lua Modules code before executing them.

wikitextprocessor is not suitable to parse non-english Wikipedia

Currently, wikitextprocessor detects template by checking if a page’s title starts with "Templates:". This strategy works with english Wikipedia, however it doesn’t with other wikis. Indeed, frwiki’s template pages start with "Modèle:", hence templates from frwiki are not properly detected.

A solution (maybe THE solution) would be to rely on the ns tag in pages. For instances, pages with ns = 10 are templates pages. Another solution would be to get the prefix of namespace 10, which can be found in the namespaces tag at the beginnig of an xml dump.

Doesn't parse sections with italics in heading title

The German Wiktionary can have italics in heading titles such as:

=== {{Wortart|Substantiv|Latein}}, {{m}}, ''nachklassisch'' ===
from page Christus

This currently gets parsed to
<ROOT(['']){} '=== ', <TEMPLATE(['Wortart'], ['Substantiv'], ['Latein']){} >, ', ', <TEMPLATE(['m']){} >, ', ', <ITALIC(){} 'nachklassisch'>, ' ==='>
completely missing the H3 heading, crucial for correctly parsing the page.

I have investigated the issue and believe it's caused by the use of italics in the heading. Basically, I would expect the parser to pass this test:

def test_hdr_italics(self):
    tree = self.parse("test", "=== ''nachklassisch'' ===")
    print(tree) # <ROOT(['test']){} '=== ', <ITALIC(){} 'nachklassisch'>, ' ==='>
    self.assertEqual(len(tree.children), 1) # Failed
    self.assertEqual(tree.children[0].kind, NodeKind.LEVEL3)
    self.assertEqual(len(tree.children[0].largs), 1)
    self.assertEqual(tree.children[0].largs[0][0].kind, NodeKind.ITALIC)

However, this test fails. See comments in code example.

The issue seems to be in token_iter() which deals with splitting single quotes at the very beginning

parts_re = re.compile(r"('{2,})")
for line in lines:
parts = re.split(parts_re, line)

before it looks for section headings at:

for m in re.finditer(token_re, part):

I don't quite understand the reasoning for dealing with single quotes first. This is why I do not feel comfortable attempting to solve this myself.

Could someone take a look?

How to run tests? module 'ustring:ustring' not found

In connection to #107, I wanted to run the test suite to verify that I didn't break anything.

However, following the instructions in the readme, I get the following error for multiple tests: lupa.lua51.LuaError: [string "<python>"]:80: module 'ustring:ustring' not found.

Apparently the lua modules are not correctly set up in the sandbox but I am not quite sure how to fix this. I expected the installation procedure to set everything up correctly.

Here the result of running make test with one exemplary error stack trace:

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/till/VSCode/wikitextprocessor/tests/test_wikiprocess.py", line 4113, in test_title_1_colon_e
    self.scribunto("1:e", "return mw.title.new('1:e').text")
  File "/home/till/VSCode/wikitextprocessor/tests/test_wikiprocess.py", line 41, in scribunto
    ret = self.ctx.expand("{{#invoke:testmod|testfn}}", timeout=timeout)
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/core.py", line 1623, in expand
    expanded = expand_recurse(encoded, parent, not pre_expand)
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/core.py", line 1425, in expand_recurse
    ret = expand_parserfn(
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/core.py", line 1353, in expand_parserfn
    ret = invoke_fn(args, expander, parent)
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/core.py", line 1223, in invoke_fn
    ret = call_lua_sandbox(self, invoke_args, expander, parent, timeout)
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/luaexec.py", line 435, in call_lua_sandbox
    initialize_lua(ctx)  # This sets ctx.lua
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/luaexec.py", line 401, in initialize_lua
    call_set_functions(ctx, set_functions)
  File "/home/till/VSCode/wikitextprocessor/wikitextprocessor/luaexec.py", line 337, in call_set_functions
    set_functions(
  File "lupa/lua51.pyx", line 858, in lupa.lua51._LuaObject.__call__
  File "lupa/lua51.pyx", line 1750, in lupa.lua51.call_lua
  File "lupa/lua51.pyx", line 1776, in lupa.lua51.execute_lua_call
  File "lupa/lua51.pyx", line 1665, in lupa.lua51.raise_lua_error
lupa.lua51.LuaError: [string "<python>"]:80: module 'ustring:ustring' not found
stack traceback:
	[string "_sandbox_phase2"]:237: in function <[string "_sandbox_phase2"]:215>
	[string "mw"]:56: in function <[string "mw"]:51>
	[string "<python>"]:80: in function 'require'
	[C]: in function 'assert'

-------------------- >> begin captured stdout << ---------------------

--------------------- >> end captured stdout << ----------------------
----------------------------------------------------------------------
Ran 842 tests in 30.436s

FAILED (errors=231)

make: *** [Makefile:6: test] Error 1

Update README.md API section to mention need to iterate over Wtp.process() return value.

EDIT: this is a documentation issue.

I followed the instructions in the README.md and tried to run this program:

from wikitextprocessor import Wtp

def page_handler(model, title, text):
  print(model, title)

ctx = Wtp(lang_code = 'fi')
ctx.process('fiwiktionary-20220101-pages-articles.xml', page_handler)

I get the following output:

  ... 10000 raw pages collected
  ... 20000 raw pages collected
  ... 30000 raw pages collected
  ... 40000 raw pages collected
  ... 50000 raw pages collected
  ... 60000 raw pages collected
  ... 70000 raw pages collected
  ... 80000 raw pages collected
  ... 90000 raw pages collected
  ... 100000 raw pages collected
  ... 110000 raw pages collected
  ... 120000 raw pages collected
  ... 130000 raw pages collected
  ... 140000 raw pages collected
  ... 150000 raw pages collected
  ... 160000 raw pages collected
  ... 170000 raw pages collected
  ... 180000 raw pages collected
  ... 190000 raw pages collected
  ... 200000 raw pages collected
  ... 210000 raw pages collected
  ... 220000 raw pages collected
  ... 230000 raw pages collected
  ... 240000 raw pages collected
  ... 250000 raw pages collected
  ... 260000 raw pages collected
  ... 270000 raw pages collected
  ... 280000 raw pages collected
  ... 290000 raw pages collected
  ... 300000 raw pages collected
  ... 310000 raw pages collected
  ... 320000 raw pages collected
  ... 330000 raw pages collected
  ... 340000 raw pages collected
  ... 350000 raw pages collected
  ... 360000 raw pages collected
  ... 370000 raw pages collected
  ... 380000 raw pages collected
  ... 390000 raw pages collected
  ... 400000 raw pages collected
  ... 410000 raw pages collected
  ... 420000 raw pages collected
  ... 430000 raw pages collected
  ... 440000 raw pages collected
  ... 450000 raw pages collected
  ... 460000 raw pages collected
  ... 470000 raw pages collected
  ... 480000 raw pages collected
Analyzing which templates should be expanded before parsing

Then it just ends. No calls are made to page_handler(). What am I doing wrong?

Strip newline character at the end of unnamed template arguments

Error page: https://ru.wiktionary.org/wiki/adygejski
Error message: https://kaikki.org/ruwiktionary/errors/details-No-declension-pattern-matches--adygejs-jTm2fWw4.html

Wikitext:

{{прил pl
|слоги={{по-слогам|a|dy|gej|ski}}|adygejski
}}

Lua error:

[adygejski (Польский морфологические и синтаксические свойства)](https://kaikki.org/ruwiktionary/All%20languages%20combined/meaning/a/ad/adygejski.html) LUA error in #invoke('inflection adj pl', 'template_decl_auto', 'adygejski\n') parent ('Шаблон:прил pl', {'слоги': 'a<span class="hyph" style="color:lightgreen;">-</span>dy<span class="hyph" style="color:lightgreen;">-</span>gej<span class="hyph" style="color:lightgreen;">-</span>ski', 1: 'adygejski\n'})

[string "inflection adj pl"]:310: No declension pattern matches 'adygejski
'

White space characters around unnamed template argument are not removed except the newline character at the end. In this case, "adygejski" should be passed to template "прил pl" and "inflection_adj_pl" lua module not "adygejski\n".

Duplicate glosses due to example templates

https://en.wiktionary.org/wiki/Buddha#Noun

The syntax here, which I'll summarize

# Foo
#* {{RQ:Conrad Heart of Darkness|page=196|passage=FooQuote}}
#: {{ux|en|FooExample}}
# Bar
#: {{ux|en|BarEx}}

Creates a duplicate gloss entry in the data; Two "Foo"s, basically.

This is due to a recent change that switched around the behavior of how example templates are handled. Previously, they were ignored when parsing for glosses, now they are not ignored when parsing a gloss.

This causes a mismatch due to a peculiarity with the way Wikitext (and Wikitextprocessor) handles this syntax:

# List Item A1
## List Item B1
#: Continuation of List Item A1

The colon is used to basically just take the contents of Continuation of List Item A1 and shunt it straight into the contents of List Item A1. Now we have:

< List A
    < List A1: ["List Item A1", "Continuation of List Item A1"]>
    < List B1
        <List Item B1>
    >
>

But our parser relies on identifying examples by identifying "...#:" lists! That is, this syntax:

# A1
#: A1 Example

results in this tree:

< List A
    < A1 >
   <List Example>
       <List A1 Example>
>

The parser relies on the example sublist to exist in order to filter out examples properly. Now that we don't ignore example templates inside gloss content anymore, this causes weird duplication.

Tatu took this behavior from here:
https://www.mediawiki.org/wiki/Help:Lists#Continuing_a_list_item_after_a_sub-item

There are a few ancient tests that test for this, too, based straight on examples in the docs.

Why is this relevant? Because the "continuing colon" is used on Wiktionary to create numberless lines inside numbered lists. Basically:

# Gloss
#* A quotation of some sort.
#: This text example does not have a number, but is indented to show it's part of this gloss

The colon has become semantically linked to "example".

The "colons continue the list item" behavior is "correct", insofar as the HTML generate is like that, but visually the only thing the colon does is remove numbers at the start of a line. It doesn't really matter that the underlying HTML is logically structured like this for our purposes.

Thing is, we do not already do straight-up continuation when create a new colon-sublist in the last example I typed up above. This only happens when the list is interrupted by something else.

This is an annoyingly thorny issue. At first I thought the recent change to example template stuff exposed a new bug, but no, it's just weird Wikitext stuff.

At the moment, I've kind of convinced myself that it's best to create the WikiNode datastructures in wikitextprocessor based on visual hierarchy, not the underlying "HTML" hierarchy. This can be easily done by changing one "return" to a "break" in wikitexprocessor/parser... However, it is technically incorrect, because it breaks examples like:

# A list that continues
## An interruption from a sublist!!
## Oh no!!
#: later on.

which aren't relevant to Wiktionary parsing, but might be relevant in some other context.

Because currently the example templates that should be filtered out are shunted into the contents of a gloss item, we don't have any way of distinguishing whether those templates are genuinely "example templates that shouldn't be example templates because something used {{ux}} in the wrong place" and "example template that appears on an indented line afterwards".

There's a couple of ways to fix this:

  1. Break with correctness and always interpret ...#: as a list item. This is the simplest, and only needs a single change in code and removing some tests.
  2. Add some kind of kludgy new WikiNode, like "ListItemContinuation" that tells the parser the stuff it contains is from this kind of continuation. This would be a major pain.
  3. Do heuristics in the recently changed code to guess whether an example template is a gloss or not.

mediawiki_languagecodes.get_all_names should return all possible language codes as keys, even if they would have an empty string value?

There seems to be some issue, maybe with our implementation of mw.languages.fetchLanguageNames...

https://fr.wikipedia.org/w/index.php?title=Module:Langue/Data&action=edit

for k, v in pairs( mwLangFr ) do
	if not p[ k ] then
		p[ k ] = { code = k, nom = v }
		table.insert( p.langueMediaWikiManquantes, k )
	end

	-- mwLangOriginal et mwLangFr ont les mêmes keys, du coup on peut traiter les deux dans cette itération

	local nomOriginal = ustringLower( mwLangOriginal[ k ] )
	if not p[ nomOriginal ] then
		p[ nomOriginal ] = p[ k ]
	end

	local nomFr = ustringLower( v )
	if not p[ nomFr ] then
		p[ nomFr ] = p[ k ]
	end
end

mwLangFr and mwLangOriginal should have the same keys, but mwLangOriginal is returning nil (they key is missing).

local mwLangOriginal = mw.language.fetchLanguageNames()
local mwLangFr = mw.language.fetchLanguageNames( 'fr' )

fetchLanguageNames is our own implementation, and afaict there's nothing wrong with it. The first one returns a dict/table of language codes mapped to their original language names, the second one a table with language codes mapped to their French names... If we're iterating over mwLangFr, then it would follow that there are keys in mwLangFr present missing from mwOriginalLang, and indeed, it seems that this is true both ways.

Yeah, the issue seems to be that in our implementation, mwLangFr and mwLangOriginal do NOT 'ont les mêmes keys', which breaks when lower() receives a nil value.

I tried to just do a naive change to our fetchLanguageNames implementation, but it didn't work: I thought that if I got the language codes from what is effectively mwLangOriginal and added it to mwLangTargetedLanguage (mwLangFr) it would work out, but I forgot that I'd already noticed that mwLangOriginal was also missing stuff from mwLangFr.

The implementation relies on xxyzz's mediawiki_languagecodes that uses a baked-in sqlite database to query for these things, and the table construction is a bit too intermediate level for me to touch (at a beginner SQL level), so I'll leave this to @xxyzz, next week.

TODO:

  • mediawiki_languagecodes.get_all_names should return all possible language codes as keys, even if they would have an empty string value?

Originally posted by @kristian-clausal in #194 (comment)

Template regex in `Wtp._encode()` can't match templates that has `-{}-` argument

Chinese Wiktionary template Ja-romanization of uses -{}- as a placeholder for the first unnamed argument of module form of/templates, -{}- will be replaced to empty string by MediaWiki. But the template regex at here

# Replace template invocation
text = re.sub(
r"(?si)\{" + MAGIC_NOWIKI_CHAR + r"?\{(("
r"\{\|[^{}]*?\|\}|"
r"\}[^{}]|"
r"[^{}](\{[^{}|])?"
r")+?)\}" + MAGIC_NOWIKI_CHAR + r"?\}",
repl_templ,
text,
)

can't match the #invoke function and the whole invoke function got expanded as plain text.

ja-romanization of template wikitext: {{#invoke:form of/templates|form_of_t|-{}-|withcap=1|lang=ja|noprimaryentrycat=}} 的[[罗马字]]转写

I tried to replace -{}- with a white space before calling _encode but I guess some Lua code removes the empty string argument and throws a "parameter 1 is required" error. Maybe a more complex regex could solve this bug.

Example of affected page: https://zh.wiktionary.org/wiki/manga#日語

Template {{refnec|....}} is misinterpreted?

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig


class TestRefnec(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp = Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig()
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_refnec(self):
        self.wxr.wtp.start_page("Test refnec")
        tree = self.wxr.wtp.parse(text="{{refnec|Une ligne de {{nobr|945 km}}}}.", expand_all=True)
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(text, 'Une ligne de 945 km.')

Test seem KO.
The result is: Une ligne de 945 km^([réf. nécessaire]). Should the result be Une ligne de 945 km. ?

Performance improvement ideas

  • Replace single thread bzcat with multi-threads lbzcat(https://github.com/kjn/lbzip2) at here:
    if path.endswith(".bz2"):
    # cmd = "bzcat {} | buffer -m 16M".format(path)
    cmd = "bzcat {}".format(path)
    subp = subprocess.Popen(["/bin/sh", "-c", cmd], stdout=subprocess.PIPE,
    bufsize=256*1024)
    wikt_f = subp.stdout
  • Use lxml's event-driven parsing feature(iterparse()) to parse the dump file. Also in the process_input() function linked above and we could ignore some namespaces at here.
  • Save page text in a SQLite db file. This would reduce memory usage, simplify code(Wtp.add_page() and Wtp.read_by_tile()) and enable multiprocessing in the process_dump() function(phase 1). Table schema:
CREATE TABLE pages (
    title TEXT,
    namespace_id INTEGER,
    redirect_to TEXT,
    need_pre_expend INTEGER,
    body TEXT,
    model TEXT,
    PRIMARY KEY (title, namesapce_id)
);
  • Since cache data is moved to SQLite, hopefully multiprocessing will not relied on the fork method so Mac and Windows don't have to use single process.

Welcome any suggestions and comments.

Differences between Lua versions : Invalid escape sequence near '".+\/'

I am trying to adapt wiktword to work on the French wiktionary. While doing that, I ran into an issue inside wikitextprocessor which seems to come from the different Lua versions used:

I saw already quite a lot of substitutions made on luaexec.py, is it a case of adding another? I added [r"\\/", r"/"] to the list and the error disappeared, but I am not sure if it is the correct way of solving it and I don't know Lua so I don't know what changed between 5.1 and 5.4.

Template `[[Fichier:Lycium shawii.jpg|vignette|...]]` is misinterpreted?

I use French Wikipedia dump file frwiki-latest-pages-articles.xml.bz2 & generated SQLite db filefr-wiki-latest.db from it. It seems that template [[Fichier:Lycium shawii.jpg|vignette|...]] is misinterpreted.

Code:

wtp = Wtp(
        db_path="fr-wiki-latest.db",
        lang_code="fr",
        project="wikipedia",
)

wxr = WiktextractContext(wtp, WiktionaryConfig())

wiki_page_body = """[[Fichier:Lycium shawii.jpg|vignette|''[[Lycium]] shawii'' appelé Gharqad qui a donné son nom au cimetière d’[[al Baqi]] à [[Médine]].]]"""
wiki_page_title = "Test"

wxr.wtp.start_page(wiki_page_title)
wiki_data = wxr.wtp.parse(
    text=wiki_page_body,
    expand_all=True,
)

print_tree(wiki_data, 2)

text = clean_node(
    wxr=wxr,
    sense_data=None,
    wikinode=wiki_data
)

print(text)

Output:

  ROOT [['Arabie saoudite']]
    LINK [['Fichier:Lycium shawii.jpg'], ['vignette'], [<ITALIC(){} <LINK(['Lycium']){} >, ' shawii'>, ' appelé Gharqad qui a donné son nom au cimetière d’', <LINK(['al Baqi']){} >, ' à ', <LINK(['Médine']){} >, '.']]
|Lycium shawii appelé Gharqad qui a donné son nom au cimetière d’al Baqi à Médine.

The | at the beginning of the sentence should not be present.
The result should be: Lycium shawii appelé Gharqad qui a donné son nom au cimetière d’al Baqi à Médine.

Paser can't parse templates that have `{}` in arguemnt

Similar to #59, example page: https://zh.wiktionary.org/wiki/%, wikitext:

#: {{zh-x|約 有 6 '''%'''{pā} 的 臺灣人 血型 是 A{ēi}B{bī}型。|}}

The text inside curly brackets is Pinyin for the previous character, this data is ignored by MediWiki and not displayed in the expanded HTML. I haven't find the Wikitext document for this syntax, maybe it's already obsolete. This time a temporary fix with magic character would be hard to implement.

Lua parsing broken with (...) and `arg`

Just as a heads up that new things might break when I get around to fixing this tomorrow.

After a few hours of hunting down a bug with Ingrian conjugation Lua modules, turns out it's just our parser.

Because we're still using a version of Lupa with a new Lua version and Wikitext uses Lua 5.1, we have to do manual substitutions and string manipulations to make the code syntactically compatible, which doesn't always work. In this case, there's the arg keyword/name in Lua 5.1 that is used to pack arguments in a function (in the form local function foo (bar, baz, ...)), but apparently that has changed in the newer version we use and so arg returns nil and crashes the module.

The fix is either trivial (some string substitution with r"\barg\b" -> "...", or really difficult, and probably super error-prone.

ERROR: unimplemented parserfn #coordinates

On Wikipedia Yémen page, there are coordinates 15° 48′ 48″ N, 47° 47′ 26″ E
The Wikitext template is: {{coord|15.8134|47.7905|type:country_region:YE_dim:1150000_source:dewiki|format=dms|display=title}}

Test Code:

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig


class TestCoord(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp = Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig()
        )


    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_coord(self):
        self.wxr.wtp.start_page("Test")
        tree = self.wxr.wtp.parse(text="{{coord|15.8134|47.7905|type:country_region:YE_dim:1150000_source:dewiki|format=dms|display=title}}", expand_all=True)
        text = clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(text, '')

Output:

Test: ERROR: unimplemented parserfn #coordinates at ['Test', 'coord', '#invoke', 'Lua:Coordinates:coord()']
Test: ERROR: LUA error in #invoke('Coordinates', 'coord', '15.8134', '47.7905', 'type:country_region:YE_dim:1150000_source:dewiki', '', '', '', '', '', '', ' format = dms ', ' name =  ', ' display = title ') parent ('Modèle:Coord', {1: '15.8134', 2: '47.7905', 3: 'type:country_region:YE_dim:1150000_source:dewiki', 'format': 'dms', 'display': 'title'}) at ['Test', 'coord', '#invoke']
'float' object has no attribute 'replace'

Square brackets around a quotation block breaks parsing

Oh, for crying out loud...

Certain articles like https://en.wiktionary.org/wiki/spuriosity have a special syntax for "mention" quotations (that I couldn't find mentioned anywhere at a quick search, just got reverted and yelled at "fix your parser" when I removed it).

# {{lb|en|rare}} [[spuriousness|Spuriousness]].
#* {{quote-book|en|author=w:Alexander Pope|...|passage=Ye are next to.... riting'', {{...}}}}
#* ['''1862''' August – '''....., [https://books.google.com/books?id=rKJkAAAAcAAJ&pg=PA168 page 168]:
#*: So she made Sir John..., spiritualism, '''spuriosity''', &c.]

The whole bit between the [...] gets parsed as an URL, so yeah, needs fixing if it isn't parsed that way on the wikimedia side.

Pages folder missing Modules and Templates, and issues with colons in word titles

Currently pages/ is missing Module/ and Template/ used for debugging, and possibly other folders I haven't used myself but could be useful. It's handy for grepping and for copying temp override files.

Additionally, two words Swe:gë’ and Ohi:yoʼ get misinterpreted as a Namespace:word pair, most probably due to the capitalization caused by their... proper nounedness. They have their own top-level Ohi and Swe folders.

New version failed on SQLite

When I try the example code (in the README) using the last version, pip-installed from local after a git clone, I get:

Traceback (most recent call last):
  File "/path/to/template.py", line 13, in <module>
    process_dump(
  File "/path/to/env/lib/python3.10/site-packages/wikitextprocessor/dumpparser.py", line 132, in process_dump
    analyze_and_overwrite_pages(
  File "/path/to/env/lib/python3.10/site-packages/wikitextprocessor/dumpparser.py", line 158, in analyze_and_overwrite_pages
    ctx.analyze_templates()
  File "/path/to/env/lib/python3.10/site-packages/wikitextprocessor/core.py", line 1043, in analyze_templates
    self.db_conn.execute(query_str)
sqlite3.OperationalError: near "FROM": syntax error

How can I solve it?

Expand template contains itself

Page: https://ru.wiktionary.org/wiki/footer
Template: https://ru.wiktionary.org/wiki/Шаблон:длина_слова

The "длина_слова" template calls itself if it is used for substitution, or use "main other" template. I think it's "{{{|safesubst:}}}" in the template delays the expansion of the arguments, but our code doesn't do that and expand the "длина_слова" template recursively.

wikitext docs:

--categories-file is currently broken

wiktwords --all-languages --all --db-path wikt-db --pages-dir pages --categories-file categories-test.json dumps/enwiktionary-20230420-pages-articles.xml.bz2

Testing out creating a database file and pages directory, resulting in:

Emitting thesaurus main entry for तडित्/Sanskrit/noun (not in main)
Emitting thesaurus main entry for linguist/English/noun (not in main)
Emitting thesaurus main entry for combining form/English/noun (not in main)
2023-05-17 09:13:29,441 INFO: Reprocessing wiktionary complete
Extracting category tree
Traceback (most recent call last):
  File "/home/kristian/.local/bin/wiktwords", line 8, in <module>
    sys.exit(main())
  File "/home/kristian/Repos/wiktextract/wiktextract/wiktwords.py", line 360, in main
    tree = extract_categories(ctx, config)
  File "/home/kristian/Repos/wiktextract/wiktextract/categories.py", line 75, in extract_categories
    ctx.add_page(f"{module_ns_local_name}:wiktextract cat tree",
  File "/home/kristian/Repos/wikitextprocessor/wikitextprocessor/core.py", line 532, in add_page
    self.db_conn.execute("""INSERT INTO pages (title, namespace_id, body,
sqlite3.ProgrammingError: Cannot operate on a closed database.

Looking at what wiktwords is actually doing there, it's the --category-file parameter that was left over when ctrl-R'ed for this command in my history. ctx.add_page() needs to be checked to see if it is being called on a closed database like here, but the full run on the kaikki regen machine seems to be running fine so hopefully kaikki will regenerate well.

Shouldn't `<gallery ...>` Tag be filtered by clean_value ?

In the French Wikipedia dump of the article about "Arabie saoudite", there is

<gallery mode="packed-hover">
Fichier:Flickr - omar chatriwala - Dawn breaks.jpg|[[Médine (province)|Médine]].
Fichier:Buraidah.jpg|[[Al Qasim]].
Fichier:Hai'l city.jpg|[[Haïl (province)|Haïl]].
Fichier:جبل حرفة في بني عمرو.jpg|[[Asir (province)|Asir]].
Fichier:بحر حقل (15037998121).jpg|[[Tabuk (province)|Tabuk]].
Fichier:Origineel huis Najran.JPG|[[Najran (province)|Najran]].
Fichier:Makkah-Panorama-2011.jpg|[[La Mecque (province)|La Mecque]].
Fichier:Faifa city.jpg|[[Jizan (province)|Jizan]].
</gallery>

product this output

Fichier:Flickr - omar chatriwala - Dawn breaks.jpg|Médine.
Fichier:Buraidah.jpg|Al Qasim.
Fichier:Hai'l city.jpg|Haïl.
Fichier:جبل حرفة في بني عمرو.jpg|Asir.
Fichier:بحر حقل (15037998121).jpg|Tabuk.
Fichier:Origineel huis Najran.JPG|Najran.
Fichier:Makkah-Panorama-2011.jpg|La Mecque.
Fichier:Faifa city.jpg|Jizan.

Shouldn't <gallery ...> Tag be filtered by clean_value ?

Detecting whether a page contains an entry for a specific language

Reminder to me. Not a high priority, but could be nice for testing and for people who only work with specific languages.

Now that there is a database, it would be helpful if we could add some meta-data for each page that is added to it, before the page is parsed in a costly way. There are some things that should be possible to detect with simple string-searches and patterns, the trouble is just being exhaustive and correct.

Detecting whether a page contains a section for a specific language should not be impossible. At first glance it would just be a search for headers that contain the language name or a variant of that name.

The page id and language id could then be added as a new record into a new table that contains these combinations ('chat' id -> 'English' id; 'chat' id -> 'French' id, etc.), and using that table we can filter what pages to use in when processing specific languages.

The stuff with the table records is pretty simple, but the detection of language is the kind of thing that can be annoying. Are there any edge cases where the simple look for '$=*(Language|Name|Variations) isn't enough?

Lua: differences in __tostring() between 5.1 and later?

Originally found in otherwise unrelated pull request: tatuylonen/wiktextract#230

Reconstruction:Latin/sufferio/Latin/verb: ERROR: LUA error in #invoke ('VL-verb', 'show', 'conj=4th') parent ('Template:VL-conj-4th', {1: 'suffer', 'pastpart': 'suffertum'}) at ['Reconstruction:Latin/sufferio', 'VL-conj-4th', '#invoke']
[string "_sandbox_phase2"]:347: '__tostring' must return a string
stack traceback:
        [C]: in function 'tostring'
        [string "_sandbox_phase2"]:347: in global 'tostring'
        [string "Module:VL-verb"]:125: in field '?'
        [string "Module:VL-verb"]:359: in upvalue 'make_table'
        [string "Module:VL-verb"]:381: in function 'Module:VL-verb.show'
        [C]: in function 'xpcall'
        [string "_sandbox_phase2"]:219: in function <[string "_sandbox_phase2"]:140>

https://en.wiktionary.org/wiki/Module:VL-verb

forms.ind_imperf = {
	["1st"] = {"ābam", "ābās", "ābat", "ābāmus", "ābātis", "ābānt"},
	["2nd"] = {"ēbam", "ēbās", "ēbat", "ēbāmus", "ēbātis", "ēbānt"},
	["3rd"] = {"ēbam", "ēbās", "ēbat", "ēbāmus", "ēbātis", "ēbānt"},
	["4th"] = {"ībam", "ībās", "ībat", "ībāmus", "ībātis", "ībānt"},
}
setmetatable(forms.ind_imperf, {__call = loop_over, __tostring = function() return {"imperfect", "ind_imperf"} end})

The error seems to be some kind of incompatibility between 5.1 and the version we have to use for now.

(LuaJIT, if you didn't know, fell through because it broke the timeout functionality in Scribunto, because it was being done with the debug module and any debug stuff is optimized away inside JITed code; a sufficiently simple infinite loop will make everything hang. Not optimal.)

But I can't figure out this from googling about it. Does anyone know about Lua's development history to comment? AFAICT, 5.1 must have somehow allowed __tostring metatable methods to return not just strings, but tables like in the above module, which must have been forbidden in later versions.

Template class="error" when expanding.

I'm trying to extract text from french wiki dumps. Page example: https://fr.wikipedia.org/wiki/Arabie_saoudite
The page contains the following template :

L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'&lt;ref group=&quot;note&quot;&gt;Aussi plus rarement « Arabie séoudite ».&lt;/ref&gt;|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]]. 

I got <strong class="error">Template:&#x27;</strong><strong class="error"> when expanding.

Code:

    from wikitextprocessor import Wtp

    wikitext = """
    L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
    """
    wtp = Wtp()
    wtp.start_page('Test')
    wiki_data = wtp.parse(
        text=wikitext,
        expand_all=True
    )
    value = wtp.node_to_wikitext(wiki_data)
    print(value)

Output:
L<strong class="error">Template:&#x27;</strong><strong class="error">Template:arabe</strong>, en forme longue le <strong class="error">Template:arabe</strong>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.