I'm trying to extract text from french wiki dumps. Page example: <a href="https://fr.w

OK but I installed with python3 -m pip install -e . <b

I am getting <div class="snippet-clipboard-content notranslate position-relative o

Basically, to get the right output I needed to do this: <div class="highlight high

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

Template class="error" when expanding.,about tatuylonen/wikitextprocessor

Comments (34)

kristian-clausal commented on June 13, 2024 1

Install it from the repo, preferably.

from wikitextprocessor.

LeMoussel commented on June 13, 2024 1

OK but I installed with python3 -m pip install -e .
python3 -m pip install -e --use-pep517 . produce an error.

Let's go !
wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" --skip-extraction ../frwiktionary-latest-pages-articles.xml.bz2

2024-02-05 13:23:09,096 INFO: Capturing words for: fr, mul
2024-02-05 13:23:09,130 INFO: First phase - extracting templates, macros, and pages
2024-02-05 13:23:09,130 INFO: skip_extract_dump: False, save_pages_path: None
2024-02-05 13:23:09,130 INFO: dump file path: ../frwiktionary-latest-pages-articles.xml.bz2
2024-02-05 13:23:11,004 INFO:   ... 10000 raw pages collected
2024-02-05 13:23:12,031 INFO:   ... 20000 raw pages collected
2024-02-05 13:23:12,986 INFO:   ... 30000 raw pages collected
2024-02-05 13:23:13,806 INFO:   ... 40000 raw pages collected
2024-02-05 13:23:14,596 INFO:   ... 50000 raw pages collected
2024-02-05 13:23:15,286 INFO:   ... 60000 raw pages collected
2024-02-05 13:23:15,981 INFO:   ... 70000 raw pages collected
2024-02-05 13:23:16,693 INFO:   ... 80000 raw pages collected
2024-02-05 13:23:17,384 INFO:   ... 90000 raw pages collected
.....

Bingo ! It's OK.

2024-02-05 13:28:24,382 INFO:   ... 5290000 raw pages collected
2024-02-05 13:28:25,035 INFO:   ... 5300000 raw pages collected
2024-02-05 13:28:25,467 DEBUG: Starting new HTTPS connection (1): fr.wiktionary.org:443
2024-02-05 13:28:25,987 DEBUG: https://fr.wiktionary.org:443 "GET /w/api.php?action=query&meta=siteinfo&siprop=interwikimap&format=json&formatversion=2 HTTP/1.1" 200 None
(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$

fr-wikt.db is 2.5Gb in size

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

I am getting

L[[:Template:']][[:Template:arabe]], en forme longue le [[:Template:arabe]], est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

with the same code.

What version of wikitextprocessor are you using? The pip one will probably always be out of date, just clone the wikitextprocessor repo and use that instead.

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

Basically, to get the right output I needed to do this:

from wikitextprocessor import Wtp
from wikitextprocessor.parser import print_tree

wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
    db_path="/home/kristian/Data/htmlgen/fr/fr-wikt.db",
    lang_code="fr",
    project="wikipedia",
)
wtp.start_page("Test")
wiki_data = wtp.parse(text=wikitext, expand_all=True)

print_tree(wiki_data, 2)

value = wtp.node_to_wikitext(wiki_data)
print(value)

which resulted in:

L’<bdi dir="rtl" class="script-Arab" style="font-family%3A%27Noto+Sans+Arabic+UI%27%2C%27Noto+Sans+Arabic%27%2CAndalus%2C%27Noto+Naskh+Arabic+UI%27%2C%27Noto+Naskh+Arabic%27%2C%27Traditional+Arabic%27%2CAmiri%2C%27Noto+Kufi+Arabic%27%2C%27Microsoft+Uighur%27%2C%27Tahoma%27%2C%27DejaVu+Sans%27%2Csans-serif%3Bfont-size-adjust%3A70%25%3B">'''Arabie saoudite'''</bdi>, en forme longue le <bdi dir="rtl" class="script-Arab" style="font-family%3A%27Noto+Sans+Arabic+UI%27%2C%27Noto+Sans+Arabic%27%2CAndalus%2C%27Noto+Naskh+Arabic+UI%27%2C%27Noto+Naskh+Arabic%27%2C%27Traditional+Arabic%27%2CAmiri%2C%27Noto+Kufi+Arabic%27%2C%27Microsoft+Uighur%27%2C%27Tahoma%27%2C%27DejaVu+Sans%27%2Csans-serif%3Bfont-size-adjust%3A70%25%3B">'''royaume d'Arabie saoudite'''</bdi>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

It seems that if the parser can't load the template (from a database, in this case, I guess) it'll default to creating link nodes. Which is weird, I'd expect it to recreate the template with all its arguments. I'll ask Tatu about it.

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

from wikitextprocessor import Wtp
from wikitextprocessor.parser import print_tree

wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
    db_path="/home/kristian/Data/htmlgen/fr/fr-wikt.db",
    lang_code="fr",
    project="wikipedia",
)
wtp.start_page("Test")
wiki_data = wtp.parse(text=wikitext, expand_all=False)

print_tree(wiki_data, 2)

value = wtp.node_to_wikitext(wiki_data)
print(value)

results in

L{{'}}{{arabe|'''Arabie saoudite'''|العربية السعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|'''royaume d'Arabie saoudite'''|المملكة العربية السعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

Which I bet is going to break if you tried to use it because it lacks all the escape characters! Or maybe not.

The issue was with "expand_all". The parser tries to, well, expand all the templates, and fails at it because it doesn't have access to the templates (through a database file in this case). The result is a link.

from wikitextprocessor.

xxyzz commented on June 13, 2024

I think the issue author's code doesn't work is because he didn't set the correct language code. And why use node_to_wikitext, shouldn't we use wtp.expand() to get plain text? And ' doesn't need to be escaped in """.

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

There is a difference between the output I got and what the original poster got, so there's something there other than what was mentioned.

I assumed the original poster wanted the wikitext for other reasons, in which case expand() might not have been appropriate.

EDIT: Oh duh, the title says "expanding". Yeah. In that case, of course expand is appropriate.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

What version of wikitextprocessor are you using?
0.4.96
Surprising because this tag does not exist in this repository. The last tag being: 0.4.95
So I installed from this repo: pip install git+https://github.com/tatuylonen/wikitextprocessor.git

Successfully installed lupa-2.1 lxml-5.1.0 mediawiki-langcodes-0.1.2 psutil-5.9.8 wikitextprocessor-0.4.96

and with the use wtp.expand(), I got this :

L[[:Modèle:']][[:Modèle:arabe]], en forme longue le [[:Modèle:arabe]], est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

Which does not exactly give the result displayed on the Wikpedia page: Arabie saoudite

If I do value = wtp.node_to_text(wikitext), I get:
"L[[:Modèle:']][[:Modèle:arabe]], en forme longue le [[:Modèle:arabe]], est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par [[Abdelaziz ibn Saoud]]."

Is it possible to get a result like this?
L'Arabie saoudite, en forme longue le royaume d'Arabie saoudite, est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par Abdelaziz ibn Saoud.

Python code:

    from wikitextprocessor import Wtp
    from wikitextprocessor.parser import print_tree

    wikitext = """
    L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
    """
    wtp = Wtp(
        db_path="fr-wikt.db",
        lang_code="fr",
        project="wikipedia",
    )
    wtp.start_page("Test")
    text = wtp.expand(text=wikitext)
    print(text)

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

Just to make sure, have you created fr-wikt.db? This is the big database file created by wiktwords when you run it on the database dump file from Wiktionary (with the --db-path parameter, same that is used when accessing the db later). If you haven't, then the expansion can't work because it doesn't have access to the templates, there is no "Template:arabe" page that it can expand into full text.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

have you created fr-wikt.db?
NO. I didn't understand that it was necessary to create fr-wikt.db database.
How can I create this database?

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

It's a long process, unfortunately, especially on a home machine.

Honestly, I wonder if we should have downloads to the .db database files we create... I'll ask Tatu.

To create the .db file all you need is to the run wiktwords with the --db-path parameter for the .db file's path and name on the appropriate wiktionary dump file.

lang=fr; wget https://dumps.wikimedia.org/${lang}wiktionary/latest/${lang}wiktionary-latest-pages-articles.xml.bz2 to download the appropriate file in linux, this was easier to copypaste than hunting down the link (lol).

then run wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" frwiktionary-latest-pages-articles.xml.bz2... I think that's the minimum needed.

Technically you dont' need the database-file... If you want to extract every page out of the dump.xml.bz2 file each time. The database file acts as a cache that saves all the pages from the dump so that you can do everything as quickly as possible, so creating the database file is basically mandatory anyhow.

--pages-dir parameter creates a directory with all the pages in the dump-file as text files. There are going to be many files in there, but it's useful have for debugging and just checking out the source of the files if you don't want to look online (or want to be sure).

from wikitextprocessor.

LeMoussel commented on June 13, 2024

Honestly, I wonder if we should have downloads to the .db database files we create...
It's a good idea. This would simplify the us.

To run wiktwords, I have to install wiktextract via pip install wiktextract
Is it correct ?

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

We were previously missing a parameter for wiktwords that would allow the creation of a .db without doing the extraction process. Use --skip-extraction when creating the db-file to do just that, should speed up things considerably.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

Hummm... Got this with --skip-extraction

(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" --skip-extraction ../frwiktionary-latest-pages-articles.xml.bz2usage: wiktwords [-h] [--out OUT] [--errors ERRORS] [--dump-file-language-code DUMP_FILE_LANGUAGE_CODE] [--language-code LANGUAGE_CODE] [--language-name LANGUAGE_NAME] [--all-languages]
                 [--pages-dir PAGES_DIR] [--all] [--translations] [--pronunciations] [--linkages] [--compounds] [--redirects] [--examples] [--etymologies] [--inflections] [--descendants]
                 [--page PAGE] [--db-path DB_PATH] [--num-processes NUM_PROCESSES] [--verbose] [--human-readable] [--override OVERRIDE] [--use-thesaurus] [--profile]
                 [--categories-file CATEGORIES_FILE] [--modules-file MODULES_FILE] [--templates-file TEMPLATES_FILE] [--redirects-file REDIRECTS_FILE]
                 [--inflection-tables-file INFLECTION_TABLES_FILE] [--debug-cell-text DEBUG_CELL_TEXT] [--quiet] [--search-pattern SEARCH_PATTERN]
                 [path]
wiktwords: error: unrecognized arguments: --skip-extraction
(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$

I do this for installation :

git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install --use-pep517 .

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

Yeah, I just merged it twenty minutes ago (should have specified lol); pull the repo again and reinstall should do the trick.

EDIT: if you install it into the venv using the -e ("editable") pip install flag, you don't need to reinstall after pulling because the install will point straight to the git directory. Useful if you want to edit the code or update it with just a git pull.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

Now with the use of the DB created fr-wikt.db, I have this result:

L’<bdi  dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''Arabie saoudite'''</bdi>, en forme longue le <bdi  dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''royaume d'Arabie saoudite'''</bdi>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].

Python code:

    from wikitextprocessor import Wtp

    wikitext = """
    L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
    """
    wtp = Wtp(
        db_path="fr-wikt.db",
        lang_code="fr",
        project="wikipedia",
    )
    wtp.start_page("Test")
    text = wtp.expand(text=wikitext)
    print(text)

Is it possible to have only text without <bdi tag ?

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

That is the correct output, but if you want it cleaned up there are cleaning functions for that; I am away from the code until tomorrow, hopefully you can survive on your own till then.On Feb 5, 2024 15:14, LeMoussel ***@***.***> wrote: Now with the use of the DB created fr-wikt.db, I have this result: L’<bdi dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''Arabie saoudite'''</bdi>, en forme longue le <bdi dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''royaume d'Arabie saoudite'''</bdi>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]]. Python code: from wikitextprocessor import Wtp wikitext = """ L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]]. """ wtp = Wtp( db_path="fr-wikt.db", lang_code="fr", project="wikipedia", ) wtp.start_page("Test") text = wtp.expand(text=wikitext) print(text) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

from wikitextprocessor.

LeMoussel commented on June 13, 2024

Yep ! I will survive alone until then :)
Thank you very much for your help.

from wikitextprocessor.

xxyzz commented on June 13, 2024

You don't need the wiktwords command from the wiktextract project to create a SQLite db file, the db file can be created by using the process_dump function. This function runs in a single process so it can run at any home pc, the speed is affected by single core performance and extracted page number.

clean_code could be used to convert wikitext and HTML tags to plain text.

from wikitextprocessor.

xxyzz commented on June 13, 2024

And you use the wrong dump file, you should use the French Wikipedia dump file not French Wiktionary dump file. Their template pages are different.

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

That was my fault, completely forgot this was for Wikipedia.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

OK To summarize

Download French Wikipedia dump file: frwiki-latest-pages-articles.xml.bz2
Create SQLite db file, by using the process_dump function.
use clean_node() to convert wikitext and HTML tags to plain text.

For process_dump, I need to do this:

from functools import partial
from typing import Any

from wikitextprocessor.dumpparser import process_dump
from wikitextprocessor import Wtp

def page_handler(wtp: Wtp, page: Page) -> Any:
	pass
	
wtp = Wtp(
    db_path="fr-wiki-latest.db",
    lang_code="fr",
    project="wikipedia",
)

process_dump(
    wtp,
    "frwiki-latest-pages-articles.xml.bz2"
)

for _ in map(partial(page_handler, wtp), wtp.get_all_pages([0])):
    pass

I couldn't find any documentation on clean_node() . Should we proceed in this way?

from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wikitextprocessor import Wtp

wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""

wtp = Wtp(
    db_path="fr-wiki-latest.db",
    lang_code="fr",
    project="wikipedia",
)
wxr = WiktextractContext(wtp, WiktionaryConfig())
wxr.wtp.start_page("Test")
tree_node = wxr.wtp.parse(text=wikitext, expand_all=True)
clean_node(wxr, tree_node)

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

clean_node returns a string that you need to print, but afaict fine. Creating the db-file with wiktextract is probably simpler if that doesn't work out.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

I don't understand what you mean by Creating the db-file with wiktextract is probably simpler if that doesn't work out.

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

It is probably simpler to create the database file with wiktwords* I meant to say. Wiktwords is the command bundled with wiktextract, so I was talking about the things mentioned earlier in this thread. Scripting everything with scratch with just wikitextprocessor seems excessive.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

OK, so I'm going to run this
wiktwords --db-path="fr-wiki-latest.db" --dump-file-language-code "fr" --skip-extraction ../frwiki-latest-pages-articles.xml.bz2

from wikitextprocessor.

LeMoussel commented on June 13, 2024

OK, SQLite database: fr-wiki-latest.db created ⇒ 20.9 GB
I'm going to take the tests again

from wikitextprocessor.

LeMoussel commented on June 13, 2024

I'm on the right track... :)

    from wikitextprocessor import Wtp
    from wiktextract.wxr_context import WiktextractContext
    from wiktextract.config import WiktionaryConfig
    from wiktextract.page import clean_node

    wtp = Wtp(
        db_path="fr-wiki-latest.db",
        lang_code="fr",
        project="wikipedia",
    )

    wikitext = """
    L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
    """
    wxr = WiktextractContext(wtp, WiktionaryConfig())
    wxr.wtp.start_page("ExtractText")
    tree_node = wxr.wtp.parse(text=wikitext, expand_all=True)
    sense_data = {}
    text = clean_node(
        wxr=wxr,
        sense_data=None,
        wikinode=tree_node
    )
   print(text)

Product this output

2024-02-09 13:58:07 INFO     Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-09 13:58:07 INFO     NumExpr defaulting to 8 threads.
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar', 2: 'العربيّة السّعودية'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/Arabe', 'Langue', '#invoke']
	Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar-Latn-alalc97', 2: 'al-ʿarabiyya as-saʿūdiyya'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/ALA-LC', 'Langue', '#invoke']
	Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar', 2: 'المملكة العربيّة السّعودية'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/Arabe', 'Langue', '#invoke']
	Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar-Latn-alalc97', 2: 'al-mamlaka al-ʿarabiyya as-saʿūdiyya'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/ALA-LC', 'Langue', '#invoke']
	Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: DEBUG: unmatched <nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: no corresponding start tag found for </nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: unmatched <nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: no corresponding start tag found for </nowiki> at ['ExtractText'] parsing ExtractText

L'Arabie saoudite (en arabe : , ), en forme longue le royaume d'Arabie saoudite (en arabe : , ), est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par Abdelaziz ibn Saoud.

Any idea why I have LUA error in #invoke('Langue', 'langue')?
Because of this error, is this why there (en arabe : , )

And Is it possible to disable printing of debug messages to standard output?

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

You are correct that the debug stuff should probably be in standard error or disableable. We're just so used to outputting the data into json-files and using standard output as a logging system... We'll look into it.

===

There seems to be some issue, maybe with our implementation of mw.languages.fetchLanguageNames...

https://fr.wikipedia.org/w/index.php?title=Module:Langue/Data&action=edit

for k, v in pairs( mwLangFr ) do
	if not p[ k ] then
		p[ k ] = { code = k, nom = v }
		table.insert( p.langueMediaWikiManquantes, k )
	end

	-- mwLangOriginal et mwLangFr ont les mêmes keys, du coup on peut traiter les deux dans cette itération

	local nomOriginal = ustringLower( mwLangOriginal[ k ] )
	if not p[ nomOriginal ] then
		p[ nomOriginal ] = p[ k ]
	end

	local nomFr = ustringLower( v )
	if not p[ nomFr ] then
		p[ nomFr ] = p[ k ]
	end
end

mwLangFr and mwLangOriginal should have the same keys, but mwLangOriginal is returning nil (they key is missing).

local mwLangOriginal = mw.language.fetchLanguageNames()
local mwLangFr = mw.language.fetchLanguageNames( 'fr' )

fetchLanguageNames is our own implementation, and afaict there's nothing wrong with it. The first one returns a dict/table of language codes mapped to their original language names, the second one a table with language codes mapped to their French names... If we're iterating over mwLangFr, then it would follow that there are keys in mwLangFr present missing from mwOriginalLang, and indeed, it seems that this is true both ways.

Yeah, the issue seems to be that in our implementation, mwLangFr and mwLangOriginal do NOT 'ont les mêmes keys', which breaks when lower() receives a nil value.

I tried to just do a naive change to our fetchLanguageNames implementation, but it didn't work: I thought that if I got the language codes from what is effectively mwLangOriginal and added it to mwLangTargetedLanguage (mwLangFr) it would work out, but I forgot that I'd already noticed that mwLangOriginal was also missing stuff from mwLangFr.

The implementation relies on xxyzz's mediawiki_languagecodes that uses a baked-in sqlite database to query for these things, and the table construction is a bit too intermediate level for me to touch (at a beginner SQL level), so I'll leave this to @xxyzz, next week.

TODO:

mediawiki_languagecodes.get_all_names should return all possible language codes as keys, even if they would have an empty string value?

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

xxyzz's update to mediawiki_langcodes and to wikitextprocessor has fixed the issue with Lang. You need to update mediawiki_langcodes to 0.20 (python -m pip install -U mediawiki_langcodes) and pull wikitextprocessor again.

from wikitextprocessor.

LeMoussel commented on June 13, 2024

What do you mean by pull wikitextprocessor again?
Should I do git pull https://github.com/tatuylonen/wikitextprocessor.git?
(Excuse me, I'm newbie with Git)

from wikitextprocessor.

kristian-clausal commented on June 13, 2024

If you're git cloned wikitextprocessor locally (which you have, I'm pretty sure, otherwise none of this would work... I think), and you've installed it with pip install -e [...] (where the ... is just the other stuff), you can 'update' wikitextprocessor by using the command git pull anywhere in the wikitextprocessor git folder. It will download everything that's been 'pushed' to the repo here. If you didn't install with pip install -e, then you also need to reinstall it using pip install again (and you could at this point switch to installing the editable version by just using the -e flag).

from wikitextprocessor.

LeMoussel commented on June 13, 2024

It's OK.

dev@dev-B550M-DS3H:~/Python/WikiExtractor$ cd wikitextprocessor
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git pull
Updating fdd30e1..6a7890c
Fast-forward
 README.md                        |  2 ++
 pyproject.toml                   |  2 ++
 src/wikitextprocessor/common.py  | 30 ++++++++++++++++++++++++++++++
 src/wikitextprocessor/core.py    | 30 +++++++++++++++++++++++++++---
 src/wikitextprocessor/luaexec.py |  2 +-
 src/wikitextprocessor/parser.py  | 37 -------------------------------------
 tests/test_parser.py             | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 155 insertions(+), 41 deletions(-)

And do

python3 -m pip install -U mediawiki_langcodes

from wikitextprocessor.

LeMoussel commented on June 13, 2024

Well done !
With the correction of @xxyzz on mediawiki_langcodes and you on wikitextprocessor, there are no more errors.

from wikitextprocessor.

Template class="error" when expanding. about wikitextprocessor HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent