Comments (34)
Install it from the repo, preferably.
from wikitextprocessor.
OK but I installed with python3 -m pip install -e .
python3 -m pip install -e --use-pep517 .
produce an error.
Let's go !
wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" --skip-extraction ../frwiktionary-latest-pages-articles.xml.bz2
2024-02-05 13:23:09,096 INFO: Capturing words for: fr, mul
2024-02-05 13:23:09,130 INFO: First phase - extracting templates, macros, and pages
2024-02-05 13:23:09,130 INFO: skip_extract_dump: False, save_pages_path: None
2024-02-05 13:23:09,130 INFO: dump file path: ../frwiktionary-latest-pages-articles.xml.bz2
2024-02-05 13:23:11,004 INFO: ... 10000 raw pages collected
2024-02-05 13:23:12,031 INFO: ... 20000 raw pages collected
2024-02-05 13:23:12,986 INFO: ... 30000 raw pages collected
2024-02-05 13:23:13,806 INFO: ... 40000 raw pages collected
2024-02-05 13:23:14,596 INFO: ... 50000 raw pages collected
2024-02-05 13:23:15,286 INFO: ... 60000 raw pages collected
2024-02-05 13:23:15,981 INFO: ... 70000 raw pages collected
2024-02-05 13:23:16,693 INFO: ... 80000 raw pages collected
2024-02-05 13:23:17,384 INFO: ... 90000 raw pages collected
.....
Bingo ! It's OK.
2024-02-05 13:28:24,382 INFO: ... 5290000 raw pages collected
2024-02-05 13:28:25,035 INFO: ... 5300000 raw pages collected
2024-02-05 13:28:25,467 DEBUG: Starting new HTTPS connection (1): fr.wiktionary.org:443
2024-02-05 13:28:25,987 DEBUG: https://fr.wiktionary.org:443 "GET /w/api.php?action=query&meta=siteinfo&siprop=interwikimap&format=json&formatversion=2 HTTP/1.1" 200 None
(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$
fr-wikt.db
is 2.5Gb in size
from wikitextprocessor.
I am getting
L[[:Template:']][[:Template:arabe]], en forme longue le [[:Template:arabe]], est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
with the same code.
What version of wikitextprocessor are you using? The pip one will probably always be out of date, just clone the wikitextprocessor repo and use that instead.
from wikitextprocessor.
Basically, to get the right output I needed to do this:
from wikitextprocessor import Wtp
from wikitextprocessor.parser import print_tree
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
db_path="/home/kristian/Data/htmlgen/fr/fr-wikt.db",
lang_code="fr",
project="wikipedia",
)
wtp.start_page("Test")
wiki_data = wtp.parse(text=wikitext, expand_all=True)
print_tree(wiki_data, 2)
value = wtp.node_to_wikitext(wiki_data)
print(value)
which resulted in:
L’<bdi dir="rtl" class="script-Arab" style="font-family%3A%27Noto+Sans+Arabic+UI%27%2C%27Noto+Sans+Arabic%27%2CAndalus%2C%27Noto+Naskh+Arabic+UI%27%2C%27Noto+Naskh+Arabic%27%2C%27Traditional+Arabic%27%2CAmiri%2C%27Noto+Kufi+Arabic%27%2C%27Microsoft+Uighur%27%2C%27Tahoma%27%2C%27DejaVu+Sans%27%2Csans-serif%3Bfont-size-adjust%3A70%25%3B">'''Arabie saoudite'''</bdi>, en forme longue le <bdi dir="rtl" class="script-Arab" style="font-family%3A%27Noto+Sans+Arabic+UI%27%2C%27Noto+Sans+Arabic%27%2CAndalus%2C%27Noto+Naskh+Arabic+UI%27%2C%27Noto+Naskh+Arabic%27%2C%27Traditional+Arabic%27%2CAmiri%2C%27Noto+Kufi+Arabic%27%2C%27Microsoft+Uighur%27%2C%27Tahoma%27%2C%27DejaVu+Sans%27%2Csans-serif%3Bfont-size-adjust%3A70%25%3B">'''royaume d'Arabie saoudite'''</bdi>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
It seems that if the parser can't load the template (from a database, in this case, I guess) it'll default to creating link nodes. Which is weird, I'd expect it to recreate the template with all its arguments. I'll ask Tatu about it.
from wikitextprocessor.
from wikitextprocessor import Wtp
from wikitextprocessor.parser import print_tree
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
db_path="/home/kristian/Data/htmlgen/fr/fr-wikt.db",
lang_code="fr",
project="wikipedia",
)
wtp.start_page("Test")
wiki_data = wtp.parse(text=wikitext, expand_all=False)
print_tree(wiki_data, 2)
value = wtp.node_to_wikitext(wiki_data)
print(value)
results in
L{{'}}{{arabe|'''Arabie saoudite'''|العربية السعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|'''royaume d'Arabie saoudite'''|المملكة العربية السعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
Which I bet is going to break if you tried to use it because it lacks all the escape characters! Or maybe not.
The issue was with "expand_all". The parser tries to, well, expand all the templates, and fails at it because it doesn't have access to the templates (through a database file in this case). The result is a link.
from wikitextprocessor.
I think the issue author's code doesn't work is because he didn't set the correct language code. And why use node_to_wikitext
, shouldn't we use wtp.expand()
to get plain text? And '
doesn't need to be escaped in """
.
from wikitextprocessor.
There is a difference between the output I got and what the original poster got, so there's something there other than what was mentioned.
I assumed the original poster wanted the wikitext for other reasons, in which case expand()
might not have been appropriate.
EDIT: Oh duh, the title says "expanding". Yeah. In that case, of course expand
is appropriate.
from wikitextprocessor.
What version of wikitextprocessor are you using?
0.4.96
Surprising because this tag does not exist in this repository. The last tag being: 0.4.95
So I installed from this repo: pip install git+https://github.com/tatuylonen/wikitextprocessor.git
Successfully installed lupa-2.1 lxml-5.1.0 mediawiki-langcodes-0.1.2 psutil-5.9.8 wikitextprocessor-0.4.96
and with the use wtp.expand()
, I got this :
L[[:Modèle:']][[:Modèle:arabe]], en forme longue le [[:Modèle:arabe]], est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
Which does not exactly give the result displayed on the Wikpedia page: Arabie saoudite
If I do value = wtp.node_to_text(wikitext)
, I get:
"L[[:Modèle:']][[:Modèle:arabe]], en forme longue le [[:Modèle:arabe]], est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par [[Abdelaziz ibn Saoud]]."
Is it possible to get a result like this?
L'Arabie saoudite, en forme longue le royaume d'Arabie saoudite, est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par Abdelaziz ibn Saoud.
Python code:
from wikitextprocessor import Wtp
from wikitextprocessor.parser import print_tree
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
db_path="fr-wikt.db",
lang_code="fr",
project="wikipedia",
)
wtp.start_page("Test")
text = wtp.expand(text=wikitext)
print(text)
from wikitextprocessor.
Just to make sure, have you created fr-wikt.db? This is the big database file created by wiktwords when you run it on the database dump file from Wiktionary (with the --db-path parameter, same that is used when accessing the db later). If you haven't, then the expansion can't work because it doesn't have access to the templates, there is no "Template:arabe" page that it can expand into full text.
from wikitextprocessor.
have you created fr-wikt.db?
NO. I didn't understand that it was necessary to create fr-wikt.db
database.
How can I create this database?
from wikitextprocessor.
It's a long process, unfortunately, especially on a home machine.
Honestly, I wonder if we should have downloads to the .db database files we create... I'll ask Tatu.
To create the .db file all you need is to the run wiktwords with the --db-path parameter for the .db file's path and name on the appropriate wiktionary dump file.
lang=fr; wget https://dumps.wikimedia.org/${lang}wiktionary/latest/${lang}wiktionary-latest-pages-articles.xml.bz2
to download the appropriate file in linux, this was easier to copypaste than hunting down the link (lol).
then run wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" frwiktionary-latest-pages-articles.xml.bz2
... I think that's the minimum needed.
Technically you dont' need the database-file... If you want to extract every page out of the dump.xml.bz2 file each time. The database file acts as a cache that saves all the pages from the dump so that you can do everything as quickly as possible, so creating the database file is basically mandatory anyhow.
--pages-dir
parameter creates a directory with all the pages in the dump-file as text files. There are going to be many files in there, but it's useful have for debugging and just checking out the source of the files if you don't want to look online (or want to be sure).
from wikitextprocessor.
Honestly, I wonder if we should have downloads to the .db database files we create...
It's a good idea. This would simplify the us.
To run wiktwords
, I have to install wiktextract via pip install wiktextract
Is it correct ?
from wikitextprocessor.
We were previously missing a parameter for wiktwords that would allow the creation of a .db without doing the extraction process. Use --skip-extraction
when creating the db-file to do just that, should speed up things considerably.
from wikitextprocessor.
Hummm... Got this with --skip-extraction
(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ wiktwords --db-path="fr-wikt.db" --dump-file-language-code "fr" --skip-extraction ../frwiktionary-latest-pages-articles.xml.bz2usage: wiktwords [-h] [--out OUT] [--errors ERRORS] [--dump-file-language-code DUMP_FILE_LANGUAGE_CODE] [--language-code LANGUAGE_CODE] [--language-name LANGUAGE_NAME] [--all-languages]
[--pages-dir PAGES_DIR] [--all] [--translations] [--pronunciations] [--linkages] [--compounds] [--redirects] [--examples] [--etymologies] [--inflections] [--descendants]
[--page PAGE] [--db-path DB_PATH] [--num-processes NUM_PROCESSES] [--verbose] [--human-readable] [--override OVERRIDE] [--use-thesaurus] [--profile]
[--categories-file CATEGORIES_FILE] [--modules-file MODULES_FILE] [--templates-file TEMPLATES_FILE] [--redirects-file REDIRECTS_FILE]
[--inflection-tables-file INFLECTION_TABLES_FILE] [--debug-cell-text DEBUG_CELL_TEXT] [--quiet] [--search-pattern SEARCH_PATTERN]
[path]
wiktwords: error: unrecognized arguments: --skip-extraction
(.venv) dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$
I do this for installation :
git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install --use-pep517 .
from wikitextprocessor.
Yeah, I just merged it twenty minutes ago (should have specified lol); pull the repo again and reinstall should do the trick.
EDIT: if you install it into the venv using the -e ("editable") pip install flag, you don't need to reinstall after pulling because the install will point straight to the git directory. Useful if you want to edit the code or update it with just a git pull.
from wikitextprocessor.
Now with the use of the DB created fr-wikt.db
, I have this result:
L’<bdi dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''Arabie saoudite'''</bdi>, en forme longue le <bdi dir="rtl" class="script-Arab" style="font-family:'Noto Sans Arabic UI','Noto Sans Arabic',Andalus,'Noto Naskh Arabic UI','Noto Naskh Arabic','Traditional Arabic',Amiri,'Noto Kufi Arabic','Microsoft Uighur','Tahoma','DejaVu Sans',sans-serif;font-size-adjust:70%;">'''royaume d'Arabie saoudite'''</bdi>, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
Python code:
from wikitextprocessor import Wtp
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
db_path="fr-wikt.db",
lang_code="fr",
project="wikipedia",
)
wtp.start_page("Test")
text = wtp.expand(text=wikitext)
print(text)
Is it possible to have only text without <bdi
tag ?
from wikitextprocessor.
from wikitextprocessor.
Yep ! I will survive alone until then :)
Thank you very much for your help.
from wikitextprocessor.
You don't need the wiktwords
command from the wiktextract project to create a SQLite db file, the db file can be created by using the process_dump function. This function runs in a single process so it can run at any home pc, the speed is affected by single core performance and extracted page number.
clean_code
could be used to convert wikitext and HTML tags to plain text.
from wikitextprocessor.
And you use the wrong dump file, you should use the French Wikipedia dump file not French Wiktionary dump file. Their template pages are different.
from wikitextprocessor.
That was my fault, completely forgot this was for Wikipedia.
from wikitextprocessor.
OK To summarize
- Download French Wikipedia dump file:
frwiki-latest-pages-articles.xml.bz2
- Create SQLite db file, by using the process_dump function.
- use
clean_node()
to convert wikitext and HTML tags to plain text.
For process_dump, I need to do this:
from functools import partial
from typing import Any
from wikitextprocessor.dumpparser import process_dump
from wikitextprocessor import Wtp
def page_handler(wtp: Wtp, page: Page) -> Any:
pass
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
process_dump(
wtp,
"frwiki-latest-pages-articles.xml.bz2"
)
for _ in map(partial(page_handler, wtp), wtp.get_all_pages([0])):
pass
I couldn't find any documentation on clean_node()
. Should we proceed in this way?
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wikitextprocessor import Wtp
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
wxr = WiktextractContext(wtp, WiktionaryConfig())
wxr.wtp.start_page("Test")
tree_node = wxr.wtp.parse(text=wikitext, expand_all=True)
clean_node(wxr, tree_node)
from wikitextprocessor.
clean_node returns a string that you need to print, but afaict fine. Creating the db-file with wiktextract is probably simpler if that doesn't work out.
from wikitextprocessor.
I don't understand what you mean by Creating the db-file with wiktextract is probably simpler if that doesn't work out.
from wikitextprocessor.
It is probably simpler to create the database file with wiktwords* I meant to say. Wiktwords is the command bundled with wiktextract, so I was talking about the things mentioned earlier in this thread. Scripting everything with scratch with just wikitextprocessor seems excessive.
from wikitextprocessor.
OK, so I'm going to run this
wiktwords --db-path="fr-wiki-latest.db" --dump-file-language-code "fr" --skip-extraction ../frwiki-latest-pages-articles.xml.bz2
from wikitextprocessor.
OK, SQLite database: fr-wiki-latest.db
created ⇒ 20.9 GB
I'm going to take the tests again
from wikitextprocessor.
I'm on the right track... :)
from wikitextprocessor import Wtp
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
wikitext = """
L{{\'}}{{arabe|\'\'\'Arabie saoudite\'\'\'|العربيّة السّعودية|al-ʿarabiyya as-saʿūdiyya}}, en forme longue le {{arabe|\'\'\'royaume d\'Arabie saoudite\'\'\'|المملكة العربيّة السّعودية|al-mamlaka al-ʿarabiyya as-saʿūdiyya}}, est une [[Absolutisme|monarchie absolue]] [[État islamique|islamique]] dirigée par la [[dynastie saoudienne|dynastie des Saoud]], depuis sa création en 1932 par [[Abdelaziz ibn Saoud]].
"""
wxr = WiktextractContext(wtp, WiktionaryConfig())
wxr.wtp.start_page("ExtractText")
tree_node = wxr.wtp.parse(text=wikitext, expand_all=True)
sense_data = {}
text = clean_node(
wxr=wxr,
sense_data=None,
wikinode=tree_node
)
print(text)
Product this output
2024-02-09 13:58:07 INFO Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-09 13:58:07 INFO NumExpr defaulting to 8 threads.
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar', 2: 'العربيّة السّعودية'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/Arabe', 'Langue', '#invoke']
Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar-Latn-alalc97', 2: 'al-ʿarabiyya as-saʿūdiyya'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/ALA-LC', 'Langue', '#invoke']
Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar', 2: 'المملكة العربيّة السّعودية'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/Arabe', 'Langue', '#invoke']
Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: ERROR: LUA error in #invoke('Langue', 'langue') parent ('Modèle:Langue', {1: 'ar-Latn-alalc97', 2: 'al-mamlaka al-ʿarabiyya as-saʿūdiyya'}) at ['ExtractText', 'arabe', '#if', '#if', 'Arabe/ALA-LC', 'Langue', '#invoke']
Loading module failed in #invoke: Langue
[string "Module:Langue/Data"]:1644: bad argument #1 to 'lower' (string expected, got nil)
ExtractText: DEBUG: unmatched <nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: no corresponding start tag found for </nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: unmatched <nowiki> at ['ExtractText'] parsing ExtractText
ExtractText: DEBUG: no corresponding start tag found for </nowiki> at ['ExtractText'] parsing ExtractText
L'Arabie saoudite (en arabe : , ), en forme longue le royaume d'Arabie saoudite (en arabe : , ), est une monarchie absolue islamique dirigée par la dynastie des Saoud, depuis sa création en 1932 par Abdelaziz ibn Saoud.
Any idea why I have LUA error in #invoke('Langue', 'langue')
?
Because of this error, is this why there (en arabe : , )
And Is it possible to disable printing of debug messages to standard output?
from wikitextprocessor.
You are correct that the debug stuff should probably be in standard error or disableable. We're just so used to outputting the data into json-files and using standard output as a logging system... We'll look into it.
===
There seems to be some issue, maybe with our implementation of mw.languages.fetchLanguageNames...
https://fr.wikipedia.org/w/index.php?title=Module:Langue/Data&action=edit
for k, v in pairs( mwLangFr ) do
if not p[ k ] then
p[ k ] = { code = k, nom = v }
table.insert( p.langueMediaWikiManquantes, k )
end
-- mwLangOriginal et mwLangFr ont les mêmes keys, du coup on peut traiter les deux dans cette itération
local nomOriginal = ustringLower( mwLangOriginal[ k ] )
if not p[ nomOriginal ] then
p[ nomOriginal ] = p[ k ]
end
local nomFr = ustringLower( v )
if not p[ nomFr ] then
p[ nomFr ] = p[ k ]
end
end
mwLangFr and mwLangOriginal should have the same keys, but mwLangOriginal is returning nil (they key is missing).
local mwLangOriginal = mw.language.fetchLanguageNames()
local mwLangFr = mw.language.fetchLanguageNames( 'fr' )
fetchLanguageNames is our own implementation, and afaict there's nothing wrong with it. The first one returns a dict/table of language codes mapped to their original language names, the second one a table with language codes mapped to their French names... If we're iterating over mwLangFr, then it would follow that there are keys in mwLangFr present missing from mwOriginalLang, and indeed, it seems that this is true both ways.
Yeah, the issue seems to be that in our implementation, mwLangFr and mwLangOriginal do NOT 'ont les mêmes keys', which breaks when lower() receives a nil value.
I tried to just do a naive change to our fetchLanguageNames implementation, but it didn't work: I thought that if I got the language codes from what is effectively mwLangOriginal and added it to mwLangTargetedLanguage (mwLangFr) it would work out, but I forgot that I'd already noticed that mwLangOriginal was also missing stuff from mwLangFr.
The implementation relies on xxyzz's mediawiki_languagecodes that uses a baked-in sqlite database to query for these things, and the table construction is a bit too intermediate level for me to touch (at a beginner SQL level), so I'll leave this to @xxyzz, next week.
TODO:
- mediawiki_languagecodes.get_all_names should return all possible language codes as keys, even if they would have an empty string value?
from wikitextprocessor.
xxyzz's update to mediawiki_langcodes and to wikitextprocessor has fixed the issue with Lang. You need to update mediawiki_langcodes to 0.20 (python -m pip install -U mediawiki_langcodes
) and pull wikitextprocessor again.
from wikitextprocessor.
What do you mean by pull wikitextprocessor again?
Should I do git pull https://github.com/tatuylonen/wikitextprocessor.git
?
(Excuse me, I'm newbie with Git)
from wikitextprocessor.
If you're git cloned wikitextprocessor locally (which you have, I'm pretty sure, otherwise none of this would work... I think), and you've installed it with pip install -e [...]
(where the ... is just the other stuff), you can 'update' wikitextprocessor by using the command git pull
anywhere in the wikitextprocessor git folder. It will download everything that's been 'pushed' to the repo here. If you didn't install with pip install -e
, then you also need to reinstall it using pip install again (and you could at this point switch to installing the editable version by just using the -e
flag).
from wikitextprocessor.
It's OK.
dev@dev-B550M-DS3H:~/Python/WikiExtractor$ cd wikitextprocessor
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git pull
Updating fdd30e1..6a7890c
Fast-forward
README.md | 2 ++
pyproject.toml | 2 ++
src/wikitextprocessor/common.py | 30 ++++++++++++++++++++++++++++++
src/wikitextprocessor/core.py | 30 +++++++++++++++++++++++++++---
src/wikitextprocessor/luaexec.py | 2 +-
src/wikitextprocessor/parser.py | 37 -------------------------------------
tests/test_parser.py | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 155 insertions(+), 41 deletions(-)
And do
python3 -m pip install -U mediawiki_langcodes
from wikitextprocessor.
Well done !
With the correction of @xxyzz on mediawiki_langcodes
and you on wikitextprocessor
, there are no more errors.
from wikitextprocessor.
Related Issues (20)
- Strip newline character at the end of unnamed template arguments HOT 3
- ERROR: unimplemented parserfn #coordinates HOT 9
- ERROR: LUA error in #invoke('Biblio', 'lienWeb') HOT 2
- WARNING: unrecognized time syntax
- LUA error in #invoke('Bandeau', 'bandeau') HOT 3
- EVOL: Store ID Page in SQLite database file. HOT 3
- WARNing "unrecognized time syntax in #time ..." HOT 2
- ERROR: unimplemented parserfn #property HOT 6
- ERROR: unimplemented parserfn PAGESIZE HOT 5
- ERROR: unimplemented parserfn filepath HOT 7
- Template {{Voir homonymes|....}} is misinterpreted HOT 21
- Checklist-1 for existing errors. HOT 59
- Infinite loop during `clean_node()` HOT 1
- Parasitic display "Uri:parse unexpected stuff at end:" HOT 3
- Presence of spurious text. HOT 17
- Data dumps do not contain interwiki link (Wikidata) data HOT 15
- Undeclared variable assignments from modules without require ('strict'); HOT 2
- non-interpretation of certain {{...}} & [[...]] HOT 1
- Can't parse link nodes contain newline character HOT 9
- assert error at src/parse.py ln 2287 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wikitextprocessor.