Repeating "downloading" line in debugger logging when downloading data

spacy-lefff : Custom French POS and lemmatizer based on Lefff for spacy

spacy v2.0 extension and pipeline component for adding a French POS and lemmatizer based on Lefff.

On version v2.0.17, spaCy updated French lemmatization

As of version 0.4.0 and above, spacy-lefff only supports python3.6+ and spacy v3

As of version 0.5.0 and above, spacy-lefff only supports python3.8+ and spacy v3

Description

This package allows to bring Lefff lemmatization and part-of-speech tagging to a spaCy custom pipeline. When POS tagging and Lemmatizaion are combined inside a pipeline, it improves your text preprocessing for French compared to the built-in spaCy French processing. It is still a WIP (work in progress), so the matching might not be perfect but if nothing was found by the package, it is still possible to use the default results of spaCy.

Installation

spacy-lefff requires spacy >= v3.0.0.

pip install spacy-lefff

Usage

Import and initialize your nlp spacy object and add the custom component after it parsed the document so you can benefit the POS tags. Be aware to work with UTF-8.

If both POS and lemmatizer are bundled, you need to tell the lemmatizer to use MElt mapping by setting after_melt, else it will use the spaCy part of speech mapping.

default option allows to return the word by default if no lemma was found.

Current mapping used spaCy to Lefff is :

{
    "ADJ": "adj",
    "ADP": "det",
    "ADV": "adv",
    "DET": "det",
    "PRON": "cln",
    "PROPN": "np",
    "NOUN": "nc",
    "VERB": "v",
    "PUNCT": "poncts"
}

MElt Tagset

MElt Tag table:

ADJ 	   adjective
ADJWH	   interrogative adjective
ADV	   adverb
ADVWH	   interrogative adverb
CC	   coordination conjunction
CLO	   object clitic pronoun
CLR	   reflexive clitic pronoun
CLS	   subject clitic pronoun
CS	   subordination conjunction
DET	   determiner
DETWH	   interrogative determiner
ET	   foreign word
I	   interjection
NC	   common noun
NPP	   proper noun
P	   preposition
P+D	   preposition+determiner amalgam
P+PRO	   prepositon+pronoun amalgam
PONCT	   punctuation mark
PREF	   prefix
PRO	   full pronoun
PROREL	   relative pronoun
PROWH	   interrogative pronoun
V	   indicative or conditional verb form
VIMP	   imperative verb form
VINF	   infinitive verb form
VPP	   past participle
VPR	   present participle
VS	   subjunctive verb form

Code snippet

You need to install the French spaCy package before : python -m spacy download fr.

An example using the LefffLemmatizer without the POSTagger:

import spacy
from spacy_lefff import LefffLemmatizer
from spacy.language import Language

@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
    return LefffLemmatizer()

nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('french_lemmatizer', name='lefff')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
    print(d.text, d.pos_, d._.lefff_lemma, d.tag_, d.lemma_)

Text	spaCy POS	Lefff Lemma	spaCy tag	spaCy Lemma
Apple	ADJ	None	ADJ__Number=Sing	Apple
cherche	NOUN	cherche	NOUN__Number=Sing	chercher
a	AUX	None	AUX__Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin	avoir
acheter	VERB	acheter	VERB__VerbForm=Inf	acheter
une	DET	un	DET__Definite=Ind Gender=Fem Number=Sing PronType=Art	un
startup	ADJ	None	ADJ__Number=Sing	startup
anglaise	NOUN	anglaise	NOUN__Gender=Fem Number=Sing	anglais
pour	ADP	None	ADP___	pour
1	NUM	None	NUM__NumType=Card	1
milliard	NOUN	milliard	NOUN__Gender=Masc Number=Sing NumType=Card	milliard
de	ADP	un	ADP___	de
dollard	NOUN	None	NOUN__Gender=Masc Number=Sing	dollard

An example using the POSTagger :

import spacy
from spacy_lefff import LefffLemmatizer, POSTagger
from spacy.language import Language

@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
    return LefffLemmatizer(after_melt=True, default=True)

@Language.factory('melt_tagger')  
def create_melt_tagger(nlp, name):
    return POSTagger()
 
nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('melt_tagger', after='parser')
nlp.add_pipe('french_lemmatizer', after='melt_tagger')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
    print(d.text, d.pos_, d._.melt_tagger, d._.lefff_lemma, d.tag_, d.lemma_)

Text	spaCy POS	MElt Tag	Lefff Lemma	spaCy tag	spaCy Lemma
Apple	ADJ	NPP	apple	ADJ__Number=Sing	Apple
cherche	NOUN	V	chercher	NOUN__Number=Sing	chercher
a	AUX	V	avoir	AUX__Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin	avoir
acheter	VERB	VINF	acheter	VERB__VerbForm=Inf	acheter
une	DET	DET	un	DET__Definite=Ind Gender=Fem Number=Sing PronType=Art	un
startup	ADJ	NC	startup	ADJ__Number=Sing	startup
anglaise	NOUN	ADJ	anglais	NOUN__Gender=Fem Number=Sing	anglais
pour	ADP	P	pour	ADP___	pour
1	NUM	DET	1	NUM__NumType=Card	1
milliard	NOUN	NC	milliard	NOUN__Gender=Masc Number=Sing NumType=Card	milliard
de	ADP	P	de	ADP___	de
dollard	NOUN	NC	dollard	NOUN__Gender=Masc Number=Sing	dollard

We can see that both cherche and startup where not tagged correctly by the default pos tagger. spaCyclassified them as a NOUN and ADJ while MElT classified them as a V and an NC.

Credits

Sagot, B. (2010). The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In 7th international conference on Language Resources and Evaluation (LREC 2010).

Benoît Sagot Webpage about LEFFF
http://alpage.inria.fr/~sagot/lefff-en.html

First work of Claude Coulombe to support Lefff with Python : https://github.com/ClaudeCoulombe

	done = int(50 * dl / total_length)
	if done % 5 == 0:
	LOGGER.debug("\r[%s%s] : downloading...",
	'' done, ' ' * (50 - done))

sammous / spacy-lefff Goto Github PK

spacy-lefff's Introduction

spacy-lefff : Custom French POS and lemmatizer based on Lefff for spacy

Description

Installation

Usage

MElt Tagset

Code snippet

Credits

spacy-lefff's People

Contributors

Stargazers

Watchers

Forkers

spacy-lefff's Issues

Code

Problem

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org