Git Product home page Git Product logo

spacy-lefff's Introduction

Build StatusCoverage StatusPyPI version

spacy-lefff : Custom French POS and lemmatizer based on Lefff for spacy

spacy v2.0 extension and pipeline component for adding a French POS and lemmatizer based on Lefff.

On version v2.0.17, spaCy updated French lemmatization

As of version 0.4.0 and above, spacy-lefff only supports python3.6+ and spacy v3

As of version 0.5.0 and above, spacy-lefff only supports python3.8+ and spacy v3

Description

This package allows to bring Lefff lemmatization and part-of-speech tagging to a spaCy custom pipeline. When POS tagging and Lemmatizaion are combined inside a pipeline, it improves your text preprocessing for French compared to the built-in spaCy French processing. It is still a WIP (work in progress), so the matching might not be perfect but if nothing was found by the package, it is still possible to use the default results of spaCy.

Installation

spacy-lefff requires spacy >= v3.0.0.

pip install spacy-lefff

Usage

Import and initialize your nlp spacy object and add the custom component after it parsed the document so you can benefit the POS tags. Be aware to work with UTF-8.

If both POS and lemmatizer are bundled, you need to tell the lemmatizer to use MElt mapping by setting after_melt, else it will use the spaCy part of speech mapping.

default option allows to return the word by default if no lemma was found.

Current mapping used spaCy to Lefff is :

{
    "ADJ": "adj",
    "ADP": "det",
    "ADV": "adv",
    "DET": "det",
    "PRON": "cln",
    "PROPN": "np",
    "NOUN": "nc",
    "VERB": "v",
    "PUNCT": "poncts"
}

MElt Tagset

MElt Tag table:

ADJ 	   adjective
ADJWH	   interrogative adjective
ADV	   adverb
ADVWH	   interrogative adverb
CC	   coordination conjunction
CLO	   object clitic pronoun
CLR	   reflexive clitic pronoun
CLS	   subject clitic pronoun
CS	   subordination conjunction
DET	   determiner
DETWH	   interrogative determiner
ET	   foreign word
I	   interjection
NC	   common noun
NPP	   proper noun
P	   preposition
P+D	   preposition+determiner amalgam
P+PRO	   prepositon+pronoun amalgam
PONCT	   punctuation mark
PREF	   prefix
PRO	   full pronoun
PROREL	   relative pronoun
PROWH	   interrogative pronoun
V	   indicative or conditional verb form
VIMP	   imperative verb form
VINF	   infinitive verb form
VPP	   past participle
VPR	   present participle
VS	   subjunctive verb form

Code snippet

You need to install the French spaCy package before : python -m spacy download fr.

  • An example using the LefffLemmatizer without the POSTagger:
import spacy
from spacy_lefff import LefffLemmatizer
from spacy.language import Language

@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
    return LefffLemmatizer()

nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('french_lemmatizer', name='lefff')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
    print(d.text, d.pos_, d._.lefff_lemma, d.tag_, d.lemma_)
Text spaCy POS Lefff Lemma spaCy tag spaCy Lemma
Apple ADJ None ADJ__Number=Sing Apple
cherche NOUN cherche NOUN__Number=Sing chercher
a AUX None AUX__Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin avoir
acheter VERB acheter VERB__VerbForm=Inf acheter
une DET un DET__Definite=Ind Gender=Fem Number=Sing PronType=Art un
startup ADJ None ADJ__Number=Sing startup
anglaise NOUN anglaise NOUN__Gender=Fem Number=Sing anglais
pour ADP None ADP___ pour
1 NUM None NUM__NumType=Card 1
milliard NOUN milliard NOUN__Gender=Masc Number=Sing NumType=Card milliard
de ADP un ADP___ de
dollard NOUN None NOUN__Gender=Masc Number=Sing dollard
  • An example using the POSTagger :
import spacy
from spacy_lefff import LefffLemmatizer, POSTagger
from spacy.language import Language

@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
    return LefffLemmatizer(after_melt=True, default=True)

@Language.factory('melt_tagger')  
def create_melt_tagger(nlp, name):
    return POSTagger()
 
nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('melt_tagger', after='parser')
nlp.add_pipe('french_lemmatizer', after='melt_tagger')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
    print(d.text, d.pos_, d._.melt_tagger, d._.lefff_lemma, d.tag_, d.lemma_)
Text spaCy POS MElt Tag Lefff Lemma spaCy tag spaCy Lemma
Apple ADJ NPP apple ADJ__Number=Sing Apple
cherche NOUN V chercher NOUN__Number=Sing chercher
a AUX V avoir AUX__Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin avoir
acheter VERB VINF acheter VERB__VerbForm=Inf acheter
une DET DET un DET__Definite=Ind Gender=Fem Number=Sing PronType=Art un
startup ADJ NC startup ADJ__Number=Sing startup
anglaise NOUN ADJ anglais NOUN__Gender=Fem Number=Sing anglais
pour ADP P pour ADP___ pour
1 NUM DET 1 NUM__NumType=Card 1
milliard NOUN NC milliard NOUN__Gender=Masc Number=Sing NumType=Card milliard
de ADP P de ADP___ de
dollard NOUN NC dollard NOUN__Gender=Masc Number=Sing dollard

We can see that both cherche and startup where not tagged correctly by the default pos tagger. spaCyclassified them as a NOUN and ADJ while MElT classified them as a V and an NC.

Credits

Sagot, B. (2010). The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In 7th international conference on Language Resources and Evaluation (LREC 2010).

Benoît Sagot Webpage about LEFFF
http://alpage.inria.fr/~sagot/lefff-en.html

First work of Claude Coulombe to support Lefff with Python : https://github.com/ClaudeCoulombe

spacy-lefff's People

Contributors

alexandrerozier avatar alexis-tonnoir avatar sammous avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

spacy-lefff's Issues

Use .melt_tagger several times

Hello,

i use melt_tagger in a class. The sentence to analyze is sent with socketio

import spacy
import sys
import random
from spacy_lefff import LefffLemmatizer, POSTagger
import socketio

class SomeClass():
    def __init__(self):
        self.nlp = spacy.load('fr')
        self.pos = POSTagger()  # comments in console
        self.french_lemmatizer = LefffLemmatizer(
            after_melt=True, default=False)
        self.nlp.add_pipe(self.pos, name='pos', after='parser')
        self.nlp.add_pipe(self.french_lemmatizer, name='lefff', after='pos')

    def analyze(self, param1):
        self.doc = self.nlp(param1)
        for d in self.doc:
            print(d.text, d.pos_, d._.melt_tagger)

# --- Socket

sio = socketio.Client()

@sio.on('connect', namespace='/test')
def on_connect():
    print('--> connection established')


@sio.on('disconnect', namespace='/test')
def on_disconnect():
    print('--> disconnected from server')


@sio.on('myevent', namespace='/test')
def on_message(data):
    print(' Received message : ',  data['spoken'])
    obj1.analyze(data['spoken'])

sio.connect('http://localhost:7000', namespaces=['/test'])

obj1 = SomeClass()

sio.wait()

On the first try with data[''spoken"] equals to 'les carottes et les radis sont des légumes',
the console prints :

 Received message : {'spoken': ' les carottes et les radis sont des légumes'}
2019-09-04 12:43:21,766 - spacy_lefff.melt_tagger - INFO -   TAGGER: POS Tagging...
les DET DET
carottes NOUN NC
et CCONJ CC
les DET DET
radis NOUN NC
sont AUX V
des DET DET
légumes NOUN NC

The next tries print in console :

Received message : {'spoken': ' les carottes et les radis sont des légumes'}
2019-09-04 12:43:27,007 - spacy_lefff.melt_tagger - INFO -   TAGGER: POS Tagging...
les DET None
carottes NOUN None
et CCONJ None
les DET None
radis NOUN None
sont AUX None
des DET None
légumes NOUN None

d.text, d.pos_ work on every request, melt_taggerdoesn't.

Exception when the malt_parser property of a token is missing

When the token property token._.malt parser is missing lefff throws an exception at line 64 of lefff.py

This can be fixed by changing the line to
t = token._.melt_tagger.lower() if self.after_melt and hasattr(token._, 'melt_tagger') else token.pos_

Python 3 Support

Thanks for this. Managed to get to work on Python 3 with spacy > 2.0.9 (2.0.11) -- at least I think it works.

  1. 2to3 all files
  2. in melt_tagger.py, change win /= 2 to win = win // 2
  3. in melt_tagger.py, add , encoding='latin1' as an argument to np.load

Not sure if still works on Python 2 with these modifications.

Should mark it as Python 2 somewhere until Python 3 support added.

Package compatibility not possible

Hi,

I have an problem of compatibility between spacy, fr-core-news-sm and spacy-lefff

spacy-transformers 1.0.6 requires spacy<4.0.0,>=3.1.0, but you have spacy 3.0.4 which is incompatible.
fr-dep-news-trf 3.1.0 requires spacy<3.2.0,>=3.1.0, but you have spacy 3.0.4 which is incompatible.

spacy-lefff 0.4.0 requires spacy<3.0.5,>=3.0.0, but you have spacy 3.2.1 which is incompatible.

It's possible to have an update of requirements.txt of spacy-lefff to, at least, spacy 3.1 ?

Hide debugging printing comments

Hello,

i'm using Spacy-Lefff with Mac 10.11.6 running python 3.7.

I can't find a way to avoid printing these informations :

2019-08-29 13:33:03,196 - spacy_lefff.downloader - INFO - data already set up
2019-08-29 13:33:03,197 - spacy_lefff.melt_tagger - INFO -   TAGGER: Loading lexicon...
2019-08-29 13:33:04,339 - spacy_lefff.melt_tagger - INFO -   TAGGER: Loading tags...
2019-08-29 13:33:04,398 - spacy_lefff.melt_tagger - INFO -   TAGGER: Loading model from /usr/local/lib/python3.7/site-packages/spacy_lefff/data/tagger/models/fr...
2019-08-29 13:33:05,311 - spacy_lefff.melt_tagger - INFO -   TAGGER: Loading model from /usr/local/lib/python3.7/site-packages/spacy_lefff/data/tagger/models/fr: done
2019-08-29 13:33:05,311 - spacy_lefff.lefff - INFO - New LefffLemmatizer instantiated.
2019-08-29 13:33:05,312 - spacy_lefff.lefff - INFO - Reading lefff data...
2019-08-29 13:33:06,414 - spacy_lefff.lefff - INFO - Successfully loaded lefff lemmatizer
2019-08-29 13:33:06,439 - spacy_lefff.melt_tagger - INFO -   TAGGER: POS Tagging...

Is there a way to do this ?

Review the mapping between Lefff and MElt

Lefff and MElt use different mapping for the part of speech tagging.
For instance, one say prep and the other p to qualify a preposition.
Though, the lemmatization uses the part of speech tagging of MElt when it is added to a pipe in spaCy.

So in the example in the README, we can see that the lemma for de or dollard are not found, yet they do exist in Lefff (unlike Apple).

Code

  • update the file mappings.py
  • make sure that when MElt was used, it uses the correct mapping to get the lemma, at line .

Missing File

Hello Hello,

I wanted to use your library for a NLP project but when I try to use the POSTagger, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: '*/.user_conda/envs/conda_env/lib/python3.7/site-packages/spacy_lefff/data/tagger/models/fr/lexicon.json'

The line that make the error is this one:

pos = POSTagger()

There is some more files to download or i messing something?

Thank you and have a great day.

Wrong analysis of adjectives after a verb

With Spacy version 2.1.3. on Windows 10, Python 3.7.2
With the fr_core_news_md model language
There is a bad analysis of a construction such like this one (a verb followed by an adjective)

"Une personne fait une déclaration qu'elle sait fausse ou trompeuse."

The verb "savoir" is analysed as an auxiliary which is wrong because this is a full verb. As well, the token "fausse" is wrongly analyzed as a verb whereas it's an adjective (at least trompeuse is correctly analysed as an adjective).

This analysis has not been tested with spacy-Leff (I am attempting to install it but installation problems/warnings have prevented me to do so until now - still trying to resolve the issue), but this improvement should definitely be implemented in a future version of spacy-Leff.

Thanks!

Melt tagger do not use its data_dir param

Hello,

thanks for the greate work, just got an issue trying to custumize dir where to download model :

melt_tagger.py init :

    def __init__(
            self,
            data_dir=DATA_DIR,
            lexicon_file_name=LEXICON_FILE,
            tag_file_name=TAG_DICT,
            print_probas=False):
        super(
            POSTagger,
            self).__init__(
            PACKAGE,
            url=URL_MODEL,
            download_dir=DATA_DIR)

data_dir=DATA_DIR is a parameter that is not used as it calls it super with download_dir=DATA_DIR.
Shouldn't it be download_dir=data_dir ?

I created a PR #28

Regards.

Can't find model 'fr_core_news_sm'

Hello ! just starting to explore spacy-leff : I have been trying to run the examples given here and on spacy.io
but I stumble on the same error. See below.
I assume it's a simple one, but I did not find help for this in the available documentation (I may not be good at looking at the right place though...).

I intend to use specy-leff, to replace MElt perl utilities, for our bamanan-french parallel corpus. Bamanan or bambara is a language spoken in west africa. See http://cormand.huma-num.fr/

Thanks for help with this...
Jean-Jacques

python3 testleff2.py
Traceback (most recent call last):
File "testleff2.py", line 9, in
nlp = spacy.load('fr_core_news_sm')
File "/usr/local/lib/python3.8/dist-packages/spacy/init.py", line 47, in load
return util.load_model(name, disable=disable, exclude=exclude, config=config)
File "/usr/local/lib/python3.8/dist-packages/spacy/util.py", line 329, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'fr_core_news_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

POS tags in Lefff are missing some cases

Problem

Sometimes, a word can be at the same time an adj and a nc and it is missing.
An example in Lefff is costaricain, which can be both at the same time but it is not taken into account in Lefff.

Building a better mapping between Lefff and Spacy POS

Problem

Current problem is to be able to take advantage of Lefff exhaustivity by having a good French POS in the Spacy pipeline before the lemmatization process.

Solution

A possible solution is to take advantage of the POS developed by the same persons of Lefff called MElt available here

weird error

seems the word "DEV" triggers an error.

import spacy
from spacy_lefff import LefffLemmatizer, POSTagger
nlp = spacy.load('fr_core_news_md')
pos = POSTagger()
french_lemmatizer = LefffLemmatizer(after_melt=True)
nlp.add_pipe(pos, name='pos', after='parser')
nlp.add_pipe(french_lemmatizer, name='lefff', after='pos')
doc = nlp(u"Paris est une ville très DEV ce jour.")
for d in doc:
    print(d.text, d.pos_, d._.melt_tagger, d._.lefff_lemma, d.tag_, d.lemma_)

Traceback (most recent call last): File "test-lemm.py", line 11, in <module> doc = nlp(u"Paris est une ville très DEV ce jour.") File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 402, in __call__ doc = proc(doc, **component_cfg.get(name, {})) File "/usr/local/lib/python3.6/dist-packages/spacy_lefff/melt_tagger.py", line 242, in __call__ beam_size=beam_size) File "/usr/local/lib/python3.6/dist-packages/spacy_lefff/melt_tagger.py", line 194, in tag_token_sequence best_sequence = sequences[-1][0] IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.