Git Product home page Git Product logo

skweak's People

Contributors

aliaksah avatar jerbarnes avatar lidiexy-palinode avatar marioangst avatar oroszgy avatar plison avatar ruanchaves avatar udayankumar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skweak's Issues

skweak.utils.display_entities behavior in Jupyter notebooks

Windows 10
Python 3.8.3

Hi,

It would seem that the skweak.utils.display_entities() function with the "hmm" parameter has some undesired behavior in Jupyter notebooks. Rather than displaying the entities like it would with the "spacy" parameter, which looks like this:

Screenshot 2021-04-29 113309

Where lines are printed to fit the screen, and entities are in a box with their label, the "hmm" parameter gives this result:

Screenshot 2021-04-29 113328

With lines being displayed awkwardly, and entity labels masking the actual word that was tagged (the COMPANY tag shows, but the entity "Reuters" is not visible).

These pictures are taken from a top-to-bottom run, unaltered quick_start.ipynb as found in the examples directory.

Love the package by the way!

Transforming corpus to Spacy docbin format

Hi,

I am currently conducting research on weak supervision for NER for the Dutch language, and would like to use your model developed in your 2020 paper as a baseline. Since I'll be working with CoNLL-2002 rather than 2003 for it's Dutch subset, I was wondering if you have any method or tips you could provide me with for converting the ConLL IOB files to Spacy docbin format, as you seemingly have already done so yourself.

Thanks in advance!

TypeError: unhashable type: 'list'

Upon applying config file in order to train textcat model using the following code:

!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train

I receive following error message:

[i] Saving to output directory: train
[i] Using CPU

=========================== Initializing pipeline ===========================
2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-03-27 15:50:05,376] [INFO] Set up nlp object from config
[2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-03-27 15:50:05,395] [INFO] Created vocabulary
[2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md
[2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\spacy.exe_main
.py", line 7, in
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli
command(prog_name=COMMAND)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train
nlp = init_nlp(config, use_gpu=use_gpu)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize
validate_get_examples(get_examples, "Tok2Vec.initialize")
File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples
File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call
for real_eg in examples:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples
for reference in reference_docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin
for doc in docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs
doc.spans.from_bytes(self.span_groups[i])
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes
group = SpanGroup(doc).from_bytes(value_bytes)
File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads
msg = msgpack.loads(data, raw=False, use_list=use_list)
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack_init
.py", line 79, in unpackb
return _unpackb(packed, **kwargs)
File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb
TypeError: unhashable type: 'list'

Seems like a dependency issue. What is the reason for it? And is there a way to fix it?

Also : Is the following error message a problem ?
"[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside."
or can it be avoided by simply applying the following?:
for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)

heuristics.VicinityAnnotator l Money Detector

Hi There!

On this page, under the section "heuristics.VicinityAnnotator" on the given example below, you talk about cue words to be next to proper names, but on the formula, you add "money_detector2" as an argument of the "VicinityAnnotator". Could you explain me why, as money_detector1 has nothing to do with proper names?

# Typically, entities next to words like say/tell/listen etc. will be PERSON
cue_words = ["say", "indicate", "reply", "claim", "declare", "tell", "answer", "listen"]
VicinityAnnotator("money_detector2", {word:"PERSON" for word in cue_words}, "nnp_detector", max_window=2)

Super thanks in advance!

Error: Cannot apply along axis 0, when applying hmm.fit

I'm getting this error when running fit and aggregate. It seems to be related to documents without any annotations by the labelling functions.
When tying it with the documents in my data set that have annotations, it works fine. However, when running it on specific ones where the labelling functions did not detect anything, it throws this error.
Is this a known issue and is there any way to fix this?
Thanks!
`docs = list(LF1.pipe(train_data))
docs = list(LF2.pipe(docs))
docs = list(LF3.pipe(docs))
docs = list(LF4.pipe(docs))
docs = list(LF5.pipe(train))
ner_model = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
docs = list(ner_model.pipe(docs))

hmm = skweak.aggregation.HMM("hmm",["A", "NOT_A"], sequence_labelling=False)
hmm.fit_and_aggregate(docs)``

image

ValueError: Cannot apply_along_axis when any iteration dimensions are 0

Sentiment Analysis Code is not working due to the above mentioned error.

ValueError Traceback (most recent call last)
in ()
4 #docs = [d for d in train_pred if any([v for (k,v) in d.spans.items()])]
5
----> 6 unified_model.fit(train_pred)
7
8 unified_model.annotate_docbin("train_pred.docbin", "train_pred.docbin")

4 frames
/usr/local/lib/python3.7/dist-packages/skweak/aggregation.py in fit(self, docs, **kwargs)
85
86 obs_generator = (self.get_observation_df(doc) for doc in docs)
---> 87 self._fit(obs_generator, **kwargs)
88
89

/usr/local/lib/python3.7/dist-packages/skweak/generative.py in _fit(self, all_obs, cutoff, n_iter, tol)
95
96 # And add the counts from majority voter
---> 97 self._add_mv_counts(all_obs)
98
99 # Finally, we postprocess the counts and get probabilities

/usr/local/lib/python3.7/dist-packages/skweak/generative.py in _add_mv_counts(self, all_obs)
417
418 # And aggregate the results
--> 419 agg_array = mv.aggregate(obs).values
420
421 if len(agg_array)==0:

/usr/local/lib/python3.7/dist-packages/skweak/voting.py in aggregate(self, obs)
51 return np.bincount(x[x>=0], weights=weights[x>=0],
52 minlength=len(self.observed_labels))
---> 53 label_votes = np.apply_along_axis(count_fun, 1, obs.values).astype(np.float32)
54
55 # For token-level sequence labelling, we need to normalise the number

<array_function internals> in apply_along_axis(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
376 raise ValueError(
377 'Cannot apply_along_axis when any iteration dimensions are 0'
--> 378 ) from None
379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
380

ValueError: Cannot apply_along_axis when any iteration dimensions are 0

Text classification on domain data

Hi, @plison I have a quick question,
To find the sentiment of sentences we many many libraries from markets like TextBlob, NLTK, Transformers, Flair...etc.
so when we don't have labeled data to train the sentiment analysis model we can get the labels from those libraries. and then we can train our model.
and how can I get labels to my domain data?
for ex: I have text like "my printer is not working" and I want to label it as "Hardware problem".
How can I achieve this with Skweak? Is there any demo code for that kind of labeling?

and how can i get results like Donald trump-person, $700-money like this for the example given?

The minimal example is not working correctly

Hello, When i am running the minimal example :

import spacy, re
from skweak import heuristics, gazetteers, generative, utils

# LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
   for tok in doc[1:]:
      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
          yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)

# LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\d{2}$", 
                                                  tok.text), "DATE")

# LF 3: a gazetteer with a few names
NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})

# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")

# apply the labelling functions
doc = lf3(lf2(lf1(doc)))

# and aggregate them
hmm = generative.HMM("hmm", ["PERSON", "DATE", "MONEY"])
hmm.fit_and_aggregate([doc])

# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")

I am expecting that there will be three entities that should get tagged "Donald Trump", "$750", and "2016". However, only the first two are getting tagged
image

even though the year is getting tagged correctly, if I display only the years entity:
image

Can anyone please help me identify what could be the issue - Years not getting tagged? I am running skweak-0.3.1 version.

Thanks!

HMM crashes if Doc does not contain annotations

I've encountered two cases where aggregation.HMM crashes if a document contains no annotated spans. Both occur at the same point in the code

self.start_counts += agg_array[0, :]

There are two issues, first if a document contains no annotated spans, agg_array will be empty but the code attempts to access row 0.

The second is related, if you have a source that generates a label but you tell HMM to ignore this label, the same exception is thrown.

A quick hack to is to check if len(agg_array)==0 and if so skip the rest of the loop - I'm not sure what consequence that has though.

A better option might be to filter the document list so it only contains labelled examples, this requires knowing what label sources generate what labels (i.e. in your example presidents applies PERSON)

Aggregation on docs with empty source does not set any `doc.spans` key

Currently, running an aggregator on a doc when the sources are empty lists does not set doc.spans[aggregator_name]. This is clear here: https://github.com/NorskRegnesentral/skweak/blob/main/skweak/aggregation.py#L54

This caused me some difficult-to-diagnose errors. Intuitively, I would expect the aggregator to set an empty list on the doc.spans, like the annotator functions do. If this is the intended behaviour, would strongly recommend mentioning this in the wiki.

As a sidenote, I think it is a bit confusing that the voter both modifies the given doc and returns it. I would either make it clear the operation happens in-place or return a modified copy of the doc, but not both. Either way the wiki doesn't make it very clear.

Skweak for text classification

Do you have any examples on how to apply skweak for text classification? All the examples I see are for NER.

Thank you.

Exception in display_entities

display_entities throws a KeyError exception on the line below if doc.spans does not contain layer

spans = doc.spans[layer]

The other code paths use get_spans(doc, layer) rather than doc.spans[layer] to avoid this case.

Note: I noticed this since I was using the PyPi version which doesn't contain the fix for #25

Consider using custom attribute extension instead of doc.spans

I love the idea of this project and I'm definitely planning to utilize it in future work. One point of discussion is it's a little confusing for a spaCy user to have doc.spans be a dict with the spans for each labeling function.

I think it might be a little easier if these were contained in a custom extension so you could access these annotations as something like doc._.skweak_spans and the doc.spans attribute was only set post aggregation.

If there's a rationale I'm missing, no big deal.

Thanks!

Regression-based outcome

Hello, thank you for sharing this repo. Do you have plans for providing capability for a regression-based outcome? Something along the lines of fine-grained sentiment on a scale from 1-5?

use Flair with skweak

hello ,
is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function.
i tried to work with flair but didnt work.
can anyone help me and thanks in advance .

Simple example of full document classification and questions

First, thanks for this great tool. I'm trying to learn skweak for full document classification. I found your sentiment example ("weak_supervision_sentiment.py") bit too complicated and slow (because of BERT), hence I wrote the test code below using IMDB sentiment data. This codes applies 4 classic BOW-type classifiers and a sentiment detector to simulate "weak learners". Few questions and comments:

  1. Is my code the correct way how to use skweak for classification of whole documents?
  2. There are "fit" and "fit_and_aggregate" functions in skweak, but no "predict" [trying to use "predict" gives error related to missing "nb_components"]. How should one do predictions for test data? In my example, I used "pipe" which adds HMM labels, but maybe not correct way.
  3. In my example code, the order of weak classifiers matter a lot. It seems that skweak always picks the predictions from the last classifier. So if I shuffle weak predictor order, my skweak outcome also changes.
  4. In my example, peak score of skweak is at best the same as the best individual model, but often less. I suppose this is expected since skweak was not designed for supervised tasks and hence is not a direct replacement for methods, e.g., in sklearn.ensemble class.
import spacy
from skweak import heuristics, aggregation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from afinn import Afinn
import random
from tensorflow.keras.datasets import imdb

nlp = spacy.load("en_core_web_sm")
afinn = Afinn(language='en')

# get IMDB sentiment data
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
ind2text = {x:k for k,x in imdb.get_word_index().items()}
# convert to text
n_train = 500 # train samples
n_test = 200 # test samples
get_text = lambda data: [" ".join([ind2text[x] for x in d]) for d in data]
X_train,Y_train = get_text(training_data[0:n_train]),list(training_targets[0:n_train])
X_test,Y_test = get_text(testing_data[0:n_test]),list(testing_targets[0:n_test])

# create some whole-text classical classifiers
my_classifiers=[]
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegressionCV(penalty="l1",cv=5,solver='liblinear')),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', NuSVC()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', PassiveAggressiveClassifier()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer(use_idf=False)),
     ('clf', RandomForestClassifier()),
]))

my_label_funs=[] # labeling functions

# add afinn annotator (span over whole document)
def labeling_fun_afinn(x):
    yield 0, len(x),('0' if afinn.score(x.text)<=0 else '1')
my_label_funs.append(heuristics.FunctionAnnotator("afinn", lambda x: labeling_fun_afinn(x)))

# apply predictor (span over whole document)
def labeling_fun_clf(x,model):
    yield 0, len(x),str(model.predict([x.text])[0])
for i,clf in enumerate(my_classifiers):
    clf.fit(X_train,Y_train)
    my_label_funs.append(heuristics.FunctionAnnotator("classifier_%i" % i, lambda x: labeling_fun_clf(x,clf)))

random.seed(4)
random.shuffle(my_label_funs) # order matters!

# get annotated Spacy docs
def process_docs(X):
    docs_list = []
    for i,doc in enumerate(X):
        x = nlp(doc)
        for lf in my_label_funs:
            x = lf(x)
        docs_list.append(x)
    return docs_list

# obtain processed docs
train_docs = process_docs(X_train)
test_docs = process_docs(X_test)

# create HMM aggregator
hmm = aggregation.HMM("hmm", ['0','1'],sequence_labelling=False)

# fit and annotate train data
hmm.fit_and_aggregate(train_docs)
# apply model to test data (works as a "predict" function?)
test_docs = list(hmm.pipe(test_docs))

# get predicted classes
hmm_preds = []
afinn_preds = []
for doc in test_docs:
    y = doc.spans['hmm'][0].label_
    hmm_preds.append(int(y))
    afinn_preds.append(0 if afinn.score(doc.text) < 0 else 1)

print("\nResults")
print(" skweak HMM: F1=%f, accuracy=%f" % (f1_score(Y_test,hmm_preds),accuracy_score(Y_test,hmm_preds)))
print(" afinn: F1=%f, accuracy=%f" % (f1_score(Y_test,afinn_preds), accuracy_score(Y_test,afinn_preds)))
for i,clf in enumerate(my_classifiers):
    y = clf.predict(X_test)
    print(" classifier %i: F1=%f, accuracy=%f" % (i+1,f1_score(Y_test,y),accuracy_score(Y_test,y)))

Gazetteer is not working with single tokens

Hello.

Can't get why gazetteer doesn't match single name 'Barack'?

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

Any ideas?

Thanks for a remarkable lib!

HMM and Scores

HI,
Thank you for this wonderful library.
I want to know what is the role of HMM here and why we are using it. and after running I am getting some scores like this,
1 -51.1920 +nan
2 -51.3015 -0.1095
may I know what this represents?
and also in a research paper(https://arxiv.org/pdf/2104.09683.pdf), there is some scores like this,
HMM-aggregated labels:

  • only heuristics 0.62 0.53
  • only gazetteers 0.46 0.39
  • only NER models 0.60 0.52
  • all but doc-level 0.83 0.74
  • all labelling functions 0.83 0.75
    what does all this score represent and how can I get by code?
    can anyone please help me with this ASP? appreciate the help.
    Thank You

Runtime error in display_entities

I am using the latest version of skweak: 0.2.17. I tried running the example (quick-start.ipynb) in the repo. When I try to execute

skweak.utils.display_entities(docs[28], "other_org_detector")

, I get this error.

image

Hard Coded gap tokens

Ideally the gap token (

self.gap_tokens = {"-"} # Hyphens should'nt stop a span
) wouldn't be hard coded, it should be a constructor argument. I have text where the hyphen is used as a delimiter rather than a hyphen (I created a custom huggingface tokenizer, and to use it with skweak a spacy wrapper).

I'd do something like:

def __init__(self, name: str, constraint: Callable[[Token], bool],
             label: str, min_characters=3, gap_tokens=Optional[Set]=None):

             self.gap_tokens = gap_tokens if gap_tokens is not None else {"-"}

Annotators (sources) issuing annotations after first 1,000 docs not detected

Hi again,

I found a problem with this method:

def _extract_sources(self, docs: Iterable[Doc], max_number=1000):
        """Extract the names of all labelling sources mentioned in the documents
        (and not included in the list of sources to avoid)"""
        sources = set()
        for i, doc in enumerate(docs):
            for source in doc.spans:
                if (len(doc.spans[source]) > 0 and
                        "aggregated" not in doc.spans[source].attrs):
                    sources.add(source)
            if i > max_number:
                break

        return sources

It is only called once (from fit):

sources = self._extract_sources(docs)

This means that sources will always be collected from the first 1,000 documents in the dataset. I started getting KeyErrors for some of my annotators as they were triggered only after the first 1,000 documents. So I modified this function to scan the full dataset instead.

Can I suggest making max_number a class parameter or maybe even removing it altogether?

Thanks a lot in advance!

Alfredo

No attribute spans or no attribute ents

I am working on a project where I am trying to resolve conflicts in named entities, one of my steps involves using skweak.

I am experiencing the following problem.

# List of spacy doc objects; each doc object represents a sentence
docs = prepare_spacy_extensions(sents, labels, headers)
# Applying skweak on each iteration
    for doc in docs:
        piped_doc = list(first_name_detector.pipe(doc))
        skweak.utils.display_entities(piped_doc)

This results in the following error

File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 116, in <module>
   perform_labeling_functions(tokens, labels, headers)
 File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 109, in perform_labeling_functions
   piped_doc = list(first_name_detector.pipe(doc))
 File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 37, in pipe
   yield self(doc)
 File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 89, in __call__
   doc.spans[self.name] = []
AttributeError: 'spacy.tokens.token.Token' object has no attribute 'spans'

When I try to perform the same action with the entire doc object, it will return another error,

line 110, in perform_labeling_functions
    skweak.utils.display_entities(piped_doc)
  File "/Users/egelm1/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/utils.py", line 734, in display_entities
    spans = doc.ents
AttributeError: 'list' object has no attribute 'ents'

Presumably you want the larger doc object and sentence level docs are not supported? Though with my current system I am unable to aggregate everything into one large doc object. Are there any solutions to this problem?

Heuristic function that specifically says a span is NOT an entity?

Hello,

Is there a way to write a heuristic function that specifically indicates that a span should not be interpreted as an entity? I.e. that it should be treated with an "O" label? Basically, I'd like to have a function that actively votes that span is NOT an entity, rather than abstaining on it.

HMM.__repr__ crash

This is a slightly odd issue, and it's not major but if you construct am HMM instance and then type the variable name you get a stack dump. The issue is that HMM is an instance of the sklearn BaseEstimator via _BaseHMM and it's __repr__ expects every class constructor argument to be available as a member variable.

i.e. in jupyter but this should work in a console

hmm = aggregation.HMM("hmm", ["DEVICE", "EQUIP", "LOCATION", "POINT"], sequence_labelling=False)
hmm

AttributeError: 'HMM' object has no attribute 'prefixes'

SKLearn expects that the following should work:

hmm._get_param_names()
hmm.get_params()

But you don't save prefixes and a number of other arguments, the BaseEstimator contains the following:

        if sequence_labelling:
            if prefixes not in {"IO", "BIO", "BILUO", "BILOU"}:
                raise RuntimeError(
                    "Tagging scheme must be 'IO', 'BIO', 'BILUO' or ''")
            self.out_labels = ["O"]
            for label in labels:
                for prefix in prefixes.replace("O", ""):
                    self.out_labels.append("%s-%s" % (prefix, label))
        else:
            self.out_labels = labels

This is to support hyper-parameter optimisation etc. If you simply add:

self.sequence_labelling = sequence_labelling
self.prefixes = prefixes
self.labels = labels

It should fix the issue (although to do it "properly" you then need to make self.out_labels a property getter method and move the logic above into it, as technically sklearn can update the params dictionary and you need anything dependent on them to dynamically update)

Tokens with no possible state

I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.

I found out that this is due to framelogprob having all -np.inf for the token in question. So I checked what happens in self._compute_log_likelihood for the respective document and found that this document had only one labeling function firing and X[source] in this line was all False for the first token (or state?).

This means that this token/state is also all masked with -np.inf in logsum in this line.

Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason.
Can I simply replace -np.inf in self._compute_log_likelihood with -100000 ? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?

Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?

new some help

hi team,
first of all thank you for providing such an awesome library.
i am a student and currently learning about weak supervision learning. can you please guide me or hint me, on how can we use this model. for an example, let say i have a binary classification and it is a textual data but over here i have only let say 500 points of labelled data, but i do have 10k unlabelled. so how can i use the 500points of labelled data to predict some datapoint in unlabelled dataset.

thank you

Sentence tagging

I would like to apply weak supervision in the problem of text classification, i.e. I would like to "automatically" tag sentences, like so:

"I like apples", "apples"
"I like oranges", "oranges"
"Peaches are my favourite fruit", "peaches"
"Can you give me an apple?", "apples"

Is it possible to do whole sentence tagging with skweak instead of entity tagging?
I found tutorials for token tagging (spaCy tokens), but how about whole sentences?

[Question] Underspecified Labels w/ out Fine-Grained Label

Context

  • I'm training an NER model using the HMM aggregator.
  • I have 2 label classes [A, B] and an under-specified label [C] which is a super-class of A and B within my ontology.
  • I have 3-sets of gazetteer label functions - one set for A, one set for B, and one set for C.

Issue

  • When training the HMM, I have tokens which are annotated by label functions for C (superclass) but are not annotated by label functions for A and B (e.g., the term "Apple" is being labeled as an ENT but is not being captured by the LFs for PER or PROD).
  • Currently I'm calling the HMM function as follows:
hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)
  • This triggers an error from the below aggregation code, since all probability mass is being placed on a label that was not included in the HMM (i.e., the under-specified label C).
    if (np.isnan(framelogprob).any() or # type: ignore
    framelogprob.max(axis=1).min() < -100000):
    pos = framelogprob.max(axis=1).argmin()
    print("problem found for token", doc[pos], "in", self.name)
    return

Question(s)

  • Should I be including the under-specified label as a possible label option in the HMM?
hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)
  • How are underspecified labels "learned" or trained differently vs. the "specified labels" (e.g., A, B in the example)?

Thanks in advance!

Missing files for the step-by-step NER tutorial

Several files that are used on the step-by-step NER tutorial are missing from the data folder ( this folder on the master branch ), so it's currently not possible to execute all steps in the tutorial.

Some examples:

The tutorial uses a spaCy ConLL 2003 annotator, but the folder ../../data/conll2003/ does not exist in this repository.
annotator = skweak.spacy.ModelAnnotator("conll2003", "../../data/conll2003/")

Similarly, the paths ../../data/wikidata_tokenised.json, ../../data/crunchbase.json are referenced in the tutorial but they also do not exist in this repository.

The file conll2003_ner.py, which is imported in the tutorial, also makes reference to missing files. Some examples:

FORM_FREQUENCIES = os.path.dirname(__file__) + "/../../data/form_frequencies.json"
self.add_annotator(ModelAnnotator("BTC", os.path.dirname(__file__) + "/../../data/btc"))

None of these paths exist in this repository.

Label Function Analysis

First of all, thanks for open sourcing such an awesome project!

Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.

Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.

Are there any plans to add such functionality down the line as a feature enhancement?

Speeding up Document Annotation

Hello, how would you recommend speeding up the time it takes to apply all the annotators to the corpus so that it can scale to larger corpora (i.e., >10,000 documents)? I'm following the convention for defining a combined annotator as shown in your conll2003_ner example but it's proven to be slow even with relatively fast labeling functions. I've attempted a first pass at parallel processing the use of the combined annotator but I did not have any luck. Any suggestions for implementing parallel processing in annotating the corpus or any other methods for scaling the annotation to a larger corpus of documents would be appreciated!

Thank you so much for this awesome library!

TypeError when nothing is found on in a document

Hi!
I'm getting an exception from fit_and_aggregate.
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'.
The exception is from line 227 in aggregation.py, np.apply_along_axis(...)

This seems to happen when all of my labeling functions return empty on one of the docs so the DataFrame is empty.

KeyError: 'hmm'

I tried to use skweak.utils.display_entities(docs[12], "hmm", add_tooltip=False) using the example you provided here but it returns KeyError: 'hmm'.
What can be the source of this problem?
My documents are in Norwegian and I use "nb_core_news_lg" spacy model.

Workaround for tokens with impossible states? ("problem found for token X" error)

Hi,

First of all I would like to congratulate you on launching skweak. It's a much needed tool for us who need weak supervision for structured prediction problems like NER.

While trying out skweak, I hit the "problem found for token X" error. I had a review of the fit method and this happens when a token ends up with an impossible state (i.e. a logprob of -inf):

# Make sure there is no token with no possible states

if (np.isnan(framelogprob).any() or  # type: ignore
    framelogprob.max(axis=1).min() < -100000):
pos = framelogprob.max(axis=1).argmin()
print("problem found for token", doc[pos], "in", self.name)
return

Not sure if there should be some sort of handling for this token rather than just returning from fit without any value?

As a workaround I modified this line in _compute_log_likelihood from this:

logsum = np.where(X_all_obs, logsum, -np.inf)

to this:

logsum = np.where(X_all_obs, logsum, -100000.0)

I.e. I'm changing an impossible probability to an "almost" impossible probability. The HMM model happily finished training and I was able to get meaningful predictions from it. However, I'm not sure if my workaround is doing more harm than good. Ideally, I think there should be some sort of handling for such a token with an impossible state.

Thanks a lot in advance. And I would like to thank you again for putting together this very much needed and easy to use weak supervision library.

Thanks,
Alfredo

Pattern for combining Gazetters and Heuristics for the same label

In my problem space, I want to build a labeling function that combines a list (gazetters) and a Heuristic annotators for the same label. Currently, if I apply them both as independent annotators operations, the last annotators clears all the previous labels because of this code line.

Is there a suggested pattern that I can utilize to combine multiple annotators for the same label that keeps the previous labels?

Wiki documentation error for `VicinityAnnotator`

VicinityAnnotator expects a {word: label} dict for its cue_words argument, and no longer requires the label as an argument. The wiki does not agree:

# Typically, entities next to words like say/tell/listen etc. will be PERSON
cue_words = ["say", "indicate", "reply", "claim", "declare", "tell", "answer", "listen"]
VicinityAnnotator("money_detector2", cue_words, "nnp_detector", "PERSON", max_window=2)

Would be nice to fix, since following the wiki throws up unexpected and obscure errors.

gazetteers.GazetteerAnnotator l Can't find model 'en_core_web_md'

Hi there, I am trying to follow this guide, and when I run the following code

tries = skweak.gazetteers.extract_json_data(f"{path_gazetteers}/json/gazetteers.json")
annotator = skweak.gazetteers.GazetteerAnnotator("pre_annoted_trips", tries)

annotator(doc)
skweak.utils.display_entities(doc, "pre_annoted_trips")

I get the following error:

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a Python package or a valid path to a data directory.

I am using pt_core_news_lg, don't understand why I am getting this error... Can I only use the extract_json_data if I got the en_core_web_md model?

Thanks in advance

Error on documents without any spans

Version: 0.2.9
Platform: Linux-4.4.0-176-generic-x86_64-with-debian-9.6
Python version: 3.6.7
Pipelines: en_core_web_md (3.0.0), en_core_web_sm (3.0.0)

If I have any document in my set that does not have any span detected (source), I get an error during the HMM model creation:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-8af531598451> in <module>
      7 #docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
      8 # And run the estimation
----> 9 docs = model.fit_and_aggregate(docs)

/backend/notebook/skweak/skweak/aggregation.py in fit_and_aggregate(self, docs, n_iter)
    300         labelling functions."""
    301 
--> 302         self.fit(list(docs))
    303         return list(self.pipe(docs))
    304 

/backend/notebook/skweak/skweak/aggregation.py in fit(self, docs, cutoff, n_iter, tol)
    353 
    354         # And add the counts from majority voter
--> 355         self._add_mv_counts(docs)
    356 
    357         # Finally, we postprocess the counts and get probabilities

/backend/notebook/skweak/skweak/aggregation.py in _add_mv_counts(self, docs)
    530 
    531             # And aggregate the results
--> 532             agg_array = mv._aggregate(obs).values
    533 
    534             # Update the start probabilities

/backend/notebook/skweak/skweak/aggregation.py in _aggregate(self, obs, coefficient)
    229             return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
    230 
--> 231         label_votes = np.apply_along_axis(count_function, 1, obs.values)
    232 
    233         # For token-level segmentation (with a special O label), the number of "O" predictions

<__array_function__ internals> in apply_along_axis(*args, **kwargs)

/usr/local/lib/python3.6/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    377     except StopIteration:
    378         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    380 
    381     # build a buffer for storing evaluations of func1d.

/backend/notebook/skweak/skweak/aggregation.py in count_function(x)
    227             ar = x[x>=min_val]-min_val
    228 
--> 229             return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
    230 
    231         label_votes = np.apply_along_axis(count_function, 1, obs.values)

<__array_function__ internals> in bincount(*args, **kwargs)

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Current workaround is to filter these documents like this:

docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]

Thanks for this promising library. I worked on getting Snorkel ready for spaCy NER data labeling, but this one looks directly like a realy good fit.

`VicinityAnnotator` returns multiple spans for the same token

The current implementation of VicinityAnnotator reads:

            for tok in doc[left_bound:right_bound]:
                for tok_form in [tok.text, tok.lower_, tok.lemma_]:
                    if tok_form in self.cue_words:
                        yield span.start, span.end, self.cue_words[tok_form]

link

which leads to it returning multiple spans if [tok.text, tok.lower_, tok.lemma_] are the same. This does not feel intented?

Missed Annotations

The base annotator filters each annotation based on _is_allowed_span

for start, end, label in self.find_spans(doc):

however implementations such as TokenConstraintAnnotator perform additional filtering, they only yield the longest span. This means in cases where the longest span violates _is_allowed_span but there exists a shorter span that is valid (but overlaps) it is not considered.

I think the logic should really be to return the longest valid spans, which means the _is_allowed_span needs to be called in the find_spans method and not __call__ of the base class.

A workaround seems to be to add the name of the annotator itself to the incompatible_sources, and then yield the candidate spans in order of length descending. That way it will return spans that satisfy both constraints.

Trying to run the quick_start code and getting a problem.

Hi there, I am trying to run the code in the quick_start.ipynb file and getting the following error: OSError: [E053] Could not read config file from /usr/local/lib/python3.7/dist-packages/en_core_web_sm/en_core_web_sm-2.2.5/config.cfg

The solution online recommends to downgrade the version for spacy from 3.2.3 to 2.3.5 as a workaround for the above error, however when I try it the function spans in spacy library is not available in a version <3.0.

I would appreciate any insight in resolving this error.

Thank you!

MUC-6 dataset

Hello, I really appreciate your work on weak supervision.
I have noticed that in your preprint, you show the results of skweak on the MUC-6 corpus.
https://arxiv.org/abs/2104.09683

I am testing different generative models and I would like to compare them on the same data sets, however, I cannot find MUC-6.
Could you, please provide me a source from which you downloaded the data set?
Is it somewhere behind a paywall?

Support for relation extraction

Right now, skweak supports two main types of NLP tasks: (token-level) sequence labelling and text classification. Both rests on the idea that labelling functions associate labels to text spans, and the role of the aggregation model is then to merge the outputs of those labelling functions such as to get unified predictions.

However, some NLP tasks cannot be easily associated to text spans. For instance, relation extraction necessitates a prediction on pairs of spans.

The question is then how to provide support for such type of tasks, for instance by implementing a RelationAnnotator that could be used to associate pairs of spans to a label.

Technically speaking, we could still encode the annotations internally as SpanGroup objects. One solution would be to only add one span of the pair in the SpanGroup, but then specify that this span is connected to a second span (SpanGroup objects allows the inclusion of JSON-serialised attributes). The method get_observation_df in the BaseAggregator class could then be extended to detect whether a span is a normal one, or is connected to a second span. If that is the case, the aggregation would then be done on pairs of spans instead of single spans.

Do get in touch if this functionality is something you need, so that we know whether we should prioritise this in our next release :-)

spacy not imported by default in quick_start.ipynb

Hi,

Just a little thing, spacy isn't imported by default in quick_start.ipynb, so running the cells top to bottom gives an error.

Also, FileNotFoundError: [Errno 2] No such file or directory: 'data/crunchbase_companies.json' occurs in the gazetteers cell, as crunchbase_companies.json is not actually present in the data directory yet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.