norskregnesentral / skweak Goto Github PK
View Code? Open in Web Editor NEWskweak: A software toolkit for weak supervision applied to NLP tasks
License: MIT License
skweak: A software toolkit for weak supervision applied to NLP tasks
License: MIT License
Windows 10
Python 3.8.3
Hi,
It would seem that the skweak.utils.display_entities() function with the "hmm" parameter has some undesired behavior in Jupyter notebooks. Rather than displaying the entities like it would with the "spacy" parameter, which looks like this:
Where lines are printed to fit the screen, and entities are in a box with their label, the "hmm" parameter gives this result:
With lines being displayed awkwardly, and entity labels masking the actual word that was tagged (the COMPANY tag shows, but the entity "Reuters" is not visible).
These pictures are taken from a top-to-bottom run, unaltered quick_start.ipynb as found in the examples directory.
Love the package by the way!
Hi,
I am currently conducting research on weak supervision for NER for the Dutch language, and would like to use your model developed in your 2020 paper as a baseline. Since I'll be working with CoNLL-2002 rather than 2003 for it's Dutch subset, I was wondering if you have any method or tips you could provide me with for converting the ConLL IOB files to Spacy docbin format, as you seemingly have already done so yourself.
Thanks in advance!
Upon applying config file in order to train textcat model using the following code:
!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train
I receive following error message:
[i] Saving to output directory: train
[i] Using CPU
=========================== Initializing pipeline ===========================
2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-03-27 15:50:05,376] [INFO] Set up nlp object from config
[2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-03-27 15:50:05,395] [INFO] Created vocabulary
[2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md
[2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\spacy.exe_main.py", line 7, in
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli
command(prog_name=COMMAND)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train
nlp = init_nlp(config, use_gpu=use_gpu)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize
validate_get_examples(get_examples, "Tok2Vec.initialize")
File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples
File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call
for real_eg in examples:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples
for reference in reference_docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin
for doc in docs:
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs
doc.spans.from_bytes(self.span_groups[i])
File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes
group = SpanGroup(doc).from_bytes(value_bytes)
File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads
msg = msgpack.loads(data, raw=False, use_list=use_list)
File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack_init.py", line 79, in unpackb
return _unpackb(packed, **kwargs)
File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb
TypeError: unhashable type: 'list'
Seems like a dependency issue. What is the reason for it? And is there a way to fix it?
Also : Is the following error message a problem ?
"[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside."
or can it be avoided by simply applying the following?:
for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)
Hi There!
On this page, under the section "heuristics.VicinityAnnotator" on the given example below, you talk about cue words to be next to proper names, but on the formula, you add "money_detector2" as an argument of the "VicinityAnnotator". Could you explain me why, as money_detector1 has nothing to do with proper names?
# Typically, entities next to words like say/tell/listen etc. will be PERSON
cue_words = ["say", "indicate", "reply", "claim", "declare", "tell", "answer", "listen"]
VicinityAnnotator("money_detector2", {word:"PERSON" for word in cue_words}, "nnp_detector", max_window=2)
Super thanks in advance!
I'm getting this error when running fit and aggregate. It seems to be related to documents without any annotations by the labelling functions.
When tying it with the documents in my data set that have annotations, it works fine. However, when running it on specific ones where the labelling functions did not detect anything, it throws this error.
Is this a known issue and is there any way to fix this?
Thanks!
`docs = list(LF1.pipe(train_data))
docs = list(LF2.pipe(docs))
docs = list(LF3.pipe(docs))
docs = list(LF4.pipe(docs))
docs = list(LF5.pipe(train))
ner_model = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
docs = list(ner_model.pipe(docs))
hmm = skweak.aggregation.HMM("hmm",["A", "NOT_A"], sequence_labelling=False)
hmm.fit_and_aggregate(docs)``
ValueError Traceback (most recent call last)
in ()
4 #docs = [d for d in train_pred if any([v for (k,v) in d.spans.items()])]
5
----> 6 unified_model.fit(train_pred)
7
8 unified_model.annotate_docbin("train_pred.docbin", "train_pred.docbin")
4 frames
/usr/local/lib/python3.7/dist-packages/skweak/aggregation.py in fit(self, docs, **kwargs)
85
86 obs_generator = (self.get_observation_df(doc) for doc in docs)
---> 87 self._fit(obs_generator, **kwargs)
88
89
/usr/local/lib/python3.7/dist-packages/skweak/generative.py in _fit(self, all_obs, cutoff, n_iter, tol)
95
96 # And add the counts from majority voter
---> 97 self._add_mv_counts(all_obs)
98
99 # Finally, we postprocess the counts and get probabilities
/usr/local/lib/python3.7/dist-packages/skweak/generative.py in _add_mv_counts(self, all_obs)
417
418 # And aggregate the results
--> 419 agg_array = mv.aggregate(obs).values
420
421 if len(agg_array)==0:
/usr/local/lib/python3.7/dist-packages/skweak/voting.py in aggregate(self, obs)
51 return np.bincount(x[x>=0], weights=weights[x>=0],
52 minlength=len(self.observed_labels))
---> 53 label_votes = np.apply_along_axis(count_fun, 1, obs.values).astype(np.float32)
54
55 # For token-level sequence labelling, we need to normalise the number
<array_function internals> in apply_along_axis(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
376 raise ValueError(
377 'Cannot apply_along_axis when any iteration dimensions are 0'
--> 378 ) from None
379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
380
ValueError: Cannot apply_along_axis when any iteration dimensions are 0
Hi, @plison I have a quick question,
To find the sentiment of sentences we many many libraries from markets like TextBlob, NLTK, Transformers, Flair...etc.
so when we don't have labeled data to train the sentiment analysis model we can get the labels from those libraries. and then we can train our model.
and how can I get labels to my domain data?
for ex: I have text like "my printer is not working" and I want to label it as "Hardware problem".
How can I achieve this with Skweak? Is there any demo code for that kind of labeling?
and how can i get results like Donald trump-person, $700-money like this for the example given?
Hello, When i am running the minimal example :
import spacy, re
from skweak import heuristics, gazetteers, generative, utils
# LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
for tok in doc[1:]:
if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)
# LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\d{2}$",
tok.text), "DATE")
# LF 3: a gazetteer with a few names
NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})
# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")
# apply the labelling functions
doc = lf3(lf2(lf1(doc)))
# and aggregate them
hmm = generative.HMM("hmm", ["PERSON", "DATE", "MONEY"])
hmm.fit_and_aggregate([doc])
# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")
I am expecting that there will be three entities that should get tagged "Donald Trump", "$750", and "2016". However, only the first two are getting tagged
even though the year is getting tagged correctly, if I display only the years entity:
Can anyone please help me identify what could be the issue - Years
not getting tagged? I am running skweak-0.3.1 version.
Thanks!
I've encountered two cases where aggregation.HMM
crashes if a document contains no annotated spans. Both occur at the same point in the code
Line 552 in fba1037
There are two issues, first if a document contains no annotated spans, agg_array
will be empty but the code attempts to access row 0.
The second is related, if you have a source that generates a label but you tell HMM to ignore this label, the same exception is thrown.
A quick hack to is to check if len(agg_array)==0
and if so skip the rest of the loop - I'm not sure what consequence that has though.
A better option might be to filter the document list so it only contains labelled examples, this requires knowing what label sources generate what labels (i.e. in your example presidents
applies PERSON
)
Currently, running an aggregator on a doc when the sources are empty lists does not set doc.spans[aggregator_name]
. This is clear here: https://github.com/NorskRegnesentral/skweak/blob/main/skweak/aggregation.py#L54
This caused me some difficult-to-diagnose errors. Intuitively, I would expect the aggregator to set an empty list on the doc.spans
, like the annotator functions do. If this is the intended behaviour, would strongly recommend mentioning this in the wiki.
As a sidenote, I think it is a bit confusing that the voter both modifies the given doc and returns it. I would either make it clear the operation happens in-place or return a modified copy of the doc, but not both. Either way the wiki doesn't make it very clear.
Do you have any examples on how to apply skweak for text classification? All the examples I see are for NER.
Thank you.
display_entities
throws a KeyError
exception on the line below if doc.spans
does not contain layer
Line 740 in fba1037
The other code paths use get_spans(doc, layer)
rather than doc.spans[layer]
to avoid this case.
Note: I noticed this since I was using the PyPi version which doesn't contain the fix for #25
I love the idea of this project and I'm definitely planning to utilize it in future work. One point of discussion is it's a little confusing for a spaCy user to have doc.spans
be a dict with the spans for each labeling function.
I think it might be a little easier if these were contained in a custom extension so you could access these annotations as something like doc._.skweak_spans
and the doc.spans attribute was only set post aggregation.
If there's a rationale I'm missing, no big deal.
Thanks!
Hello, thank you for sharing this repo. Do you have plans for providing capability for a regression-based outcome? Something along the lines of fine-grained sentiment on a scale from 1-5?
hello ,
is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function.
i tried to work with flair but didnt work.
can anyone help me and thanks in advance .
First, thanks for this great tool. I'm trying to learn skweak for full document classification. I found your sentiment example ("weak_supervision_sentiment.py") bit too complicated and slow (because of BERT), hence I wrote the test code below using IMDB sentiment data. This codes applies 4 classic BOW-type classifiers and a sentiment detector to simulate "weak learners". Few questions and comments:
import spacy
from skweak import heuristics, aggregation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from afinn import Afinn
import random
from tensorflow.keras.datasets import imdb
nlp = spacy.load("en_core_web_sm")
afinn = Afinn(language='en')
# get IMDB sentiment data
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
ind2text = {x:k for k,x in imdb.get_word_index().items()}
# convert to text
n_train = 500 # train samples
n_test = 200 # test samples
get_text = lambda data: [" ".join([ind2text[x] for x in d]) for d in data]
X_train,Y_train = get_text(training_data[0:n_train]),list(training_targets[0:n_train])
X_test,Y_test = get_text(testing_data[0:n_test]),list(testing_targets[0:n_test])
# create some whole-text classical classifiers
my_classifiers=[]
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', LogisticRegressionCV(penalty="l1",cv=5,solver='liblinear')),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', NuSVC()),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', PassiveAggressiveClassifier()),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer(use_idf=False)),
('clf', RandomForestClassifier()),
]))
my_label_funs=[] # labeling functions
# add afinn annotator (span over whole document)
def labeling_fun_afinn(x):
yield 0, len(x),('0' if afinn.score(x.text)<=0 else '1')
my_label_funs.append(heuristics.FunctionAnnotator("afinn", lambda x: labeling_fun_afinn(x)))
# apply predictor (span over whole document)
def labeling_fun_clf(x,model):
yield 0, len(x),str(model.predict([x.text])[0])
for i,clf in enumerate(my_classifiers):
clf.fit(X_train,Y_train)
my_label_funs.append(heuristics.FunctionAnnotator("classifier_%i" % i, lambda x: labeling_fun_clf(x,clf)))
random.seed(4)
random.shuffle(my_label_funs) # order matters!
# get annotated Spacy docs
def process_docs(X):
docs_list = []
for i,doc in enumerate(X):
x = nlp(doc)
for lf in my_label_funs:
x = lf(x)
docs_list.append(x)
return docs_list
# obtain processed docs
train_docs = process_docs(X_train)
test_docs = process_docs(X_test)
# create HMM aggregator
hmm = aggregation.HMM("hmm", ['0','1'],sequence_labelling=False)
# fit and annotate train data
hmm.fit_and_aggregate(train_docs)
# apply model to test data (works as a "predict" function?)
test_docs = list(hmm.pipe(test_docs))
# get predicted classes
hmm_preds = []
afinn_preds = []
for doc in test_docs:
y = doc.spans['hmm'][0].label_
hmm_preds.append(int(y))
afinn_preds.append(0 if afinn.score(doc.text) < 0 else 1)
print("\nResults")
print(" skweak HMM: F1=%f, accuracy=%f" % (f1_score(Y_test,hmm_preds),accuracy_score(Y_test,hmm_preds)))
print(" afinn: F1=%f, accuracy=%f" % (f1_score(Y_test,afinn_preds), accuracy_score(Y_test,afinn_preds)))
for i,clf in enumerate(my_classifiers):
y = clf.predict(X_test)
print(" classifier %i: F1=%f, accuracy=%f" % (i+1,f1_score(Y_test,y),accuracy_score(Y_test,y)))
Hello.
Can't get why gazetteer doesn't match single name 'Barack'?
import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)
{'presidents': [Donald Trump]}
Any ideas?
Thanks for a remarkable lib!
HI,
Thank you for this wonderful library.
I want to know what is the role of HMM here and why we are using it. and after running I am getting some scores like this,
1 -51.1920 +nan
2 -51.3015 -0.1095
may I know what this represents?
and also in a research paper(https://arxiv.org/pdf/2104.09683.pdf), there is some scores like this,
HMM-aggregated labels:
Ideally the gap token (
Line 41 in fba1037
I'd do something like:
def __init__(self, name: str, constraint: Callable[[Token], bool],
label: str, min_characters=3, gap_tokens=Optional[Set]=None):
self.gap_tokens = gap_tokens if gap_tokens is not None else {"-"}
Hi again,
I found a problem with this method:
def _extract_sources(self, docs: Iterable[Doc], max_number=1000):
"""Extract the names of all labelling sources mentioned in the documents
(and not included in the list of sources to avoid)"""
sources = set()
for i, doc in enumerate(docs):
for source in doc.spans:
if (len(doc.spans[source]) > 0 and
"aggregated" not in doc.spans[source].attrs):
sources.add(source)
if i > max_number:
break
return sources
It is only called once (from fit
):
sources = self._extract_sources(docs)
This means that sources
will always be collected from the first 1,000 documents in the dataset. I started getting KeyError
s for some of my annotators as they were triggered only after the first 1,000 documents. So I modified this function to scan the full dataset instead.
Can I suggest making max_number
a class parameter or maybe even removing it altogether?
Thanks a lot in advance!
Alfredo
I am working on a project where I am trying to resolve conflicts in named entities, one of my steps involves using skweak.
I am experiencing the following problem.
# List of spacy doc objects; each doc object represents a sentence
docs = prepare_spacy_extensions(sents, labels, headers)
# Applying skweak on each iteration
for doc in docs:
piped_doc = list(first_name_detector.pipe(doc))
skweak.utils.display_entities(piped_doc)
This results in the following error
File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 116, in <module>
perform_labeling_functions(tokens, labels, headers)
File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 109, in perform_labeling_functions
piped_doc = list(first_name_detector.pipe(doc))
File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 37, in pipe
yield self(doc)
File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 89, in __call__
doc.spans[self.name] = []
AttributeError: 'spacy.tokens.token.Token' object has no attribute 'spans'
When I try to perform the same action with the entire doc object, it will return another error,
line 110, in perform_labeling_functions
skweak.utils.display_entities(piped_doc)
File "/Users/egelm1/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/utils.py", line 734, in display_entities
spans = doc.ents
AttributeError: 'list' object has no attribute 'ents'
Presumably you want the larger doc object and sentence level docs are not supported? Though with my current system I am unable to aggregate everything into one large doc object. Are there any solutions to this problem?
Hello,
Is there a way to write a heuristic function that specifically indicates that a span should not be interpreted as an entity? I.e. that it should be treated with an "O" label? Basically, I'd like to have a function that actively votes that span is NOT an entity, rather than abstaining on it.
This is a slightly odd issue, and it's not major but if you construct am HMM
instance and then type the variable name you get a stack dump. The issue is that HMM is an instance of the sklearn BaseEstimator via _BaseHMM
and it's __repr__
expects every class constructor argument to be available as a member variable.
i.e. in jupyter but this should work in a console
hmm = aggregation.HMM("hmm", ["DEVICE", "EQUIP", "LOCATION", "POINT"], sequence_labelling=False)
hmm
AttributeError: 'HMM' object has no attribute 'prefixes'
SKLearn expects that the following should work:
hmm._get_param_names()
hmm.get_params()
But you don't save prefixes
and a number of other arguments, the BaseEstimator
contains the following:
if sequence_labelling:
if prefixes not in {"IO", "BIO", "BILUO", "BILOU"}:
raise RuntimeError(
"Tagging scheme must be 'IO', 'BIO', 'BILUO' or ''")
self.out_labels = ["O"]
for label in labels:
for prefix in prefixes.replace("O", ""):
self.out_labels.append("%s-%s" % (prefix, label))
else:
self.out_labels = labels
This is to support hyper-parameter optimisation etc. If you simply add:
self.sequence_labelling = sequence_labelling
self.prefixes = prefixes
self.labels = labels
It should fix the issue (although to do it "properly" you then need to make self.out_labels a property getter method and move the logic above into it, as technically sklearn can update the params dictionary and you need anything dependent on them to dynamically update)
Hi,
could you please provide some insights regarding the range of number of epochs and redundancy_factor for parameter tuning?
Thanks!
I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.
I found out that this is due to framelogprob
having all -np.inf
for the token in question. So I checked what happens in self._compute_log_likelihood
for the respective document and found that this document had only one labeling function firing and X[source]
in this line was all False for the first token (or state?).
This means that this token/state is also all masked with -np.inf
in logsum
in this line.
Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason.
Can I simply replace -np.inf
in self._compute_log_likelihood
with -100000
? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?
Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?
hi team,
first of all thank you for providing such an awesome library.
i am a student and currently learning about weak supervision learning. can you please guide me or hint me, on how can we use this model. for an example, let say i have a binary classification and it is a textual data but over here i have only let say 500 points of labelled data, but i do have 10k unlabelled. so how can i use the 500points of labelled data to predict some datapoint in unlabelled dataset.
thank you
I would like to apply weak supervision in the problem of text classification, i.e. I would like to "automatically" tag sentences, like so:
"I like apples", "apples"
"I like oranges", "oranges"
"Peaches are my favourite fruit", "peaches"
"Can you give me an apple?", "apples"
Is it possible to do whole sentence tagging with skweak instead of entity tagging?
I found tutorials for token tagging (spaCy tokens), but how about whole sentences?
Context
Issue
hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)
Lines 397 to 401 in 0613f20
Question(s)
hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)
Thanks in advance!
Several files that are used on the step-by-step NER tutorial are missing from the data
folder ( this folder on the master branch ), so it's currently not possible to execute all steps in the tutorial.
Some examples:
The tutorial uses a spaCy ConLL 2003 annotator, but the folder ../../data/conll2003/
does not exist in this repository.
annotator = skweak.spacy.ModelAnnotator("conll2003", "../../data/conll2003/")
Similarly, the paths ../../data/wikidata_tokenised.json
, ../../data/crunchbase.json
are referenced in the tutorial but they also do not exist in this repository.
The file conll2003_ner.py
, which is imported in the tutorial, also makes reference to missing files. Some examples:
FORM_FREQUENCIES = os.path.dirname(__file__) + "/../../data/form_frequencies.json"
self.add_annotator(ModelAnnotator("BTC", os.path.dirname(__file__) + "/../../data/btc"))
None of these paths exist in this repository.
First of all, thanks for open sourcing such an awesome project!
Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.
Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.
Are there any plans to add such functionality down the line as a feature enhancement?
Hello, how would you recommend speeding up the time it takes to apply all the annotators to the corpus so that it can scale to larger corpora (i.e., >10,000 documents)? I'm following the convention for defining a combined annotator as shown in your conll2003_ner example but it's proven to be slow even with relatively fast labeling functions. I've attempted a first pass at parallel processing the use of the combined annotator but I did not have any luck. Any suggestions for implementing parallel processing in annotating the corpus or any other methods for scaling the annotation to a larger corpus of documents would be appreciated!
Thank you so much for this awesome library!
_do_forward_pass() seems to have been deprecated from new version of hmmlearn. After downgrading hmmlearn to version 0.2.6 the error is resolved.
Hi!
I'm getting an exception from fit_and_aggregate.
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'.
The exception is from line 227 in aggregation.py
, np.apply_along_axis(...)
This seems to happen when all of my labeling functions return empty on one of the docs so the DataFrame is empty.
I tried to use skweak.utils.display_entities(docs[12], "hmm", add_tooltip=False) using the example you provided here but it returns KeyError: 'hmm'.
What can be the source of this problem?
My documents are in Norwegian and I use "nb_core_news_lg" spacy model.
Hi,
First of all I would like to congratulate you on launching skweak. It's a much needed tool for us who need weak supervision for structured prediction problems like NER.
While trying out skweak, I hit the "problem found for token X" error. I had a review of the fit
method and this happens when a token ends up with an impossible state (i.e. a logprob of -inf):
# Make sure there is no token with no possible states
if (np.isnan(framelogprob).any() or # type: ignore
framelogprob.max(axis=1).min() < -100000):
pos = framelogprob.max(axis=1).argmin()
print("problem found for token", doc[pos], "in", self.name)
return
Not sure if there should be some sort of handling for this token rather than just returning from fit
without any value?
As a workaround I modified this line in _compute_log_likelihood
from this:
logsum = np.where(X_all_obs, logsum, -np.inf)
to this:
logsum = np.where(X_all_obs, logsum, -100000.0)
I.e. I'm changing an impossible probability to an "almost" impossible probability. The HMM model happily finished training and I was able to get meaningful predictions from it. However, I'm not sure if my workaround is doing more harm than good. Ideally, I think there should be some sort of handling for such a token with an impossible state.
Thanks a lot in advance. And I would like to thank you again for putting together this very much needed and easy to use weak supervision library.
Thanks,
Alfredo
skweak/aggregation.py", line 405, in fit
logprob, fwdlattice = self._do_forward_pass(framelogprob)
AttributeError: 'HMM' object has no attribute '_do_forward_pass'
In my problem space, I want to build a labeling function that combines a list (gazetters) and a Heuristic annotators for the same label. Currently, if I apply them both as independent annotators operations, the last annotators clears all the previous labels because of this code line.
Is there a suggested pattern that I can utilize to combine multiple annotators for the same label that keeps the previous labels?
VicinityAnnotator
expects a {word: label}
dict for its cue_words
argument, and no longer requires the label as an argument. The wiki does not agree:
# Typically, entities next to words like say/tell/listen etc. will be PERSON
cue_words = ["say", "indicate", "reply", "claim", "declare", "tell", "answer", "listen"]
VicinityAnnotator("money_detector2", cue_words, "nnp_detector", "PERSON", max_window=2)
Would be nice to fix, since following the wiki throws up unexpected and obscure errors.
Hi there, I am trying to follow this guide, and when I run the following code
tries = skweak.gazetteers.extract_json_data(f"{path_gazetteers}/json/gazetteers.json")
annotator = skweak.gazetteers.GazetteerAnnotator("pre_annoted_trips", tries)
annotator(doc)
skweak.utils.display_entities(doc, "pre_annoted_trips")
I get the following error:
OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a Python package or a valid path to a data directory.
I am using pt_core_news_lg
, don't understand why I am getting this error... Can I only use the extract_json_data
if I got the en_core_web_md
model?
Thanks in advance
Hi, can the aggregation step use some validation data like Snorkel?
Version: 0.2.9
Platform: Linux-4.4.0-176-generic-x86_64-with-debian-9.6
Python version: 3.6.7
Pipelines: en_core_web_md (3.0.0), en_core_web_sm (3.0.0)
If I have any document in my set that does not have any span detected (source), I get an error during the HMM model creation:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-8af531598451> in <module>
7 #docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
8 # And run the estimation
----> 9 docs = model.fit_and_aggregate(docs)
/backend/notebook/skweak/skweak/aggregation.py in fit_and_aggregate(self, docs, n_iter)
300 labelling functions."""
301
--> 302 self.fit(list(docs))
303 return list(self.pipe(docs))
304
/backend/notebook/skweak/skweak/aggregation.py in fit(self, docs, cutoff, n_iter, tol)
353
354 # And add the counts from majority voter
--> 355 self._add_mv_counts(docs)
356
357 # Finally, we postprocess the counts and get probabilities
/backend/notebook/skweak/skweak/aggregation.py in _add_mv_counts(self, docs)
530
531 # And aggregate the results
--> 532 agg_array = mv._aggregate(obs).values
533
534 # Update the start probabilities
/backend/notebook/skweak/skweak/aggregation.py in _aggregate(self, obs, coefficient)
229 return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
230
--> 231 label_votes = np.apply_along_axis(count_function, 1, obs.values)
232
233 # For token-level segmentation (with a special O label), the number of "O" predictions
<__array_function__ internals> in apply_along_axis(*args, **kwargs)
/usr/local/lib/python3.6/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
377 except StopIteration:
378 raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
380
381 # build a buffer for storing evaluations of func1d.
/backend/notebook/skweak/skweak/aggregation.py in count_function(x)
227 ar = x[x>=min_val]-min_val
228
--> 229 return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
230
231 label_votes = np.apply_along_axis(count_function, 1, obs.values)
<__array_function__ internals> in bincount(*args, **kwargs)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
Current workaround is to filter these documents like this:
docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
Thanks for this promising library. I worked on getting Snorkel ready for spaCy NER data labeling, but this one looks directly like a realy good fit.
By hacking the code, it is easy to know out_labels is initialized in BaseAggregator (a super class of HMM). I am just curious where observed_labels is initialized? Thanks.
The current implementation of VicinityAnnotator
reads:
for tok in doc[left_bound:right_bound]:
for tok_form in [tok.text, tok.lower_, tok.lemma_]:
if tok_form in self.cue_words:
yield span.start, span.end, self.cue_words[tok_form]
which leads to it returning multiple spans if [tok.text, tok.lower_, tok.lemma_]
are the same. This does not feel intented?
The base annotator filters each annotation based on _is_allowed_span
Line 88 in fba1037
however implementations such as TokenConstraintAnnotator
perform additional filtering, they only yield the longest span. This means in cases where the longest span violates _is_allowed_span
but there exists a shorter span that is valid (but overlaps) it is not considered.
I think the logic should really be to return the longest valid spans, which means the _is_allowed_span
needs to be called in the find_spans
method and not __call__
of the base class.
A workaround seems to be to add the name
of the annotator itself to the incompatible_sources, and then yield the candidate spans in order of length descending. That way it will return spans that satisfy both constraints.
Hi there, I am trying to run the code in the quick_start.ipynb file and getting the following error: OSError: [E053] Could not read config file from /usr/local/lib/python3.7/dist-packages/en_core_web_sm/en_core_web_sm-2.2.5/config.cfg
The solution online recommends to downgrade the version for spacy from 3.2.3 to 2.3.5 as a workaround for the above error, however when I try it the function spans in spacy library is not available in a version <3.0.
I would appreciate any insight in resolving this error.
Thank you!
Hello, I really appreciate your work on weak supervision.
I have noticed that in your preprint, you show the results of skweak on the MUC-6 corpus.
https://arxiv.org/abs/2104.09683
I am testing different generative models and I would like to compare them on the same data sets, however, I cannot find MUC-6.
Could you, please provide me a source from which you downloaded the data set?
Is it somewhere behind a paywall?
Right now, skweak
supports two main types of NLP tasks: (token-level) sequence labelling and text classification. Both rests on the idea that labelling functions associate labels to text spans, and the role of the aggregation model is then to merge the outputs of those labelling functions such as to get unified predictions.
However, some NLP tasks cannot be easily associated to text spans. For instance, relation extraction necessitates a prediction on pairs of spans.
The question is then how to provide support for such type of tasks, for instance by implementing a RelationAnnotator
that could be used to associate pairs of spans to a label.
Technically speaking, we could still encode the annotations internally as SpanGroup
objects. One solution would be to only add one span of the pair in the SpanGroup
, but then specify that this span is connected to a second span (SpanGroup
objects allows the inclusion of JSON-serialised attributes). The method get_observation_df
in the BaseAggregator
class could then be extended to detect whether a span is a normal one, or is connected to a second span. If that is the case, the aggregation would then be done on pairs of spans instead of single spans.
Do get in touch if this functionality is something you need, so that we know whether we should prioritise this in our next release :-)
Hi,
Just a little thing, spacy isn't imported by default in quick_start.ipynb, so running the cells top to bottom gives an error.
Also, FileNotFoundError: [Errno 2] No such file or directory: 'data/crunchbase_companies.json'
occurs in the gazetteers cell, as crunchbase_companies.json is not actually present in the data directory yet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.