Git Product home page Git Product logo

flyingsquid's Introduction

More Interactive Weak Supervision with FlyingSquid

UPDATE 06/17/20: Code re-factored, with two new features:

  • Compute label model parameters by looking at all possible triplets and taking the mean or median; we find this to be more stable than just looking at a single triplet (use label_model.fit(..., solve_method='triplet_mean')). By default, the code now uses triplet_mean.
  • Get the estimated accuracies of each labeling function P(lambda_i == Y) with label_model.estimated_accuracies().

FlyingSquid is a new framework for automatically building models from multiple noisy label sources. Users write functions that generate noisy labels for data, and FlyingSquid uses the agreements and disagreements between them to learn a label model of how accurate the labeling functions are. The label model can be used directly for downstream applications, or it can be used to train a powerful end model:

FlyingSquid can be used to build models for all sorts of tasks, including text applications, video analysis, and online learning. Check out our blog post and paper on arXiv for more details!

Getting Started

  • Quickly install FlyingSquid
  • Check out the examples folder for tutorials and some simple code examples

Sample Usage

from flyingsquid.label_model import LabelModel
import numpy as np

L_train = np.load('...')

m = L_train.shape[1]
label_model = LabelModel(m)
label_model.fit(L_train)

preds = label_model.predict(L_train)

Installation

We recommend using conda to install FlyingSquid:

git clone https://github.com/HazyResearch/flyingsquid.git

cd flyingsquid

conda env create -f environment.yml
conda activate flyingsquid

Alternatively, you can install the dependencies yourself:

  • Pgmpy
  • PyTorch (only necessary for the PyTorch integration)

And then install the actual package:

pip install flyingsquid

To install from source:

git clone https://github.com/HazyResearch/flyingsquid.git

cd flyingsquid

conda env create -f environment.yml
conda activate flyingsquid

pip install -e .

Citation

If you use our work or found it useful, please cite our paper at ICML 2020:

@inproceedings{fu2020fast,
  author = {Daniel Y. Fu and Mayee F. Chen and Frederic Sala and Sarah M. Hooper and Kayvon Fatahalian and Christopher R\'e},
  title = {Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods},
  booktitle = {Proceedings of the 37th International Conference on Machine Learning (ICML 2020)},
  year = {2020},
}

flyingsquid's People

Contributors

danfu09 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flyingsquid's Issues

Question on the binary Ising model

In the paper, a binary Ising model is constructed to handle abstain. I read through the code and it seems to me the Ising model is actually never constructed. I wonder where the Ising model comes into play in the code.

For example, the paper says P(λi = 0, Y dep(i) = 1) is factorizable due to the construction of G (the Ising model) and in code it is simply P(λi = 0, Y dep(i) = 1) = P(λi = 0)*(Y dep(i) = 1)

r_vals[r_val] = pos_prob * zero_probs
. I don't understand what's going on here.

I appreciate any explanations. Thanks!

Support for abstaining

Hello!

I stumbled upon this new weak labeling framework while looking for updates to the Snorkel code to support conditional dependencies. Very interesting work, but after quickly glancing through the code it is unclear if you fully support use of abstain here. As I understand it, the labeling semantics that are supported at the moment are 1 for positive, -1 for negative, and 0 for abstain. However, the docstring for several functions states that abstains are not supported. At the same time, there are other functions that clearly do consider abstains.

Could you clarify (1) if the code supports abstains in its current form, and (2) if the semantics are indeed as described above (1, -1, and 0)?

Thanks!

Speeding up training time on large datasets with label dependencies

Hello,

I tried training on 100K records with 9 weak labels: training takes 0.02 seconds without lambda_edges, but 7s with 1 edge, 18s with 2 edges and 21s with 3 lambda edges. Is this expected behavior? Are there ways to speed it up or parallelize?
(I have multiple datasets with 47M rows, so assuming linear scaling in records, it'd take almost 3h for training on each...)

Thank you!

_triplet_method_preprocess seems to be missing break statements

In this loop (https://github.com/HazyResearch/flyingsquid/blob/master/flyingsquid/label_model.py#L283-L288), both the inner and outer loop are escaped once a triplet is found.

Here (https://github.com/HazyResearch/flyingsquid/blob/master/flyingsquid/label_model.py#L305-L308) and in a mirrored loop below, there is no escape for the outer loop. That seems to be a bug as it means the entire inner loop will continue to be evaluated over and over even if a triplet was discovered.

Tutorials examples for multiclass and multilabel labelling?

Hi, thanks for making this public. I enjoyed the paper, and my team and I are excited to try this out in our work.

We have a multilabel problem and are struggling a little bit with how to apply FlyingSquid to this setup. C.2 of the appendix mentions that in the multiclass case, a one-versus-all scheme can be applied repeatedly. This makes sense in principle, but I was wondering if there was an example of this using LabelModel that could be provided? We went through the example notebooks, but each demonstrates a binary classification problem.

In the case of multilabel, I was thinking that you could follow a one-vs-all approach without a voting scheme as is typically done, such that each instance in the dataset can end up with multiple labels. Is this something that can be done with FlyingSquids LabelModel?

Prediction Time

Hello,

This a question rather than an issue. I tried applying flyingsquid to a dataset with 65M instances and 3 weak labels, using the model structures from tutorials. The single node model one took ~2.5s to train, and the sequential (3 node) took ~8s. However, when I tried to get predictions: preds = label_model.predict(L_train) it was running for a long time (~20minutes) without completing.
Does this behavior make sense? What could cause it?

Thanks!

_triplet_method_preprocess could produce triplet with only two entries

In my experimentation, I ran into some cases where the _triplet_method_preprocess function adds triplets containing only two items.

Seems to happen when there are some dependencies included among LFs using lambda_edges.

In my local code, I've changes this: https://github.com/HazyResearch/flyingsquid/blob/master/flyingsquid/label_model.py#L346-L349
to the following code:
if found and len(triplet) > 2:
to prevent the fit() from failing.

Questions for the pytorch-integration

Hey I didnt fully understand how the pytorch integration fits in the online learning way.
Was wondering if you could make a simple tutorial. The other two are fantastic!
Thanks!

Inconsistency in temporal model output shapes

Hello,

I am playing with temporal models, like in Video.ipynb tutorial. One thing I noticed was that for those models, with v>1, the return shapes are different for label_model.predict and label_model.predict_proba_marginalized. The former returns a value for each v in a frame, while the later returns a single value for the whole frame. So if v=3, and there are 1000 frames, the first returns (1000,3) while the second returns (3000,). I think it would be convenient to have the same shapes in both cases.

Thanks!

KeyError when fitting a model

Hello,

I've ran into the following error:

~/flyingsquid/flyingsquid/label_model.py in fit(self, L_train, class_balance, Y_dev, flip_negative, clamp, solve_method, sign_recovery, verbose)
    604                 elif num_Ys(equals_one_tup) != 0 and num_lambdas(equals_one_tup) != 0:
    605                     # If this contains lambdas and Y's, we can't observe it
--> 606                     r_vals[r_val] = probability_values[r_val]
    607                 elif num_Ys(equals_one_tup) != 0:
    608                     # We need to cache this moment
KeyError: (('lambda_1', 'lambda_2', 'lambda_3', 'Y_0'), ('0',))

The labelled model I am creating has m=4 weak labels, and the label_edges are: [(1, 2), (1, 3), (2, 3)].
The data summary is:

count	2.676571e+06	2.676571e+06	2.676571e+06	2.676571e+06
mean	1.793339e-05	7.472247e-07	7.472247e-07	9.493490e-04
std	4.234747e-03	8.644214e-04	8.644214e-04	3.079688e-02
min	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
50%	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
75%	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
max	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00

It seems that it's due to very rare weak labels?

Returning NaN for probabilities

Hello,

I've found that in a number of situations predict_proba_marginalized returns 'nan'. I didn't see this behavior in tutorials or documented, and wasn't sure how to interpret it. Here is one example:

import numpy as np
def squid_run(a):
    m=a.shape[1]
    lm = LabelModel(m)
    lm.fit(a)
    print(lm.estimated_accuracies())
    out=lm.predict_proba_marginalized(a)
    unique, counts = np.unique(out, return_counts=True) 
    print(unique, counts)


n=100
a=np.ones(n)
b=-1*np.ones(n)
z=np.concatenate([a, b, b, a,a,b]).reshape((3,2*n)).transpose()
print(z.shape)
squid_run(z)

n=100
a=np.ones(n)
b=np.zeros(n)
z=np.concatenate([a, b, b, a,a,b]).reshape((3,2*n)).transpose()
print(z.shape)
squid_run(z)

The first of the runs returns 'nan' for all instance probabilities (and 1s for estimated probabilities). The second run returns:

(200, 3)
[0.25, 0.25, 0.25]
[0.5] [200]

The same runs with only two weak labels, ie:
'''
z=np.concatenate([a, b, b, a]).reshape((2,2*n)).transpose()
'''
result in 'nan' in both cases.

Flying Squid for NER

I'm wondering how you'd go about specifying a dependency structure graph for a multi-label NER type problem. For example, the case where one label function tags "George Washington" as ["I-per", "I-per"] and another as ["ABS", "I-loc"].

At first glance, it seems like you'd define each task as assigning a single label to a single token, i.e.

Y1=George, Y2=Washington

I'm trying to figure out specifically how this would map to Figure 2 in your paper. Is it possible to specify the dependency graph such that the overlapping labels are correctly resolved (i.e. we don't end up with ["I-per", "I-loc"])?

Some small fixs

Hello!

It is a fascinating work and thank you for sharing the code.
I have been playing with the code for a week, and feel like to brought to your attention some small bugs.

  1. When in a maximal clique there are all observed lambdas and no latent variable y, three typos on line 688, line 814 and line 984.
# line 688
           easy_marginals[marginal] = JointProbabilityDistribution(
# line 814
                lf_vecs = lambda_marginal_vecs[marginal]
# line 984
                if indices not in lambda_marginals:
  1. After line 309, miss a condition to break from a for-loop when a triplet is found (produce triplets of two).
# starting line 307 --
                        triplet = [expectation, first_node, second_node]
                        found = True
                        break
                    
                    # add these two lines
                    if found:
                        break

                if not found:
  1. In the tutorials, the accuracy computed for Majority Voting is incorrect, as the prediction are not converted to {+1, -1} to match the ground truth but instead {True, False}:
# In [9]
majority_vote_accuracy = np.sum(majority_vote_preds*2-1 == Y_dev) / Y_dev.shape[0]

Hope it helps and thanks again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.