probcomp / pclean Goto Github PK

View Code? Open in Web Editor NEW

217.0 22.0 31.0 1.4 MB

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

Julia 100.00%

probabilistic-programming probabilistic-graphical-models data-cleaning data-cleansing bayesian-inference

pclean's Introduction

PClean

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

pclean's People

Contributors

Stargazers

Watchers

pclean's Issues

Runtime error: LoadError: MethodError: no method matching logdensity(::AddTypos, ::String3, ::String3)

Hello, I am new to both PClean and Julia, when I tried to run the PClean experiments (julia --project=. experiments/hospital/run.jl), I got the following errors:

ERROR: LoadError: MethodError: no method matching logdensity(::AddTypos, ::String3, ::String3)
Closest candidates are:
  logdensity(::FormatName, ::Any, ::Any) at ~/lab/PClean/src/distributions/format_name.jl:33
  logdensity(::FormatName, ::Any, ::Any, ::Any, ::Any) at ~/lab/PClean/src/distributions/format_name.jl:14
  logdensity(::ExpandOnShortVersion, ::Any, ::Any, ::Any) at ~/lab/PClean/src/distributions/expand_on_short_version.jl:30
  ...
Stacktrace:
 [1] var"##663"(state#321::PClean.ProposalRowState)
   @ PClean ~/lab/PClean/src/inference/proposal_compiler.jl:101
 [2] #invokelatest#2
   @ ./essentials.jl:716 [inlined]
 [3] invokelatest
   @ ./essentials.jl:714 [inlined]
 [4] make_block_proposal!(state::PClean.ProposalRowState, block_index::Int64, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/block_proposal.jl:175
 [5] extend_particle!(particle::PClean.SMCParticle, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/row_inference.jl:70
 [6] run_smc!(trace::PClean.PCleanTrace, class::Symbol, key::Int64, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/row_inference.jl:146
 [7] initialize_trace(observations::Vector{ObservedDataset}, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/inference.jl:37
 [8] macro expansion
   @ ~/lab/PClean/experiments/hospital/run.jl:79 [inlined]
 [9] top-level scope
   @ ./timing.jl:220
in expression starting at /Users/sniu/lab/PClean/experiments/hospital/run.jl:78

my Julia version is 1.7.2, I searched online and found one similar issues (https://discourse.julialang.org/t/error-loaderror-methoderror-no-method-matching-setindex-shape-check-int64-int64-int64/18249) mentioned about new Julia need to do the broadcasting, but from error messages, I haven't figured out where to adjust it or maybe I go to the wrong direction. Any insights are appreciated. Thank you!

Sufeng

How to handle when a column is an array of struct

Hello,

I wonder if anyone has insights that when a table column is an array of struct, which is very common in real world database, for example, consider someone's profile has following schema:

  {
    "type": "array",
    "items": [
      {
        "type": "record",
        "name": "schools",
        "fields": [
          {
            "name": "school_name",
            "type": [
              "string"
            ]
          },
          {
            "name": "degree",
            "type": [
              "string"
            ]
          }
        ]
      },
    ]
  },

one example is:

    Name        |            schools                                     
    A           | [[MIT, Phd],                [MIT, master],            [MIT, bachrlo]]
    B           | [[Boston U, master], [Boston U, bachelor]]
    C           | [[CMU, bachelor]]

the most naive way I can think is break each school array as single element, like followings:

    Name        |            schools                                     
    A           |       [MIT, Phd]                           
    A           |       [MIT, master] 
    A           |       [MIT, bachrlo]

but I think this method would loss the relational information for one person, for example, if I know A study master at MIT, then it is possible that previous degree at MIT is bachelor instead of a typo (bachrlo)

So my question is:

how can we properly handle when the column is an array of a data structure
if columns have hierarchy structure, just confirm if I did correct or not: I should simply build this hierarchy structure into the PClean program (like a bayesian network ), and then just align them with hierarchy column names, am I correct here?

Thank you!

Investigate performance of Flights model after fixing issue #2

Performance of Flights model suffers without the subproblem block at

f51c948#diff-2a3b7234fcda10bae8f2e3e677e2add7dc29ea841a266f8a13708c4e57ac069bR14

but it is unclear to me why this should be the case: the flight ID is always observed.

Make letter frequencies a learnable parameter for `string_prior`

Implement `collapse_particles=False` option for `InferenceConfig`

In the terminology of the paper, the steps of PClean's SMC initialization algorithm each correspond to a subproblem. Several subproblems together form an increment of the database. Currently, the number of particles is shrunk to 1 after every increment, before cloning back to the original number of particles for the following increment. This enables all particles to share memory for the "distant past" (on which they are forced to agree), and only store "diffs" to that history recording the values on which they may disagree. Without this, each particle would need to store its own copy of the latent database, which is not practical for very large datasets. However, this aspect of the algorithm should probably be configurable, via a collapse_particles=False option in InferenceConfig, which would have each particle store the entire latent database in memory. Future work could explore more memory-efficient representations (I'm sure there are interesting data structures for this sort of thing).

Are you accepting PR collaborators and python version?

I've recently found this repo and while Julia is on my list of languages to learn I've never used it.

It looks like this repo hasn't had any updates in two years so I was curious if you were accepting PR's or collaborators.
Would there be an example of this algorithm in python that I could look at to get a deep understanding of how the algorithm works?

Bug when multiple reference slots in one class target the same other class

Currently, it is impossible, during the initial SMC sweep, for the first reference slot to refer to a new object and for the second to refer to the same new object. This does not affect any of the examples in the paper or the repo, but should ideally be fixed before open-sourcing.

Update parameter declaration syntax to match paper

This could be done superficially, by changing the parser only, or by reconsidering the Parameter mechanism in the current code. (A more ambitious rewrite would extend the current prototype's machinery for conjugate continuous parameters to also work for conjugate continuous latents.)

Support linear combinations of MeanParameters

The goal is to allow parameters (possibly from different classes) to be transformed before they are used as arguments to distributions. For example, linear combinations of normally-distributed parameters can still be used as the mean of a Gaussian observation. This may be useful in cases where we want to model multiple causes; for example, the rent of an apartment may be normally distributed around a linear combination of parameters representing the effects of (1) the apartment's location, (2) the apartment's size, (3) the apartment's landlord, etc.

There are (at least) two general strategies we could take for supporting this:

We could maintain as global state during inference a mean vector and covariance matrix for the multivariate Gaussian posterior over all MeanParameters in a program, updating it as necessary when values are observed or unobserved. Then, we could resample all MeanParameters jointly in a single blocked Gibbs update. This is probably the best approach from an inference perspective, as it fully takes into account all posterior correlations between the variables. I haven't yet worked out the math for what the update rules would be, or how expensive they'd be. A useful reference would be Dario Stein's recent paper, which describes the implementation of a language with Gaussian variables and affine transformations that supports exact conditioning, and uses the "maintain a covariance matrix and mean vector" approach: https://arxiv.org/pdf/2101.11351.pdf
We could perform individual Gibbs updates separately on each MeanParameter. Then when observing that N(x1+x2, sigma) == y, we think of it as an observation of x1 as y-x2 when updating x1, and of x2 as y-x1 when updating x2. This requires fewer changes to the current architecture, at the cost of possibly worse inference (more Gibbs samples are needed to converge to the same local posterior that the blocked update would have sampled from directly).

Accurate name prior

A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:

PClean.@model CustomerModel begin

    @class FirstNames begin
        name ~ StringPrior(1, 60, all_given_names)
    end

    @class Person begin
        given_name ~ FirstNames
    end;

    @class Obs begin
        begin
            person ~ Person
            given_name ~ AddTypos(person.given_name.name)
        end
    end;

end;

query = @query CustomerModel.Obs [
    given_name person.given_name.name given_name
];

observations = [ObservedDataset(query, df)]
config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true)
@time begin 
    tr = initialize_trace(observations, config);
    run_inference!(tr, config)
end

Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".

Intermediate steps could be to

Train a more accurate n-gram text model that is trained on a data set of names.
Train or find an existing deep generative model for names.

Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:

Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).
Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).
Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.

Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.

Explore adding split/merge moves during inference

Here are a bunch of not particularly organized notes I had lying around about this...

Existing approaches to Split Merge in the literature:

A split-merge algorithm is made up of several components:

[[Choice of component(s) to split or merge in MCMC]]
[[Allocation strategy for datapoints after a split in split-merge MCMC]]
[[Strategy for setting new component parameters when proposing a split or merge in MCMC]]

Some examples of existing approaches:

Jain and Neal: Split-Merge for DPs (Paper)
- Choose components by uniformly selecting two chaperones.
- Allocate datapoints via [[Intermediate scans for allocation of datapoints in split-merge split proposals]].
- Conjugacy is assumed, so new component parameters are marginalized out. (A version exists for conditionally conjugate priors, which handles parameters during the intermediate scans.)
The Chaperones Algorithm
- Choose components by selecting two chaperones, perhaps based on a similarity measure.
- Allocate datapoints via sequential allocation—including of the two chaperones. (They may wind up in the same or different clusters, no matter how they started.)
- Conjugacy assumed (Dirichlet-Categorical, and Dirichlet parameters are marginalized out).
Richardson and Green: Reversible-Jump Split-Merge
- Choose components by flipping a coin to decide between split and merge, then uniformly choosing 1 or 2 distinct components.
- Because parameters are explicit, allocation decisions are mutually independent, conditioned on parameters. When proposing a split, allocation is performed using weighted Bernoulli draws with probabilities computed according to the parameters of each component.
- Parameters are proposed by altering existing components' parameters: e.g., averaging during a merge, and adding or subtracting a random quantity to or from each component during a split.

Choice of component(s) to split or merge in MCMC

[[Existing approaches to split-merge in the literature]] typically use one of the following strategies for deciding whether to split or merge, and if so, which components to target:

Choose two members ("anchors" or "chaperones") uniformly at random. If the two members are in the same component, we propose a split. If the two members are in different components, we propose a merge.
Choose two members ("anchors" or "chaperones") from a custom distribution based on similarity. This distribution must not depend on the current partition of members into components; it may also not depend on component parameters (except perhaps via summary statistics that the split/merge proposal is guaranteed to conserve).
Choose one or two components, based on the result of a coin flip. This is the typical reversible-jump split-merge algorithm. One may employ smart, state-dependent strategies for deciding which component(s) to select, but it may be necessary to use a [[Smart-Dumb Dumb-Smart strategy]] to maintain a reasonable acceptance rate.

Schemes based on "chaperones" or "anchors" must contend with the question of, in their Allocation strategy for datapoints after a split in split-merge MCMC, whether to require that the two chaperones be proposed as belonging to distinct components.
Problem: We want to choose anchor / chaperone objects that "close but not that close”; these are the ones that the existing algorithm likely has trouble with. If there were a measure of distance that only depended on the non-submodel nodes, this would be valid (but would cost O(n^2) to evaluate all the distances). User-defined criteria for "closer inspection" could be an interesting way around this.

Allocation strategy for datapoints after a split in split-merge MCMC

There are at least two strategies in the literature:

Sequential allocation of datapoints after a split in split-merge MCMC

Intermediate scans for allocation of datapoints in split-merge split proposals

When proposing a split in a split-merge MCMC algorithm, it may be necessary to propose a data association: for each observation previously associated with the latent entity being split, which new entitiy should it be associated with? One strategy was described by this paper: Dirichlet-Process Split-Merge.

In this paper, a split is proposed when two randomly chosen observations lie in the same component. The basic "randomized split” algorithm then forces the two observations into two distinct components in the proposal, before assigning all the other elements. This algorithm has the obvious downside that good splits will be rare to propose.

Fixed launch state argument. Jain and Neal begin by describing a strategy in which, after i and j are put in distinct components, the rest of the observations in S are allocated to the two components using some predetermined (fixed) strategy, yielding c^{launch}. Then a Restricted Gibbs scan is performed. Interestingly, this restricted Gibbs scan has its probabilities computed as part of the proposal distribution. This suggests that it needn't really be a Gibbs scan, but can just be a scan of “relatively good proposals." (In PClean, I think all these Gibbs proposals would happen based on 'surrogate' enumeration; this would all be part of the "smart" proposal.) For reasons I don't understand, the Gibbs sampling cannot modify the assignments of data points i and j. This is discussed as a weakness in Section 5, but I don't see why it's necessary. Perhaps it’s in order to “break the symmetry”: a particular split should only be possible to arrive at in one way. We use "the component that i is in" or "the component that j is in" as the 'names' of the two new components—useful in constructing a reverse move.

Random launch state. A "uniform" random launch state is permissible because it can be thought of as part of a state-independent "move selection." The paper argues that it is OK to replace the distribution of the launch state with several scans of restricted Gibbs—it claims this is also a "random launch state." (They claim that in computing the reverse move for a merge, it is necessary to do these n-1 scans -- i.e., actually sample them -- in order to generate a launch state.)

Note for PClean applicability: In a simple DPMM, the restricted Gibbs scan depends not at all on the rest of the model state. But in PClean, we do presumably depend on some of the rest of the state. But maybe it doesn't matter because that part of the state does not change at all during the running of the algorithm.
Can this be understood in terms of involutive MCMC? Probably.

The algorithm proceeds via random initialization, followed by restricted Gibbs scans. Then a final restricted Gibbs scan is done as "part of the proposal."

In a PClean context, re-marginalizing for each proposed combination will be a necessity for this to work. It would be useful to think through whether this is possible to do efficiently. Can the results of enumeration be incrementally updated?

There does appear to be a non-conjugate version, but it’s actually only for conditionally conjugate models:

Selecting "anchor" or "chaperone" reference slots

One thing we can do is choose at random from the keys that have the same submodel observations because with different observations, it will never be possible to merge. This is OK because the observations can't change via a split/merge move.
Another possibility is to develop a smart split / dumb merge kernel. (Good merges will never be accepted, but…) The smart split could be to choose objects with particularly low average ExternalLikelihoods. The dumb merge could be random.

Implementation questions

Can we measure how well average ExternalLikelihood detects good splits?

Can we invent a generic similarity metric, based on reference slot distribution?

“How well does the single best object (from enumeration) explain these two cells, on average?”
- Note that we want “similar but not too similar,” which might be hard to do. i.e., how to tune? We could fit a 3-component log-normal mixture... and choose from the middle compnent?

Add code for all experiments to repository (or create new repository for paper experiments)

This includes

Runtime + accuracy-over-time measurements against baseline inference algorithms (Figure 6)
Configuration for baseline systems (HoloClean + NADEEF)
Uncertainty-aware analysis of Rents dataset

Refactor "preferred values" implementation

Currently some distributions support "preferred values" as an extra argument, and the distribution interface allows the distribution to decide its own enumerative proposal. This could probably be refactored to avoid having so much repeated code across distributions.

Explore strategies for supporting reference slots with multiple target objects

Support using (wrapped?) Gen distributions in PClean models?

Document the role of proposal dummy values

This text implies that ProposalDummyValues are only used for distributions that have infinite (and discrete) support:

https://github.com/probcomp/PClean/blob/master/src/distributions/distributions.jl#L10-L14

But StringPrior has finite support (there is a maximum length), and it implements discrete_proposal_dummy_value:

https://github.com/probcomp/PClean/blob/master/src/distributions/string_prior.jl#L23-L25

Delay proposals of values with no likelihoods in the current subproblem

Consider the program

class NameWithNickname begin
  true_name ~ string_prior(1, 30) preferring all_first_names
  nickname ~ string_prior(1, 30) preferring all_first_names
end
class Person begin
  fname ~ NameWithNickname
  lname ~ string_prior(1, 30) preferring all_last_names
end
class Record begin
  person ~ Person
  name ~ uniform([person.fname.true_name, person.fname.nickname])
end

The problem here is that when you first process a record, you are (by design) assumed to be observing either the person’s true first name, or their nickname. But PClean will try to initialize both latent fields. Suppose you see a person’s first name, and PClean gets it right that it’s a full first name. Then later you see their nickname in another record. You won’t be able to assign the new record to the same “person” object, because the “person” object will already have some (other, generated-from-the-prior) nickname.

If we can delay the proposal of the "other" latent until we have evidence for it, we could circumvent this issue, and do accurate inference in models like this.

This is also very relevant for data integration across multiple sources, where different sources may report different attributes.

Support observed relational structure

Add `DPMem` functionality for reference slots

Allow reference slot declarations to pass arguments:

x ~ C(arg)

and treat this as instantiating a completely different random measure (/ latent table) for each value of arg.

Consider changing default subproblem blocking

In the paper, it is implied that by default, all attributes and reference slots of a class belong to the same subproblem. However, the current implementation uses the opposite convention: that by default, each attribute or reference slot is in its own subproblem, requiring manual blocking in order to define bigger subproblems. We should consider changing this to match the paper.