py-why / dodiscover Goto Github PK

View Code? Open in Web Editor NEW

73.0 6.0 15.0 501.02 MB

[Experimental] Global causal discovery algorithms

Home Page: https://www.pywhy.org/dodiscover/

License: MIT License

Python 100.00%

causal-inference causality graphs python structure-learning

dodiscover's People

Stargazers

Watchers

Forkers

adam2392 iamrabin robertness harshasatyavardhan ebridge2 jaron-lee francescomontagna heronliuqwq rosedesicilia26 aryan26roy eeulig sups007 haochen3611 jimmy-inl diro5t

dodiscover's Issues

[Refactor] Change all CI tests from classes into functions

Yep! It is any Callable with the BaseConditionalIndependenceTest API. However, I will end up refactoring all those to just "functions" as suggested by @bloebp and @petergtz. In a downstream PR this will look like the following:

ctrl+f / replace BaseConditionalIndependenceTest with Callable[[np.ndarray, np.ndarray, np.ndarray], float] for type hints
just turn all the CI test classes into their corresponding functions
migrate any internal functionality of the CI test classes into private util functions for the dodiscover/ci/ submodule

Originally posted by @adam2392 in #30 (comment)

Modify log-likelihood score to account for ideal interventions

Build a wrapper around causal-learn's GIN algorithm

Demonstrate use of edge constraints in a causal discovery algorithm tutorial

Is your feature request related to a problem? Please describe.
We need a tutorial that demonstrates the utility of constraints in causal discovery.

Describe the solution you'd like
Update the notebook in this PR to work with constraints. Needs for this to be resolved first.

Describe alternatives you've considered
I think we could have multiple notebooks to that address this, actually.

Additional context
See links above.

Constraints aren't working with PC algorithm

Constraints are not working with the PC algorithm. See code here to reproduce.

This is probably an overlap with Issue 46 but I include it as a bug since we have code to reproduce.

Checklist

Description

Python traceback:

No traceback, no error thrown, just got unexpected results.

Related issues or possible duplicates

Issue 46

Environment

OS: OS X

Python version: 3.10.8

Output of pip freeze:

anyio==3.6.2
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.8
attrs==22.1.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cffi==1.15.1
contourpy==1.0.5
cycler==0.11.0
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
-e git+https://github.com/py-why/dodiscover.git@c89670d2fcd1cc3333fea5e6ea00b9a095bdb6d1#egg=dodiscover
dowhy==0.8
entrypoints==0.4
executing==1.1.1
fastjsonschema==2.16.2
fonttools==4.38.0
graphviz==0.20.1
idna==3.4
ipykernel==6.16.2
ipython==8.5.0
ipython-genutils==0.2.0
ipywidgets==8.0.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.16.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-server==1.21.0
jupyter_client==7.4.4
jupyter_core==4.11.2
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.3
kiwisolver==1.4.4
MarkupSafe==2.1.1
matplotlib==3.6.1
matplotlib-inline==0.1.6
mistune==2.0.4
mpmath==1.2.1
nbclassic==0.4.5
nbclient==0.7.0
nbconvert==7.2.2
nbformat==5.7.0
nest-asyncio==1.5.6
networkx @ git+https://github.com/adam2392/networkx.git@100c93d213b1f00ac6dff67e5c02bb098060fd9e
notebook==6.5.1
notebook_shim==0.2.0
numpy==1.23.4
packaging==21.3
pandas==1.5.1
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
prometheus-client==0.15.0
prompt-toolkit==3.0.31
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
pydot==1.4.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2022.5
-e git+https://github.com/py-why/pywhy-graphs.git@a5b74ac31d0f02b8b7a35ef0ca01ea10d8436646#egg=pywhy_graphs
pyzmq==24.0.1
qtconsole==5.3.2
QtPy==2.2.1
scikit-learn==1.1.3
scipy==1.9.3
Send2Trash==1.8.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
stack-data==0.5.1
statsmodels==0.13.2
sympy==1.11.1
terminado==0.17.0
threadpoolctl==3.1.0
tinycss2==1.2.1
tornado==6.2
tqdm==4.64.1
traitlets==5.5.0
typing_extensions==4.4.0
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
widgetsnbextension==4.0.3

Steps to reproduce

Example source:

Go through the notebook in [this PR](https://github.com//pull/67), and then run this code.

included_edges = nx.DiGraph([('S', 'L'), ('S', 'B')])
context = make_context().variables(data=data).edges(include=included_edges).build()

ci_estimator = GSquareCITest(data_type="discrete")
pc = PC(ci_estimator=ci_estimator)

def convert_to_int(df):
    for var in df.columns:
        data[var] = [1 if x == "yes" else 0 for x in data[var]]
    return df
data_mod = convert_to_int(data)

pc.fit(data_mod, context)
graph = pc.graph_

draw(graph)

Extend log-likelihood score to BIC

Create wrapper for score-based algorithms w/o interventions or constraints

Implement structural intervention distance for graph comparison

Is your feature request related to a problem? Please describe.
We need ways to evaluate the performance of causal discovery on some ground truth

Describe the solution you'd like
J. Peters and P. Bühlmann: "Structural intervention distance (SID) for evaluating causal graphs", Neural Computation 27, pages 771-799, 2015.

Describe alternatives you've considered
In causal terms, this is preferable to structural Hamming distance, which is also proposed in a separate issue.

Additional context

SID R package
Code in CDT (uses SID R package)

Integrate a native GIN algorithm with constraint algorithms

Is your feature request related to a problem? Please describe.
Implement GIN as a native algorithm.

Describe the solution you'd like
We have a wrapper around causal-learn for the GIN algorithm. The GIN algorithm uses kernel-based conditional independence tests. So do we. In the very least we can leverage our KCI test and other CI tests. In my view, GIN is a constraint-based algorithm (that learns latents) and should be with other constraint-based algorithms. @adam2392

A native implementation would help us apply constraints as well.

Describe alternatives you've considered
Could stay in a separate directory for causal rep related algorithms.

Define an API for hard interventions

Is your feature request related to a problem? Please describe.
No causal discovery library worth it's salt has only observational causal discovery alone.

Describe the solution you'd like
Hard interventions: For data in a data frame. I imaged a list of lists, where the jth element of the outer list is a list of column indexes that were intervened on in the jth row. Although this already feels like there is a missing abstraction that needs definition

Soft interventions: A list, where the jth element of the list corresponds to value of the soft intervention had in the jth row of the data.

Describe alternatives you've considered
Perhaps an intervention column in the data itself.

PC algo only working with int data inputs

Right now, the PC algorithm I believe requires discrete variables to be integers instead of characters.
I tried running PC on this data:

A	S	T	L	B	E	X	D
no	yes	no	no	yes	no	no	yes
no	yes	no	no	no	no	no	no
no	no	yes	no	no	yes	yes	yes
no	no	no	no	yes	no	no	yes
no	no	no	no	no	no	no	yes

But it threw an error. To get it to work I had to convert the values to ints.

def convert_to_int(df):
    for var in df.columns:
        data[var] = [1 if x == "yes" else 0 for x in data[var]]
    return df
data_mod = convert_to_int(data)

pc.fit(data_mod, context)

Calling this a bug. pc.fit(data, context) should work.

Add additional CI tests: Mutual Information, Chi square, Monte Carlo

LGTM. My only issues are with some of the type hints.

@adam2392 can you comment here on why some flavors are left out? For example:

mutual information and asymptotic chi-square test for discrete variables, perhaps with a James-Stein estimator
Monte Carlo permutation testing and sequential Monte Carlo permutation testing
Jonckheere-Terpstra test for ordinal variables

Originally posted by @robertness in #16 (review)

PC/FCI algo: FDR by leveraging the storage of maximum pvalues / minimum test statistics between pairs of edges

Is your feature request related to a problem? Please describe.
Implement: https://www.jmlr.org/papers/volume10/li09a/li09a.pdf

Describe the solution you'd like
Basically perform FDR procedure on sets of pvalues over the entire graph.

Note that the storage cost will go up since we need to keep track of these values

Additional context
Came up in #18

[ENH] Logging in dodiscover

The general issue here is that many structure learning algorithms are sort of "state dependent" and thus it is difficult to parse what is occurring before the final result (i.e. graph) is returned. Logging is a way for developers and users to debug their pipeline, but it should also be easy to turn on/off various levels of messaging.

I think we can possibly emulate something like the MNE-Python, or some other package's solution: https://mne.tools/stable/logging.html

"There should be an option to turn off logging information. Maybe make the logger an optional parameter that can be passed or add flag to turn it off. (This applies to all logging/print information)"

Originally posted by @bloebp in #20 (comment)

Add notebook tutorial for PC algorithm

Is your feature request related to a problem? Please describe.
We need notebooks for our algorithms.

Describe the solution you'd like
A notebook describing the FCI algorithm on a public benchmark dataset

Add causal constraints to causica

Is your feature request related to a problem? Please describe.
Add causal constraints to Causica.

Tracking CI tests and moving them to a new repo?

Most likely we will be moving the CI tests from dodiscover/dowhy to a new CI testing repo. Then this thread talks about some of the design choices of type of input to expect.

Originally posted by @adam2392 in #25 (comment)

POC: `learn_graph` should choose a default method that favors one method over another based on context constraint

Improve coverage of dodiscover to 95% in time for v0.1 release

https://app.codecov.io/gh/py-why/dodiscover/tree/main/dodiscover shows the coverage by our unit-testing. A robust software package should at least aim for >90% test coverage.

This is feasible for us given that many of the API and algorithms have expected usage. The main components that we DEFINITELY should improve upon are:

Context/ContextBuilder: If Context is a central ingredient for our API, then we should have full test coverage of the expected errors and edge cases
CI tests
Constraint-based algorithms

The first two should be relatively straightforward upon inspection of the codecov files. However, adding additional tests to the causal discovery algos may require some more thinking since these need to test different parametrizations of the algorithms.

[DOC] Implement an example using a neural-net based classifier to do Classifier CI Testing

We are interested in a new user contribution of an example where they demonstrate how to use the ClassifierCITest with a scikit-learn classifier and a neural-net based classifier.

This would enable CI testing over high-dimensional objects, such as images and pixels for example. The example does not necessarily need to operate over images/pixels.

Ideally the pytorch model is already trained and fitted, and the model is just pulled from something like hugging face. We can introduce a test-time dependency on pytorch and hugging face. Then the model should be relatively small, so it can run in the continuous integration platform.

"Could we expand this method to work with a neural-net based classifier using Keras or Pytorch? It likely feels like overkill but it would be a first step to telling a story to the broader community that this repo is using cutting edge "deep" methods. We could also workshop an example with multidimensional variables later (e.g. pixels) though not in this PR.

Originally posted by @robertness in #28 (review)"

Allow CMI to be approximated with Chi-square in large-sample sizes?

          LGTM.  One comment.  The CMI value that is calculated as follows:

val = hxyz - (hxz + hyz - hz).mean()

Doesn't this have an asymptotic chi-squared distribution under the null hypothesis? If so, should there be an option to calculate the p-value that way?

Originally posted by @robertness in #85 (review)

Type hints in graph protocol could cause issues with type checker

Hi @adam2392,

just as a heads-up: By using type-hints in

dodiscover/dodiscover/_protocol.py

Line 10 in 3c4f8d2

def nodes(self) -> Iterable:

it could be that type-checkers will give a type violation when passing in a networkx Graph for this protocol, because networkx does not use type-hints. We've tried that originally for

https://github.com/py-why/dowhy/blob/ead8d47102f0ac6db51d84432874c331fb84f3cb/dowhy/gcm/graph.py#L26-L30

and then removed them again, because IntelliJ would show an error, and I'd assume other type-checkers will do the same.

Feel free to close this issue after reading, just wanted to make you aware of it.

Extend confusion_matrix_networks to account for edge directionality

Summary

One may easily extend the code in dodiscover.metrics.py for the confusion_matrix_networks function to allow comparisons based on edge directionality. This is an ALL-OR-NONE type of comparisons, where we either test for the presence of an adjacency, or we test for absolute correctness of the adjacency and directionality of the edge.

Right now, the function only implements testing for the presence of an adjacency. See Robert's comments for more details.

Original Post

I would say either you have an option for a strict matching of edge-type or just presence of an edge. So in the latter case if the true edge is A <- B and the detected edge is A -> B or A - B, then this would be a false negative in the former case and a true positive in the latter. I think you're just doing the latter case. I think that is the preferred case and it's fine if we just have that.

Originally posted by @robertness in #48 (comment)

Documentation for new users - tutorials and examples

Reposting from @bloebp and Jaron

https://github.com/cmu-phil/example-causal-datasets for example datasets

The below for real/synthetic datasets used in dowhy.

"You can generate new ones (see the code at the bottom). Otherwise, the notebook uses this:
Latencies:
https://github.com/py-why/dowhy/blob/main/docs/source/example_notebooks/rca_microservice_architecture_latencies.csv

Supply chain:
https://github.com/py-why/dowhy/blob/main/docs/source/example_notebooks/supply_chain_week_over_week.csv"

Proposal

Based on the larger discussion today, we like toy examples into the examples/ directory for communicating core concepts and then tutorials/ for larger more involved code and documentation that works with more complex datasets, such as the above.

What would be nice is to have a tutorial for the key structure learning algorithms that leverage the above two datasets and demonstrates the difference in output.

In addition, it would be nice to have a short sweet example for each core algo.

This Issue is just keeping track of that discussion.

Use graphviz to compare graphs

Is your feature request related to a problem? Please describe.
We need more ways to compare a discovered graph to a ground truth graph.

Describe the solution you'd like
Overlay two graphs using graph viz, and use different edge colors to indicate:

Successes: Edges that are the same in both graphs
Success but wrong direction: Edges that are aligned one way in one graph and aligned the other way in the other graph. Alternatively, edges that are undirected in one graph but not undirected in the other
False positives: Edges that are present in the discovered graph but missing in the ground truth graph
False negatives: Edges that are missing in the discovered graph but present in the ground truth graph

Describe alternatives you've considered
We've proposed structural Hamming distance and confusion matrices. However, graph visualization is more intuitive and it shows you where exactly in the graph you made your errors, which can be much more informative than a statistic.

Additional context
This exists in bnlearn though it is not clear to the user which graph is the graph being evaluated (the discovered graph) and which graph is the reference graph (the ground truth).

Implement discrete log-likelihood scores on discrete Bayesian network

Finalize constraints abstraction, add to bnlearn wrapper

Implement PC algorithm following API

The team has proposed an API for causal discovery in: https://github.com/py-why/dowhy/wiki/API-for-Global-Causal-Discovery

Porting over a relevant example algorithm such as the PC algorithm will require:

refactoring the API to conform to the new API
adding a protocol class to allow abstraction from the "representation of graphs"
include CI tests

Subclass PC algorthm so it's the more traditional PC algorithm

Ask @adam2392 for details. Involves refactoring how the algorithm deals with colliders. Requires only one method to be rewritten.

Visual comparison of graphs.

Is your feature request related to a problem? Please describe.
Would like the graphviz.compare feature from bnlearn.

graphviz.compare() plots one or more figures and returns invisibly a list containing the graph objects generated from the networks that were passed as arguments (in the same order). They can be further modified using the graph and Rgraphviz packages.

graphviz.compare(groundtruth, learned_graph)

Describe the solution you'd like

Basically, you provide a ground truth graph and a discovered graph. Edge color and type is used to

Black solid edge: True positives
Blue dashed edge: False negatives
Red edge: False positives

You get a false positive in the case of comparing directed edge against an undirected edge,

net <- model2network("[B][A|B]")
graphviz.compare(cpdag(net), net)

Describe alternatives you've considered
We have some other graph comparison

Need to account for constraints in context variable for GIN wrapper

Is your feature request related to a problem? Please describe.
Need to apply constraints, if possible to GIN algorithm. If not possible with causal-learn interface, need to surface this to the user.

Describe the solution you'd like
Constraints in the context variable should be applied to GIN learning.

Describe alternatives you've considered
Could just have a warning message.

Additional context
Initial GIN wrapper PR

Create interventions abstraction with context variable, add to bnlearn wrapper

Implement greedy search

Proposal: rename GraphProtocol to Graph

Hi @adam2392,

I recommend renaming

dodiscover/dodiscover/_protocol.py

Line 6 in 3c4f8d2

class GraphProtocol(Protocol):

to just Graph. In Java it became kind of an anti-pattern to name an entity ISomeEntity for interfaces and I would apply the same here. The reason is that it gives you less flexibility in your design to change from a concrete class to a protocol or back, because you'd have to change the name too and everywhere where it's used.

Implement R5-7 for selection bias in FCI algorithm

Is your feature request related to a problem? Please describe.
Currently, the FCI algorithm implementation does not implement R5-8 in the FCI algorithm, which accounts for learning graphs with selection bias.

It would be a straightforward issue for someone to add these into the existing implementation.

Describe the solution you'd like
Implement R5-8 of the FCI algorithm. See publication.

Implement _apply_rule5, _apply_rule6 and _apply_rule7
Update logic inside _apply_rules_1to10
add unit tests to tests/unit_tests/constraint/test_fcialg.py
[Bonus] Implement an example going through the different caveats and concepts of the FCI algorithm

Probably the most involved would be implementing the unit tests. The current graph API is a work in progress.

[ENH] Add required/excluded edge directions in the orientation phase of PC/FCI

Summary

Required and excluded edges are not taken into account at the orientation phase (i.e. setting edge directions) of constraint-based structure learning algos.

There are a few challenges though to tackle:

how to warn users that the resulting object may not obey the definition of a CPDAG/PAG
whether or not to set these in the skeleton phase (so they are not tested with CI tests), or to set these at the orientation level after CI tests have been run

Misc. reference

"Required edges and excluded edges are built into the LearnSkeleton class rn meaning they do not account for directionality. However, these can also be taken into account at the orientation phase...

There currently is a tradeoff that theory has yet to answer:

inside skeleton phase: you just initially set the adjacencies based on this prior info.
inside orientation phase: you could orient additional edges

^ the nuanced challenge here is that the resulting object is NOT a CPDAG/PAG, and thus would not be expected to work with the downstream ID/Estimation algorithms that assume an input CPDAG/PAG. I'm leaning towards passing in a warning str to the graph object that when printed out, warns users to not use them in ID/estimation if they set prior edges that might break the conditional-independence statements.

Originally posted by @adam2392 in #30 (comment)"

Implement the Psi-FCI algorithm

Is your feature request related to a problem? Please describe.
The Psi-FCI algorithm is an extension of the FCI algorithm that is complete for Psi-PAGs. The algorithm can learn a Psi-PAG given a combination of observational and soft-interventional datasets. This will be a good introduction of a more complex multi-dataset algorithm that we can start taking advantage of to flush out the Context/data API. Moreover, it will help us test the limits of the existing graph and dowhy.gcm API.

We should implement the algorithm as outlined in https://causalai.net/r67.pdf.

Describe the solution you'd like
The Psi-FCI should subclass the existing FCI algorithm. In terms of graphical structure, there are additional graph API to consider. The main part being encoding what are known as "F-nodes". Another issue to sort out is how to test for conditional distribution differences (i.e. is P_1(y|x) = P_2(y|x)?), which is a central problem within Psi-FCI and other multi-dataset structure learning algos.

Implement the KCD test https://arxiv.org/abs/2102.08208. This will allow us to test P_1(y|x) = P_2(y|x)
Implement the Psi-PAG graph in pywhy-graphs
Implement the Psi-FCI algorithm in dodiscover; propose necessary modifications to the general fit(data, context) / learn_graph(data, context) API
Implement an example demonstrating Psi-FCI usage with a simulated SCM with soft interventions

Additional context
cc: @robertness @darthtrevino

Tutorial on GIN

Is your feature request related to a problem? Please describe.
Before implementing a native version, a first step could be to demonstrate the usage of GIN by for example replicating one of the simulation results in the original paper. One should note that the algorithm only has guarantees when all observed variables have no direct causal relationships with each other.

It should be noted in the tutorial.

Afterwards, a native implementation (#87) can take place that can use the unit tests and the tutorial to check for correctness.

Describe the solution you'd like
Add a notebook to docs/tutorials, or add a short python script example to examples/. I think since this is "explaining an algorithm", this would be more "in-depth" and should therefore be a tutorial.

Additional context
cc: @robertness wanna add this, since you're more familiar?

Consolidated API for conditional testing

Summary

In constraint-based causal discovery, there are various types of "conditional testing". In traditional algos, like PC/FCI, one typically has variables of the same type (i.e. binary, discrete, continuous) and one is interested in the conditional independence test:

$H_0: X \perp Y | Z$

On the other hand, modern algorithms that leverage multiple datasets, such as the psi-FCI algorithm, inherently want to perform a conditional k-sample test, where:

$H_0: P_t(Y|X) = P_{t'}(Y|X)$

testing for differences in the conditional distributions of Y|X.

Right now the API is slightly different, but they could be consolidated.

Possible Solutions

The test(..) api could probably have the following:

df the dataset
x_vars: X variable(s)
y_vars: Y variable(s)
z_vars: Z variables in CI testing, or group indicator in CD testing

References

" That directory handles CI testing, which tests: H_0: P(y | x, z) = P(y | z), where typically x,y,z have to of the same variable type (i.e. binary, discrete, continuous).

while conditional discrepancy tests: H_0: P(y | x) = P'(y | x), or formulated as H_0: P(y|x,t) = P(y|x), where t is discrete and y,x can be continuous.

I guess they could... be formulated as the same arguments in the API. Perhaps I can refactor the code if there is a lot of similarity between ci and cd tests to a common just conditional_testing submodule."

Originally posted by @adam2392 in #82 (comment)

Neighborhood algorithm for greedy search

Implement structural hamming distance as a graph comparison stat.

Is your feature request related to a problem? Please describe.
We need more ways to compare a discovered graph to a ground truth graph.

Describe the solution you'd like
Structural hamming distance (SHD) is one solution. In simple terms, this is the number of edge insertions, deletions or flips in order to transform one graph to another graph. Tsamardinos I, Brown LE, Aliferis CF (2006). "The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm". Machine Learning, 65(1):31–78.

Describe alternatives you've considered
Treating edges as labels and providing a confusion matrix is already a work in progress.
Created an issue for comparing graphs through visualization

Additional context
Implementation in CDT

Release v0.1

Tracking what it will take to do an initial release for v0.1. We should do a simultaneous release of pywhy-graphs v0.1, so that way.

Some left-over PRs I want to get merged:

#83 for CMI kNN test: done
#85 for Classifier CMI test: still WIP
#86 for an initial look at basic time-series causal discovery
#72 @robertness is in charge of this, so if necessary, we can slate this for v0.2. However, seems like if it's wrapping causal-learn, just needs someone to take it over the finish line.
We should update the dependency for pywhy-graphs to v0.1 once we release there
We should probably create some more documentation talking about how to use Context and the constraint-based causal discovery

Other possible improvements, we can stack in are #29, but this really also requires someone to champion.

[BUG, ENH] Make sure classifier is new everytime and Add optional caching to classifier-based CI tests

Problem

The classifier-based CI tests are nice because they are non-parametric generally, but slow because for each test, you need to re-fit a classifier. For modular components, we should optionally cache predictions.

For example:

CCMI: we can cache MI estimates of I(X;Y), I(X;Y,Z), such that we can re-use them when needed
CCIT: store the metric (and optionally biased metric) values for combinations of (X, Y, Z). The ordering of (X and Y) does not matter here.

Another issue is that we should ensure that the classification model is refit from scratch each time ci_estimator.test(...) is called. Downstream classification models in scikit-learn might support warm-starts, and this may be the subject

Proposed solution

Create a nested dictionary

For CCMI:

X variable
Y variable
Z variable (can be None to indicate no conditioning)

For CCIT:

X variable
Y variable

Implement GES using causal-learn

Is your feature request related to a problem? Please describe.
We need an initial score-based algorithm.

Describe the solution you'd like
Use causal-learn as a dependency to build GES

Describe alternatives you've considered
We could build it on our own, but causal-learn has a solid implementation and there is no need to reinvent the wheel

Add a wrapper to Causica SCM discovery algorithm

Is your feature request related to a problem? Please describe.
Add a basic wrapper to the DECI algorithm in the causica library.

Describe the solution you'd like
At a first pass, implement with no constraints and no intervention model. Make this an optional dependency.

Describe alternatives you've considered
Causica is a good example of a SCM-based causal discovery algorithm that utilizes deep learning.

Prototype a results section

The results of causal discovery algorithms should be comparable to one another, such that a user can see how well an algorithm performed without having to understand the details of the algorithm.

Update how to get parameters of a context to facilitate "copying"

Currently, we do copying of Context within the ContextBuilder. I think this is fine for simplifying the API of the Context, but I think adding new properties to Context then necessarily means we update this one area of copying. It is possible that there are bugs that are introduced.

def make_context(context: Optional[Context] = None) -> ContextBuilder:
    result = ContextBuilder()
    if context is not None:
    # this part needs to be updated manually every time
        result.graph(deepcopy(context.init_graph))
        result.edges(deepcopy(context.included_edges), deepcopy(context.excluded_edges))
        result.variables(copy(context.observed_variables), copy(context.latent_variables))
        result.state_variables(deepcopy(context.state_variables))
    return result

Instead, I propose to dynamically get all possible parameter names similar to how scikit-learn does it. We would add _get_params() and _set_params() to the Context class, and then change the above LOC to:

def make_context(context: Optional[Context] = None) -> ContextBuilder:
    result = ContextBuilder()
    if context is not None:
        context_params = context.get_params()
        result.set_params(context_params)
    return result

We can make get_params/set_params private to discourage users from manipulating their context classes themselves.

cc: @robertness @darthtrevino

Implement GIN

Implement the GIN causal discovery algorithm. GIN learns latent causes that cluster observed variables. I also learns causal orderings between those latent causes.

References:

Xie, Cai, Huang, Glymour, Hao, Zhang, "Generalized Independent Noise Condition for Estimating Linear Non-Gaussian Latent
Variable Causal Graphs," NeurIPS 2020
Cai, Xie, Glymour, Hao, Zhang, “Triad Constraints for Learning Causal Structure of Latent Variables,“ NeurIPS 2019

EPIC: Baseline causal discovery algorithms

This epic targets a variety of baseline causal discovery algorithms in dodiscover. The goal is to have a MVP as a discovery library.

Implement hierarchical latent tree discovery

Implement an algorithm from the class of causal discovery algorithms that learn a hierarchical tree over and observed set of variables.

The primary reference is: Xie, Huang Chen, He, Geng, Zhang, “Estimation of Linear Non-Gaussian Latent Hierarchical Structure,” arxiv 2022

Related algorithms:

Chen, Peixian, et al. "Latent tree models for hierarchical topic detection." Artificial Intelligence 250 (2017): 105-124.
Treecat

A	S	T	L	B	E	X	D
no	yes	no	no	yes	no	no	yes
no	yes	no	no	no	no	no	no
no	no	yes	no	no	yes	yes	yes
no	no	no	no	yes	no	no	yes
no	no	no	no	no	no	no	yes

A	S	T	L	B	E	X	D
no	yes	no	no	yes	no	no	yes
no	yes	no	no	no	no	no	no
no	no	yes	no	no	yes	yes	yes
no	no	no	no	yes	no	no	yes
no	no	no	no	no	no	no	yes

py-why / dodiscover Goto Github PK

dodiscover's People

Stargazers

Watchers

Forkers

dodiscover's Issues

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

Summary

Original Post

Proposal

Summary

Misc. reference

Summary

Possible Solutions

References

Problem

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org

A	S	T	L	B	E	X	D
no	yes	no	no	yes	no	no	yes
no	yes	no	no	no	no	no	no
no	no	yes	no	no	yes	yes	yes
no	no	no	no	yes	no	no	yes
no	no	no	no	no	no	no	yes