mckinsey / causalnex Goto Github PK
View Code? Open in Web Editor NEWA Python library that helps data scientists to infer causation rather than observing correlation.
Home Page: http://causalnex.readthedocs.io/
License: Other
A Python library that helps data scientists to infer causation rather than observing correlation.
Home Page: http://causalnex.readthedocs.io/
License: Other
If a Node has only one parent (e.g. A->B) this node is always assigned to the flat distribution when we fit the probabilities.
I dig in and found out that problem turns out to come from PGMPY. I will raise the same issue there too, but am not sure how we want to handle it in CausalNex in the meantime.
import numpy as np
import pandas as pd
from causalnex.structure import StructureModel
from causalnex.network import BayesianNetwork
sm = StructureModel()
sm.add_edge('A','B')
np.random.seed(11)
vals = [1,2,3]
A = np.random.choice(vals,size=3000,p=[.1,.3,.6])
B = [np.random.choice(vals,p=[.9,.05,.05]) if a==1 else # you can put any values here, the result will be the same
np.random.choice(vals,p=[.1,.2,.7]) if a==2 else
np.random.choice(vals,p=[.85,.1,.05]) if a==3 else
np.random.choice(vals) for a in A]
df = pd.DataFrame([A,B],index=['A','B']).T
#####
bn = BayesianNetwork(sm)
bn = bn.fit_node_states(df)
bn = bn.fit_cpds(df, method="MaximumLikelihoodEstimator")
print(bn.cpds['B'].round(decimals=2))
A 1 2 3
B
1 0.92 0.11 0.85
2 0.02 0.21 0.10
3 0.06 0.68 0.05
A 1 2 3
B
1 0.33 0.33 0.33
2 0.33 0.33 0.33
3 0.33 0.33 0.33
pip show causalnex
):Name: causalnex
Version: 0.5.0
Summary: Toolkit for causal reasoning (Bayesian Networks / Inference)
Home-page: https://github.com/quantumblacklabs/causalnex
Author: QuantumBlack Labs
Author-email: [email protected]
License: Apache Software License (Apache 2.0)
Location: /opt/anaconda3/envs/rehoww/lib/python3.6/site-packages
Requires: scikit-learn, pandas, prettytable, wrapt, pgmpy, scipy, numpy, networkx
Required-by: ```
* Python version used (`python -V`):
Python 3.6.10 :: Anaconda, Inc.
* Operating system and version:
MAC OS
* pandas version: 0.24
## CAUSE:
This comes is from PGMPY, precisely file `pgmpy/estimators/base.py`, ~ line 127.
parents_states = [self.state_names[parent] for parent in parents]
state_count_data = data.groupby([variable] + parents).size().unstack(parents)
row_index = self.state_names[variable]
column_index = pd.MultiIndex.from_product(parents_states, names=parents)
state_counts = state_count_data.reindex(index=row_index, columns=column_index).fillna(0) # <----Where the error is
If the node has more than one parent, `state_count_data` columns will be `MultiIndex` from the start. So doing ` state_count_data.reindex(...,columns=column_index)` causes no problem.
If the node has one single parent, however, `state_count_data` columns will not be `MultiIndex`, but just "normal" indexing. In that case, when doing `state_count_data.reindex(...,columns=column_index)` the result is dataframe full of NAs.
## Dirty solution:
convert `state_count_data.columns` to `Multiindex` before reindexing
parents_states = [self.state_names[parent] for parent in parents]
state_count_data = data.groupby([variable] + parents).size().unstack(parents)
row_index = self.state_names[variable]
if len(parents) == 1: ## ADD THIS IF CONDITION
state_count_data.columns = pd.MultiIndex.from_product(list(state_count_data.columns), names=parents)
column_index = pd.MultiIndex.from_product(parents_states, names=parents)
state_counts = state_count_data.reindex(index=row_index, columns=column_index).fillna(0)
A ValueError is raised when trying to access bn.cpds
after probabilities have been fit. This occurs when there are less states in the data given to fit_cpds
than to fit_node_states
. For example, when fit_node_states
is called with the full dataset, and fit_cpds
with only a training portion.
It prevents me from inspecting the CPDs fit using training data.
from causalnex.structure import StructureModel
from causalnex.network import BayesianNetwork
import pandas as pd
sm = StructureModel([("a", "b"), ("c", "b")])
bn = BayesianNetwork(sm)
train = pd.DataFrame(data=[[0, 0, 1], [1, 0, 1], [1, 1, 1]], columns=["a", "b", "c"])
test = pd.DataFrame(data=[[0, 0, 1], [1, 0, 1], [1, 1, 2]], columns=["a", "b", "c"])
data = pd.concat([train, test])
bn.fit_node_states(data)
bn.fit_cpds(train)
bn.cpds
The CPDs should be available
a ValueError is received.
ValueError: cannot reshape array of size 4 into shape (2,4)
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
Name: causalnex
Version: 0.4.3
Summary: Toolkit for causal reasoning (Bayesian Networks / Inference)
Home-page: https://github.com/quantumblacklabs/causalnex
Author: QuantumBlack Labs
Author-email: [email protected]
License: Apache Software License (Apache 2.0)
Location: /Users/ben_horsburgh/opt/anaconda3/lib/python3.7/site-packages
Requires: networkx, scikit-learn, pgmpy, matplotlib, pandas, numpy, wrapt, prettytable, scipy
Required-by:
Note: you may need to restart the kernel to use updated packages.
Python 3.7.4
MacOS 10.15.2
Hello, is there an way to define manually the CPTs (Conditional Probability Tables)? I couldn't find anything related to this in the documentation.
If there isnt, would be interesting to have this feature.
Thank you all for the great library.
Make lint
fails if black is not installed. It fails with the message premission denied
This is not a huge issue (we can install it on our own), but I believe the expected behaviour is for Make lint
to install black
if Python >=3.6 and
cloning the repo and contributing.
I tested initialising environments with the versions of python 3.5, 3.6 and 3.7, installing the requirements and then running make lint
3.5: all passed
3.6 and 3.7: the following error message
Traceback (most recent call last):
File "/opt/anaconda3/envs/test_37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/anaconda3/envs/test_37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/gabriel_azevedo_ferreira/Documents/Projects/CAUSALNEX_RND/causalnex/tools/min_version.py", line 46, in <module>
subprocess.run(run_cmd, check=True)
File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 488, in run
with Popen(*popenargs, **kwargs) as process:
File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/opt/anaconda3/envs/test_37/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'black'
Include as many relevant details about the environment in which you experienced the bug:
python -V
):CausalNex should give the optionally to treat missing values as a category for each variable of interest.
NaN or missing value are quite usual in any studies and revealing a lot of information as such
Following the tutorial raises the following:
bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior="K2")
Would be great to have a fully working jupyter notebook as an example.
/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in fit_cpds(self, data, method, bayes_prior, equivalent_sample_size)
344
345 transformed_data = data.copy(deep=True) # type: pd.DataFrame
--> 346 transformed_data = self._state_to_index(transformed_data[self.nodes])
347
348 if method == "MaximumLikelihoodEstimator":
/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in _state_to_index(self, df, nodes)
307 cols = nodes if nodes else df.columns
308 for col in cols:
--> 309 df[col] = df[col].map(self._node_states[col])
310 df.is_copy = True
311 return df
TypeError: 'NoneType' object is not subscriptable
pip show causalnex
): causalnex==0.4.3python -V
): 3.7When we call Discretiser(method='uniform', num_buckets=5)
and Discretiser(method='quantile', num_buckets=5)
, it seems we always get the same result. Concretely, it seems both of these two provide quantile discretisation method. It would be better for the uniform discretiser to produce uniform discretisation method.
I wanted to use uniform discretisation as a discretiser, but I could not achieve it with the implemented Discretiser.
from causalnex.discretiser import Discretiser
import numpy as np
data = np.random.normal(0, 1, 10000)
Discretiser(method='uniform', num_buckets=5).fit(data).numeric_split_points
Discretiser(method='quantile', num_buckets=5).fit(data).numeric_split_points
Uniform discretiser is expected to produce numeric split points the delta of which is constant. Ref: sklearn's KBinsDiscretiser
Uniform discretiser produces numeric split points which separate data so that the number of data points for each bucket is the same. Also, this is the same as what quantile discretiser does.
causalnex==0.7.0
python 3.7.7
ProductName: Mac OS X
ProductVersion: 10.15.5
First, thanks for the grate package!
Using the InferenceEngine I can "do" and obtain negative probabilities.
I was trying something like this from the example in the tutorial:
ie.do_intervention("higher", {'yes': x, 'no': 1-x})
print("updated marginal G1", ie.query()["G1"])
Here, I can set values for x
outside the range of [0,1], which I think is not ideal.
A more relevant question occurred to me from observing that the output of ie.query()["G1"]
after the intervention is a perfectly linear function of x
, extending into the negative. Are the CPDs in your model linear functions by design? Shouldn't these functions be bounded?
Firstly congrats on open sourcing this project, awesome work everyone! 🙂 Just a quick issue, seems like there is some broken links on the README for the tutorial section.
Loads correct tutorial page
404
Do you think it is proper to use CausalNex with time series data? (~55 years of annual records, pct change applied). I know on the website, it says it is recommended to use CausalNex with at least 1000 instances but I keep the number of nodes as 5 or 6, so I think maybe it can work. But I'm not really sure about that.
What are the edge weights representing? How important are their quantities? And what should we understand if they are positive or negative?
How should we decide on the threshold?
I would be really glad if you can help. Thanks.
Dear Team,
I have a continuous dataset with ~850 samples and 13 variables (including 1 target variable). I would like to identify the causes of the effect (target variable). The tutorial section has nicely explained steps to work with Causalnex. I am using causalnex 0.4.3.
Following this, the Structure Model has been created manually by including possible causes and relationships among them. All other steps as given in the tutorial have been followed and the model works well upto the step of Predictions (i.e. without any Errors). However, the final Evaluation metrics are not very good. For instance, the accuracy of Predictions vs Train_set and Predictions vs Test_set for minority/less majority classes are always somewhere around 60-70%.
I have experimented with following different combination of modifications on model.
I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.
Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/model? Could you please suggest any troubleshooting methods to identify the cause of low accuracy/metrics?
Any ideas would be of great help! Thanks in Advance.
Hi there,
Just wondering if there's any plans for causalnex
to support Python 3.8? I've had a look at the dependencies in the requirements.txt
file and they all seem to support Python 3.8 on PyPI.
Thanks, and great documentation by the way!
Hi, I have been working with the library, specially with NOTEARS, and I was making experiments and observed that a structure learned with the raw data can be different from the structure learned from normalized data (MinMaxScaler). Is there any reason to the method be affected by scale? By reading the paper I couldn't figure a reason. Or is some mistake in my application?
I'm trying to obtain an interventional data distribution, i.e., I want to intervene on a specific node and see how that affects the conditional probabilities in the entire graph. Currently, it's possible to intervene on a specific node and query the marginals of every other node. Let's say I have a graph with 3 nodes (A, B, C), one of them (A) is confounding the causal relation between the other two (B --> C). I can intervene on the confounding node A, and I want to obtain the probabilities P(B | A) and P(C | B, A). I suppose the former is directly inferred from the marginal P(B) (since we know the interventional value of A). But how can I obtain P(C | B, A)?
Database repair problems, where you'd want to remove unwanted causal effects from the distribution
Possibility to query conditionals along with marginals, or to sample from the distributions resulting from an intervention?
I am currently running Bayesian Network Tutorial. I am getting the following syntax error while running the code.
How has this bug affected you? What were you trying to accomplish?
Building a Bayesian Network. Unable to build it because of the syntax error.
from causalnex.structure import StructureModel
sm_manual = StructureModel()
sm_manual.add_edges_from(
[
("b", "a", origin="expert"),
]
)
Run it in jupyter
It should form a Bayesian Graph
Getting syntax Error
-- If you received an error, place it here.
File "<ipython-input-5-127b20ebb195>", line 11
("b", "a", origin="expert"),
^
SyntaxError: invalid syntax
-- Separate them if you have more than one.
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
): 0.4.2python -V
): 3.7.4There is an attribute error encountered after running the below statement in the tutorial.
bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior="K2")
Pandas version 0.25.1 is used, not version 1.
when i use 'from_pandas' to learning causal map by notears,i run 'watch -n 1 free -m',it shows that 3/16GB used.i run 370 thousand data but only use memory 3G?how to improve efficiency?
Every 1.0s: free -m Tue Jun 30 16:18:36 2020
total used free shared buff/cache available
Mem: 16384 2799 12213 0 1371 13584
Swap: 0 0 0
Hi QB––
I am running a do-calculus on a small dataset (116x32) with 2 to 4 discretized buckets.
The BN fits the CPDs in 2 sec, so relatively good perf.
However a simple do-intervention takes forever and even never ends running, I waited several hours then I interrupted kernel.
$ from causalnex.inference import InferenceEngine
$ ie = InferenceEngine(bn)
$ ie.do_intervention("cD_TropCycl", {1: 0.2, 2: 0.8})
$ print("distribution after do", ie.query()["cD_TropCycl"])
Shouldn't it be running just a few seconds given the low number of buckets?
How long does it normally take?
no results returned after hours running a simple query.
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
): 0.5.0python -V
): python 3.7.6Thank you very much!!
I want to know an input data and result for dynotears.
I tried to use dynotears.from_pandas using DREAM4 challenge data, but get an empty graph.
I constructed a list of dataframe as below that contains 10 dataframes.
For each dataframe, the column is node and the row is timepoint such as below.
g1 g2
1 1 2
2 4 2
3 3 1
[Update]DAGs with NO TEARS
https://github.com/xunzheng/notears
Zheng, X., Dan, C., Aragam, B., Ravikumar, P., & Xing, E. P. (2020). Learning sparse nonparametric DAGs (AISTATS 2020, to appear).
https://arxiv.org/pdf/1909.13189.pdf
https://github.com/xunzheng/notears
pl. evaluating 1909.13189
Thanks!
Hi,
Thanks for this awesome package!
I wanted to get some more info on the Sklearn Interface.
How does the Sklearn Interface relate to structural models and Bayesian networks that can be used through other parts of the API?
Are the DAGRegressors just non probabilistic structural causal models where as Bayesian Networks are probabilistic? Is this the key difference?
Cheers,
Nick
This is just an "FYI" - not a bug report.
I am using causalnex with Apache RAPIDS with a view to using GPUs for the networkx activity.
There are a couple of version incompatibilities because RAPIDS uses more recent Python libraries than causalnex. These are:
with networkx 2.4 the weakly_connected_component_subgraphs has been removed (it was announced as deprecated in 2.1; it's now gone).
with pandas 0.25.0 I get the following attribute error (maybe it's actually a Python 3.7 -> Python 3.6 issue as RAPIDS is using 3.6?):
----> 2 classification_report(bn, test, "G1")
/opt/conda/envs/rapids/lib/python3.6/site-packages/causalnex/evaluation/evaluation.py in classification_report(bn, data, node)
205 )
206
--> 207 return pd.DataFrame.from_dict(report, orient="index")
/opt/conda/envs/rapids/lib/python3.6/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
1172 # TODO speed up Series case
1173 if isinstance(list(data.values())[0], (Series, dict)):
-> 1174 data = _from_nested_dict(data)
1175 else:
1176 data, index = list(data.values()), list(data.keys())
/opt/conda/envs/rapids/lib/python3.6/site-packages/pandas/core/frame.py in _from_nested_dict(data)
8473 new_data = OrderedDict()
8474 for index, s in data.items():
-> 8475 for col, v in s.items():
8476 new_data[col] = new_data.get(col, OrderedDict())
8477 new_data[col][index] = v
AttributeError: 'float' object has no attribute 'items'
Does the current structure algorithm valid when there are confounder variables ?
If not, can you add alternative one which take into account.
Hello,
Thank you very much for CausalNex! I am new to Bayesian Networks and I am trying to understand them. I have one question. If we set the distribution of the value of one feature to 1 like in the tutorial(100% students wanted to do a higher education) we will obtain a certain rate. Should the obtained rate be the same in the case we set the distribution of the other possible value of the feature to 1? And why not? Moreover, setting only one value for all the instances in the dataset is equivalent to removing the certain feature?
Hi,
After learn the BN model, I use the 'sm.egdes(data)' and then I find some negative weight edges in my model.
What's the meaning of those negative weight edges?
Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
Why is this change important to you? How would you use it? How can it benefit other users?
(Optional) Suggest an idea for implementing the addition or change.
(Optional) Describe any alternative solutions or features you've considered.
The edit on Github links to a non-existing https://github.com/quantumblacklabs/causalnex/blob/master/docs/index.rst file
Instead it should probably lead to the repo.
There is a bug associated with 'fit_cpds' and 'fit_node_states_and_cpds' functions associated with pandas.
The function works fine but the error exists.
A colab jupyter notebook can be accessed from here:
https://colab.research.google.com/drive/1uY4b_gXSwRvUYe774pm6cX7Whigexzk0?usp=sharing
Or
# Create a Bayesian Network with a manually defined DAG
from causalnex.structure.structuremodel import StructureModel
from causalnex.network import BayesianNetwork
from causalnex.inference import InferenceEngine
sm = StructureModel()
sm.add_edges_from([
('rush_hour', 'traffic'),
('weather', 'traffic')
])
data = pd.DataFrame({
'rush_hour': [True, False, False, False, True, False, True],
'weather': ['Terrible', 'Good', 'Bad', 'Good', 'Bad', 'Bad', 'Good'],
'traffic': ['heavy', 'light', 'heavy', 'light', 'heavy', 'heavy', 'heavy']
})
bn = BayesianNetwork(sm)
# Inference can only be performed on the `BayesianNetwork` with learned nodes states and CPDs
bn = bn.fit_node_states_and_cpds(data)
-- If you received an error, place it here.
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5191: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
object.__getattribute__(self, name)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5192: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
return object.__setattr__(self, name, value)
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/base.py:54: FutureWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
states = sorted(list(self.data.ix[:, variable].dropna().unique()))
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/base.py:111: FutureWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
state_count_data = data.ix[:, variable].value_counts()
/usr/local/lib/python3.6/dist-packages/pgmpy/estimators/MLE.py:128: FutureWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
state_counts.ix[:, (state_counts == 0).all()] = 1
{'heavy': 0.7142857142857142, 'light': 0.2857142857142857}
-- Separate them if you have more than one.
## Your Environment
Include as many relevant details about the environment in which you experienced the bug:
* CausalNex version used (`pip show causalnex`): 0.8.1
* Python version used (`python -V`): 3.6
* Pandas vesion used: 0.25.3
* Operating system and version: colab session
Discretiser
class is great. Would be quite helpful to have log_uniform
method just like uniform
.
Often times attribute distributions is zero-inflated or somehow non-uniform / log-uniform.
Same method utilized in uniform
but first taking the logarithm of the data and reverting the percentiles back to normal scale.
Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
Why is this change important to you? How would you use it? How can it benefit other users?
(Optional) Suggest an idea for implementing the addition or change.
(Optional) Describe any alternative solutions or features you've considered.
I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others
First,the dataset i have is 300K rows* 150cols.I defined the label(result) what i need to find what cols(reasons) contribute to,and the distribution of labels is more than 280,000 0 and more than 10,000 are 1.
Second,i use from_pandas()
to learns the structure.
at the last,i use sm.remove_edges_below_threshold(0.01)
,but i got all causal relationship of label is that label contribute to others.
I found a typo on the tutorial (https://causalnex.readthedocs.io/en/latest/03_tutorial/03_tutorial.html):
It should be 'us' instead of 'as'.
I want to find some causal relationship from some cols to label what i defined by from_pandas(no tears),but i got some causal relationship from label to others
First,the dataset i have is 300K rows* 150cols.I defined the label(result) what i need to find what cols(reasons) contribute to,and the distribution of labels is more than 280,000 0 and more than 10,000 are 1.
Second,i use from_pandas()
to learns the structure.
at the last,i use sm.remove_edges_below_threshold(0.01)
,but i got all causal relationship of label is that label contribute to others.
Using the discrete parameter learning functionality on a standard BN structure (20-30 nodes and an intuitive discretisation for each) requires huge amounts of memory. Parameter learning is faster and more memory-efficient when calculated implementing parameter learning on continuous data in other packages (Gaussian BN).
When suitable normality assumptions are being met Gaussian BNs perform well, they don't require any loss of information through discretisation and the memory requirements of parameter learning are severely reduced. I've also seen in your docs that this is already on your roadmap to have this implemented - an ETA would be awesome!
tryng to run the model on google collab and pandas rises issues about importing some packages
the first one that turned out to be fatal is relevent to 'OrderedDict'
code:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/html.py in ()
8 from textwrap import dedent
9
---> 10 from pandas.compat import OrderedDict, lzip, map, range, u, unichr, zip
11
12 from pandas.core.dtypes.generic import ABCMultiIndex
ImportError: cannot import name 'OrderedDict'
the second one which is fatal is relevant to importing lmap
/usr/local/lib/python3.6/dist-packages/pandas/core/config.py in ()
55
56 import pandas.compat as compat
---> 57 from pandas.compat import lmap, map, u
58
59 DeprecatedOption = namedtuple('DeprecatedOption', 'key msg rkey removal_ver')
ImportError: cannot import name 'lmap'
with standalone pandas, I did not get these issues seems to me google collab missing something
any suggestions please ?
How to assess the effect of an intervention? How to compute see and do probabilities?
I want to perform Level 2 of Causal Inference - Do-Calculus.
Considering the student performance example on the CausalNex website.
I am using the "higher" node as the intervention and performing do_intervention on it. I get a probability distribution for G1 based on the intervention. When I try to compare the probability distribution of G1 before the intervention, it is the same.
pred = ie.query({"higher": "yes"}), gives
'higher': {'no': 0.0, 'yes': 1.0}
'G1': {'Fail': 0.206829529425519, 'Pass': 0.793170470574481}
ie.do_intervention("higher",{"yes":1.0,"no":0.0})
ie.query()
'higher': {'no': 0.0, 'yes': 0.9999999999999998}
'G1': {'Fail': 0.20682952942551894, 'Pass': 0.7931704705744809}
I want to understand how do_intervention differs? How do I compute the see probabilities (before intervention) and do probabilities?
'K2'
as bayes_prior
is working but 'BDeu'
throws an error.
bn = BayesianNetwork(graph_largest_sub)
bn = bn.fit_node_states(train)
bn = bn.fit_cpds(train, method='BayesianEstimator', bayes_prior='BDeu')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-ec000222de66> in <module>
1 bn = BayesianNetwork(graph_largest_sub)
2 bn = bn.fit_node_states(train)
----> 3 bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior='BDeu')
/usr/local/lib/python3.7/site-packages/causalnex/network/network.py in fit_cpds(self, data, method, bayes_prior, equivalent_sample_size)
368 prior_type=bayes_prior,
369 equivalent_sample_size=equivalent_sample_size,
--> 370 state_names=state_names,
371 )
372 else:
/usr/local/lib/python3.7/site-packages/pgmpy/models/BayesianModel.py in fit(self, data, estimator, state_names, complete_samples_only, **kwargs)
695 _estimator = estimator(self, data, state_names=state_names,
696 complete_samples_only=complete_samples_only)
--> 697 cpds_list = _estimator.get_parameters(**kwargs)
698 self.add_cpds(*cpds_list)
699
/usr/local/lib/python3.7/site-packages/pgmpy/estimators/BayesianEstimator.py in get_parameters(self, prior_type, equivalent_sample_size, pseudo_counts)
71 prior_type=prior_type,
72 equivalent_sample_size=_equivalent_sample_size,
---> 73 pseudo_counts=_pseudo_counts)
74 parameters.append(cpd)
75
/usr/local/lib/python3.7/site-packages/pgmpy/estimators/BayesianEstimator.py in estimate_cpd(self, node, prior_type, pseudo_counts, equivalent_sample_size)
131 pseudo_counts = [1] * node_cardinality
132 elif prior_type == 'BDeu':
--> 133 alpha = float(equivalent_sample_size) / (node_cardinality * np.prod(parents_cardinalities))
134 pseudo_counts = [alpha] * node_cardinality
135 elif prior_type == 'dirichlet':
TypeError: float() argument must be a string or a number, not 'NoneType'
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
): 0.5.0python -V
): 3.7In the lib repo you write:
A Python library that helps data scientists to infer causation rather than observing correlation
In docs:
In this package we are mostly interested in the case where BNs are causal. Hence, the edge between nodes should be seen as cause -> effect relationship.
Also:
Bayesian Network consists of a DAG, a causal graph where nodes represents random variables
A bayesian network can equivalently encode dependency between variables as a -> b -> c and a <- b <- c. https://en.wikipedia.org/wiki/Bayesian_network#Causal_networks. What helps causalnex find causal relationships and does it really do that? Thanks
plot_structure
does not plot anything since the update. I am just trying to replicate the tutorial in a jupyter notebook.
from causalnex.structure import StructureModel
sm = StructureModel()
sm.add_edges_from([
('health', 'absences'),
('health', 'G1')
])
_, _, _ = plot_structure(sm)
The dependencies specified in setup.py
directly come from the requirements.txt
where they are specified with a pinned requirement (e.g. pandas == 0.24.0
). This is no good practice for a package except in some edge cases (e.g. a known incompatibility or a known bug, but even in that case, it is preferable to declare a "less-than" requirement than an equality).
Due to these pinned versions, it is not easy to use causalnex
in an existing environment (where, e.g., another version than pandas 0.24.0
is needed).
Could I suggest you to declare more permissive constraints? It is quite common in Python to only define a lower bound (e.g. pandas>=0.24.0
) but this is a very optimistic way of specifying a dependency constraint, since it is unlikely for a package to remain fully compatible with all future updates of a dependency. A better option would be to use ~=
(i.e. "all updates supposedly compatible with given version").
Btw, for the specific case of pandas, however, I suggest to adopt updates from the 1.0.0
branch as well.
Most of the developers are using python with Docker now a days and I see instructions contain only conda and pycharm. I will suggest to have the setup with Docker as well
I still find it difficult to setup casualnex on local setup, however when I did with Docker it was much simpler and reproducible, hence would like to add this feature and if needed I can do that and contribute as well
We could use a base python image or a standard image loaded with other packages and extend it (if needed I can share how I did it)
Alternatives are already available
Never having used pygraphviz, most of the google and stack overflow searches are about Errors during installation, on both Mac and PC.
This has been frustrating to run Causal Nex on Jupiter. It works on terminal but is cumbersome. Why is SNS or Matplotlib not used instead (yeah apparently graphviz and pygraphviz are powerful, but rarely used compared to other visual python aides.
Tell us what should happen.
Tell us what happens instead.
-- If you received an error, place it here.
-- Separate them if you have more than one.
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
):python -V
):Hi,
I dont know the meaning of the parameter 'w_threshold' in "from_pandas",because I can get a BN model when I using ''hill_climb" by pgmpy.The edges' number of the model vary the value of w_threshold ,so I dont know which one is correct?
this problem is not exist in ''hill_climb".
Hi, Thank you for the package, it makes lots of causal inference jobs much easier. And the tutorial helps with getting started quickly.
I was just wondering if there is any way of plotting the DAGs inside causalnex without pygraphviz. It's bit of pain to get that running on Windows.
Thanks
When i use DecisionTreeClassifier,like below:
X_train, X_test, y_train, y_test = train_test_split(train, label, test_size=0.2, random_state=33)
dtc = DecisionTreeClassifier(max_depth=12)
dtc.fit(X_train, y_train)
how can i define the label of dataset in the process of CausalNex train?
Since the latest update plot_structure
requires pygraphviz requirement which is missing from the requirements.txt.
Hi! I have one problem with classification_report method. It throws an AttributeError: 'float' object has no attribute 'items'. The roc_auc method is working properly called with the same parameters as in the case of classification_report
File "C:\Users......\anaconda3\lib\site-packages\causalnex\evaluation\evaluation.py", line 207, in classification_report
return pd.DataFrame.from_dict(report, orient="index")
File "C:\Users......\anaconda3\lib\site-packages\pandas\core\frame.py", line 1179, in from_dict
data = _from_nested_dict(data)
File "C:\Users......\anaconda3\lib\site-packages\pandas\core\frame.py", line 8486, in _from_nested_dict
for col, v in s.items():
AttributeError: 'float' object has no attribute 'items'
Include as many relevant details about the environment in which you experienced the bug:
pip show causalnex
): 0.5.0python -V
): 3.7Unit test for independence of variables in data generator fails sometimes for python 3.5 (maybe on others as well but havent observed)
Test is unavoidably stochastic, not sure why python 3.5 specific (numpy dependencies/RNG?).
test_mixed_type_independence
This FR addresses the existing structure learning ability in causalnex.
It provides an alternative to the NO TEARS algorithm which performs better comparatively.
Details: https://www.groundai.com/project/dynotears-structure-learning-from-time-series-data/1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.