dkesada / dbnr Goto Github PK

View Code? Open in Web Editor NEW

44.0 5.0 10.0 1.4 MB

Gaussian dynamic Bayesian networks structure learning and inference based on the bnlearn package

License: GNU General Public License v3.0

R 69.08% C++ 25.59% Jupyter Notebook 5.33%

forecasting bayesian-networks dynamic-bayesian-networks inference time-series

dbnr's Introduction

dbnR

Introduction

This package offers an implementation of Gaussian dynamic Bayesian networks (GDBN) structure learning and inference based partially on Marco Scutari’s package bnlearn (https://www.bnlearn.com/). It also allows the construction of higher-order DBNs. Three structure learning algorithms are implemented:

A variation on Ghada Trabelsi’s dynamic max-min hill climbing (https://theses.hal.science/tel-00996061/document).
A particle swarm optimization algorithm for higher-order DBNs (https://doi.org/10.1109/BRC.2014.6880957)
A scalable, order invariant particle swarm optimization algorithm for higher-order DBNs (https://link.springer.com/chapter/10.1007/978-3-030-86271-8_14)

Inference is performed either via the particle filtering offered by bnlearn or by doing exact inference over the multivariate Gaussian equivalent of a network implemented in this package. A visualization tool is also implemented for GDBNs and bnlearn’s BNs via the visNetwork package (https://github.com/datastorm-open/visNetwork).

Current development

The main functionality of the package is running and working. In order of importance, the next objectives are:

To refractor the DMMHC algorithm into R6 for consistency with the PSOHO algorithm and with any new structure learning algorithms.
To add an automatically generated shiny interface of the net. This makes interacting with the network easier and allows for simulation prototypes.

For now, the dbn.fit object as an extension of bnlearn’s bn.fit object will stay the same except for the “mu” and “sigma” attributes added to it. This way, it remains easy to call bnlearn’s methods on the dbn.fit object and I can store the MVN transformation inside the same object. Not an elegant solution, but its simplicity is enough. What should be addressed is having to perform the folding of a dataset outside the predict function.

Getting Started

Prerequisites

This package requires R ≥ 3.6.1 to work properly. It also works for R ≥ 3.5.0, the only difference is the color palette of the DBN visualization tool.

The bnlearn and data.table packages, among others, are required for this package to work. They will be installed automatically when installing this package. They can also be installed manually via CRAN with the command

install.packages(c("bnlearn", "data.table"))

The packages visNetwork, magrittr and grDevices are optional for the visualization tool. They will only be required if you want to use it.

Installing

As of today, the easiest way of installing dbnR is via CRAN. To install it, simply run

install.packages('dbnR')

You can also install the latest version in GitHub with the install_github function in the devtools package. The commands you need to run are

library(devtools)
devtools::install_github("dkesada/dbnR")

This will install the required dependencies if they are not available. After this, you will be ready to use the package.

Basic examples

To get the structure of a GDBN from a dataset, you need to use the function learn_dbn_struc

library(dbnR)
#> Cargando paquete requerido: bnlearn
#> 
#> Adjuntando el paquete: 'dbnR'
#> The following objects are masked from 'package:bnlearn':
#> 
#>     degree, nodes, nodes<-, score

data(motor)

size <- 3
dt_train <- motor[200:2500]
dt_val <- motor[2501:3000]
net <- learn_dbn_struc(dt_train, size)

The dt argument has to be either a data.frame or a data.table of numeric columns, in the example we use the sample dataset included in the package. The size argument determines the number of time slices that your net is going to have, that is, the Markovian order of the net. A Markovian order of 1 means that your data in the present is independent of the past given the previous time slice. If your case doesn’t meet this criteria, the size of the net can be increased, to take into account more past time slices in the inference. In our function, Markovian order = size - 1. The function returns a ‘dbn’ object that inherits from the ‘bn’ class in bnlearn, so that its auxiliary functions like ‘arcs’ and such also work on DBN structures.

Once the structure is learnt, it can be plotted and used to learn the parameters

plot_dynamic_network(net)

alt text

f_dt_train <- fold_dt(dt_train, size)
fit <- fit_dbn_params(net, f_dt_train, method = "mle-g")

After learning the net, two different types of inference can be performed: point-wise inference over a dataset and forecasting to some horizon. Point-wise inference uses the folded dt to try and predict the objective variables in each row. Forecasting to some horizon, on the other hand, tries to predict the behaviour in the future M instants given some initial evidence of the variables.

There is an extensive example of how to use the package in the markdowns folder, which covers more advanced concepts of structure learning and inference.

License

This project is licensed under the GPL-3 License, following on bnlearn’s GPL(≥ 2) license.

References

The bnlearn package (https://www.bnlearn.com/).
The visNetwork package (https://datastorm-open.github.io/visNetwork/)
Kaggle’s dataset repository, where the sample dataset is from (https://kaggle.com/datasets/wkirgsn/electric-motor-temperature)
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Applications of dbnR

Quesada, D., Valverde, G., Larrañaga, P., & Bielza, C. (2021). Long-term forecasting of multivariate time series in industrial furnaces with dynamic Gaussian Bayesian networks. Engineering Applications of Artificial Intelligence, 103, 104301.
Quesada, D., Bielza, C., & Larrañaga, P. (2021, September). Structure Learning of High-Order Dynamic Bayesian Networks via Particle Swarm Optimization with Order Invariant Encoding. In International Conference on Hybrid Artificial Intelligence Systems (pp. 158-171). Springer, Cham.
Quesada, D., Bielza, C., Fontán, P., & Larrañaga, P. (2022). Piecewise forecasting of nonlinear time series with model tree dynamic Bayesian networks. International Journal of Intelligent Systems, 37, 9108-9137.

dbnr's People

Contributors

Stargazers

Watchers

Forkers

tekrajchhetri abdelaadimkrs kevinresearch roozbehsanaei vishalbelsare onestep1 davisandn sooooooocute habinocean prashasti19075

dbnr's Issues

Error for dbnR::plot_network() using R 4.0.2

Hi,
currently I upgraded my R to version (4.0.2). Since then I'm getting the following error when trying to plot a network (as in the example of the vignette):

Error in sprintf("Package %s needed for this function to work.") :
too less arguments

Would be nice if that could be fixed.

log likelihood for partial observed dataset

Hi dear author, I encountered a problem with DBN evaluation. If I learn a DBN from training-set, where all the values are observable. While for testing, some of the variables are masked out. For example, we have "a_t_1,b_t_1,c_t_1,a_t_0,b_t_0,c_t_0", while values of "a_t_1,c_t_0" are masked. If I want to evaluate my DBN with this test set, for example, compute the log likelihood for the test set, should I use the learned DBN to infer the values of masked nodes, then compute the whole likelihood? Or could you please suggest another method?

Error in sprintf("Package %s needed for this function to work.") : too few arguments

Keep getting this error when trying the example from the readme. Is there anything I missed?

prediction in case unknown target node

Hi i have some question about prediction

After make a model we want to predict the data

in general case we input data without target node like this

library(bnlearn)
dtraining.set = learning.test[1:4000, ]
dvalidation.set = learning.test[4001:5000, ]
dvalidation.set_R = learning.test[4001:5000, -5] #remove target node "E"

dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
dfitted = bn.fit(dag, dtraining.set)

pred = predict(dfitted, node = "E", 
               data = dvalidation.set,
               method = "parents")

pred_R = predict(dfitted, node = "E", 
               data = dvalidation.set_R,
               method = "parents")


table(dvalidation.set$E, pred)
table(dvalidation.set$E, pred_R)
table(pred, pred_R) #same result

but in this case does not it worked when remove the target node

library(dbnR)
size = 3
data(motor)
str(motor)
dt_train <- motor[200:900]
dt_val <- motor[901:1000]
dt_val_R <- motor[901:1000,-8] #remove target mode "PM"

# With a DBN
obj <- c("pm_t_0")
net <- learn_dbn_struc(dt_train, size)

f_dt_train <- fold_dt(dt_train, size)
f_dt_val <- fold_dt(dt_val, size)
f_dt_val_R <- fold_dt(dt_val_R, size)

fit <- fit_dbn_params(net, f_dt_train, method = "mle-g")

fit$pm_t_2

res <- suppressWarnings(predict_dt(fit, f_dt_val, obj_nodes = obj))
res_R <- suppressWarnings(predict_dt(fit, f_dt_val_R, obj_nodes = obj)) #not worked

am I misunderstanding and using it incorrectly?

DBN structure

Hello!
Can I define the network structure by myself, instead of acquiring the network through structure learning?
I want to define an edge that go from one time slice to another. I find whitelist, blacklist, blacklist_ tr, but there is no whitelist_ tr ?

Data preprocessing

Hello, is it necessary to standardize and normalize the data before dynamic Bayesian network prediction?

format of dataset

Dear author,
Thanks so much for providing this useful package. However, as a beginner of R and DBN, I might have a stupid question.
Assuming that we 3 variables a\b\c, the marcovian order is equal to 1.
Then I suppose that the dataset should contain these columns:
a_t_1,b_t_1,c_t_1,a_t_0,b_t_0,c_t_0
and every row of the dataset is iid sampled from a pre-defined DBN.
But in your example, the motor dataset, there are 11 variables and marcovian order=2.
Then I guess there should be 11*3=33 couloums in the dataset.
It confused me a lot. Could you please give me some guidance?
Best regards.

Standard format of f_dt

I formatted my dataframe with the following names :

> names(phenoData_wide)
  [1] "age"                "sex"                "height"            
  [4] "weight"             "diabete"            "hyperten"          
  [7] "myoinfarc"          "cardiofailure"      "cerebrovasc"       
 [10] "dementia"           "copd"               "paralysis"         
 [13] "renafailure"        "mort_flg"           "MV_days"           
 [16] "CRRT_days"          "VASO_days"          "Hospital_days"     
 [19] "Batch"              "SampleLocation"     "hrmax_t_1"         
 [22] "hrmax_t_3"          "hrmax_t_5"          "hrmin_t_1"         
 [25] "hrmin_t_3"          "hrmin_t_5"          "mapmax_t_1"        
 [28] "mapmax_t_3"         "mapmax_t_5"         "mapmin_t_1"        
 [31] "mapmin_t_3"         "mapmin_t_5"         "sapmax_t_1"        
 [34] "sapmax_t_3"         "sapmax_t_5"         "sapmin_t_1"        
 [37] "sapmin_t_3"         "sapmin_t_5"         "rrmax_t_1"         
 [40] "rrmax_t_3"          "rrmax_t_5"          "rrmin_t_1"         
 [43] "rrmin_t_3"          "rrmin_t_5"          "tmax_t_1"          
 [46] "tmax_t_3"           "tmax_t_5"           "tmin_t_1"          
 [49] "tmin_t_3"           "tmin_t_5"           "mv_t_1"            
 [52] "mv_t_3"             "mv_t_5"             "crrt_t_1"          
 [55] "crrt_t_3"           "crrt_t_5"           "gcs_t_1"           
 [58] "gcs_t_3"            "gcs_t_5"            "lac_t_1"           
 [61] "lac_t_3"            "lac_t_5"            "k_t_1"             
 [64] "k_t_3"              "k_t_5"              "na_t_1"            
 [67] "na_t_3"             "na_t_5"             "cl_t_1"            
 [70] "cl_t_3"             "cl_t_5"             "ca_t_1"            
 [73] "ca_t_3"             "ca_t_5"             "pha_t_1"           
 [76] "pha_t_3"            "pha_t_5"            "paco_t_1"          
 [79] "paco_t_3"           "paco_t_5"           "pao_t_1"           
 [82] "pao_t_3"            "pao_t_5"            "abe_t_1"           
 [85] "abe_t_3"            "abe_t_5"            "fio_t_1"           
 [88] "fio_t_3"            "fio_t_5"            "SaO2_t_1"          
 [91] "SaO2_t_3"           "SaO2_t_5"           "procal_t_1"        
 [94] "procal_t_3"         "procal_t_5"         "phcv_t_1"          
 [97] "phcv_t_3"           "phcv_t_5"           "pcvco_t_1"         
[100] "pcvco_t_3"          "pcvco_t_5"          "pcvo_t_1"          
[103] "pcvo_t_3"           "pcvo_t_5"           "scvo_t_1"          
[106] "scvo_t_3"           "scvo_t_5"           "bun_t_1"           
[109] "bun_t_3"            "bun_t_5"            "alb_t_1"           
[112] "alb_t_3"            "alb_t_5"            "cr_t_1"            
[115] "cr_t_3"             "cr_t_5"             "bilirubin_t_1"     
[118] "bilirubin_t_3"      "bilirubin_t_5"      "crp_t_1"           
[121] "crp_t_3"            "crp_t_5"            "wbc_t_1"           
[124] "wbc_t_3"            "wbc_t_5"            "hct_t_1"           
[127] "hct_t_3"            "hct_t_5"            "plt_t_1"           
[130] "plt_t_3"            "plt_t_5"            "inr_t_1"           
[133] "inr_t_3"            "inr_t_5"            "aptt_t_1"          
[136] "aptt_t_3"           "aptt_t_5"           "tt_t_1"            
[139] "tt_t_3"             "tt_t_5"             "ddimer_t_1"        
[142] "ddimer_t_3"         "ddimer_t_5"         "urine_t_1"         
[145] "urine_t_3"          "urine_t_5"          "sofa_pf_t_1"       
[148] "sofa_pf_t_3"        "sofa_pf_t_5"        "sofa_plat_t_1"     
[151] "sofa_plat_t_3"      "sofa_plat_t_5"      "sofa_GCS_t_1"      
[154] "sofa_GCS_t_3"       "sofa_GCS_t_5"       "sofa_bilirubin_t_1"
[157] "sofa_bilirubin_t_3" "sofa_bilirubin_t_5" "sofa_vaso_t_1"     
[160] "sofa_vaso_t_3"      "sofa_vaso_t_5"      "sofa_cr_t_1"       
[163] "sofa_cr_t_3"        "sofa_cr_t_5"        "sofa_uo_t_1"       
[166] "sofa_uo_t_3"        "sofa_uo_t_5"        "SOFA_t_1"          
[169] "SOFA_t_3"           "SOFA_t_5"           "UTI_flg_t_1"       
[172] "UTI_flg_t_3"        "UTI_flg_t_5"        "UTI_dose_t_1"      
[175] "UTI_dose_t_3"       "UTI_dose_t_5"       "fluidin_t_1"       
[178] "fluidin_t_3"        "fluidin_t_5"        "fluidout_t_1"      
[181] "fluidout_t_3"       "fluidout_t_5"       "pf_t_1"            
[184] "pf_t_3"             "pf_t_5"

I think this is not standard f_dt format, and the function reported error;

> net <- dbnR::learn_dbn_struc(
+   phenoData_wide, 
+   f_dt = phenoData_wide, method = "dmmhc", 
+   #blacklist = blacklist,
+   #blacklist_tr = blacklist_tr, 
+   restrict = "mmpc", maximize = "hc",
+   restrict.args = list(test = "cor"),
+   maximize.args = list(score = "bic-g", maxp = 10))
Error in initial_folded_dt_check(f_dt) : 
  the data.frame is not properly time formatted.

How may I model these variables that containing both time-varying and time-fixed variables?

A comprehensive tutorial for the package

Dear authors,
Thank you for your excellent package; however, I cannot find a comprehensive tutorial, are you planning to post one in the future? Many users will need this to conduct in their domain field.

How to define time.

Hi.

I just found this package and I was wondering how I can define time sequence (a column for time) for the observation.

Some questions about smoothing

How to explain why the obtained state of the past time is so different from the real data after smoothing.

marcovian order=1 but size=10

Hi, dear author, I have a problem when implementing dbn learning.
The marcovian order=1 but I have dataset with 10 or more time slices. Then how should I deal with my dataset if I hope my DBN's order equal to 1. For example, if I have ['a_t_9','a_t_8',...'a_t_1','a_t_0'], then should I shifte the dataset to construct a ['a_t_1','a_t_0'] with more observations?
Another interesting point I find is that, if I increase the size of DBN, for example, with more variables and higher order, the time cost of structure learning increase quadratically.
If my DBN has (20 variables, 2 time slices), dmmhc only cost a few ms. But if I increase to (20 variables, 10 time slices), it takes a few minutes.

Best regards

Does the predict_dt function use the current time information?

lets say our variables are
a_t_0, a_t_1 and b_t_0, b_t_1. Then if I call predict_dt with obj_var = a_t_0, will it be using b_t_0 in that prediction? (introducing lookahead bias)

Error Forecasting with Approx Mode

Forecasting with the approximate mode produces the following Error
Error in FUN(X[[i]], ...) : object 'test' not found.

What is the way around it please

DMMHC vs PSOHO vs NATPSOHO and conditional independence measures

Hello,
I have a doubt regarding the measures used in dbnR to calculate the conditional independence given the data. I am not sure if I fully understand this step in dbnR.
In the case of DMMHC, as you explained in the paper "Long-term forecasting of multivariate time series in industrial furnaces with dynamic Gaussian Bayesian networks" you use the exact 𝑡 test for Pearson’s correlation coefficient. We can also see this with the result of the net$learning:
$test
[1] "bic-g"
$ntests
[1] 29027
$algo
[1] "rsmax2"
$args
$args$k
[1] 2.038769
$args$alpha
[1] 0.05 ---> SIGNIFICANCE LEVEL ??
$optimized
[1] TRUE
$restrict
[1] "si.hiton.pc"
$rstest
[1] "cor" ---> PEARSON ??
$maximize
[1] "hc"
$maxscore
[1] "bic-g"
$illegal
NULL
First question, if I want to get the conditional probabilities I just need to get the pearson´s correlations and their p-value, right? That would what you do internally in dbnR?

In the case of PSOHO and NATPSOHO algorithms there is no filter, dbnR does not calculate the pearson and its t-test or another correlation measure and test to calculate the conditional probability and its significance, right? net$learning is:
$whitelist
NULL
$blacklist
NULL
$test
[1] "none"
$ntests
[1] 0
$algo
[1] "empty"
$args
list()

This explains why the density of the network with DMMHC is much lower than in PSOHO and NATPSOHO.
Could you explain why you did not apply a similar approach in the case of PSOHO or NATPSOHO?
Thanks

Getting error while installing

Hi,

I am getting following error while installing from github.

Please help me out asap.

Thanks

Forecasting plot does not show predictions

When performing forecasting with the forecast_ts() function, if the predictions are outside the range of the original values they will not be seen in the plot. This is due to the order of the plot + line calls: the plot function is called first on the original time series, and afterwards the predictions are added on top of that plot with the lines function. It is common to find this scenario when the original time series does not have large variations, and so the forecasting can be outside this range. This behaviour can be misleading, because the forecasting is made, but in the plot it looks like the model didn't do anything.

To fix this, instead of plotting the original time series without taking into account the predictions, the frame should be built based on the range of values of both the prediction and the forecasting.

A couple of plots illustrating the issue:

Current plot:
Desirable plot:

Interpretation of predictive images

I have two little questions:
1.There are two color lines in the prediction image, which one represents the true color and which one represents the prediction.
2.When you make a prediction, you get results, but you get false alarms:
In value[3L] :
The sigma matrix is computationally singular. Using the pseudo-inverse instead.
Does that matter?
Thank you!

The naming convention for time

Hi @dkesada
sorry for disturbing you again. I my work with the package, I found the naming convention of the time series data is _t_0 and so on; it does not matter, but are there any tricks or hints on how to change the convention to custom style in the plot produced by plot_dynamic_network. In medical research, the early time is usually named time 0;
Another question is that, can I plot marginal distribution for the dynamic bayesian model as that done in the bnlearn package : https://www.bnlearn.com/examples/graphviz-chart/

Weird results from predict_dt

Hey,

I'm trying to use predict_dt with a DBN I trained and as a result I get an almost constant value. I looked at the result data and for some reason it has constant values even for the features I'm not trying to predict (which I'm assuming is the problem). Did you ever come across something like that?

I tried it for a large number of samples for training and testing but it also happens for the following script that only uses one sample:
df <- read.csv("C:\Temp Data\1.csv")
sample_data_dt <- as.data.table(df)
dt_train <- sample_data_dt[1:300]
dt_val <- sample_data_dt[301:454]

obj <- c("glideslope_error_deg_t_0")
net1 <- learn_dbn_struc(dt_train, size)
f_dt_train <- fold_dt(dt_train, size)
f_dt_val <- fold_dt(dt_val, size)
fit1 <- fit_dbn_params(net1, f_dt_train, method = "mle-g")
res <- suppressWarnings(predict_dt(fit1, f_dt_val, obj_nodes = obj, verbose = T))

A small part of "res":

Questions about Integration of dbnR with Python, Hidden variables and filtered_fold_dt

Hello, Sorry to bother you but I really have some confusions to be solved urgently.
1.About the Integration of dbnR with Python
Why I can't run the codes in ''python_integration.ipynb'' successfully, the error is: AttributeError: 'DataFrame' object has no attribute 'iloc'. at In[3] line 22.

2.About Hidden variables in DBN
Does dbnR support the dynamic bayesian network with hidden variables? or Can dbnR fit the parameters using Expectation Maximization?

3.About filtered_fold_dt
I'm confused about how to deal with a set of data consisting of multiple different time series, you mentioned that can use function filtered_fold_dt. I tried but it seems to be useless, the fitting result is still the same.

Dependencies between time slices

Sorry to bother you, I have a little question to ask. Is our algorithm based on unstable conditions, and what improvements have been made compared with the traditional DBN? (1. The structure in each time slice of the network built by the traditional DBN is the same. 2. The dependency between time slices is also the same.) Are we only improving condition 2.

arrow point from t_2 to t_1 and t_0

Hi
I have a question after learning the tutorial. I found the plot drawing by dbnR::plot_dynamic_network(fit)
The arrow direction is from _t_2 to t_1 and then to t_0; I am confusing this because the causal direction can only from t_0 to t_1 and t_2, events happened early can affect latter events, but not in reversed order.

Factorial Data Treatment

Hi,
Thanks for that package; I really appreciate that work!
Is it planned to have an extension for dealing with non numeric data as well? I actually find myself very often in the situation I have to cope with mixed attribute types.
Otherwise do you have a recomendation what a "good way" was to overcome this.

How to get simple regressions instead of multiple regressions in parameter learning

Hello,

I would like to check how could I get the regression equations between each node and their parents separately to later perform a meta-regression study, if possible. I already have the DBN structure. Is there any way to get them using the dbnR parameter learning step using mle-g?

Thank you in advance.

Best regards,
Irene

Does it support missing data in the dataframe?

I have a dataset with high number of missing values. The function threw an error. I was wondering if there is a documentation on how to handle missing data.

reference

Dear author,
I adopt your package in my research work, how should I refer your package in my work.
Best

Data set division

Hi,
I found a problem when I was doing the network structure construction, different data set division will lead to different constructed network structure, does it have any effect and what is the best ratio?
Thank you very much!

Prediction problem

obj_var <- c("Y_t_0")
res <- (dbnR::predict_dt(fit, f_dt_val, obj_var))
res_fore  <- suppressWarnings( dbnR :: forecast_ts( f_dt_val , fit , obj_vars  = c("Y_t_0"), ini  =  1 , len  =  87 ))
res_fore

In this example, there are 87 rows of data after folding.
The prediction results are shown in the figure below (comparison between the real value and the predicted value). Why are the prediction results of the above two methods so different? One is very good, and the other is very bad. What is the difference between the two methods, and which method is more appropriate for real prediction?

What to do when we have data for t0,t1,t2...tn and want to predict t(n+1)

Hi,

Please let me know if I can use this package when we have the data for t0,t1,t2....tn and want to predict for tn+1.....
If yes HOW?

Thanks a lot!!!

Psoho and natPsoho, problem in R

Hi. I have a problem with the algorithms psoho y natpsoho. Once i use them Rstudio says "R Session Aborted: R encountered a fatal error. The session was terminated". It happens even with the simpliest case:
data(motor)
size <- 3
dt_train <- motor[200:2500]
dt_val <- motor[2501:3000]
net <- learn_dbn_struc(dt_train, size, method = "psoho")

It is not a memory problem; and when I use method= dmmhc the function works well.

A week ago it worked fine, but now it happens in two different computers with motor data and with my data. I have updated r to version 4.2.2 and rstudio to the last version to try to solve it, but it is still failing.
I do not know if this error is replicable or just happening to me.