The model and datasets are described in Sanna Passino, F. and Heard, N. A., "Modelling dynamic network evolution as a Pitman-Yor process", Foundations of Data Science, 2019, 1(3):293-306 (link).
This repository contains Python code used to perform network-wide anomaly detection in a computer network using the two parameter Poisson-Dirichlet or Pitman-Yor process (Pitman and Yor, 1997).
This code builds up on the Hadoop-MapReduce procedure described in Heard and Rubin-Delanchy (2016). The Dirichlet process described by the authors in the paper is extended to include an extra parameter, which allows for more flexibility when modelling data exhibiting power-law behaviour.
A computer network can be interpreted as a directed graph , where is the node set of computers and is the edge set of observed unique connections.
Let us assume that is a sequence of source computers that have connected to a destination computer . For the given destination computer , we assume that the exchangeable sequence of source computers has the following hierarchical distribution:
The PPPF implied by the Pitman-Yor process is:
where is the number of unique source computers that have connected to up to time , and is the number of times the source computer has connected to in the fist observed connections to the destination computer.
Therefore, the -value for the -th observation is:
where:The code also uses mid--values , where:
The mid--values might be preferable since the distribution of the source nodes is discrete.
A sequence of -values can be combined in this code using 6 different methods, described in Heard and Rubin-Delanchy (2018):
- Edgington's method:
- Pearson's method:
- George's method:
Note that the distributional results are only valid under normal behaviour of the network.
In the code, the -values and mid--values are combined in two different stages. Suppose that for a given destination computer , the -values (and mid--values) corresponding to each observed connection are computed using the PY posterior predictive probability.
-
for all the connections on a given edge , it is possible to combine the -values and obtain a grouped -value for each edge,
-
given the -values for each edge, it is possible to combine the -values for a given source computer , in order to obtain a -value for each source computer.
The -values computed at the second stage give an anomaly score for the source node , which can be used for anomaly detection.
An example of a data line in the LANL authentication dataset (Kent, 2016) is:
1,C608@DOM1,C608@DOM1,C608,C467,Kerberos,Network,LogOn,Success
The following command returns the edge list lanl_graph.txt
with tab sepearated source and destination, and weights given by the number of observed connections on each edge:
hadoop fs -text MY_FOLDER/auth.txt.gz | ./get_auth_graph.py > lanl_graph.txt
Given the LANL edge list, it is possible to obtain method of moments estimates of the hyperparameters and using the code in py_parameters.py
:
cat lanl_graph.txt | ./py_parameters.py
The first of the three Hadoop MapReduce procedures can be most simply run using the command:
./py_anon.sh &
where the anonymised file py_anon.sh
in 1 - pvals (all)
is appropriately modified to give the correct -input and -output. Similar procedures can be carried out for the two remaining MapReduce procedures, using the .sh
files in the folders 2 - pvals (edges)
and 3 - pvals (nodes)
.
-
Sanna Passino, F. and Heard, N.A. (2019), "Modelling dynamic network evolution as a Pitman-Yor process", Foundations of Data Science, 2019, 1(3):293-306. (Link)
-
Heard, N.A. and Rubin-Delanchy, P. (2016), "Network-wide anomaly detection via the Dirichlet process", Proceedings of IEEE workshop on Big Data Analytics for Cyber-Security Computing. (Link)
-
Heard, N.A. and Rubin-Delanchy, P. (2018), "Choosing between methods of combining p-values", Biometrika 105(1), 239–246. (Link)
-
Kent, A.D. (2016), ”Cybersecurity data sources for dynamic network research”, In Dynamic Networks and Cyber-Security. World Scientific. (Link)(Data)
-
Pitman, J. and Yor, M. (1997), "The two-parameter Poisson-Dirichlet distribution derived from a stable sub-ordinator", Annals of Probability 25, 855-900. (Link)