Git Product home page Git Product logo

pitman_yor's Introduction

Modelling dynamic network evolution as a Pitman-Yor process

The model and datasets are described in Sanna Passino, F. and Heard, N. A., "Modelling dynamic network evolution as a Pitman-Yor process", Foundations of Data Science, 2019, 1(3):293-306 (link).

This repository contains Python code used to perform network-wide anomaly detection in a computer network using the two parameter Poisson-Dirichlet or Pitman-Yor process (Pitman and Yor, 1997).

This code builds up on the Hadoop-MapReduce procedure described in Heard and Rubin-Delanchy (2016). The Dirichlet process described by the authors in the paper is extended to include an extra parameter, which allows for more flexibility when modelling data exhibiting power-law behaviour.

Methodology

The Pitman-Yor process

A computer network can be interpreted as a directed graph , where is the node set of computers and is the edge set of observed unique connections.

Let us assume that is a sequence of source computers that have connected to a destination computer . For the given destination computer , we assume that the exchangeable sequence of source computers has the following hierarchical distribution:

The PPPF implied by the Pitman-Yor process is:

where is the number of unique source computers that have connected to up to time , and is the number of times the source computer has connected to in the fist observed connections to the destination computer.

Therefore, the -value for the -th observation is:

where:

The code also uses mid--values , where:

The mid--values might be preferable since the distribution of the source nodes is discrete.

Combining p-values

A sequence of -values can be combined in this code using 6 different methods, described in Heard and Rubin-Delanchy (2018):

  • Edgington's method:

  • Fisher's method - let , then:

  • Pearson's method:

  • George's method:

  • Stouffer's method - let denote the inverse of the CDF of a standard normal distribution, then:

  • Tippett's method (or minimum -value) method:

Note that the distributional results are only valid under normal behaviour of the network.

Anomaly detection

In the code, the -values and mid--values are combined in two different stages. Suppose that for a given destination computer , the -values (and mid--values) corresponding to each observed connection are computed using the PY posterior predictive probability.

  • for all the connections on a given edge , it is possible to combine the -values and obtain a grouped -value for each edge,

  • given the -values for each edge, it is possible to combine the -values for a given source computer , in order to obtain a -value for each source computer.

The -values computed at the second stage give an anomaly score for the source node , which can be used for anomaly detection.

Usage

Data preprocessing

An example of a data line in the LANL authentication dataset (Kent, 2016) is:

1,C608@DOM1,C608@DOM1,C608,C467,Kerberos,Network,LogOn,Success

The following command returns the edge list lanl_graph.txt with tab sepearated source and destination, and weights given by the number of observed connections on each edge:

hadoop fs -text MY_FOLDER/auth.txt.gz | ./get_auth_graph.py > lanl_graph.txt

Given the LANL edge list, it is possible to obtain method of moments estimates of the hyperparameters and using the code in py_parameters.py:

cat lanl_graph.txt | ./py_parameters.py

Hadoop procedures

The first of the three Hadoop MapReduce procedures can be most simply run using the command:

./py_anon.sh &

where the anonymised file py_anon.sh in 1 - pvals (all) is appropriately modified to give the correct -input and -output. Similar procedures can be carried out for the two remaining MapReduce procedures, using the .sh files in the folders 2 - pvals (edges) and 3 - pvals (nodes).

References

  • Sanna Passino, F. and Heard, N.A. (2019), "Modelling dynamic network evolution as a Pitman-Yor process", Foundations of Data Science, 2019, 1(3):293-306. (Link)

  • Heard, N.A. and Rubin-Delanchy, P. (2016), "Network-wide anomaly detection via the Dirichlet process", Proceedings of IEEE workshop on Big Data Analytics for Cyber-Security Computing. (Link)

  • Heard, N.A. and Rubin-Delanchy, P. (2018), "Choosing between methods of combining p-values", Biometrika 105(1), 239–246. (Link)

  • Kent, A.D. (2016), ”Cybersecurity data sources for dynamic network research”, In Dynamic Networks and Cyber-Security. World Scientific. (Link)(Data)

  • Pitman, J. and Yor, M. (1997), "The two-parameter Poisson-Dirichlet distribution derived from a stable sub-ordinator", Annals of Probability 25, 855-900. (Link)

pitman_yor's People

Contributors

fraspass avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.