ymirsky / kitsune-py Goto Github PK

A network intrusion detection system based on incremental statistics (AfterImage) and an ensemble of autoencoders (KitNET)

License: MIT License

Python 100.00%

kitsune-py's Introduction

Overview

In this repository you will find a Python implementation of Kitsune; an online network intrusion detection system, based on an ensemble of autoencoders. From,

Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai, "Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection", Network and Distributed System Security Symposium 2018 (NDSS'18)

What is Kitsune?

Neural networks have become an increasingly popular solution for network intrusion detection systems (NIDS). Their capability of learning complex patterns and behaviors make them a suitable solution for differentiating between normal traffic and network attacks. However, a drawback of neural networks is the amount of resources needed to train them. Many network gateways and routers devices, which could potentially host an NIDS, simply do not have the memory or processing power to train and sometimes even execute such models. More importantly, the existing neural network solutions are trained in a supervised manner. Meaning that an expert must label the network traffic and update the model manually from time to time.

Kitsune is a novel ANN-based NIDS which is online, unsupervised, and efficient. A Kitsune, in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and whose strength increases with experience. Similarly, Kitsune has an ensemble of small neural networks (autoencoders), which are trained to mimic (reconstruct) network traffic patterns, and whose performance incrementally improves overtime.

The architecture of Kitsune is illustrated in the figure below:

First, a feature extraction framework called AfterImage efficiently tracks the patterns of every network channel using damped incremental statisitcs, and extracts a feature vector for each packet. The vector captures the temporal context of the packet's channel and sender.
Next, the features are mapped to the visible neurons of an ensemble of autoenoders (KitNET https://github.com/ymirsky/KitNET-py).
Then, each autoencoder attempts to reconstruct the instance's features, and computes the reconstruction error in terms of root mean squared errors (RMSE).
Finally, the RMSEs are forwarded to an output autoencoder, which acts as a non-linear voting mechanism for the ensemble.

We note that while training \textbf{Kitsune}, no more than one instance is stored in memory at a time. Kitsune has one main parameter, which is the maximum number of inputs for any given autoencoder in the ensemble. This parameter is used to increase the algorithm's speed with a modest trade off in detection performance.

Some points about KitNET:

It is completely plug-and-play.
It is based on an unsupervised machine learning algorithm (it does not need label, just train it on normal data!)
Its efficiency can be scaled with its input parameter m: the maximal size of any autoencoder in the ensemble layer (smaller autoencoders are exponentially cheaper to train and execute)

Implimentation Notes:

This python implimentation of Kitsune is is not optimal in terms of speed. To make Kitsune run as fast as described in the paper, the entire project must be cythonized, or implimented in C++
For an experimental AfterImage version, change the import line in netStat.py to use AfterImage_extrapolate.py, and change line 5 of FeatureExtractor.py to True (uses cython). This version uses Lagrange-based Polynomial extrapolation to assit in computing the correlation based features.
We also require the scapy library for parsing (tshark [Wireshark] is default).
The source code has been tested with Anaconda 3.6.3 on a Windows 10 64bit machine.

To install scapy, run in the terminal:

pip install scapy

Using The Code

Here is a simple example of how to make a Kitsune object:

from Kitsune import *


# KitNET params:
maxAE = 10 #maximum size for any autoencoder in the ensemble layer
FMgrace = 5000 #the number of instances taken to learn the feature mapping (the ensemble's architecture)
ADgrace = 50000 #the number of instances used to train the anomaly detector (ensemble itself)
packet_limit = np.Inf #the number of packets from the input file to process
path = "../../captured.pcap" #the pcap, pcapng, or tsv file which you wish to process.

# Build Kitsune
K = Kitsune(path,packet_limit,maxAE,FMgrace,ADgrace)

You can also configure the learning rate and hidden layer's neuron ratio via Kitsune's contructor.

The input file can be any pcap network capture. When the object is created, the code check whether or not you have tshark (Wireshark) installed. If you do, then it uses tshark to parse the pcap into a tsv file which is saved to disk locally. This file is then later used when running Kitnet. You can also load this tsv file instead of the origional pcap to save time. Note that we currently only look for tshark in the Windows directory "C:\Program Files\Wireshark\tshark.exe"

If tshark is not found, then the scapy packet parsing library is used. Scapy is significatly slower than using wireshark/tsv...

To use the Kitsune object, simply tell Kitsune to process the next packet. After processing a packet, Kitsune returns the RMSE value of the packet (zero during the FM featuremapping and AD grace periods).

Here is an example usage of the Kitsune object:

while True: 
    rmse = K.proc_next_packet() #will train during the grace periods, then execute on all the rest.
    if rmse == -1:
        break
    print(rmse)

Demo Code

As a quick start, a demo script is provided in example.py. In the demo, we run Kitsune on a network capture of the Mirai malware. You can either run it directly or enter the following into your python console

import example.py

The code was written and with the Python environment Anaconda: https://anaconda.org/anaconda/python For significant speedups, as shown in our paper, you must implement Kitsune in C++, or entirely using cython.

Full Datasets

The full datasets used in our NDSS paper can be found by following this google drive link: https://goo.gl/iShM7E

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

If you use the source code, the datasets, or implement KitNET, please cite the following paper:

Yisroel Mirsky [email protected]

kitsune-py's People

Contributors

Stargazers

Watchers

Forkers

abbdulwahab86 willnewton jennyhsiao mbeddeveloper marcoeg madhancse019 oing0125 maq18 nukegama garrardmew 0zhongying0 debakantagogoi jibanli rooierakkert zyyrrr lazerhawk yangchenxifreddy ethan-phu jinsagi 15110142735 hebowen325 ahmet-ozer-pro kurumeti jl1829 liuxuetong2019 happyfaye jwilson1172 okadahiro0621 mohanpilla linerd brennane dudusa shitongzhu phoenixml harshakumarakalutarage chichidd pankajr141 death-from-ai shitouxyz123 00anupam00 skasman ass77 colinlh scp-111 abhaykul slchun-mmlab scilicet64 chemistryhuang zhangzhao156 bubba457 ihwong jinyangli aswinjose89 h2bit ksauka liujie40 yu89mo hcp6897 e-raz0r vikash2026 ramixix robertgrahamcocking paulomrocha kgardas hvsbloger ahmed50ayman sandeepadevin avoca-dorable dyrc9 x1anwang ionicarabic qi-yun pvfalcao c3-riyadm av-troshin77 jhnprr jra419 jacob-doll shining20183 robot-2020 t-n-j cindyyzhang mazahirhussain daishuangbiao netgroup guyputs ceyhanyilmaz dancerain superzerosec cok-zhangziliang janixx17 isa96 rubythalib33 filip62 dimaraed ana-maria-o dqqan adithya1012 chenweizhe

kitsune-py's Issues

In the AfterImage.pyx, the method get_last() of Class Queue.

in the AfterImage.pyx , Class Queue:

    cdef insert(self,double v):
        self.q[self.indx] = v
        self.indx = (self.indx + 1) % 3
        self.n += 1

    cdef get_last(self):
        return self.q[self.indx]

when n is 1 or 2, calling the 'get_last' will always get 0.
when n is 3, calling the 'get_last' will get the first element in queue not the last.
I'd like to ask if it should be written like this : return self.q[(self.indx-1)%3]

Continuously training the model

Does the kitsune continuously train the model when it executes anomaly detection or does it just execute once training period is done?

If not why ?

Overflow issues when executing network

Hello,
I am getting overflow warnings in the sigmoid function in the utils file. This gives me very large results for the RMSEs. The error only starts after 'training' is done and after the first error appears it keeps happening more and more often. I have changed the code a little so that instead of getting packets from a file, the FE receives them from a live capture. Before the error appears, the training seems to be going ok. Any ideas as to what could be going wrong? Could this be an issue with the way I parse the live capture packets?

Edit: Could this be happening because I am not reducing the value of my max int by a factor of 10? I just noticed this in the prep() method in the FE.

Thank you,
Miguel

How FMgrace related to number of features ?

In the code, FMgrace has been used as 5000 which is for 100 features. If I'm reducing the number of features would it be better to reduce FMgrace or to increase?

When I run the example.py, I can't generate the right image of the detection result.

I ran an example of this project using Ubuntu18, Python3.6.5, and got the following result.However, this result graph is not consistent with the result graph you shared.This result graph does not reflect the attack, and the data under attack appears to be drawn in the wrong place. I would appreciate if you can tell me why and how to get the right result!

Repeated update_cov

I do not understand, why incStatDB.update_get_1D_Stats update all incStat.covs, and then incStatDB.update_get_2D_Stats update one of them again in AfterImage.py.

Look for tshark in linux directory /usr/bin/tshark

Note that we currently only look for tshark in the Windows directory "C:\Program Files\Wireshark\tshark.exe"
How i can change the code to look for tshark in linux directory /usr/bin/tshark

How to use pytorch with detector

I need to use pytorch for backpropagation to calculate adversarial samples, but I found that the torch framework is not used in this code, how can it be easily and quickly changed to torch? ?

learning the KitNET code

I have trouble for understanding the code of KitNET.emmmm~

Is there any one have used .csv data to implement this kitsune and get the mse？

I have added your fix. Seems that some machines are sensitive to this. thanks!

I am still having this issue

Please can anyone help
Originally posted by @ymirsky in #4 (comment)

Compilation fails on Windows with C:\\\\MinGW\\\\bin\\\\gcc.exe' failed with exit status 1\n

Compilation fails on Windows with C:\\MinGW\\bin\\gcc.exe' failed with exit status 1\n

attributes of dataset

Can I know the attributes mentioned in the dataset.

How to extract features with headers ?

I have pcap file and i want to extract file to tsv / csv with your framework. For result sholud be like this:
toriimallock.csv
Could you help me please ?

The dataset in Google Cloud Disk is empty. Where can I download the dataset used in the paper?

Clean network traffic are not the 1st million packets

Greetings YisroelMirsky,
I wish to use your datasets as an input to my models. However, upon looking into the I/O graphs of the captured pcap files, I found that there is no spikes of any attack packets after 1st million packets in the following dataset (I downloaded all 9 pcaps from google drive in your github kitsune project):

In the SSL renegotiation pcap:

As can be seen, after the first million packets, there is no significant rise in SSL filter line.

In the SSDP flood pcap:

Also, there is no abnormal behavior in the UDP filter line. I presume in SSDP flood attack, UDP packets are the attack vectors. (The abnormal behavior of UDP packets doesn't happen until the very end, which is after around 2.621.185 packets)

Do I understand your statement of "clean network traffic was captured for the first 1 million packets " correctly? Or am I missing something?

Thanks,
Hieu

Question on cleanOutOldRecords method

I dont see any call to cleanOutOldRecords in the code. When should this be called and with what values?

Not able to run example

Hi, I have the following in the virtualenv
Package Version

Cython 0.29.6
numpy 1.16.2
pip 19.0.3
scapy 2.4.2
setuptools 40.9.0
wheel 0.33.1

When it run it give error such as "

Importing AfterImage Cython Library
Importing Scapy Library
Traceback (most recent call last):
File "example.py", line 1, in
from Kitsune import Kitsune
File "/home/pi/vir1/Kitsune-py/Kitsune.py", line 2, in
from KitNET.KitNET import KitNET
File "/home/pi/vir1/Kitsune-py/KitNET/KitNET.py", line 2, in
import KitNET.dA as AE
ImportError: No module named dA
"

Please advise thanks

How do I tune the hyperparameters

I have used the code to detect anomalies on one of my datasets. When I use the active wiretap dataset, I managed to obtain the anomaly when FMGrace was 100K and ADGrace was around 800k. It was a almost horizontal plot with two anomaly spikes.

https://ibb.co/0CjJK1L

But when I use it for my dataset, prediction rmse values shows more of a linear graph. I'm guessing it's because of overfitting.

https://ibb.co/ngpZv6K

I would like to know how and what parameters need to be changed to obtain a more smooth result?

Thank you!

how to use Kitsune online

hi, I have read your paper, which is a wonderful work.
I try to use Kitsune as an online IDS on Jetson Nano.
However, I am a new bird in this field.
I want to know how to do an online detection.
if I need to capture packages with Wireshark first , and generate a file.pcap, then pass the file.pcap to Kitsune?
or if there are some ways to pass the features to Kitsune directly.
thank you so much for apply so wonderful paper and code!
thank you again if any advance!

The number of features in the code is not the same as in the paper.

In the paper, a window has 23 features, but in the code（netStat.py）, a window has only 20 features.

Question: c/c++ implementation?

Has any body implemented this in pure C or C++?

How to implement it

how to show the result and how to know which is a malicious traffic

How to determine if a packet is malicious packets

How to determine if a packet is malicious packets vim pacp file

getHeaders_2D getting wrong index

in both AfterImage files in the function: getHeaders_2D()

we see this line:
hdrs = incStat_cov(incStat(Lambda,IDs[0]),incStat(Lambda,IDs[0]),Lambda).getHeaders(ver,suffix=False)

i think the second index should be 1 like this:
hdrs = incStat_cov(incStat(Lambda,IDs[0]),incStat(Lambda,IDs[1]),Lambda).getHeaders(ver,suffix=False)

RMSEs is always 0

Hey and thanks in advance for any help. My error description is probably not very helpful and rather broad but maybe anyone has some pointer as to how I could proceed in fixing it.

I am using a pcap file of IEC 104 communication (so TCP/IP packages with a custom payload (which shouldn't matter since kitsune only uses flow data, right?) but using the example.py with my own pcap gives me an array of [0.0,...,0.0] as a result.

Is there any obvious thing I could be missing or does anyone have an idea on where I should start looking for the issue in order to debug this?

Best Regards

Some metrics are calculated differently from paper

I have noticed that some metrics are calculated differently from what is described in the paper, in particular:

Radius. Radius in the paper is defined as sqrt(var1^2+var2^2), however in AfterImage.py, line 88, it is calculated as sqrt(var1+var2)
Covariance. The paper defines covariance as SR_ij/(w_i+w_j), but in AfterImage.py, line 203, it defines a new weight w3 and divides SR_ij by w3.

Could you please clarify why these two metrics are calculated differently?

How do I fit my own features without using pcap files?

I'm using network packets from my elasticsearch and I wanna know how do I fit my own features to this.
Is it possible to use different data than a pcap file or is it necessary to use pcap ?