thijsvanede / flowprint Goto Github PK

Original implementation of FlowPrint as in the NDSS '20 paper

License: MIT License

Python 100.00%

flowprint's Introduction

FlowPrint

This repository contains the code for FlowPrint by the authors of the NDSS FlowPrint [1] paper [PDF]. Please cite FlowPrint when using it in academic publications. This master branch provides FlowPrint as an out of the box tool. For the original experiments from the paper, please checkout the NDSS branch.

Introduction

FlowPrint introduces a semi-supervised approach for fingerprinting mobile apps from (encrypted) network traffic. We automatically find temporal correlations among destination-related features of network traffic and use these correlations to generate app fingerprints. These fingerprints can later be reused to recognize known apps or to detect previously unseen apps. The main contribution of this work is to create network fingerprints without prior knowledge of the apps running in the network.

Documentation

We provide an extensive documentation including installation instructions and reference at flowprint.readthedocs.io.

References

[1] van Ede, T., Bortolameotti, R., Continella, A., Ren, J., Dubois, D. J., Lindorfer, M., Choffnes, D., van Steen, M. & Peter, A. (2020, February). FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic. In 2020 NDSS. The Internet Society.

Bibtex

@inproceedings{vanede2020flowprint,
  title={{FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic}},
  author={van Ede, Thijs and Bortolameotti, Riccardo and Continella, Andrea and Ren, Jingjing and Dubois, Daniel J. and Lindorfer, Martina and Choffness, David and van Steen, Maarten, and Peter, Andreas},
  booktitle={NDSS},
  year={2020},
  organization={The Internet Society}
}

flowprint's People

Contributors

Stargazers

Watchers

flowprint's Issues

tshark raise an error : 'Some fields aren't valid: '

tshark: Some fields aren't valid: ssl.handshake.certificate
Is the command in reader.py is right?
That's my tshark version.

Trouble preprocessing pcaps and generating fingerprints

As per the README file of the repository, I'm trying to preprocess pcaps from Andrubis dataset to flows that FlowPrint can interpret but no matter what pcaps I use as input, the output fingerprints seem to be empty. Is there any specific input format for the pcaps to be used?

python -m flowprint --pcaps c_dataset/ --write flows.p

Reading c_dataset/...
/home/kumailraza/FlowPrint/flowprint/reader.py:64: UserWarning: tshark exception: '[Errno 2] No such file or directory: 'tshark': 'tshark'', defaulting to pyshark
.format(ex))

Output fingerprints:

flows.p:

�^Ccnumpy.core.multiarray
_reconstruct
q^@cnumpy
ndarray
q^AK^@�q^BC^Abq^C�q^DRq^E(K^AK^@�q^Fcnumpy
dtype
q^GX^B^@^@^@o8q^HK^@k^A�q Rq
(K^CX^A^@^@^@|q^KNNNJ��J��K?tq^Lb�]q^Mtq^Nbh^@h^AK^@�q^Oh^C�q^PRq^Q(K$
�]q^Stq^Tb�q^U.

TypeError: 'Flow' object is not iterable

Describe the bug
Hello, I'm trying run flowprint in a particular dataset but I get an error in flowprint.py (line 206):
matches = list(set().union(*[lookup.get(x, set()) for x in fingerprint]))

The error says "TypeError: 'Flow' object is not iterable"

To Reproduce
I just copied the sample code that was made available and run in my own dataset

preprocessor = Preprocessor(verbose=True)
X, y = preprocessor.process(files = pcaps,
                            labels = labels)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

flowprint = FlowPrint(
    batch       = 300,
    window      = 30,
    correlation = 0.1,
    similarity  = 0.9
)

flowprint.fit(X_train, y_train)
y_recognize = flowprint.recognize(X_test)

Expected behavior
I believe that the error should not happen

Screenshots
Here you can see X_test I'm sending and the error:

Additional context
I installed flowprint from the command line with pip install flowprint. I'm using macOS 10.15.5 to run the code and my dataset consists of data from 5 applications in .pcap format. I'm using Python 3.7.9, TShark 3.0.3

Thank you for your attention! :D

Request for Raw PCAP Files of ReCon and ReCon Extended Datasets

Hello,

Firstly, I'd like to commend the team for the exceptional work on this project. The insights and results presented in the paper are truly commendable.

I'm reaching out to inquire if there's a repository or location where I can access the raw PCAP files for both the ReCon and ReCon Extended datasets mentioned in your paper. Having access to these files would greatly assist in further research and analysis.

Thank you for your time and consideration. I look forward to your response.

There a question in FingerprintGenerator._fit_single_batch_

Describe the question
Every batch there will create a new cluster object and when I test my data, I need create a new cluster to recomputer the fingers.And the samples in cluster using the concatenate() func is useless, because self.samples always is zeros(0) in a new cluster object.

    # new cluster every time we call the _fit_single_batch_(), so when we use the samples, the samples must be zeros(0) 
     cluster = Cluster()
     # Cluster flows into network destinations
     cluster.fit(X, y)

       # Add X to samples, if before we use cluster.fit, we create a new cluster. There is no need to use concatenate()
        self.samples = np.concatenate((self.samples, X))
       # why not self.samples = X

what i think
I think cluster should be create as a member variable of the FingerprintGenerator. And it should be inited in _init_().And when I need to train my data, I need save fingerprints and cluster. When I need to predict my data, I need load cluster ,too.

Am i right? Or my comprehension of the method is wrong? Thx rely me!

Problem about accuracy

# trainSet: the path of trainSet (dir of train pcap files)
# trainLabel: the package name of traffic
# testSet the path of testSet (dir of test pcap files)
def flowprintTest(trainSet, trainLabel, testSet):
    preprocessor = Preprocessor(verbose=True)
    if(os.path.exists("./flows.p")):
        X_train, y_train = preprocessor.load("./flows.p")
    else:
        X_train, y_train = preprocessor.process(files=trainSet,
                                labels=trainLabel)
        preprocessor.save('flows.p', X_train, y_train)

    flowprint = FlowPrint(
        batch       = 300,
        window      = 30,
        correlation = 0.1,
        similarity  = 0.9
    )
    flowprint.fit(X_train, y_train)

    fingerprints = flowprint.fingerprints
    with open("./fingerprints","w") as fp:
        for fingerprint in fingerprints:
            fp.write(fingerprints[fingerprint] + " " + json.dumps(fingerprint.to_dict())+"\n")

    totalNum = 0.0
    trueNum = 0.0
    for testData in testSet:
        fileName = testData.split("\\")[-1]
        fileName = fileName.split(".pcap")[0]
        # Get X_test fingerprints
        X_test = np.array(list(preprocessor.extract(testData).values()))
        # Is this function?
        testPrints = flowprint.fingerprinter.fit_predict(X_test)
        y_recognize = flowprint.recognize(testPrints)
        if(fileName in y_recognize.tolist()):
            trueNum += 1
        else:
            with open("./result.txt","a") as fp:
                fp.write(fileName+"\n")
                if(np.size(y_recognize)==0):
                    fp.write("[]"+"\n")
                else:
                    fp.write(y_recognize[0]+"\n")
                for testPrint in testPrints.tolist():
                    fp.write(json.dumps(testPrint.to_dict())+"\n")
                fp.write("############\n")
        totalNum += 1

    if(totalNum!=0):
        Trate = trueNum/totalNum
        print("Success rate is " + str(Trate))
    return

if __name__ == "__main__":
    fileNames = os.listdir(trainPath)
    fileNames.sort()
    labels = []
    inputFlows = []
    testFlows = []
    for fileName in fileNames:
        labels.append(fileName.split(".pcap")[0])
        inputFlows.append(trainPath+fileName)

    fileNames = os.listdir(testPath)
    for fileName in fileNames:
        testFlows.append(testPath+fileName)

    flowprintTest(inputFlows,labels,testFlows)

I'm using this function to reproduce the experiment.
But I can only get 50% accuracy.
The dataSet I am using is that https://recon.meddle.mobi/cross-market.html -China-android.
I split the data set in half, half as the training set and half the test set.
Is the way I use it wrong?

Can I save my model into files and load it from files to predict new unseen samples?

Add a feature to save model and load model.

Some exception in reader.py

Describe the bug
In function read_pyshark

def read_pyshark(self, path):
     pcap = iter(pcap_obj)
.....
     pcap.close()

pcap doesn't have the close() function. It rasies a exception RuntimeError: Event loop is closed


Exception ignored in: <bound method Capture.__del__ of <FileCapture android\appinventor.ai_reflectiveapps.rbxlook.pcap>>
Traceback (most recent call last):
  File "xxx\lib\site-packages\pyshark\capture\capture.py", line 435, in __del__
    self.close()
  File "xxx\site-packages\pyshark\capture\capture.py", line 426, in close
    self.eventloop.run_until_complete(self.close_async())
  File "xxx\lib\asyncio\base_events.py", line 443, in run_until_complete
    self._check_closed()
  File "xxx\lib\asyncio\base_events.py", line 357, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
\fingerprint.py:159: RuntimeWarning: coroutine 'Capture.close_async' was never awaited
  return hash(frozenset([x for x in self]))

So I changed it like that

def read_pyshark(self, path):
     pcap_obj = pyshark.FileCapture(path)
     pcap = iter(pcap_obj)
.....
     pcap_obj.close()

and it worked!

ModuleNotFoundError: No module named 'flow_generator'

Describe the bug
I want to preprocess pcap files but I got this error

code :

from flowprint.preprocessor import Preprocessor

if name == 'main':

    pcaps = ['Whatsapp Messenger_D_1_Final.pcap','Whatsapp Messenger_C_4_Final.pcap']
    labels =  ['WhatsappD1', 'WhatsappC2']

    # Load data
    preprocessor = Preprocessor(verbose=True)
    X, y = preprocessor.process(files=pcaps, labels=labels)

error:
Traceback (most recent call last):
File "C:\Python\python39\lib\site-packages\flowprint\preprocessor.py", line 7, in
from .reader import Reader
File "C:\Python\python39\lib\site-packages\flowprint\reader.py", line 1, in
from cryptography import x509
ModuleNotFoundError: No module named 'cryptography'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\python39\lib\site-packages\flowprint\preprocessor.py", line 10, in
from flow_generator import FlowGenerator
ModuleNotFoundError: No module named 'flow_generator'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Flowprint\test\mycode\main.py", line 2, in
from flowprint.preprocessor import Preprocessor
File "C:\Python\python39\lib\site-packages\flowprint\preprocessor.py", line 13, in
raise ValueError(e)
ValueError: No module named 'flow_generator'

A question about per-device fingerprinting

I have a question regarding the implementation of the fingerprinting technique in the code you shared.
In the article, you mentioned that the fingerprint should be generated separately for each device because identifying apps on a per-device basis assists in limiting the amount of dynamic behavior. However, when I reviewed the code you kindly shared, I noticed that the code does not seem to follow this approach.

Could you please provide some insights to help me understand this discrepancy? Is there something I may have missed or misunderstood?

how to use Recon Dataset

I would greatly appreciate it if you could provide a brief explanation of how to use the Recon Dataset in its current format.

'tshark' Error in running the preprocessing of pcaps

python -m flowprint --pcaps ../exec_clustering/pcaps_frompcaps_dataset --write flow.p

Reading ../exec_clustering/pcaps_frompcaps_dataset...
/home/kumailraza/FlowPrint/flowprint/reader.py:64: UserWarning: tshark exception: '[Errno 2] No such file or directory: 'tshark': 'tshark'', defaulting to pyshark
.format(ex))

Output fingerprints:

It does create the output file flow.p but then the fingerprinting step gives an error.

Problem about store the fingerprints into one .json file

As shown in flowprint.readthedocs.io, when I try to use the command line tool with this command (I have already transform one.pcap file into flow and store it in d:\flows.p) :

python -m flowprint --read d:\flows.p --fingerprint d:\fingerprints.json

I got an error :

Traceback (most recent call last):
File "C:\Users\clairel\anaconda3\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\clairel\anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\FlowPrint-master\FlowPrint-master\flowprint\__main__.py", line 233, in <module> fingerprint(flowprint, args) File "D:\FlowPrint-master\FlowPrint-master\flowprint\__main__.py", line 92, in fingerprint
flowprint.save(outfile)
File "D:\FlowPrint-master\FlowPrint-master\flowprint\flowprint.py", line 319, in save
for fp in fingerprints or self.fingerprints]
File "D:\FlowPrint-master\FlowPrint-master\flowprint\flowprint.py", line 319, in <listcomp>
for fp in fingerprints or self.fingerprints]
File "D:\FlowPrint-master\FlowPrint-master\flowprint\fingerprint.py", line 114, in to_dict
'certificates': self.certificates,
File "D:\FlowPrint-master\FlowPrint-master\flowprint\fingerprint.py", line 99, in certificates
return sorted([x for x in self if not isinstance(x, tuple)])
TypeError: '<' not supported between instances of 'NoneType' and 'int'

The version of python is Python 3.7.6.

Is there something wrong? Or I made some mistakes?

Why identify a Flow using these features in flow_generator.py?

Hello, recently I'm trying to reproduce the result in FlowPrint, I was confused why in flow_generator.py we use features:

for packet in packets:
     key = (packet[0], packet[1], packet[2])
     # Add packet to flow
     result[key] = result.get(key, Flow()).add(packet)

from the comment we can see that:
packet[0] is "Filename of capture",
packet[1] is "Protocol",
packet[2] is "TCP/UDP stream identifier".

From the article, a Flow is identified as "a group of packets within a burst that have the same (ip source, ip destination, sport, dport, protocol)-tuple". Therefore, I reckon that we should use:

flow_key = tuple(packet[5,7,6,9,2])

Or do I misunderstand the work?