hmedal / lans Goto Github PK

LArge LAbeled Netflow graph Simulator

License: GNU General Public License v3.0

Python 76.74% R 1.39% Scala 21.87%

lans's Introduction

About Graph-Simulation

This project is to generate simulated graph from a given graph. This script has strong dependency on the name of the columns, types of the columns and format of the input file.

This script requires the following softwares:

Spark 2.1.0, prebuilt for Hadoop 2.7 and later
R 3.2.1 to generate graph properties.
Python 2.7.8 is required along with the following packages: openmpi-1.10, numpy 1.10.4, panda 0.23.3, random, csv, gc (garbage collector), sklearn 0.18.1 (cluster, KMeans, kneighbors_graph, scale), sys, subprocess.
The code should be run with at least 4 processors. In this version, each processor is responsible for one role; so, the minimum number of processors must be equal to the number of roles. (This condition will be removed from the next versions.)

How to run Graph-Simulation:

Download "spark-2.1.0-bin-hadoop2.7.tgz" this package from http://spark.apache.org/downloads.html , unzip and save it in your cluster.
Make all the files in the "/Spark/..../bin" directory executable (e.g. using command chmod a+x bin).
Download our project and unzip it.
Copy all the jar files from the "Required_Packages" directory of our project to "/Spark/..../jars" directory of Spark.
Copy our project (except "Required_Packages" directory) in the working directory of your cluster. Compile src/GraphProperties.scala, src/KCore.scala, and src/Properties.scala to /Spark/.../jars/Properties.jar.
Keep all the input files (e.g. 11.binetflow, 5.binetflow and so on) in the "input_files" directory of our project. These input files are used as seed graphs while generating large-scale simulated graph. All the input seed graphs must be inside the "input_files" directory, otherwise the input graphs will not be considered as input.
Configure Python package, R package and Home directory for Spark in the default configuration file of the cluster.
Submit job "runProject.pbs" to a cluster.
When the job ends it will provide the simulated graph in "SimulatedGraph" directory.

Expected Results

When Graph-Simulation-4 is run it will create a new folder named SimulatedGraph. The contents of this folder should be a series of small graphs beginning with "localgen_0.csv" and ending with "localgen_N.csv" where N is the number of local graphs generated - 1. This folder will also contain a file named upperlevelGraph.csv which contains the connections between local graphs which unite all local graphs into a single larger graph.
In order to verify that the simulation has completed and is correct compare the number of files in the SimulatedGraph folder with the number of processors specified in the PBS script used to run the simulation. If the simulation is complete then all localgen_.csv files and upperlevelGraph.csv should be present and not empty.

lans's People

Contributors

Watchers

lans's Issues

LANS-V4: Parallel_parameter_estimation_V2 is taking long time for scenario 5,7,11 and 9,13(two groups)

Issue from Sponsor (IndexError: list index out of range)

From Mandy Sack:

Issue:
When I run with only 1 data set (5.binetflow) I receive the follow errors:

('Number of Processors: ', 16)

Traceback (most recent call last):
File "Enterprise_Connection_With_Graph_Simulation.py", line 103, in
main()
File "Enterprise_Connection_With_Graph_Simulation.py", line 50, in main
graphList.append(choice(org_graphList[1:len(org_graphList)]))
File "/opt/gd/lang/python-2.7.11/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range

Get LANS running on Cray machine

Lastly, they are attempting to build/run the code on a Cray XC40 machine and the version of mpi4py that LANS requires needs to be rebuilt for that machine. I am wondering if your group now has access to the Cray systems, whether you can try to provide a release that works/installs simply on an XC40 machine (only if you have access to one from Cray).

pre-edge simulation indegree higher than outdegree

when generating nodes the predefined indegree is higher than the predefined outdegree, we need to check if this is random or bias

LANS-V4: Error in Graph property calculation for scenario 3 and 1

Exception in thread "main" org.apache.hadoop.fs.ParentNotDirectoryException: Parent path is not a directory: file:/work/sharun/LargeScenarios/input_files/1.binetflow
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:523)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:531)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:531)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:694)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:313)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:124)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at Properties$.main(Properties.scala:71)
at Properties.main(Properties.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Read 4 items
Error in if (substr(files[k], 1, 4) == "part") { :
missing value where TRUE/FALSE needed
Execution halted

version 5 input files could not be .binetflow

in the event of a .binetflow file graph_gen5.py looked for a file with only the scenario name

ex. if input was 5.binetflow, the code would look for just 5

LANS-V5: 3D histogram code failed to handle multiple input files

/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/indexing.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.setitem_with_indexer(indexer, value)
Traceback (most recent call last):
File "create_3D_edge_attribute_histograms.py", line 90, in
merged_df = pd.read_csv(temp_folder + 'merged_dataframe' + ctu_files[w].split('.', 1)[0] + '.csv')
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 250, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 566, in init
self._make_engine(self.engine)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 705, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1072, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 350, in pandas.parser.TextReader.cinit (pandas/parser.c:3187)
File "pandas/parser.pyx", line 594, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5930)
IOError: File /work/sharun/1000_bin/temp/merged_dataframe_6.csv does not exist

version 5, error in get_histograms

error with line
str = each[1].split(",",2)
in function get_histograms

caused by incorrect version of create_attribute_histograms.py
or incorrect attribute files being used as inputs

create 3d histograms throwing error file not found

creating the 3d histograms threw an error wherein the merged dataframe would be created and then the code could not find the completed dataframe, this turned out to be an issue where hardcoding was used to find a .csv file where the actual result could be .csv or .binetflow

Inaccurate in and out degree in simulation

simulated nodes do not always fit the histograms, need to switch histogram use to exact values rather than probabilities

Fix pandas issue

This issue needs to be fixed in versions 5 an 6.1.

The first issue was noticed with pandas (same issues on both versions of LANS). The version of pandas that was available on the system was 0.21.1. None of these issues are seen when using pandas version 0.19.1
In the file role_mining.py, an error occurs at line:
feature_data = pd.read_csv(feature_file,delimiter=',',usecols=[0,1,2,3,4,5,6])
features = feature_data[[1,2,3,4,5,6]].as_matrix()

What I did to work around it before rolling back to pandas version 0.19.1 was to specify those columns, which made it past that error.

README file mentions compiling using an IDE

We should remove mention of an IDE

Fix networkx issue (only in LANS v6)

However, then I ran into another error in Enterprise_Connection_With_Graph_Simulation.py at line:
create_graph(temp_folder,graphList[rank], seed = seedlist[rank], startpoint = startIndex[rank])
I was not able to get past that error quickly, and the version of pandas was rolled back to 0.19.1.

The 2nd issue I ran into was with networkx, and only in LANS version 6. Networkx version 2 was available in the sponsor’s environment. The following changes to Property.py makes it compatible with both version 1 and 2 of networkx.

Line62-Line66
    def getInDegree(self):
        return sorted(dict(self.G.in_degree()).values())

    def getOutDegree(self):
        return sorted(dict(self.G.out_degree()).values())

I found this site helpful for the mirgration: https://networkx.github.io/documentation/stable/release/migration_guide_from_1.x_to_2.0.html

task 1

indegrees in simulated graph do not fit the original graph or the simulated nodes

in the generate_edge function the global "innodes" and "outnodes" seem to reset between function calls.
these variables are assigned as empty global lists inside the create_graph function (they were previously declared outside of all functions to ensure that they were always run at the import declaration but changing location has had no noticable effect)
though the code removes lines from the roles within innodes in the nodeCreation function to clear it of all nodes with an indegree of less than 1 immediately after initializing it and after an a node's in degree is decremented a check is run to test if the new in degree of that node is less than 1 and remove the node from the list if it is, the code still chooses nodes that should have been removed from the innodes list as destination roles.

Fix issue with issues with running code for CTU datasets

Hi Hugh, Mandy is traveling so I’ll do my best to describe the issues, she can correct me when she gets back.

The main issue was if either the sport or dport fields was empty it made things very unhappy. The way Mandy worked around it was to replace the empty field with a dummy value of 0.

She also removed the spaces in the hex values for those fields and converted them to decimal.

All the CTU data set manipulation was only required to be done once for the datasets so it was easy to forget they were tweaked.

README revisions

Readme needs a few edits.