acmfi / libradar Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 7.73 MB

ACMFI version of libRadar

License: MIT License

Rust 100.00%

libradar's Introduction

libradar

ACMFI version of libRadar

libradar's People

Contributors

Watchers

libradar's Issues

Call graph generation

In order to calculate the score we need to get the Call Graph of the Dex file. This component should be added to the disass module, taking as input a method and returning the call (sub-)graph.

There is no need to actually hold in memory the Call Graph, with an Iterator that returns the edges of the graph should be more than enough.

The package tree will get the invoked methods for each method in each class and needs the names of the invoked methods in whatever format is used in the package database for matching them against the api set.

Define actions and entry point of the program

For a usable program we obviously need an entrypoint. The first and easiest is to create a cli program that does the training and the detection. The problem with this is, depending on how the cli is implemented we may have coupling issues. For example, if the whole detection algorithm is written in the detection cli entrypoint, do we need to reimplement it if we decide to have a HTTP entrypoint?

In order to future proof ourselves, the approach I believe is the fittest here is to apply the Command pattern. By implementing each action of the system in a separate command we can have more flexibility. The cli programs, HTTP clients or whatever will just be lightweight facades to the actual command.

Implementation-wise it could go as follows:

~~An abstract interface that defines what the general behavior of the command is:~~

trait Action {
  fn run(&mut self);
}

~~Then, each action/command would be represented with a struct that hold all the parameters it needs.~~

Drawbacks: If the code of the action is to be entrypoint independent, we cant make assumptions like writing to stdout. This means that we need to have a common interface for the output of the actions, which potentially has many different formats or even no output at all.

Something like this may be much more simple to do and can still fit into a Command in an upper layer of abstraction:

// A struct that holds the configuration for the application
// DB, libradar parameters, etc
struct Context {
  db: crate::db::DexDB,
  ...
}

impl Context {
  fn create_analysis() -> impl Analyzer {} // Returns some type of analysis (fuzzy or exact)
}

trait Analyzer {
  fn analyze_apk() -> AnalysisResult;
  fn analyze_dex() -> AnalysisResult;
}

struct ExampleAnalysis;

impl Analyzer for ExampleAnalysis {
  ...
}

With something like this the responsibilities are split into 3 components:

Context knows the state of the program and what parameters are selected.
DBManager will handle the connection to databases and returning instances of the DexDB trait.
Analyzer will implement the different analysis that Context will choose based on the parameters.

Package tree data structure

libRadar uses a tree structure to hold the classes according to the package they belong. This tree also generates the hash and the score of the library recursively.

In my mind (and inspired by the reference implementation) the module should have a public struct that acts as a public API and the actual tree is defined recursively using a private struct.

We have to generate a tree for each dex that we are handling, so it must receive a Dex as input parameter along with the api set that is used to calculate the score of each node in the tree.

The leaf nodes should contain classes and the root nodes represent the java packages.

Training validation dataset

To ensure that our code works as intended we need to build a testing dataset and some form of validation.

For extracting apps for the testing dataset we can use f-droid. We need the compiled APK, the same users would install on their devices and the list of third-party libraries the app contains. For common android projects the third-party libraries are declared in the gradle build scripts. A quick manual inspection can reveal the list of libraries the app uses.

The problem is now how to do the validation. We could split the dataset into training and validation datasets if the dataset is large enough. The problem with this and any similar approaches is that if a library is not in the training dataset, it will not see it in the validation dataset. This is a fundamental problem of signature databases, which is libradars approach.

CLI entrypoint

For the basic implementation we are going for a CLI application. However, not all CLIs are equal.

This medium post talks about good CLI interfaces.

We should strive to reach as much points as possible. The main problem with our CLI is that we have several things our application can do (training, detection, db management, etc). Two approaches can be used here:

One single executable that has several subcommands (i.e. git)
Several executables that share a common codebase (i.e. radare2)

Both approaches can work with rust due to how cargo compiles. Which approach to use will be then influenced by what CLI frameworks are available to use.

The Awesome Rust list can be a good starting point for trying out CLI frameworks to see what fits our needs the most and complies as much as possible with the 12 factor philosophy.

Packages database

For doing the library detection we need to have a database of known libraries. This database has at least 2 columns:

The api set, which is a list of methods used to identify the libraries and its usage by a method affects its score
The libs set, a set of known libraries in the database with its name and its hash.

Since in the future we want to support multiple database backends we need to define a common interface.

For an initial implementation we can use a simple in-memory database that is loading at starting.

Since we don't have training implemented yet we can use the database included in the reference implementation for testing.

acmfi / libradar Goto Github PK

libradar's Introduction

libradar

libradar's People

Contributors

Watchers

libradar's Issues

Call graph generation

Define actions and entry point of the program

Package tree data structure

Training validation dataset

CLI entrypoint

Packages database

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent