Git Product home page Git Product logo

salento's Introduction

Salento

Salento is a statistical bug-detection framework based on the machine learning model used by Bayou. For technical details about Salento refer to the paper Bayesian Specification Learning for Finding API Usage Errors, FSE'17 (link)

Requirements

  • Python3 (Tested with 3.5.1)
  • Tensorflow (Tested with 1.4)

Training

To train a Salento model on a data file, say DATA.json:

  1. Setup environment:
export PYTHONPATH=$PYTHONPATH:/path/to/salento/src/main/python
  1. Ensure that the data is in the right JSON format using the schema file doc/json_schemas/salento_input_schema.json.

  2. (Optional.) Extract evidences from the data:

python3 src/main/python/scripts/evidence_extractor.py DATA.json DATA-training.json

This will create a DATA-training.json after extracting evidences from each package in DATA.json. Run with --help for more options that you can use to filter the sequences selected for training.

  1. Go to the model folder and start training with a model configuration:
cd src/main/python/salento/models/low_level_evidences
python3 train.py /path/to/DATA-training.json --config config.json

Run with --help to see a description of the model configuration options. Edit config.json as needed.

Inference

To test a trained model on some test data:

1-3. Follow steps 1-3 above to produce a file DATA-testing.json with evidences.

  1. Go to the aggregators folder and run one of the aggregators on the test data:
cd src/main/python/salento/aggregators
python3 sequence_aggregator.py --data_file /path/to/DATA-testing.json --model_dir /path/to/model/directory

The model directory should contain the trained model's files, such as checkpoint, config.json, etc.

salento's People

Contributors

asingh-gt avatar cogumbreiro avatar vineethk avatar vm4422 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

salento's Issues

Port Salento to Tensorflow the 1.0 API

Salento is currently stuck in tensforflow 0.12. One important maintenance milestone is to bring the API up to date with the latest version, 1.4 as of now.

I am currently doing this effort.

Salento can't handle unknown vocabs

The problem appears to be that Salento's internals are not expecting unknown vocabs. I am wondering if we should just filter out unknown vocabs when ranging through, say Aggregator.events.

@vijay-murali, thoughts?

I'm getting this error when running the sequence aggregator:

Package 1----
Traceback (most recent call last):
  File "/home/tgc/salento/src/main/python/salento/aggregators/sequence_aggregator.py", line 52, in <module>
    aggregator.run()
  File "/home/tgc/salento/src/main/python/salento/aggregators/sequence_aggregator.py", line 38, in run
    llh += math.log(self.distribution_next_call(spec, events[:i], call=self.call(event)))
  File "/home/tgc/salento/src/main/python/salento/aggregators/base.py", line 73, in distribution_next_call
    return dist if call is None else dist[call]
KeyError: 'cogl_pipeline_set_layer_filters'

Moving the android driver to its own repository

As far as I understand, the same code extractors can be used for multiple tools (salento and bayou).

Maybe it makes more sense to move code extractors to their own repository?

This would simplify repository maintenance and packaging.

Change the architecture of salento to one that is feasible to streaming

Salento expects as an input a sequence of packages.
The problem is that the file format that contains the sequence of packages is a JSON objects, which means that all packages must fit into memory to read them. We currently have some use cases where the datasets do not fit memory, so this architecture is a bottleneck for scalability.

We need to:

  1. change the file format to something amenable to streaming packages
  2. change the internals (say, train.py) such that data is loaded lazily and use as much as possible generators (versus creating lists upfront)

Salento crashing with `states` defined

Traceback (most recent call last):
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 94, in <module>
    aggregator.run()
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 81, in run
    kld_score = self.compute_kld(spec, seqs_l)
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 59, in compute_kld
    log_q = self.log_likelihood(spec, sequence)
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 44, in log_likelihood
    llh += math.log(self.distribution_next_state(spec, events[:i] + [partial_event], state=state))
  File "/home/tgc/salento/src/main/python/salento/aggregators/base.py", line 91, in distribution_next_state
    return dist[state]
KeyError: '4#5'

Exception running `kld.py`

Hi, @vijay-murali,

I am trying to debug the error below and for that I was looking at the implementation of kld.py.

Error

tarted at 2017-11-27 16:42:05.398390
2017-11-27 16:42:05.398588: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
### foo.c
Traceback (most recent call last):
  File "../salento/statistical/kld.py", line 150, in <module>
    main()
  File "../salento/statistical/kld.py", line 65, in main
    klds = [(l, kld.compute(l, pack)) for l in locations]
  File "../salento/statistical/kld.py", line 65, in <listcomp>
    klds = [(l, kld.compute(l, pack)) for l in locations]
  File "../salento/statistical/kld.py", line 131, in compute
    samples = [sample(seqs_l, nsamples=1) for i in range(self.args.num_iters)]
  File "../salento/statistical/kld.py", line 131, in <listcomp>
    samples = [sample(seqs_l, nsamples=1) for i in range(self.args.num_iters)]
  File "/home/tiago/Work/salento/statistical/utils.py", line 20, in sample
    samples = [random.choice(s) for i in range(nsamples)] if nsamples > 1 else random.choice(s)
  File "/usr/lib/python3.6/random.py", line 257, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

Input

{"packages": [
    {"data": [
        {"sequence": [
            {
              "call": "pthread_mutex_lock",
              "states": [],
              "location": "foo.c:2"
            },
            {
              "call": "pthread_mutex_unlock",
              "states": [],
              "location": "foo.c:1"
            }
         ]}
    ],
    "name": "foo.c"
    }
]}

Walk through

In function main() we find the following code:

        for pack in parser.packages:
            locations = parser.locations(pack)
            # ...
            klds = [(l, kld.compute(l, pack)) for l in locations]

For this input we get that there is only one package, where locations = ['foo.c:1', 'foo.c:2'].

Then we have a call to compute(self, l, pack), where in the first line we can find:

        seqs_l = self.parser.sequences(pack, l)

According to the documentation of sequences:

If location is given, then get all sequences in package that end at location.`

Hence, for foo.c:1 we get the only sequence in the input and for foo.c:2 we get seqs_l = [] which then triggers the error.

How to interpret Salento's output?

Any help trying to interpret Salento's output? How do I know what's unlikely?

For instance, I've plotted the output of sequence_aggregator.py as a scatter graph:

Plot of likelyhood

Any idea what to make of this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.