trishullab / salento Goto Github PK

Statistical bug-finding framework for API-using code

License: Apache License 2.0

Python 100.00%

salento's Introduction

Salento

Salento is a statistical bug-detection framework based on the machine learning model used by Bayou. For technical details about Salento refer to the paper Bayesian Specification Learning for Finding API Usage Errors, FSE'17 (link)

Requirements

Python3 (Tested with 3.5.1)
Tensorflow (Tested with 1.4)

Training

To train a Salento model on a data file, say DATA.json:

Setup environment:

export PYTHONPATH=$PYTHONPATH:/path/to/salento/src/main/python

Ensure that the data is in the right JSON format using the schema file doc/json_schemas/salento_input_schema.json.
(Optional.) Extract evidences from the data:

python3 src/main/python/scripts/evidence_extractor.py DATA.json DATA-training.json

This will create a DATA-training.json after extracting evidences from each package in DATA.json. Run with --help for more options that you can use to filter the sequences selected for training.

Go to the model folder and start training with a model configuration:

cd src/main/python/salento/models/low_level_evidences
python3 train.py /path/to/DATA-training.json --config config.json

Run with --help to see a description of the model configuration options. Edit config.json as needed.

Inference

To test a trained model on some test data:

1-3. Follow steps 1-3 above to produce a file DATA-testing.json with evidences.

Go to the aggregators folder and run one of the aggregators on the test data:

cd src/main/python/salento/aggregators
python3 sequence_aggregator.py --data_file /path/to/DATA-testing.json --model_dir /path/to/model/directory

The model directory should contain the trained model's files, such as checkpoint, config.json, etc.

salento's People

Contributors

Stargazers

Watchers

Forkers

cogumbreiro khanhgithead gaolois romainrouzaud

salento's Issues

Port Salento to Tensorflow the 1.0 API

Salento is currently stuck in tensforflow 0.12. One important maintenance milestone is to bring the API up to date with the latest version, 1.4 as of now.

I am currently doing this effort.

Salento can't handle unknown vocabs

The problem appears to be that Salento's internals are not expecting unknown vocabs. I am wondering if we should just filter out unknown vocabs when ranging through, say Aggregator.events.

@vijay-murali, thoughts?

I'm getting this error when running the sequence aggregator:

Package 1----
Traceback (most recent call last):
  File "/home/tgc/salento/src/main/python/salento/aggregators/sequence_aggregator.py", line 52, in <module>
    aggregator.run()
  File "/home/tgc/salento/src/main/python/salento/aggregators/sequence_aggregator.py", line 38, in run
    llh += math.log(self.distribution_next_call(spec, events[:i], call=self.call(event)))
  File "/home/tgc/salento/src/main/python/salento/aggregators/base.py", line 73, in distribution_next_call
    return dist if call is None else dist[call]
KeyError: 'cogl_pipeline_set_layer_filters'

Moving the android driver to its own repository

As far as I understand, the same code extractors can be used for multiple tools (salento and bayou).

Maybe it makes more sense to move code extractors to their own repository?

This would simplify repository maintenance and packaging.

Change the architecture of salento to one that is feasible to streaming

Salento expects as an input a sequence of packages.
The problem is that the file format that contains the sequence of packages is a JSON objects, which means that all packages must fit into memory to read them. We currently have some use cases where the datasets do not fit memory, so this architecture is a bottleneck for scalability.

We need to:

change the file format to something amenable to streaming packages
change the internals (say, train.py) such that data is loaded lazily and use as much as possible generators (versus creating lists upfront)

Salento crashing with `states` defined

Traceback (most recent call last):
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 94, in <module>
    aggregator.run()
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 81, in run
    kld_score = self.compute_kld(spec, seqs_l)
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 59, in compute_kld
    log_q = self.log_likelihood(spec, sequence)
  File "/home/tgc/salento/src/main/python/salento/aggregators/kld_aggregator.py", line 44, in log_likelihood
    llh += math.log(self.distribution_next_state(spec, events[:i] + [partial_event], state=state))
  File "/home/tgc/salento/src/main/python/salento/aggregators/base.py", line 91, in distribution_next_state
    return dist[state]
KeyError: '4#5'

Exception running `kld.py`

Hi, @vijay-murali,

I am trying to debug the error below and for that I was looking at the implementation of kld.py.

Error

tarted at 2017-11-27 16:42:05.398390
2017-11-27 16:42:05.398588: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
### foo.c
Traceback (most recent call last):
  File "../salento/statistical/kld.py", line 150, in <module>
    main()
  File "../salento/statistical/kld.py", line 65, in main
    klds = [(l, kld.compute(l, pack)) for l in locations]
  File "../salento/statistical/kld.py", line 65, in <listcomp>
    klds = [(l, kld.compute(l, pack)) for l in locations]
  File "../salento/statistical/kld.py", line 131, in compute
    samples = [sample(seqs_l, nsamples=1) for i in range(self.args.num_iters)]
  File "../salento/statistical/kld.py", line 131, in <listcomp>
    samples = [sample(seqs_l, nsamples=1) for i in range(self.args.num_iters)]
  File "/home/tiago/Work/salento/statistical/utils.py", line 20, in sample
    samples = [random.choice(s) for i in range(nsamples)] if nsamples > 1 else random.choice(s)
  File "/usr/lib/python3.6/random.py", line 257, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

Input

{"packages": [
    {"data": [
        {"sequence": [
            {
              "call": "pthread_mutex_lock",
              "states": [],
              "location": "foo.c:2"
            },
            {
              "call": "pthread_mutex_unlock",
              "states": [],
              "location": "foo.c:1"
            }
         ]}
    ],
    "name": "foo.c"
    }
]}

Walk through

In function main() we find the following code:

        for pack in parser.packages:
            locations = parser.locations(pack)
            # ...
            klds = [(l, kld.compute(l, pack)) for l in locations]

For this input we get that there is only one package, where locations = ['foo.c:1', 'foo.c:2'].

Then we have a call to compute(self, l, pack), where in the first line we can find:

        seqs_l = self.parser.sequences(pack, l)

According to the documentation of sequences:

If location is given, then get all sequences in package that end at location.`

Hence, for foo.c:1 we get the only sequence in the input and for foo.c:2 we get seqs_l = [] which then triggers the error.

How to interpret Salento's output?

Any help trying to interpret Salento's output? How do I know what's unlikely?

For instance, I've plotted the output of sequence_aggregator.py as a scatter graph:

Any idea what to make of this?