logintelligence / logadempirical Goto Github PK

View Code? Open in Web Editor NEW

149.0 149.0 37.0 4.51 MB

Log-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)

License: MIT License

Python 100.00%

deep-learning log-analysis log-based-anomaly-detection

logadempirical's People

Contributors

Stargazers

Watchers

logadempirical's Issues

How can l find this BGL.log_structured.csv?

When l run this program following your README.md, this error occured.

FileNotFoundError: [Errno 2] No such file or directory: './dataset/bgl/BGL.log_structured.csv'

And I do not know how to change the BLG.log to BGL.log_structured.csv.

Benchmark is missing

Could you please provide a benchmark (details of argument values for each experiment) for accurate reproduction of results?

Perhaps, similar to the Loglizer repository which has a benchmarks folder.

Missing data['Seq'], what‘s 'seq' ？

Hello, I have some questions about the 'seq' field in the data；

I tried to find how 'seq' was generated in the code. However, in addition to the generated part of the session window, the training data saved by the sliding window does not contain this part. I think it's an index list of logs, does that make sense?
Is there any mistake in my understanding? Could you please help with this?

Some question about loading dataset,expecially hdfs.

I have some questions about loading dataset,expecially hdfs.

RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?

The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust.
I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".

My question is why do you use "history_size" to fix the data sequence, including session window HDFS ?
As a result, the length of the final training dataset sequence is "history_size".

Is there any mistake in my understanding? Can you help me answer it?

RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?

Deeplog input is empty

In logadempirical/logdeep/dataset/sample.py line 161, the value of sequential_pattern is set to empty list:

sequential_pattern = []

which is the main input of deeplog model. It caused an error during running deeplog saying the input is empty.

However, according to logdeep rep, this feature is supposed to be initialized as:
sequential_pattern = line[i:i + window_size]

Please update the code.

License?

This is great work! I just found that there is no license information provided in the repository, making it impossible to be officially reused by others. Can you please add license information? Thank you!

How can i get the raw Spirit dataset

How can i get the raw Spirit dataset（172 million log messages）

LogAnomaly model design fault

In logadempirical/logdeep/models/lstm.py line 109, the inputs are set as:

input0, input1 = features[2], features[1]

where features[2] is semantic pattern according to logadempirical/logdeep/dataset/sample.py. However, in the original code of in logdeep repository, we have:

input0, input1 = features[0], features[1]

where features[0] is sequential_pattern.

Hence, the current LogAnomaly model in this rep is not working properly.

The code did not run successfully

Guys, this code is missing something. According to the demo script to run the code for a long time or failed to run。

Is there a big guy out there who can fix this? Thank you very much！！！
You can add me on wechat :wlf0104 qq:1780130585
Or send me Google Mail：[email protected]

Update requirements

Could you please update requirements for the project please? I suppose it's outdated.

HDFS log dataset

Do you use the same HDFS log dataset as in DeepLog paper? Could you please provide the log dataset? Or anywhere can I view the logs?

ValueError while running LogAnomaly

Hi, I was trying to run loganomaly model, in the cmd line I simply change the model option to 'loganomaly'. It shows this error:

File "/home/LogADEmpirical/logadempirical/logdeep/dataset/vocab.py", line 46, in find_similar
    if sim > 0.90:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Could you please help with this?

ValueError: 6 columns passed, passed data had 5 columns

I follow README.md demo shell script，and I got this error.
Please tell me how to change the parameter to avoid this problem.

What part of the thunderbird dataset did you use ?

I am trying to reproduce the results of your paper.
In your paper, you wrote"We leverage 10 million continuous log lines..." for the Thunderbird dataset, could you tell me which part you actually used?
(I tried using the first 10 million lines but it seemed different)

embeddings.json

I'm trying to reproduce your results (like another poster here)...

Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?

python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset

Traceback (most recent call last):
  File "main_run.py", line 213, in <module>
    main()
  File "main_run.py", line 195, in main
    run_deeplog(options)
  File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
    Trainer(options).start_train()
  File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
    train_logs, train_labels = sliding_window(data,
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
    event2semantic_vec = read_json(os.path.join(data_dir, e_name))
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
    with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'

How can l find this BGL.log_structured.csv and third and hdfs

Confusion about log parsing

I have a confusion about how to set parameters for parsing.I tried to parse HDFS with the settings in the figure, but there was obviously a problem. What is the right way?
There are four data sources. How should we set the parameters for parsing?

why train = train[:100000] in PLELog?

In logadempirical/PLELog/data/DataLoader.py line 324, there is this line:

train = train[:100000]

which restricts the training dataset to the first 100000 logs. Why it has been applied?

In fact, it makes the training size different than what is reported in the paper.

Similarly on line 330:
val = val[:20000]

the result is not reproducible

Hi, I was trying out deeplog using HDFS1 dataset (used only first 1m lines parsed by Drain).

I run it with the following parameters settings:
python main_run.py --folder=hdfs_1m/ --log_file=HDFS_1m.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process

This is the result:
Precision: 86.964%, Recall: 53.931%, F1-measure: 66.576%, Specificity: 0.996
(I have tried with different param settings, there's not much of a difference)

Could you please have an advice for me? Thanks in advance!

Missing HDFS.log_structured.csv file

Hello, I was trying to run deeplog on hdfs dataset but ended up with the following error.

command I used:
!python main_run.py --folder=bgl/ --log_file=HDFS.log --dataset_name=hdfs --model_name=deeplog --window_type=sliding\ --sample=sliding_window --is_logkey --train_size=0.8 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100\ --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=150 --history_size=10 --lr=0.001\ --accumulation_step=5 --session_level=hour --window_size=60 --step_size=60 --output_dir=experimental_results/demo/random/ --is_process

Are we supposed to run other scripts first to generate such files (for example data_loader.py or synthesize.py)
Can we re-run the code with other formats of HDFS dataset which are publicly available?
Thanks,

logintelligence / logadempirical Goto Github PK

logadempirical's People

Contributors

Stargazers

Watchers

Forkers

logadempirical's Issues

Recommend Projects

Recommend Topics

Recommend Org