logintelligence / logadempirical Goto Github PK
View Code? Open in Web Editor NEWLog-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)
License: MIT License
Log-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)
License: MIT License
When l run this program following your README.md, this error occured.
FileNotFoundError: [Errno 2] No such file or directory: './dataset/bgl/BGL.log_structured.csv'
And I do not know how to change the BLG.log to BGL.log_structured.csv.
Could you please provide a benchmark (details of argument values for each experiment) for accurate reproduction of results?
Perhaps, similar to the Loglizer repository which has a benchmarks folder.
Hello, I have some questions about the 'seq' field in the data;
I tried to find how 'seq' was generated in the code. However, in addition to the generated part of the session window, the training data saved by the sliding window does not contain this part. I think it's an index list of logs, does that make sense?
Is there any mistake in my understanding? Could you please help with this?
I have some questions about loading dataset,expecially hdfs.
RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?
The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust.
I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".
My question is why do you use "history_size" to fix the data sequence, including session window HDFS ?
As a result, the length of the final training dataset sequence is "history_size".
Is there any mistake in my understanding? Can you help me answer it?
RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?
In logadempirical/logdeep/dataset/sample.py line 161, the value of sequential_pattern is set to empty list:
sequential_pattern = []
which is the main input of deeplog model. It caused an error during running deeplog saying the input is empty.
However, according to logdeep rep, this feature is supposed to be initialized as:
sequential_pattern = line[i:i + window_size]
Please update the code.
This is great work! I just found that there is no license information provided in the repository, making it impossible to be officially reused by others. Can you please add license information? Thank you!
How can i get the raw Spirit dataset(172 million log messages)
In logadempirical/logdeep/models/lstm.py line 109, the inputs are set as:
input0, input1 = features[2], features[1]
where features[2] is semantic pattern according to logadempirical/logdeep/dataset/sample.py. However, in the original code of in logdeep repository, we have:
input0, input1 = features[0], features[1]
where features[0] is sequential_pattern.
Hence, the current LogAnomaly model in this rep is not working properly.
Guys, this code is missing something. According to the demo script to run the code for a long time or failed to run。
Is there a big guy out there who can fix this? Thank you very much!!!
You can add me on wechat :wlf0104 qq:1780130585
Or send me Google Mail:[email protected]
Could you please update requirements for the project please? I suppose it's outdated.
Do you use the same HDFS log dataset as in DeepLog paper? Could you please provide the log dataset? Or anywhere can I view the logs?
Hi, I was trying to run loganomaly model, in the cmd line I simply change the model option to 'loganomaly'. It shows this error:
File "/home/LogADEmpirical/logadempirical/logdeep/dataset/vocab.py", line 46, in find_similar
if sim > 0.90:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Could you please help with this?
I follow README.md demo shell script,and I got this error.
Please tell me how to change the parameter to avoid this problem.
I am trying to reproduce the results of your paper.
In your paper, you wrote"We leverage 10 million continuous log lines..." for the Thunderbird dataset, could you tell me which part you actually used?
(I tried using the first 10 million lines but it seemed different)
I'm trying to reproduce your results (like another poster here)...
Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?
python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset
Traceback (most recent call last):
File "main_run.py", line 213, in <module>
main()
File "main_run.py", line 195, in main
run_deeplog(options)
File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
Trainer(options).start_train()
File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
train_logs, train_labels = sliding_window(data,
File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
event2semantic_vec = read_json(os.path.join(data_dir, e_name))
File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'
In logadempirical/PLELog/data/DataLoader.py line 324, there is this line:
train = train[:100000]
which restricts the training dataset to the first 100000 logs. Why it has been applied?
In fact, it makes the training size different than what is reported in the paper.
Similarly on line 330:
val = val[:20000]
Hi, I was trying out deeplog using HDFS1 dataset (used only first 1m lines parsed by Drain).
I run it with the following parameters settings:
python main_run.py --folder=hdfs_1m/ --log_file=HDFS_1m.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
This is the result:
Precision: 86.964%, Recall: 53.931%, F1-measure: 66.576%, Specificity: 0.996
(I have tried with different param settings, there's not much of a difference)
Could you please have an advice for me? Thanks in advance!
Hello, I was trying to run deeplog on hdfs dataset but ended up with the following error.
command I used:
!python main_run.py --folder=bgl/ --log_file=HDFS.log --dataset_name=hdfs --model_name=deeplog --window_type=sliding\ --sample=sliding_window --is_logkey --train_size=0.8 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100\ --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=150 --history_size=10 --lr=0.001\ --accumulation_step=5 --session_level=hour --window_size=60 --step_size=60 --output_dir=experimental_results/demo/random/ --is_process
Are we supposed to run other scripts first to generate such files (for example data_loader.py
or synthesize.py
)
Can we re-run the code with other formats of HDFS dataset which are publicly available?
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.