Git Product home page Git Product logo

logadempirical's People

Contributors

vanhoanglepsa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

logadempirical's Issues

How can l find this BGL.log_structured.csv?

When l run this program following your README.md, this error occured.

FileNotFoundError: [Errno 2] No such file or directory: './dataset/bgl/BGL.log_structured.csv'

And I do not know how to change the BLG.log to BGL.log_structured.csv.

Benchmark is missing

Could you please provide a benchmark (details of argument values for each experiment) for accurate reproduction of results?

Perhaps, similar to the Loglizer repository which has a benchmarks folder.

Missing data['Seq'], what‘s 'seq' ?

Hello, I have some questions about the 'seq' field in the data;
image
I tried to find how 'seq' was generated in the code. However, in addition to the generated part of the session window, the training data saved by the sliding window does not contain this part. I think it's an index list of logs, does that make sense?
Is there any mistake in my understanding? Could you please help with this?

Some question about loading dataset,expecially hdfs.

I have some questions about loading dataset,expecially hdfs.

RQ1: What is the parameter "history_size" used for? What's the value of "history_size"?

The explanation of "history_size" in the code is to split sequences for deeplog, loganomaly & logrobust.
I find that the "history_size" is used in the sliding_window() function just like the picture. In this function, it uses fix_window to split sequence,which the fix window size is "history_size".

image

My question is why do you use "history_size" to fix the data sequence, including session window HDFS ?
As a result, the length of the final training dataset sequence is "history_size".

image
image

Is there any mistake in my understanding? Can you help me answer it?

RQ2: Is there any other way to load datasets? Maybe the new way can solve the RQ1?
截屏2022-05-10 01 35 30

Deeplog input is empty

In logadempirical/logdeep/dataset/sample.py line 161, the value of sequential_pattern is set to empty list:

sequential_pattern = []

which is the main input of deeplog model. It caused an error during running deeplog saying the input is empty.

However, according to logdeep rep, this feature is supposed to be initialized as:
sequential_pattern = line[i:i + window_size]

Please update the code.

License?

This is great work! I just found that there is no license information provided in the repository, making it impossible to be officially reused by others. Can you please add license information? Thank you!

LogAnomaly model design fault

In logadempirical/logdeep/models/lstm.py line 109, the inputs are set as:

input0, input1 = features[2], features[1]

where features[2] is semantic pattern according to logadempirical/logdeep/dataset/sample.py. However, in the original code of in logdeep repository, we have:

input0, input1 = features[0], features[1]

where features[0] is sequential_pattern.

Hence, the current LogAnomaly model in this rep is not working properly.

The code did not run successfully

Guys, this code is missing something. According to the demo script to run the code for a long time or failed to run。
image
image
Is there a big guy out there who can fix this? Thank you very much!!!
You can add me on wechat :wlf0104 qq:1780130585
Or send me Google Mail:[email protected]

Update requirements

Could you please update requirements for the project please? I suppose it's outdated.

HDFS log dataset

Do you use the same HDFS log dataset as in DeepLog paper? Could you please provide the log dataset? Or anywhere can I view the logs?

ValueError while running LogAnomaly

Hi, I was trying to run loganomaly model, in the cmd line I simply change the model option to 'loganomaly'. It shows this error:

File "/home/LogADEmpirical/logadempirical/logdeep/dataset/vocab.py", line 46, in find_similar
    if sim > 0.90:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Could you please help with this?

What part of the thunderbird dataset did you use ?

I am trying to reproduce the results of your paper.
In your paper, you wrote"We leverage 10 million continuous log lines..." for the Thunderbird dataset, could you tell me which part you actually used?
(I tried using the first 10 million lines but it seemed different)

embeddings.json

I'm trying to reproduce your results (like another poster here)...

Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?

python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset

Traceback (most recent call last):
  File "main_run.py", line 213, in <module>
    main()
  File "main_run.py", line 195, in main
    run_deeplog(options)
  File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
    Trainer(options).start_train()
  File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
    train_logs, train_labels = sliding_window(data,
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
    event2semantic_vec = read_json(os.path.join(data_dir, e_name))
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
    with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'

Confusion about log parsing

I have a confusion about how to set parameters for parsing.I tried to parse HDFS with the settings in the figure, but there was obviously a problem. What is the right way?
There are four data sources. How should we set the parameters for parsing?
截屏2022-04-25 15 55 03

why train = train[:100000] in PLELog?

In logadempirical/PLELog/data/DataLoader.py line 324, there is this line:

train = train[:100000]

which restricts the training dataset to the first 100000 logs. Why it has been applied?

In fact, it makes the training size different than what is reported in the paper.

Similarly on line 330:
val = val[:20000]

the result is not reproducible

Hi, I was trying out deeplog using HDFS1 dataset (used only first 1m lines parsed by Drain).

I run it with the following parameters settings:
python main_run.py --folder=hdfs_1m/ --log_file=HDFS_1m.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process

This is the result:
Precision: 86.964%, Recall: 53.931%, F1-measure: 66.576%, Specificity: 0.996
(I have tried with different param settings, there's not much of a difference)

Could you please have an advice for me? Thanks in advance!

Missing HDFS.log_structured.csv file

Hello, I was trying to run deeplog on hdfs dataset but ended up with the following error.

image

command I used:
!python main_run.py --folder=bgl/ --log_file=HDFS.log --dataset_name=hdfs --model_name=deeplog --window_type=sliding\ --sample=sliding_window --is_logkey --train_size=0.8 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100\ --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=150 --history_size=10 --lr=0.001\ --accumulation_step=5 --session_level=hour --window_size=60 --step_size=60 --output_dir=experimental_results/demo/random/ --is_process

Are we supposed to run other scripts first to generate such files (for example data_loader.py or synthesize.py)
Can we re-run the code with other formats of HDFS dataset which are publicly available?
Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.