d0ng1ee / logdeep Goto Github PK

View Code? Open in Web Editor NEW

368.0 5.0 114.0 6.15 MB

log anomaly detection toolkit including DeepLog

License: MIT License

Python 100.00%

log-analysis anomaly-detection aiops deeplog log-anomaly pytorch sequence-prediction failure-detection

logdeep's Introduction

logdeep

Introduction

LogDeep is an open source deeplearning-based log analysis toolkit for automated anomaly detection.

Note: This repo does not include log parsing，if you need to use it, please check logparser

Major features

Modular Design
Support multi log event features out of box
State of the art(Including resluts from deeplog,loganomaly,robustlog...)

Models

Model	Paper reference
DeepLog	[CCS'17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
LogAnomaly	[IJCAI'19] LogAnomaly: UnsupervisedDetectionof SequentialandQuantitativeAnomaliesinUnstructuredLogs
RobustLog	[FSE'19] RobustLog-BasedAnomalyDetectiononUnstableLogData

Requirement

python>=3.6
pytorch >= 1.1.0

Quick start

git clone https://github.com/donglee-afar/logdeep.git
cd logdeep

Example of building your own log dataset
SAMPLING_EXAMPLE.md

Train & Test DeepLog example

cd demo
# Train
python deeplog.py train
# Test
python deeplog.py test

The output results, key parameters and train logs will be saved under result/ path

DIY your own pipeline

Here is an example of the key parameters of the loganomaly model which in demo/loganomaly.py
Try to modify these parameters to build a new model!

# Smaple
options['sample'] = "sliding_window"
options['window_size'] = 10

# Features
options['sequentials'] = True
options['quantitatives'] = True
options['semantics'] = False

Model = loganomaly(input_size=options['input_size'],
                    hidden_size=options['hidden_size'],
                    num_layers=options['num_layers'],
                    num_keys=options['num_classes'])

Benchmark results

		HDFS
Model	feature	Precision	Recall	F1
DeepLog(unsupervised)	seq	0.9583	0.9330	0.9454
LogAnomaly(unsupervised)	seq+quan	0.9690	0.9825	0.9757
RobustLog(supervised)	semantic	0.9216	0.9586	0.9397

logdeep's People

Contributors

Stargazers

Watchers

Forkers

juno119 holysll nieyiwen chen0031 xwzpp hankkung weibinmeng jiejunnan qyjcode lakergsj booleer laugigabyte nagsubhadeep safe3 hisylvain gp-yuege arunbaruah awpboxer macoconne junhannah greitzmann yuzhoupeng shwetanshu21 mahfuj9346449 duoergun0729 chennqqi elanafu qiyeboy linhtran52099 heyitsthatguy chinahappyking 666wxy666 changnaman taomingming lonesoul cliffordlai wildfire8966 zyh1234 mortal12138 flyonok flxuru code197 summelon goseekwhy vanhoanglepsa zhangyashen hcgcarry coder-chenzhi paulinko trendingtechnology panameraxxx wibruce zaihanlit thzt16 kobeontheway andrew2019github gx9732 yilish kapiya yishengduwu cmlyotwhn houjingwen li-du qwer556617123 conleykong ztz1989 wangxinqi94 arpine-aikoda don-young leima0324 p10001 eyreeyre chaimabf yeyuan1107 zk1056309462 ivantha mahmoodalmansooei nekle 98zy98 hour080 david-wl kdyao wiigin 1543295695 softwiser-group superzerot amsehili onlysixpence wangpengcufe nichong255 qq2418916003 ghyunoh planningbycai ilwoof beiluomi iridescent-wang mvandermeulen sajuptpm wpcwpcwpc struggleforcode

logdeep's Issues

A data processing problem

Hi @donglee-afar:
I read the answers to the issue areas, but I still don't understand the data processing process. I don't know how to convert "sequece_hdfs.csv" to "hdfs_train" in logdeep project.

Could please you help me with it?😀
Can you provide a reference code and a workflow?

thank you very much.

hdfs文件夹下的event2semantic_vec.json这个文件是怎么用原始日志得到的

What confuses me is that it seems that you don't set an attention layer, which is mentioned in the paper, in the RobustLog model. Do you mind explaining the reason for me? I'm a ML/DL newbie. Thanks in advance!

How to generate data sequence

Thank you for the great project.

I am trying to use this implementation on a research project of my own using my own data. I used Drain to parse the data into two files: a structured file and a template file. But I am not sure how to proceed and convert these files into a sequence file (numbers only) like hdfs_train file.

请问下，deeplog输出的这些指标是基于啥计算的，无监督的话咋知道哪些是对的，哪些是错的？最后有输出啥结果文件，找出有问题的日志窗口吗

Question about LogAnomaly

Thank you for your work. It was very helpful.

I have a Two questions about LogAnomaly.

First,
When I read LogAnomaly paper there was Template2Vec Section.
But I can't find that part in your code.
There was a count vector part, but the sequence part does not seem to have Template2Vec applied.

Second,
Attention was implemented in the logdeep/models/lstm.py, but it was not used.

Again, Thanks for your work. :)

question about fastText and TF-IDF

Hi! After studying this amazing project, I have two question.
1.How do you get the event2semantic.json?
2.After I got the log template, How to classify so many log templates into 28 numbers like your file?

looking forward your reply. （I will be very grateful if you could upload your code :) ）

About the attention_net function

Thanks for your codes. About the code, I have some questions. Dose the attention_net() function in the file lstm.py is completed? Or how can I apply the attention_net into loganomaly process. Thanks again!

Possible implementation errors for session_windows

Thanks for your work. Deeplog performs very well when I use the "sliding_window" option on HDFS. However, it performs very poorly when I use the "session_window" option on HDFS (Precision: 2.953%, Recall: 99.994%, F1-measure: 5.736%). Could you please double-check whether your implementation of "session_window" is correct? (Please let me know anyone else also had this problem) Thanks

Particularly, you have used the following function to truncate or pad each session for "session_window". From my side, I think it will largely impact the accuracy (Precision or F1-score).

def trp(l, n):
""" Truncate or pad a list """
r = l[:n]
if len(r) < n:
r.extend(list([0]) * (n - len(r)))
return r

Error of LogAnomaly implementation.

Hi,

Thank you so much for this amazing project!
I am recently playing with different methods here in this project, however, I do find something odd about the implementation of LogAnomaly. Here in this project, LogAnomaly is actually using log event ids within a window to predict the next event id, just the same as DeepLog, which seems wrong.

According to the paperwork of LogAnomaly, it seems using "semantic vectors" for the prediction (the prediction is an event id or semantic vector, either way is good).

So, as far as I can see, should we change the inputs of LogAnomaly from Sequentials and Quantitives to Semantics and Quantitives ?

Please correct me if I misunderstood something. Thanks again for sharing the project!

Lin, Yang

BGL Dtata

Hi,

May I know how to get the BGL data ?

In RobustLog's code, I didn't see the operation of weighting the semantic vector with TF-IDF

Thank you very much for the author's contribution. The problem is: I didn't see the semantic weighting of TFIDF when I was debugging the program of RobustLog.

An error occurs when the terminal command line runs

(python38) houjingwen@MacBook-Pro-2 demo % python3 loganomaly.py train
Traceback (most recent call last):
File "loganomaly.py", line 11, in
from logdeep.models.lstm import loganomaly,deeplog,robustlog
File "/Users/houjingwen/Desktop/logdeep-master/logdeep/models/lstm.py", line 1, in
import torch
ModuleNotFoundError: No module named 'torch'

Question about feature extraction on bgl dataset

Hi, it's me again. :)

I'm trying to perform the Deeplog model on bgl dataset. So far, I was able to understand the logic and generate the event sequences from structured bgl log dataset using this sample_bgl.py that you provided (many thanks!!).

It basically slides a 30-min window with 12-min step size on the structured bgl log. And in this case, we will end up having event sequence that contains either huge amount of events (e.g. I found a event sequence with 12514 events in it...) or event sequence with one or no event in it (since there is no event happen at that time period in the sliding window).

After generating the event sequences, I deleted the event sequences with no event and ended up getting a file with 65 non-empty event sequences. And I randomly picked 60 event sequences as my training sequences and rest of the 5 will be validation data.

And this is when my questions kick in.

When I tried to generate the sequential feature for the training dataset, should I do the same thing for hdfs dataset like using sliding window of 10 or other size on the event sequence and the next event to the current window will be the label for this current window. But in this case, how should I deal with the event sequence with only 1 event?
And I do remember that you mentioned in the other post, if using bgl dataset, then it can direcly use the event sequence for the sequential vector since it is generated using the sliding window already, but in this case, in my understanding, each event sequence (except for the last event) will directly be a sequential vector, then the label for this vector will be the last event in that event sequence? Then what about the event sequence with only 1 event?

Look forward to your valueble feedback!! And thank you for answering all of my questions!!!

One-hot encoding?

I look at your input to deeplog model is just numerical token sequences. Why isn't one-hot encoding transformation used as the input? The numbers in sequences represent operations, so they are nominal, not ordinal.

question about bgl dataset

Hi, thank you for making this amazing project.

I have some question when I use the BGL dataset to train and test. In the logdeep/dataset/sample.py you use 'hdfs/event2semantic_vec.json', I have no idea what is the function of this file. And when I use the BGL dataset, how I generate this file? Or I needn't this file? If it is not needed, how should I modify sample.py?

looking forward to your reply.

Display original log with results?

Do you have anything that displays the original log records, their ground truth status as normal and abnormal, and the result from logdeep predictions?

About Sampling (or Feature Extraction)

Hi!

I think section 3B of this paper (Chinese edition at here) may help people understand those sampling methods.

B. Feature Extraction

The main purpose of this step is to extract valuable features from log events that could be fed into anomaly detection models. The input of feature extraction is log events generated in the log parsing step, and the output is an event count matrix. In order to extract features, we firstly need to separate log data into various groups, where each group represents a log sequence. To do so, windowing is applied to divide a log dataset into finite chunks [5]. As illustrated in Figure 1, we use three different types of windows: fixed windows, sliding windows, and session windows.

Fixed window: Both fixed windows and sliding windows are based on timestamp, which records the occurrence time of each log. Each fixed window has its size, which means the time span or time duration. As shown in Figure 1, the window size is Δt, which is a constant value, such as one hour or one day. Thus, the number of fixed windows depends on the predefined window size. Logs that happened in the same window are regarded as a log sequence.

Sliding window: Different from fixed windows, sliding windows consist of two attributes: window size and step size, e.g., hourly windows sliding every five minutes. In general, step size is smaller than window size, therefore causing the overlap of different windows. Figure 1 shows that the window size is ΔT , while the step size is the forwarding distance. The number of sliding windows, which is often larger than fixed windows, mainly depends on both window size and step size. Logs that occurred in the same sliding window are also grouped as a log sequence, though logs may duplicate in multiple sliding windows due to the overlap.

Session window: Compared with the above two windowing types, session windows are based on identifiers instead of the timestamp. Identifiers are utilized to mark different execution paths in some log data. For instance, HDFS logs with block_id record the allocation, writing, replication, deletion of certain block. Thus, we can group logs according to the identifiers, where each session window has a unique identifier.

After constructing the log sequences with windowing techniques, an event count matrix X is generated. In each log sequence, we count the occurence number of each log event to form the event count vector. For example, if the event count vector is [0, 0, 2, 3, 0, 1, 0], it means that event 3 occurred twice and event 4 occurred three times in this log sequence. Finally, plenty of event count vectors are constructed to be an event count matrix X, where entry Xi,j records how many times the event j occurred in the i-th log sequence.

B. 特征提取

该步骤的主要目的是从日志事件中提取有价值的特征，这些特征可以被输入异常检测模型。特征提取的输入是日志解析步骤中生成的日志事件，输出是事件计数矩阵。为了提取特征，我们首先需要将日志数据分成不同的组，其中每个组代表一个日志序列。为此，窗口被应用于将日志数据集划分成有限块。如图1所示，我们使用三种不同类型的窗口:固定窗口、滑动窗口和会话窗口

固定窗口
固定窗口和滑动窗口都基于时间戳，时间戳记录每个日志的发生时间。每个固定窗口都有其大小，这意味着时间跨度或持续时间。如图1所示，窗口大小为∆t，这是一个常量值，例如一小时或一天。因此，固定窗口的数量取决于预定义的窗口大小。同一窗口中发生的日志被视为日志序列

滑动窗口
与固定窗口不同，滑动窗口由两个属性组成:窗口大小和步长，例如，每小时窗口每五分钟滑动一次。通常，步长小于窗口大小，因此会导致不同窗口的重叠。图1显示了窗口大小是∆T，而步长是转发距离。滑动窗口的数量通常大于固定窗口，主要取决于窗口大小和步长。发生在同一滑动窗口中的日志也被分组为日志序列，尽管由于重叠，日志可能会在多个滑动窗口中重复

会话窗口
与上述两种窗口类型相比，会话窗口基于标识符而不是时间戳。标识符用于在一些日志数据中标记不同的执行路径。例如，带有block_id的HDFS日志记录了某些数据块的分配、写入、复制和删除。因此，我们可以根据标识符对日志进行分组，其中每个会话窗口都有一个唯一的标识符。

在利用窗口技术构建日志序列之后，生成事件计数矩阵X。在每个日志序列中，我们计算每个日志事件的发生次数，以形成事件计数向量。例如，如果事件计数向量是[ 0、0、2、3、0、1、0 ]，这意味着在这个日志序列中，事件3发生了两次，事件4发生了三次。最后，大量事件计数向量被构造成事件计数矩阵X，其中条目Xi, j记录了事件j在第i个日志序列中发生了多少次。

请问作者，data_read('template.txt')中template.txt文件是怎么得到的？第二个脚本里deepLog_hdfs_train.txt文件在data文件夹下也没看到

Sorry for the late reply，
These are the three code snippets I wrote before, run them in orderI hope it will be useful to you！
@huhui ,@arunbaruah ,@nagsubhadeep, @Magical66
1.

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 10:54:57 2019

@author: lidongxu1
"""
import re
import spacy
import json

def data_read(filepath):
    fp = open(filepath, "r")
    datas = []  # 存储处理后的数据
    lines = fp.readlines()  # 读取整个文件数据
    i = 0  # 为一行数据
    for line in lines:
        row = line.strip('\n') # 去除两头的换行符，按空格分割
        datas.append(row)
        i = i + 1   
    fp.close()
    return datas

def camel_to_snake(name):
    """
    # To handle more advanced cases specially (this is not reversible anymore):
    # Ref: https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case  
    """
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()


def replace_all_blank(value):
    """
    去除value中的所有非字母内容，包括标点符号、空格、换行、下划线等
    :param value: 需要处理的内容
    :return: 返回处理后的内容
    # https://juejin.im/post/5d50c132f265da03de3af40b
    # \W 表示匹配非数字字母下划线
    """
    result = re.sub('\W+', ' ', value).replace("_", ' ')
    result = re.sub('\d',' ',result)
    return result
# https://github.com/explosion/spaCy
# https://github.com/hamelsmu/Seq2Seq_Tutorial/issues/1
nlp = spacy.load('en_core_web_sm')
def lemmatize_stop(text):
    """
    https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy
    """
#    nlp = spacy.load('en_core_web_sm')
    document = nlp(text)
    # lemmas = [token.lemma_ for token in document if not token.is_stop]
    lemmas = [token.text for token in document if not token.is_stop]
    return lemmas

def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

data = data_read('template.txt')
result = {}
for i in range(len(data)):
    temp = data[i]
    temp = camel_to_snake(temp)
    temp = replace_all_blank(temp)
    temp = " ".join(temp.split())
    temp = lemmatize_stop(temp)
    result[i] = temp
print(result)
dump_2_json(result, 'eventid2template.json')





# 单独保存需要用到的fasttext词向量
template_set = set()
for key in result.keys():
    for word in result[key]:
        template_set.add(word)

import io
from tqdm import tqdm

# https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md
def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in tqdm(fin):
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

fasttext = load_vectors('cc.en.300.vec')

template_fasttext_map = {}

for word in template_set:
    template_fasttext_map[word] = list(fasttext[word])
    

dump_2_json(template_fasttext_map,'fasttext_map.json')

import os
import json
import numpy as np
import pandas as pd
from collections import Counter
import math

def read_json(filename):
    with open(filename, 'r') as load_f:
        file_dict = json.load(load_f)
    return file_dict

eventid2template = read_json('eventid2template.json')
fasttext_map = read_json('fasttext_map.json')
print(eventid2template)
dataset = list()
with open('data/'+'deepLog_hdfs_train.txt', 'r') as f:
    for line in f.readlines():
        line = tuple(map(lambda n: n - 1, map(int, line.strip().split())))
        dataset.append(line)
print(len(dataset))
idf_matrix = list()
for seq in dataset:
    for event in seq:
        idf_matrix.append(eventid2template[str(event)])
print(len(idf_matrix))
idf_matrix = np.array(idf_matrix)
X_counts = []
for i in range(idf_matrix.shape[0]):
    word_counts = Counter(idf_matrix[i])
    X_counts.append(word_counts)
print(X_counts[1000])
X_df = pd.DataFrame(X_counts)
X_df = X_df.fillna(0)
print(len(X_df))
print(X_df.head())
events = X_df.columns
print(events)
X = X_df.values
num_instance, num_event = X.shape

print('tf-idf here')
df_vec = np.sum(X > 0, axis=0)
print(df_vec)
print('*'*20)
print(num_instance)
# smooth idf like sklearn
idf_vec = np.log((num_instance + 1)  / (df_vec + 1)) + 1
print(idf_vec)
idf_matrix = X * np.tile(idf_vec, (num_instance, 1))
X_new = idf_matrix
print(X_new.shape)
print(X_new[1000])

word2idf = dict()
for i,j in zip(events,idf_vec):
    word2idf[i]=j
    # smooth idf when oov
    word2idf['oov'] = (math.log((num_instance + 1)  / (29+1)) + 1)

print(word2idf)
def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

dump_2_json(word2idf,'word2idf.json')

import json
import numpy as np
from collections import Counter

def read_json(filename):
    with open(filename, 'r') as load_f:
        file_dict = json.load(load_f)
    return file_dict

event2template = read_json('eventid2template.json')
fasttext = read_json('fasttext_map.json')
word2idf = read_json('word2idf.json')


event2semantic_vec = dict()
# todo :
# 计算每个seq的tf，然后计算句向量
for event in event2template.keys():
    template = event2template[event]
    tem_len = len(template)
    count = dict(Counter(template))
    for word in count.keys():
        # TF
        TF = count[word]/tem_len
        # IDF
        IDF = word2idf.get(word,word2idf['oov'])
        # print(word)
        # print(TF)
        # print(IDF)
        # print('-'*20)
        count[word] = TF*IDF
    # print(count)
    # print(sum(count.values()))
    value_sum = sum(count.values())
    for word in count.keys():
        count[word] = count[word]/value_sum
    semantic_vec = np.zeros(300)
    for word in count.keys():
        fasttext_weight = np.array(fasttext[word])
        semantic_vec += count[word]*fasttext_weight
    event2semantic_vec[event] = list(semantic_vec)
def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

dump_2_json(event2semantic_vec,'event2semantic_vec_sameoov.json')

Originally posted by @donglee-afar in #3 (comment)

关于 TP,FP,TN,FN的问题！

按照您代码里对TP,FP,TN,FN的定义来看：异常样本数量 = TP+FP。在您的代码结果中，这似乎是对应不上的，是否您的代码 /tools/predict.py/def predict_supervised(self)函数中，对最后的结果判定存在问题。

re.error: missing ), unterminated subpattern at position 21

Traceback (most recent call last):
File "structure_bgl.py", line 66, in
eventmap = match(BGL)
File "structure_bgl.py", line 41, in match
if re.match(r''+item,log_event) and re.match(r''+item,log_event).span()[1] == len(log_event):
File "/home/lepton00/opt/miniconda/lib/python3.7/re.py", line 173, in match
return _compile(pattern, flags).match(string)
File "/home/lepton00/opt/miniconda/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/lepton00/opt/miniconda/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/home/lepton00/opt/miniconda/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/lepton00/opt/miniconda/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/home/lepton00/opt/miniconda/lib/python3.7/sre_parse.py", line 819, in _parse
source.tell() - start)
re.error: missing ), unterminated subpattern at position 21

Hi does anyone have this problem? Some answers I found on Stack Overflow suggest adding r before the regex string which the owner has already done it.

Edit: I used Drain in logpai as the parser.

Can you show me an example about running LogAnomaly in unsupervised mode

BGL dataset

Would you like to show how you parsed bgl dataset? Because LogPai does not give scripts for parsing BGL

hdfs_train sequence file doesn't correspond to the sequence file generated for 100k structured file provided in the repository

Hi,

Can you kindly let me know how you got 4855 sequences in hdfs_train? While I used your 'sample_hdfs.py' script to generate a sequence file from a 100k structured file provided by you and it generates 7940 sequences. Any help would be highly appreciated.

Thanks

Bi-LSTM robustlog

Nice implementation!
But why is the model of robustlog same as deeplog's?
In the original paper of robustlog, they use Bi-LSTM beside attention

Question about hdfs_train, hdfs_test_normal, and hdfs_test_abnormal

Thanks for your awesome work! @donglee-afar

I have two questions about hdfs_train, hdfs_test_normal, and hdfs_test_abnormal:

How to get them from the whole dataset? I mean, how to divide the whole dataset into train and test after we already have an event id sequence for each BlockId in the HDFS log?
I learn from data/hdfs/gen_train_data.py that, hdfs_train contains only normal data. I wonder if I'm right.

Looking forward to your reply!
Thank you!

Use the deeplog model on streaming log?

Hi, thanks for this awesome toolkit!

I took a look at the BGL dataset and found that the anomaly log with the same label shares the same error message. For example, the anomaly log with type KERNDTLB shares the error message RAS KERNEL FATAL data TLB error interrupt.

KERNDTLB 1118552678 2005.06.11 R30-M0-N9-C:J16-U01 2005-06-11-22.04.38.300588 R30-M0-N9-C:J16-U01 RAS KERNEL FATAL data TLB error interrupt

So it seems to me if there is an alert trigger built on the real-time streaming log data, then multiple regular expression based rules would be enough to detect the anomaly error. So I'm wondering is there any advantage to use deeplog model on the streaming log to detect the anomaly compared to the regular expression based rules?

Any thoughts are welcomed!

DeepLog hdfs original unpased data

Hello,
I started to use this ropo and it is grat!
But I need to see the original data of the train and the test (normal and abnormal).
I understood that the data generated from the csv 'HDFS_100k.log_structured', but how it has generated?
and what format exactly is the date and time in this file?

thanks!

Anomaly log file type detection and predict future log error

Dear @donglee-afar

I am working on anomaly log analysis methods such as drain approach to use to structure log data into structure data. We have lots of log file and we have no idea this specific log file belong to which software? like android, HDF and etc.

How we recognize log file type?
How we can predict future log analysis error based on current log data?

'../result/deeplog/deeplog_last.pth 这个文件怎么产生

In HDFS templates count is 28?

Thanks for your excellent project. But I have a little confused. I use drain as logparser, but the template count is 47. so I want to know what log parsing method you use to get the template.

Howto make text to number

As the title describe, are there some requirement to convert text to numbers?

Question about obtaining the benchmark result

Thank you for all the amazing work you've done!

I successfully ran through the training and predicting process of deeplog model using the same HDFS data file that you are using (from loghub).

And I'm using Drain as my parsing tool to get the structured log data. I ended up having 48 unique event ID in the template. And I'm using around 5000 sessions for the training and the train loss and validation loss converged to 0.2 (start from 0.8) around 300+ epochs. I didn't change the default parameter setting in the deeplog.py file except for the number of classes (48 in my case).

The result that I got from prediction is shown below. It does not look as promising as the benchmark.

I'm not sure why but is it because of the parsing tool?

And idea or suggetions of improving the model results are welcome!!

hdfs parsing

Hi
Which log parser do you use to parse the HDFS dataset

F1 not achieved

The Precision in the roubustlog model is not achieved

prepare_log 这个的内容是什么

An example to use it for any log file.

Hi, Thanks for making such an amazing project.

I have trying to use it for my log files. I could parse log files to its equivalent csv files using Logparser by LogPAI, but I have no idea how to convert logs to sequence of number as you have in you ~/data/hdfs/ directory. Also, then how to use it for inference real time log file.

Could please you help me with it?

Question about deeplog in logs Apache

How would I go about using deeplog for Apache logs?

192.168.0.14 - - [15/Sep/2021:07:28:39 -0400] "GET /media/plg_system_popup/js/jquery.js HTTP/1.1" 200 293755 "https://192.168.0.52/" "Mozilla /5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"