Git Product home page Git Product logo

Comments (25)

d0ng1ee avatar d0ng1ee commented on May 29, 2024 7

I will update the code and documentation on how to generate the sequence of number in the next few days.It will include two methods which depends on your own logs: Time sliding window and hard disk ID sequence window.

from logdeep.

kartikeyporwal avatar kartikeyporwal commented on May 29, 2024 4

Thanks alot for sharing your expertise.

I've gone through your code, I get your idea as (I am putting minute details so that it could be helpful to someone in future.)

  1. First gather all the logs which was obtained from normal execution of application, i.e., logs without errors.
  2. Combine these logs and convert to _structured.csv and _template.csv file using drain from logpai.
  3. Train the model using obtained _structured.csv from step 2.
  4. After successful training and saved model, it's time to test the model' s accuracy using a abnormal log file (log file with anomaly) and normal log file followed by inference of the model for the new logs files.
  5. To implement step 4, since log files will be different in the sequence of events, so obtaining _structured.csv and _template.csv file using drain will not make any sense as randomly generated event_id will be completely different for an event from generated event_id for same event from the log file used for training. So, you proposed structure_bgl.py, using which I can generate event_id for completely new logs based on the event_id of the logs used for training using the generated event_template. Further, sample_bgl.py will convert the structured log into sequence of event_id which can further be replaced by its equivalent integer and thus testing can be performed.
  6. Further to inference the model, new log line or logs lines in particular time window can be mapped with training file's event_template to obtain event_id.

Could I figured it out correctly?

Please feel free to correct me if I failed to describe your approach.

Thanks for your time.

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024 3

Sorry for the late reply,
These are the three code snippets I wrote before, run them in orderI hope it will be useful to you!
@huhui ,@arunbaruah ,@nagsubhadeep, @Magical66
1.

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 10:54:57 2019

@author: lidongxu1
"""
import re
import spacy
import json

def data_read(filepath):
    fp = open(filepath, "r")
    datas = []  # 存储处理后的数据
    lines = fp.readlines()  # 读取整个文件数据
    i = 0  # 为一行数据
    for line in lines:
        row = line.strip('\n') # 去除两头的换行符,按空格分割
        datas.append(row)
        i = i + 1   
    fp.close()
    return datas

def camel_to_snake(name):
    """
    # To handle more advanced cases specially (this is not reversible anymore):
    # Ref: https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case  
    """
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()


def replace_all_blank(value):
    """
    去除value中的所有非字母内容,包括标点符号、空格、换行、下划线等
    :param value: 需要处理的内容
    :return: 返回处理后的内容
    # https://juejin.im/post/5d50c132f265da03de3af40b
    # \W 表示匹配非数字字母下划线
    """
    result = re.sub('\W+', ' ', value).replace("_", ' ')
    result = re.sub('\d',' ',result)
    return result
# https://github.com/explosion/spaCy
# https://github.com/hamelsmu/Seq2Seq_Tutorial/issues/1
nlp = spacy.load('en_core_web_sm')
def lemmatize_stop(text):
    """
    https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy
    """
#    nlp = spacy.load('en_core_web_sm')
    document = nlp(text)
    # lemmas = [token.lemma_ for token in document if not token.is_stop]
    lemmas = [token.text for token in document if not token.is_stop]
    return lemmas

def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

data = data_read('template.txt')
result = {}
for i in range(len(data)):
    temp = data[i]
    temp = camel_to_snake(temp)
    temp = replace_all_blank(temp)
    temp = " ".join(temp.split())
    temp = lemmatize_stop(temp)
    result[i] = temp
print(result)
dump_2_json(result, 'eventid2template.json')





# 单独保存需要用到的fasttext词向量
template_set = set()
for key in result.keys():
    for word in result[key]:
        template_set.add(word)

import io
from tqdm import tqdm

# https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md
def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in tqdm(fin):
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

fasttext = load_vectors('cc.en.300.vec')

template_fasttext_map = {}

for word in template_set:
    template_fasttext_map[word] = list(fasttext[word])
    

dump_2_json(template_fasttext_map,'fasttext_map.json')

import os
import json
import numpy as np
import pandas as pd
from collections import Counter
import math

def read_json(filename):
    with open(filename, 'r') as load_f:
        file_dict = json.load(load_f)
    return file_dict

eventid2template = read_json('eventid2template.json')
fasttext_map = read_json('fasttext_map.json')
print(eventid2template)
dataset = list()
with open('data/'+'deepLog_hdfs_train.txt', 'r') as f:
    for line in f.readlines():
        line = tuple(map(lambda n: n - 1, map(int, line.strip().split())))
        dataset.append(line)
print(len(dataset))
idf_matrix = list()
for seq in dataset:
    for event in seq:
        idf_matrix.append(eventid2template[str(event)])
print(len(idf_matrix))
idf_matrix = np.array(idf_matrix)
X_counts = []
for i in range(idf_matrix.shape[0]):
    word_counts = Counter(idf_matrix[i])
    X_counts.append(word_counts)
print(X_counts[1000])
X_df = pd.DataFrame(X_counts)
X_df = X_df.fillna(0)
print(len(X_df))
print(X_df.head())
events = X_df.columns
print(events)
X = X_df.values
num_instance, num_event = X.shape

print('tf-idf here')
df_vec = np.sum(X > 0, axis=0)
print(df_vec)
print('*'*20)
print(num_instance)
# smooth idf like sklearn
idf_vec = np.log((num_instance + 1)  / (df_vec + 1)) + 1
print(idf_vec)
idf_matrix = X * np.tile(idf_vec, (num_instance, 1))
X_new = idf_matrix
print(X_new.shape)
print(X_new[1000])

word2idf = dict()
for i,j in zip(events,idf_vec):
    word2idf[i]=j
    # smooth idf when oov
    word2idf['oov'] = (math.log((num_instance + 1)  / (29+1)) + 1)

print(word2idf)
def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

dump_2_json(word2idf,'word2idf.json')
import json
import numpy as np
from collections import Counter

def read_json(filename):
    with open(filename, 'r') as load_f:
        file_dict = json.load(load_f)
    return file_dict

event2template = read_json('eventid2template.json')
fasttext = read_json('fasttext_map.json')
word2idf = read_json('word2idf.json')


event2semantic_vec = dict()
# todo :
# 计算每个seq的tf,然后计算句向量
for event in event2template.keys():
    template = event2template[event]
    tem_len = len(template)
    count = dict(Counter(template))
    for word in count.keys():
        # TF
        TF = count[word]/tem_len
        # IDF
        IDF = word2idf.get(word,word2idf['oov'])
        # print(word)
        # print(TF)
        # print(IDF)
        # print('-'*20)
        count[word] = TF*IDF
    # print(count)
    # print(sum(count.values()))
    value_sum = sum(count.values())
    for word in count.keys():
        count[word] = count[word]/value_sum
    semantic_vec = np.zeros(300)
    for word in count.keys():
        fasttext_weight = np.array(fasttext[word])
        semantic_vec += count[word]*fasttext_weight
    event2semantic_vec[event] = list(semantic_vec)
def dump_2_json(dump_dict, target_path):
    '''
    :param dump_dict: submits dict
    :param target_path: json dst save path
    :return:
    '''
    class MyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, bytes):
                return str(obj, encoding='utf-8')
            return json.JSONEncoder.default(self, obj)

    file = open(target_path, 'w', encoding='utf-8')
    file.write(json.dumps(dump_dict, cls=MyEncoder, indent=4))
    file.close()

dump_2_json(event2semantic_vec,'event2semantic_vec_sameoov.json')      

    

from logdeep.

kartikeyporwal avatar kartikeyporwal commented on May 29, 2024

Thanks for your response.

Also, thanking you in advance for sharing your methods to generate sequence of numbers from any log file as almost every log anomaly detection repo uses hdfs logs with pre-computed sequence of numbers from log message and I could not figure out how to use it for every log file case.

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024

Hope it will help you :)
Example of how to sample your own log

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024

What I do in structure_bgl.py is just to use the template extracted by drain to map the log file to event_id and extract the time and other information to be used next part.
This part can be said that it is just the meaning of data cleaning?
It seems to be no problem with the other parts you mentioned.

from logdeep.

kartikeyporwal avatar kartikeyporwal commented on May 29, 2024

TBH, this is exactly what I meant, using structure_bgl.py, one can use the template extracted by drain as used during training to map new log files to event_id of that template.

from logdeep.

cherishwsx avatar cherishwsx commented on May 29, 2024

Hi! Thank you for posting this amazing project and thank you @kartikeyporwal for opening this issue here! I also got a couple of questions about using the model on my own datasets. And here is my understanding of the workflows here:

  1. Take HDFS raw log dataset as an example, I first need to transform it into a structured log dataset using the LogParser. And I will ended up getting this two files _structured.csv and _template.csv
  2. And to get the training data and test data that look like the hdfs_train, hdfs_test_normal and hdfs_test_abnormal from the structured log dataset that I got from step 1, I will need to first do the sampling to generate the sequence of number as you stated in the example of how to sample your own log. Then after having the event sequences, I will need to do the train test split mannually to get the three datsasets that listed above.
  3. After having the datasets, we can perform the model (e.g. deeplog.py) training on the hdfs_train file that uses the sliding window sampling methods to generate sequence vector, count vector and semantic vector to train the deep learning model. And we can choose our own combination of the feature vectors that we wanna use.
  4. Lastly, use the saved model to do inference on the test dataset.

And the questions I have based on the workflow that I describe above are:

  1. When generating the semantic vector, it used a event2semantic_vec.json file that contains total 0-28 events number indicator mapping to different vectors. I guess this file is specifically generated for HDFS dataset to correspond to each eventID in HDFS, right? And how can we generate such json file if we are using our own log data?
  2. And I'm also a little bit confused about the two sampling parts in step 2 and step 3. My understanding is that the sampling method happened in step2 is to generate sequence of event like we see in the hdfs_train file. And I believe it also depends on which window type you choose, right? And based on your sampling example for HDFS, I think that is session window sampling which is the same method used in loglizer dataloader.py. And for the sampling method in the sample.py, this is mainly for generating the feature vectors, right?

Feel free to correct me if there is anything wrong!! Look forward to any feedback!! :)

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024

hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
2.
LSTM training process input sequence length needs to be fixed

You are right, first sampling origin HDFS dataset by session window, then use sliding window for generating the feature vectors(count vector and sequence vector)
If you use BGL dataset, just sampling origin data by sliding window and can just use it for generating the feature.

In robustlog (supervised learning), I just use a fixed sequence length method(crop and pad) to train lstm in the code......

from logdeep.

cherishwsx avatar cherishwsx commented on May 29, 2024

Thank you for the reply!!

  1. It would be great if you can upload your code! I really appreciate that!
  2. Do you mean the sequence length (28 in your case since this number will be different depend on which parser tool) that you use to initialize the some of the feature vector length?

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024

I mean the length of sequence [5 5 5 22 11 9 11 9 11 9] is fixed as 10 in deeplog and loganomaly.
Example:
sequence [5 5 5 22 11 9 11 9 11 9]

sequence_vector=[5,5,5,22,11,9,11,9,11,9] 
count_vector=[0]*28
count_vector[5] = 3
count_vector[9] = 3
count_vector[11] = 3
count_vector[22] = 1

28 is just the number of template(Ground truth, not parsing by myself) of the HDFS dataset.

from logdeep.

cherishwsx avatar cherishwsx commented on May 29, 2024

In this case, I think you are refering to the winow_size parameter (default is 10), correct? For example, if the window_size is set to 20, then the length of sequence will be 20. And the sequence_vector and count_vector (like the code chunk you showed above) will be created based on the length 20 sequence.

from logdeep.

d0ng1ee avatar d0ng1ee commented on May 29, 2024

You are right! @cherishwsx

from logdeep.

Magical66 avatar Magical66 commented on May 29, 2024

hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
@donglee-afar I'm very interested about how to get the event2semantic_vec.json file. Could you please upload the code? Thank you very much!

from logdeep.

nagsubhadeep avatar nagsubhadeep commented on May 29, 2024

hi @cherishwsx
1.
I just use the Facebook open source fasttext pre-trained word vector model to extract the Word vector and use tf-idf method to generate Sentence vector for a log(correspond to each enentID).
If you are interested I can upload my code as a reference
2.
LSTM training process input sequence length needs to be fixed

You are right, first sampling origin HDFS dataset by session window, then use sliding window for generating the feature vectors(count vector and sequence vector)
If you use BGL dataset, just sampling origin data by sliding window and can just use it for generating the feature.

In robustlog (supervised learning), I just use a fixed sequence length method(crop and pad) to train lstm in the code......

Hi @donglee-afar ,

Can you please upload the code that is used to generate the event2semantic_vec.json file?

Thanks,
Deep

from logdeep.

arunbaruah avatar arunbaruah commented on May 29, 2024

Hi @donglee-afar,

Please give me a hint for creating the event2semantic_vec.json file. Thank you

from logdeep.

huhui avatar huhui commented on May 29, 2024

Hi @donglee-afar,

How to use fasttext generate event2semantic_vec.json file? Look forward to any feedback! Thank you!!!

from logdeep.

zeinabfarhoudi avatar zeinabfarhoudi commented on May 29, 2024

Hi @donglee-afar

What is the format of the "template.txt" file? Is the same as "templates.csv" file in the repository, which include EventId and EventTemplate, or the hdfs.log file?

Thanks

from logdeep.

ZanisAli avatar ZanisAli commented on May 29, 2024

Hi @donglee-afar

What is the format of the "template.txt" file? Is the same as "templates.csv" file in the repository, which include EventId and EventTemplate, or the hdfs.log file?

Thanks

Hi,
These are the templates, you can dump the templates from the templates.csv file into this file as .txt. I am not sure but I think you can even use the templates.csv file but only templates column needs to be kept in that file as well as header row to be removed

from logdeep.

zeinabfarhoudi avatar zeinabfarhoudi commented on May 29, 2024

@ZanisAli Thank you for your reply

I have another question: In the testing phase to predict "test_normal" and "test_abnormal", Does the "template.txt" file become update from the training phase or not? In the other words, is the "template.txt" file in the train the same as the test step?

from logdeep.

ZanisAli avatar ZanisAli commented on May 29, 2024

@Farhodi For the training and testing part, template.txt file will not be used at all, instead, the sequences generated from the structured file created by template identification techniques will be used. template.txt file here is only used for robustlog to generate event2vector_semantics.json file. Except that there is no use of this file.

from logdeep.

zeinabfarhoudi avatar zeinabfarhoudi commented on May 29, 2024

@ZanisAli, Thanks for your helping

Is it possible to edit the "LogAnomaly" demo code to have semantic information?

from logdeep.

gavine avatar gavine commented on May 29, 2024

Thanks @donglee-afar for this fantastic project and all the good work. also for @cherishwsx the good summary.

" And to get the training data and test data that look like the hdfs_train, hdfs_test_normal and hdfs_test_abnormal from the structured log dataset that I got from step 1, I will need to first do the sampling to generate the sequence of number as you stated in the example of how to sample your own log. Then after having the event sequences, I will need to do the train test split mannually to get the three datsasets that listed above."

Could you please help elaborate the above a bit more in regard with how to generate these three files hdfs_train, hdfs_test_normal and hdfs_test_abnormal ?

What I am trying to achieve here is to apply robustlog to Linux logs (e.g. syslogs). I first parse the syslogs with Drain to get the template, and then labelled the original syslogs by placing the "-" or otherwise(for abnormal) in the first field, after that, apply structure_bgl.py and sample_bgl.py to structure and sample the logs respectively, and I am stuck in the next step to train and validate the model, and do prediction with it, and that is where the above question comes from.

Would you please help here ? Thanks a lot!

from logdeep.

michhar avatar michhar commented on May 29, 2024

Sorry for the late reply, These are the three code snippets I wrote before, run them in orderI hope it will be useful to you! @huhui ,@arunbaruah ,@nagsubhadeep, @Magical66 1.

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 10:54:57 2019

@author: lidongxu1
"""
import re
import spacy
import json

<code snipped for brevity, see above in thread for full code>

dump_2_json(event2semantic_vec,'event2semantic_vec_sameoov.json')

Thank you so much, @donglee-afar, for the excellent project and the code snippets for preprocessing. I have used your example code snippets to create a gist for preprocessing (in my case is was Ubuntu system logs and I used a parser project based on SPELL as well as my own text normalization method). This was mainly to create the event2semantic_vec.json semantics file for use with LogAnomaly method you've implemented.

Here is the gist in case it can help anyone (pls give feedback as you wish): https://gist.github.com/michhar/388d037439da6114d67aa8f793293870

Best regards.

from logdeep.

Nightmare2334 avatar Nightmare2334 commented on May 29, 2024

谢谢@donglee-afar为了这个出色的项目和所有出色的工作。也为了了@cherishwsx 好的总结。

““为了从我从1”获得的结构化日志数据集中获得看起来看起来像hdfs_train,hdfs_test_test_normal和hdfs_test_abnormal的的示例。然而在获得事件序列之后,我将需要手动进行火灾车辆测试拆分以获取上面列表的三个数据集。”

关于如何生成这三个文件 hdfs_train、hdfs_test_normal 和 hdfs_test_abnormal,您能否帮助详细说明以上内容?

(syslogs syslogs)syslogs syslogs得到得到得到得到日志日志日志日志structure_bgl.py和sample_bgl.py应用于结构和样本分别是日志,我被困在下一步训练和验证模型,并用它做预测,这就是上面提到的问题的来源。

你能帮忙吗?多谢!

@gavine Hello buddy, I have the same problem as you. Can you help me? I hope you can reply to me when you see it. This is very important to me. Thank you!

from logdeep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.