Git Product home page Git Product logo

korcen-ml's Introduction

Korcen

131_20220604170616

korcen-ml은 기존 키워드 기반의 korcen의 우회가 쉽다는 단점을 극복하기위해 딥러닝을 통해 정확도를 한층 더 올리려는 프로젝트입니다.

KOGPT2 모델만 공개하고 있으며 모델 파일은 여기에서 확인이 가능합니다.

더 많은 모델 파일과 학습 데이터를 다운받고 싶다면 문의주세요.

데이터 문장수
VDCNN(23.4.30) 200,000개
VDCNN_KOGPT2(23.5.28) 2,000,000개
VDCNN_LLAMA2(23.9.30) 5,000,000개
VDCNN_LLAMA2_V2(24.1.29) 10,000,000개

키워드 기반 기존 라이브러리 : py version, ts version

서포트 디스코드 서버

모델 검증

데이터마다 욕설의 기준이 달라 오차가 있다는 걸 감안하고 확인하시기 바랍니다.

korean-malicious-comments-dataset Curse-detection-data kmhas_korean_hate_speech Korean Extremist Website Womad Hate Speech Data LGBT-targeted HateSpeech Comments Dataset (Korean)
korcen 0.7121 0.8415 0.6800 0.6305 0.4479
VDCNN(23.4.30) 0.6900 0.4885 0.4885
VDCNN_KOGPT2(23.6.15) 0.7545 0.7824 0.7055 0.6875
VDCNN_LLAMA2(23.9.30) 0.7762 0.8104 0.7296
VDCNN_LLAMA2_V2(24.1.29) 0.8322 0.8410 0.7837 0.7120 0.7477

example

#py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 1000

model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"

model = tf.keras.models.load_model(model_path)
with open(tokenizer_path, "rb") as f:
    tokenizer = pickle.load(f)

def preprocess_text(text):
    text = text.lower()
    
    return text

def predict_text(text):
    sentence = preprocess_text(text)
    encoded_sentence = tokenizer.encode_plus(sentence,
                                             max_length=maxlen,
                                             padding="max_length",
                                             truncation=True)['input_ids']
    sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
    prediction = model.predict(sentence_seq)[0][0]
    return prediction
    
while True:
    text = input("Enter the sentence you want to test: ")
    result = predict_text(text)
    if result >= 0.5:
        print("This sentence contains abusive language.")
    else:
        print("It's a normal sentence.")

Maker

Tanat

github:   Tanat05
discord:  Tanat05
email:    [email protected]

Reference

@misc {l._junbum_2023,
    author       = { {L. Junbum} },
    title        = { llama-2-ko-70b },
    year         = 2023,
    url          = { https://huggingface.co/beomi/llama-2-ko-70b },
    doi          = { 10.57967/hf/1130 },
    publisher    = { Hugging Face }
}

License

모든 korcenApache-2.0라이선스 하에 공개되고 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요.

  • 라이선스 고지 및 저작권 고지 필수(일반인이 접근 가능한 부분에 표시)

Copyright© All rights reserved.

korcen-ml's People

Contributors

tanat05 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

korcen-ml's Issues

오류있어요

example/main.py:38에 오류있어요

[원본]

    if result >= 0.5:
        print("욕설입니다")
    else:
        print("욕설입니다")

[수정본]

    if result >= 0.5:
        print("욕설입니다")
    else:
        print("욕설이 아닙니다")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.