View Code? Open in Web Editor NEW

This project forked from m1-llie/tumcc

Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP&M, 2022).

tumcc's Introduction

TUMCC (Telegram Underground Market Chinese Corpus)

TUMCC is the first Chinese corpus in jargons identification field.

A total of 28,749 sentences, including 804,971 characters, from 19,821 Telegram users of 12 Telegram groups were collected when we built TUMCC.

We have finished data screening and word segementation before we released this corpus. Thus it might be convenient for you to use.

After cleaning, TUMCC contains 3,863 sentences (a total of 100,000 characters) from 3,139 Telegram users.

Files

TUMCC-clean.txt contains corpus after our cleaning. You can use it directly in your research.

TUMCC-raw.7z contains raw information we collected from Telegram. You can do text cleaning by yourself to get more valid data and useful information.

For more details about the target Telegram groups sources for data extraction, please refer to the paper Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (Information Processing and Management, 2022).

Recommend Projects

zhihe9527 / tumcc Goto Github PK

tumcc's Introduction

TUMCC (Telegram Underground Market Chinese Corpus)

Files

tumcc's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent