Git Product home page Git Product logo

tumcc's Introduction

TUMCC (Telegram Underground Market Chinese Corpus)

TUMCC is the first Chinese corpus in jargons identification field.

A total of 28,749 sentences, including 804,971 characters, from 19,821 Telegram users of 12 Telegram groups were collected when we are building the TUMCC.

We finished data screening and word segementation before we release this corpus. Thus it might be convenient for you to use.

After cleaning, TUMCC contains 3,863 sentences (a total of 100,000 characters) from 3,139 Telegram users.

Files

TUMCC-clean.txt contains corpus after our cleaning. You can use it directly in your research.

TUMCC-raw.7z contains raw infomation we collected from Telegram. You can do text cleaning by yourself to get more vaild data.

For more details about the target Telegram groups for data extracting, please refer to the paper Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (Information Processing and Management, 2022).

All Rights Reserved. Please cite us if you use the dataset for a research purpose.

tumcc's People

Contributors

m1-llie avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.