Git Product home page Git Product logo

clean-discord's Introduction

Clean-Discord

clean-discord is a fast, efficient, and robust script for cleaning large quantities of messages from discord data generated by DiscordChatExporter. Its average processing rate is ~300k messages per 50 seconds (including detoxifying) while only consuming about 1gb of memory.

Usage

With DiscordChatExporter:

This script uses data from DiscordChatExporter with a few important alterations (details in their wiki):

  • Timestamp format is in yyyy-mm-dd
  • (Recommended but optional) splitting the files into partitions of N messages.

You can alter and copy this command for ease of use with the CLI version:

dotnet DiscordChatExporter.Cli.dll export \
  -t [token] \
  -f json -p 300000 \
  --dateformat yyy-mm-dd \
  -o [output dir] \
  -c [channel id to export]

With custom data:

You may use custom data, formatted properly, to use this script. Your input json files should be formatted as follows:

{
  "messages": [
    {
      "id": "12345678910111213",
      "type": "Default",
      "timestamp": "YYYY-MM-DDTHH:MM:SS+00:00",
      "content": "this where the text goes",
      "author": {
        "id": "31211101987654321",
        "name": "Jake",
        "isBot": false,
      },
    },
  ]
}

Benchmarking the script

Regex, parsing, and classification performance can vary from device to device. See the README in src tested on a 3.0ghz CPU (the test is single-threaded).

If you would like to run the benchmark, you run the following command from the base directory:

python3 src/workers.py

Cleaning files

All cleaning functionality of this repo can be accessed with clean.py.

My super large discord dataset with over 300 million messages was cleaned using the command on a 4-core machine with 4gb of memory, in approximately 6 hours.

python3 clean.py -detox -workers 8 -dir ../data -out ../cleaned

Creating a compressed dataset and splits

Using the dataset as-is is completely feasible, but it is recommended to create proper splits and also generate all possible turns in a conversation. Take this conversation for example:

Hi    How are you?    Im doing well    This is a conversation?    Yes.    Huh    Its also a test

You can make the most out of the turns in this conversation by creating all windows of this conversation (input turns are separated by /b):

Hi    How are you?
Hi/bHow are you?    Im doing well
Hi/bHow are you?/bIm doing well/bThis is a conversation?    Yes.
Hi/bHow are you?/bIm doing well/bThis is a conversation?/bYes.    Huh
Hi/bHow are you?/bIm doing well/bThis is a conversation?/bYes./bHuh    Its also a test

You can also limit the number of turns in a conversation by this means. As expanding the number of examples this way increases the size of the dataset dramatically, it is recommended to compress the dataset. TensorFlow Datasets supports compressed files for easy streaming and shuffling during training.

To generate splits, you may run the command

python3 split.py -compression_level 9 -workers 8 -dir ../data -out ../context

Note that the -out parameter is a prefix, and -train.txt/-train.txt.gz and -val.txt/-val.txt.gz will be appended to the split accordingly. The script will create a directory named temp that contains all the partitioned files individually named in the format [SPLIT]-ID.txt/[SPLIT]-ID.txt.gz, which can will be merged automatically or later by running:

python3 split.py -step merge

How is it done?

The process of cleaning the data includes removing a lot of the issues that can be found in discord chat logs, including:

Please add a pull request or an issue if you can think of any other cases this script should cover!

  • Filtering common prefixes for popular bots on discord
  • Filtering common system messages, such as pins, server joins, etc.
  • Translating "special" unicode-based characters into the english alphabet (text like π”Ύπ•£π•’π•Ÿπ••π•žπ•’'𝕀 πŸ…‚πŸ„²πŸ„°πŸ…πŸ…ˆ Ι–Φ…Ι’ to Grandma's SCARY DOG (a real username btw))
  • Converting excessive spaces and unicode spaces to traditional spaces (text like hi , you! to hi, you)
  • Replace users who left the the server(s) without being properly cached (they show up as Deleted User) with a random name that is attached to their id (names like @Deleted User to @Jake)
  • Fixing excessive punctuation or spelling with certain limits (to keep ellipsis, for example) (text like REEEEEEEEEEEEEEE.......... to REEE...)
  • Filtering non-ascii characters and commonly used characters for ascii/unicode art (but keeping enough to make the vast majority of messages look ok) (text like πŸ–ŠοΈi like to write <πŸ–ŠοΈ> to i like to write)
  • Converting supported emojis to their shorthand form (πŸ˜‚ to :joy:,)
  • Removing multiline code blocks while only removing the ticks around the single line code blocks (removing text like ```text```)
  • Replacing newlines with an escaped newline (hi\nhow are you? to hi\\nhow are you?)
  • Merging multi-message, single-author continuous messages into a single message merged with escaped newlines
  • Removing URLs (like https://jadeai.ml)
  • Removing emails (like [email protected])
  • Removing phone numbers (like +1 (123) 456-7890)
  • Removing custom emojis (like :pogchamp:)
  • Removing toxic messages (like f*** you) *note: example is censored

Doing this for a lot of data (millions of messages) is extremely difficult to do, and this repo employs a lot of optimizations. If you're a developer and would like to benchmark the functionality of the scripts, you can run the script in src/workers.py, which contains benchmarking tools. See the README in src for my benchmark results.

clean-discord's People

Contributors

jef1056 avatar codemicro avatar malamasn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.