Git Product home page Git Product logo

discard's People

Contributors

jesseweinstein avatar sanqui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

discard's Issues

zstd compression

<Sanqui> I've been compressing jsonl with gzip
<Sanqui> should I look into zstd?
<b> I'd say you should
<b> Yeah, zstd is bloody magic
<b> I got like 3-4x better compression out of zstd
<c> Zstd is magic we use it at work for VM backups/snapshots```

File partitioning

Let's say if a log file goes over 100mb I want to start partitioning it

Archiving avatars and files

Discard currently does not support downloading files. Nonetheless, it is clear that a backup of a server is not complete without fetching the attachments. Still, given the possible scope of archival operations, I'm concerned for time spent downloading and disk space, and wouldn't want to enable them by default.

Now, here are some thoughts on how I want to implement downloading files (including avatars, custom emoji ec.).

First, it's important to note that all Discord files are served from a CDN that does not perform any authorization. I can just get the link to an attachment from any server and post it somewhere else and anybody can download it. This means that we don't actually need a specialized tool like Discard to fetch them. It also means they can benefit from being included in the Wayback Machine.
My first step towards supporting files will be similar to the reader: parse a completed run and output a list of URLs for files to download by another tool (e.g. wget or ArchiveBot).

My main question here to potential users of Discard is whether outputting a list of file URLs is enough for you, or if you would prefer Discard to download files as a part of a run. And if so, if you want just the files, or WARCs.

Your input is appreciated!

Consider WARC support

Arguments for:

  • Standard for HTTP content
  • Existing ecosystem
  • Indexing support

Arguments against:

  • No standardized Websocket support yet. Most data we're getting is over HTTP, but some critical data is over WS and I expect Discord to use WS more in the future. In particular, realtime will rely on it heavily
  • Discord API is already entirely JSON
  • WARC is primarily intended for websites, this is not a website
  • The content we're accessing is authentication-walled, so headers would have to be scrubbed everywhere

Requirements

Functional requirements

  • Archival of selected Discord servers through command-line interface
  • Operation using a bot account or a user account
  • Output to JSON files
  • Basic summary in output (i.e. number of messages and users recorded)
  • Halting on error

Non-functional requirements

  • Streaming output (not buffered)
  • Optional gzip archival of request .json files
  • Resilience to network errors
  • Scalability to multiple accounts

Archiving without an account

Technically, Discord doesn't require a user account to join a Discord server and read history. With a public invite and permissive moderation levels on the server, you can join with just a nick, not even a password needed. While spamming open servers with daily puppet accounts is clearly not advisable, archiving through these "light" accounts might reduce load on verified user accounts (for more restricted servers).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.