Git Product home page Git Product logo

gazelle's Introduction

🧪 BioGazelle

This software is twice removed from the original What.cd Gazelle. It's based on the security hardened PHP7 fork Oppaitime Gazelle. It shares several features with Orpheus Gazelle and incorporates certain innovations by AnimeBytes. The goal is to organize a functional database with pleasant interfaces, and render insightful views using data from robust external sources.

Changelog: Bio ← OT

Please find a running list of major software improvements below. This list is by no means exhaustive; it's a best hits compilation. The points are presented in no particular order.

Built to scale, micro or macro

BioGazelle is pretty fast out of the box, on a single budget VPS. If you want to scale horizontally, the software supports both Redis clusters and database server replication. Please note that Redis clusters expect at least three nodes. This lower limit is inherent to Redis' cluster implementation.

Universal database id's

BioGazelle is in the process of migrating to UUID v7 primary keys to enable useful content-agnostic operations such as tagging and AI integration. This will consolidate the database and allow for powerful cross-object association. The UUIDs are stored as binary strings for index speed and to minimize disk usage. By the way, all binary data is transparently converted by the database wrapper.

Full stack search engine rewrite

Data indexing is important, so BioGazelle has upgraded to Manticore Search, the successor to Sphinx. This upgrade also involved a rewrite of the search configuration from scratch, based on AnimeBytes' example. The Gazelle frontend itself uses a rewritten browse.php controller and a brand new Twig template. Oh yeah, the PHP backend class is also completely rewritten, replacing at least four legacy classes.

Secure authentication system

The user handling, including registration, logins, etc., has been rewritten into a unified system in the Auth class. The system acts as an oracle that takes inputs and returns messages. Passphrase hashing is all done with PASSWORD_DEFAULT, ready for Argon2id.

I tested this extensively and determined that prehashing passphrases was no good. Not only it is impossible upgrade the algorithm, e.g., from sha256 to sha3-512, but prehashing lowers the total entropy of long strings even if binary is used throughout. Test it yourself with 72 bytes of random binary data (the bcrypt max) and an entropy calculator.

BioGazelle enforces a 15-character minimum passphrase length and imposes no other limitations. This is consistent with the list of OWASP best practices. In fact, the whole class is informed by this document.

Bearer token authorization

Read the API documentation. API tokens can be generated in the user security settings and used with the JSON API. Internal API calls for Ajax and such use a special token that can safely be exposed to the frontend. It's based on hashing a rotating server secret concatenated with a secure session cookie.

The session cookies themselves are tight, btw. No JavaScript access, scoped to the same site, long length, etc. This kind of stuff is in the low level Http class.

WebAuthn security tokens

BioGazelle has always supported hardware keys thanks to Oppaitime. But we took it up a notch by upgrading this system to use the modern WebAuthn standard instead of the deprecated FIDO U2F standard. This specification is well supported in all major browsers, and it doesn't require a $50 dongle: use a hardware key, a smartphone fingerprint or QR code reader, or just generate a key in the browser. The underlying library is the canonical web-auth/webauthn-lib.

OpenAI integration

One of BioGazelle's goals is to place data in context using OpenAI's completions API to generate tl;dr summaries and tags from content descriptions. Just paste your abstract into the torrent group description and get a succinct natural language summary with tags. It's possible to disable AI content display in the user settings.

Twig template system

BioGazelle's Twig interface takes cues from OPS's extended filters and functions. Twig provides a security benefit by escaping rendered output, and a secondary benefit of clarifying the PHP running the site sections. Everything you could need is a globally available template variable.

A quick note about template inheritance. Everything extends a clean HTML5 base template. Torrent, collections, requests, etc., and their respective sidebars are implemented as semantic HTML5 in easily digestible chunks of content. No more mixed PHP code and HTML markup!

Markdown and BBcode support

BioGazelle uses the SimpleMDE markdown editor with a reasonably extended custom editor interface. All the Markdown Extra features supported by Parsedown Extra are documented and the useful ones are exposed in the editor. The default recursive regex BBcode parser (yuck) is replaced by Vanilla NBBC. Parsed texts are cached for speed, using both Redis and the Twig disk cache.

Good typography

BioGazelle supports an array of unobtrusive fonts with the appropriate glyphs for bold, italic, and monospace. These options are available to every theme. Font Awesome 5 is also universally available, as is the entire Material Design color palette. Download the fonts to get started. Also, there are two simple color modes, calm mode and dark mode, that I like to think are pleasing to the eye.

Active data minimization

BioGazelle has real lawyer-vetted policies. In the process of matching the tech to the legal word, I dropped support for a number of compromising features:

  • Bitcoin, PayPal, and currency exchange API and system calls;
  • Bitcoin addresses, user donation history, and similar metadata; and
  • IP address and geolocation, email address, passphrase, and passkey history.

Besides that, BioGazelle has several passive developments in progress:

  • prepare all queries with parameterized statements;
  • declare strict mode at the top of every PHP and JS file;
  • check strict equality and strong typing, including function arguments;
  • run all files through generic formatters such as PHP-CS-Fixer; and
  • move all external libraries to uncomplicated package management.

Proper application layout

BioGazelle takes cues from the best-of-breed PHP framework Laravel. The source code is reorganized along Laravel's lines while maintaining the comfy familiarity of OT/WCD Gazelle. The app logic, config, and Git repo lies outside the web root for enhanced security.

BioGazelle uses the Flight router to define app routes. Features include clean URIs and centralized middleware. An ongoing project involves modernizing the app based on Laravel's excellent tools, with help from other personally-vetted libraries that may be lighter.

App singleton

The main site configuration uses extensible ArrayObjects with by the ENV special class. Also, the whole app is always instantly available: the config, database, cache, current user, Twig engine, etc., are accessible with a simple call to Gazelle\App::go(). All such objects use the same quick and easy go → factory → thing API. Just in case you need to extend some core object without headaches.

Decent debugging

BioGazelle seeks to be easy and fun to develop. I collected the old debug class monstrosity into a nice little bar. There's also no more DEBUG_MODE or random permissions. There's just a development mode that spits everything out, and a production mode that doesn't.

The entire app is also available on the command line for cron jobs, development, and fun. Good for BioGazelle, good for America! Just run php shell from the repository root to get up and running. This is based on Laravel Tinker and in fact uses the same REPL under the hood.

Minor changes

  • database crypto bumped up to AES-256
  • good subresource integrity support
  • configurable HTTP status code errors
  • integrated diceware passphrase generator
  • semantic HTML5 templates and layouts (WIP)
  • dead simple PDO database wrapper, fully parameterized
  • polite copy; the site says "please" and "thank you"
  • the codebase runs on PHP8 with minimal warnings
  • all database queries that are rewritten are usually simpler
  • no need to think about cache collisions across environments
  • a small amount of Eloquent models for core schema objects
  • authenticated email over STARTTLS with external server support

Features inherited from Oppaitime

Gracie Gazelle

Gracie Gazelle

Gracie is a veteran pirate of the digital ocean. On land, predators form companies to hunt down prey. But in the lawless water, the prey attacks the predators' transports. Gracie steals resources from the rich and shares them with the poor and isolated people. Her great eyesight sees through the darkest corners of the internet for her next target. Her charisma attracts countless salty goats to join her fleet. She proudly puts the forbidden share symbols on her hat and belt, and is now one of the most wanted women in the world.

Tyson Tan

Character design and bio by Tyson Tan, who offers mascot design services for free and open source software, free of charge, under a free license. Download the high resolution version.

tysontan.com / [email protected] / @TysonTanX

gazelle's People

Contributors

asutekku avatar battleprogrammershirase avatar dependabot[bot] avatar olet-toporkov avatar pjc09h avatar thisis-myname avatar tricidious avatar xoru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gazelle's Issues

Buy and test a U2F hardware key

I recently received a YubiKey 5C NFC essentially for free, so I can now develop and test the FIDO2 authentication feature. It will be implemented using the native browser WebAuthn specification, of course, and not rely on the couple of deprecated libraries that OT Gazelle used.

Refactor the JavaScript: IIFE's and event listeners

Currently, a lot of the site JavaScript uses raw functions dumped into a file. This causes problems with Google Closure Compiler, which in its advanced optimization mode, aggressively rewrites function names. This prevents me from using anything more than simple optimizations. The solution is to encapsulate all JavaScript in self-executing arrow functions (() => { /* ... */ })(); whose contents look for certain events such as clicking a widget and such.

Move all social features to a Discourse backend

Replace huge swaths of garbage homebrew code for nonessential features to a Discourse API backend running in a Docker container. Several steps to this migration:

  • set up the Discourse Connect SSO with automatic forum login
  • finish mocking up the forums, wiki, torrent comments, news/blog, user profile, and private message interfaces
  • support the full CRUD operations of the relevant Discourse features
  • lock all social stuff behind an authentication challenge ("you must be logged in to view the forums")
  • migrate the existing data to Discourse
  • proxy the Discourse API through the BioTorrents.de one

This will free up a lot of time to focus on the actual torrent features. There are some longstanding forum bugs, as well as huge SQLi potential, that I don't want to fix (e.g., locking and moving threads has never worked right).

This should be reasonably feature complete compared to what currently exists, except with a better frontend and overall cleaner backend logic.

Use models for at least some core objects

A set of basic models that extend a subset of Laravel Eloquent's features, e.g., find(), save(), delete() (soft), etc., would go a huge way toward cleaning up the code. Each major artifact such as a torrent, group, collection, request, etc., should be its own "thing" that can be loaded, displayed, and manipulated. Also has implications for API CRUD support: load the model, change some stuff, and save it.

Full API CRUD support

Currently, the API only supports GET requests. It should use controllers for all the major objects of the site with simple methods like create(), read(), update(), and delete().

All integer primary keys should be a bigint and most database tables should have a UUID v7 unique key

Branching off the work in creatorObjects to position the database for scale. I've been meaning to implement some kind of basic sharding and replication since the beginning, which relies on not having key collisions. UUID v7 stored as binary(16) as a unique key, while maintaining the standard auto-increment id bigint columns, seems to be the way to go.

The database class is already set up to transparently handle UUID binary to string conversion so, e.g., select uuid, name from creators order by created desc limit 10 will return UUID's in the form of 01877b4a-b27c-70db-9522-149e9a40ef59.

UUID documentation:
https://uuid.ramsey.dev/en/stable/rfc4122/version7.html
https://uuid.ramsey.dev/en/stable/database.html

Sharding documentation:
https://aws.amazon.com/what-is/database-sharding/
https://www.linode.com/docs/guides/sharded-database/

Misc documentation:
https://emmer.dev/blog/why-you-should-use-uuids-for-your-primary-keys/
https://itnext.io/laravel-the-mysterious-ordered-uuid-29e7500b4f8
https://stackoverflow.com/questions/52414414/best-practices-on-primary-key-auto-increment-and-uuid-in-sql-databases
https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
https://vladmihalcea.com/uuid-database-primary-key/
https://www.mysqltutorial.org/mysql-uuid/
https://www.percona.com/blog/store-uuid-optimized-way/

Rewrite the torrent search backend

The frontend is mostly done, screenshot attached. The backend has always been a mess. I don't like messy logic in the core of the app (the whole point is to efficiently index and serve data), so I'm gonna rewrite the Sphinx backend with something a bit more simple and clean. I've had a look at the library source code, it seems to be okay. With time, we can probably index collections, requests, and maybe even Top 10 history.

newTorrentSearch

Move everything over to clean routes

I'm done with /foo.php?bar=baz in the web interface and API. Flight support is coming along. The best part is, it breaks all the leetcode and enforces strict standards.

Namespace the damn app already

First class collision occurred with OpenAI. Will start with purely static classes (most of them) and work my way toward the other classes. PSR-4 support has been a thing in composer.json for a while, probably deleted because JSON doesn't support comments. No use statements because I like to know what's going on.

Rethink bonus points and user classes to act more like a hostile bank account

The way that user classes and bonus points currently work: you rank up by having upload activity (or buying upload with BP) and you get a large amount of BP for your seed size. This should be reversed, where user ranks depend on a minimum average seed size and BP are negatively compounded. I know "hostile bank account" is a tautology.

Bearer token scopes

Can be pretty simple, sliced by section or HTTP method. Whatever works and is easiest.

Soft deletes for torrents

Would be pretty useful: DMCA request comes in, we soft delete it, turns out the request is abusive, nothing is lost.

Implement database replication in the Gazelle codebase

The new database class should transparently pull data from a replica if the methods single, row, column, or multi are called, and write data to the source if do is called. Both scenarios should support an array of database instances, but realistically, there's only one of each and that's way overkill.

Login is broken on dev

Can't log into the dev instance (whoops). Good time to just rewrite the crazy system to use a secure library with sensible paths.

plz be my ai gf

https://github.com/biotorrents/gazelle/blob/openai/app/OpenAI.php

OpenAI API integration for tl;dr torrent group summaries and keywords. Need to get as much production database coverage as possible before my free trial credits expire in April 2023 or so. This is largely done and will be merged into the authentication branch soon.

Pardon the delay! It turns out that rewriting the whole authentication, template, database, and a lot of other stuff became essentially a full application rewrite. Once everything is tested, I'll just merge it, even if it means the forums and wiki might go away for a while.

Turn authors (creators) into first-class objects

Currently, the artist tables in the database are all linking tables. Torrent creators / study authors / artists / etc. should be their own object in the logical schema similar to a torrent group, that can be independently indexed and searched.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.