Git Product home page Git Product logo

scrambler's Introduction

The Scrambler turns your website to gibberish to confuse humans and annoy scrapers.

I made the Scrambler as a creative response to rampant scraping by AI companies, who for years have collected our data with neither consent nor payment to train their models -- often for purposes that directly harm us. Think of it as a less polite alternative to robots.txt. I am not the first to think of this idea, but I like to think having multiple people's takes on it can only make the world a better place.

Installation

The Scrambler is a CGI script written in Python 3. It uses no modules outside the standard library. Place scrambler.py in your server's cgi-bin directory and make it executable. Congratulations, you're now ready for visitors.

To scramble a webpage, pass its URL through the query string, like so: ?url=https%3A//www.example.com (%3A is the escape code for the ':' character). If you do not specify a URL, the Scrambler defaults to the root page of your domain.

To prevent abuse, the Scrambler is restricted by default to browsing its host domain. You can allow access to additional sites by setting the SCRAMBLER_ALLOWLIST environment variable to a comma-separated list of domains. Note that the domain must exactly match your allowlist -- example.com and www.example.com would be considered separate sites. Other precautions the Scrambler implements are detailed below under "Security".

Scrambling Scrapers

Before we start, let's get one thing out of the way:

The Scrambler is not intended to provide serious protection from scraping. While I hope it is effective, the real point is to amuse humans rather than to frustrate bots, because all technical measures to prevent scraping can be circumvented. The proper way to address this misbehavior is through regulation. AI companies know this, which is why every time it comes up they change the subject to bad science fiction. Regulation, of course, is a complicated subject, and I'm not going to get into the details here.

Now then. To properly annoy scrapers, you'll need to somehow redirect their requests through the Scrambler. If you use Apache httpd, an easy way to do that is using mod_rewrite. Here's an example:

RewriteCond %{HTTP_USER_AGENT} GPTBot|Wget
RewriteCond %{REQUEST_URI} \.(html|php)$
RewriteCond %{REQUEST_URI} !^/cgi-bin/scrambler.py$
RewriteRule ^(.*) /cgi-bin/scrambler.py?honeypot=1&url=%{REQUEST_SCHEME}\%3A//%{HTTP_HOST}%{REQUEST_URI} [L]

The three lines starting with RewriteCond specify whose requests for what get scrambled:

  1. Identify whose requests to scramble.
    • An easy way to do this is through user agent detection, though this relies on the scrapers being honest about what they are.
    • In this example, I'm scrambling requests from GPTBot (OpenAI's crawler) and Wget (an open-source download tool). I'm just picking on Wget to show how you can catch multiple programs with one line.
  2. Identify what content to scramble.
    • How exactly you do this depends on what you used to build your website.
    • My example site is made up of static HTML files and PHP scripts, so I can filter requests by file extension. If you're running a complex web application, your RewriteCond may be more complicated.
  3. Exempt the Scrambler itself from scrambling.
    • Otherwise, a bot that knows about the Scrambler can create an infinite loop by endlessly redirecting it to itself. Web servers (and hosting companies!) tend not to like those.

Beware this isn't Stack Overflow. Understand what those lines mean and customize them for your own site.

The RewriteRule on the last line is what sends these naughty requests through the Scrambler. This one is usually safe to use as-is, assuming you put the Scrambler under /cgi-bin. Note the honeypot=1 in the query string, which activates some additional restrictions (see "Security" below for details).

To avoid confusing more helpful bots, like the ones that index sites for search engines, you should probably block them from accessing the Scrambler directly. The following lines in your robots.txt should do it:

User-agent: *
Disallow: /cgi-bin/scrambler.py

Legitimate scrapers that obey robots.txt now know they're safe from scrambling. Naughty ones won't be hitting it through that URL -- to them it will look like they're accessing your website normally -- so it doesn't matter if they check robots.txt or not.

Security

The Scrambler implements a few basic precautions to prevent abuse:

  • It is restricted to your own site by default, and can only access other sites if you explicitly allow them (see "Installation" above).
  • It only allows accessing sites through HTTP and HTTPS on their respective well-known ports. This is because non-standard ports are typically used for non-public internal purposes.
  • It blocks JavaScript to prevent undesirable behaviors, both intended (like tracking) and unintended (like weird side effects from scrambling).

Adding honeypot=1 to the query string (see "Scrambling Scrapers") further restricts access for your unwelcome visitors:

  • Access to other sites is blocked completely, even if they're on your allowlist. This is so scrapers don't suck up all your bandwidth if you've linked to and allowlisted a huge site like Wikipedia.
  • Access to linked content the Scrambler can't scramble, like PDF files, is also blocked. (This does not apply to embedded content like images.)

While I believe the Scrambler is reasonably safe, beware that it is probably not bulletproof. As always when using random code from the Internet, caveat emptor.

scrambler's People

Contributors

bmjcode avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.