ai-robots-txt / ai.robots.txt Goto Github PK
View Code? Open in Web Editor NEWA list of AI agents and robots to block.
Home Page: https://coryd.dev/posts/2024/go-ahead-and-block-ai-web-crawlers/
License: MIT License
A list of AI agents and robots to block.
Home Page: https://coryd.dev/posts/2024/go-ahead-and-block-ai-web-crawlers/
License: MIT License
We could prime some questions in a FAQ from the hacker news discussion. The main one is along the lines of "Why would an AI web crawler respect robots.txt?"
If we wanted to be brave, we could enable a wiki for this repo!
img2dataset is software that spiders sites for AI training. It's not run by a specific company so hits can come from anywhere.
It claims to honor robots.txt
with the "img2dataset" user agent token, and X-Robots-Tag or HTML <meta>
directives "noai" and "noimageai".
tl;dr: independent body suggests adding new file to site root directory to ask organizations that use AI for text and data mining (TDM) to NOT use site data for training purposes.
From their FAQ:
An ai.txt file is a simple text file placed in the root directory (or .well-known/) of your website that communicates with data miners. It provides instructions on whether the text and media files hosted on your domain can be used to train commercial AI models.
An example file I created:
# Spawning AI
# Prevent datasets from using the following file types
User-Agent: *
Disallow: /
Disallow: *
PS: Thanks for this project ๐
Resources:
The idea is to scrape the content of Dark Visitors using a bot and generate PRs for this project. A bit like dependabot.
Help populate the table-of-bot-metrics.md to clarify bot activity.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.