Git Product home page Git Product logo

ao3scraper's Introduction

AO3Scraper

In collaboration with @ssterman. A simple Python Archive of Our Own scraper. Now with HASTAC 2017 presentation slides!

Features:

  • Given a fandom URL and amount of fic you want, returns a list of the fic IDs. (ao3_work_ids.py)
  • Given a (list of) fic ID(s), saves a CSV of all the fic metadata and content. (ao3_get_fanfics.py)
  • Given the CSV of fic metadata and content created by ao3_get_fanfics.py, saves a new CSV of only the metadata. (extract_metadata.py)
  • Given the CSV of fic metadata and content created by ao3_get_fanfics.py, creates a folder of individual text files containing the body of each fic (csv_to_txts.py)
  • Given the CSV of fic metadata and content created by ao3_get_fanfics.py, uses an AO3 tag URL to count the number of works using that tag or its wrangled synonyms (get_tag_counts.py)

Dependencies

  • pip install bs4
  • pip install requests
  • pip install unidecode

Example Usage

Let's say you wanted to collect data from the first 100 English completed fics, ordered by kudos, in the Sherlock (TV) fandom. The first thing to do is use AO3's nice search feature on their website.

We get this URL as a result: http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29

Run python ao3_work_ids.py <url>. You can optionally add some flags:

  • --out_csv output.csv (the name of the output csv file, default work_ids.csv)
  • --num_to_retrieve 10 (how many work ids you want, defaults to all)
  • --multichapter_only 1 (restricts output to only works with more than one chapter, defaults to false)
  • --tag_csv name_of_csv.csv (provide an optional list of tags; the retrieved fics must have one or more such tags. default ignores this functionality)

The only required input is the search URL.

For our example, we might say:

python ao3_work_ids.py "http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29" --num_to_retrieve 100 --out_csv sherlock

Now, to actually get the fics, run python ao3_get_fanfics.py sherlock.csv. You can optionally add some flags:

  • --csv output.csv (the name of the output csv file, default fanfic.csv)
  • --header 'Chrome/52 (Macintosh; Intel Mac OS X 10_10_5); Jingyi Li/UC Berkeley/[email protected]' (an optional http header for ethical scraping)

If you don't want to give it a .csv file name, you can also query a single fic id, python ao3_get_fanfics.py 5937274, or enter an arbitrarily sized list of them, python ao3_get_fanfics.py 5937274 7170752.

If you stop a scrape from a csv partway through (or it crashes), you can restart from the last uncollected work_id using the flag --restart 012345 (the work_id). The scraper will skip all ids up to that point in the csv, then begin again from the given id.

By default, we save all chapters of multi-chapter fics. Use --firstchap 1 to only retrieve the first chapter of multichapter fics.

We cannot scrape fics that are locked (for registered users only), but submit a pull request if you want to build authentication!

Note that the 5 second delays before requesting from AO3's server are in compliance with the AO3 terms of service. Please do not remove these delays.

Happy scraping!

Improvements

We love pull requests!

FF.net

Want to scrape fanfiction.net? Check out my friend @smilli's ff.net scraper!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.