Git Product home page Git Product logo

atk_scraper's Introduction

America's Test Kitchen Recipe Scraper

Scrapes America's Test Kitchen website for recipes and saves PNG screenshots and/or JSON for import to a recipe manager (e.g. https://mealie.io).

Pre-requisites

  • Chrome v111. If you have a different version of Chrome, replace with the corresponding driver found here.

  • Python 3.6 with an environment built off of requirements.txt.

  • America's Test Kitchen/Cook's Country/Cook's Illustrated web subscription (or trial).

Apps

get_recipes.py: Grab individual recipes from a list

  • -h, --help : show this help message and exit
  • -e EMAIL, --email EMAIL : ATK email for login.
  • -p PASSWORD, --password PASSWORD : Single quoted password for login. For example 'my_password!*'
  • -r RECIPES, --recipes RECIPES : Text file containing a list of individual recipes to grab.
  • -j JSON, --json JSON : Get recipes as json for mealie (default True)
  • -i IMAGE, --image IMAGE : Get recipes as images (default False)
  • -o OUT_PATH, --out_path OUT_PATH : Location to save images/json (default './recipes/')
  • --driver DRIVER : Path to the chromedriver. (default './chromedriver')
  • --verbose : verbose output

get_searches.py: Traverse search results and grab all recipes within

  • -h, --help : show this help message and exit
  • -e EMAIL, --email EMAIL : ATK email for login.
  • -p PASSWORD, --password PASSWORD : Single quoted ATK password for login. For example 'my_password!*'
  • -r RECIPES, --recipes RECIPES : Text file containing a list of search result pages to recursively descend and grab all recipes from. See recipes.txt for an example. Using "All Recipes" page will not work as the site stops loading recipes after 900 are reached. It will not load "All Recipes" as the name implies. This is why you need to separate into smaller search sets
  • -j JSON, --json JSON : Get recipes as json for mealie (default True)
  • -i IMAGE, --image IMAGE : Get recipes as images (default False)
  • -o OUT_PATH, --out_path OUT_PATH : Location to save images/json (default ./recipes/)
  • --driver DRIVER : Path to the chromedriver. (default ./chromedriver)
  • --verbose : verbose output

Process

  1. Selenium opens Chrome driver in headless mode.
  2. Logs into ATK using credentials provided.
  3. Iterates through the list of pages, whether individual recipes or full search pages.
  4. Each page source is passed to BeautifulSoup, which extracts all recipe links.
  5. Each recipe link is loaded with Selenium. Page dimensions are determined using page divs.
  6. If -i is specified, a screenshot is saved. The Chrome window is resized to fit these dimensions and a screenshot is saved to the specified path. Screenshots are cleaned using Pillow and saved as <image>.trimmed.png
  7. If -j is specified, recipe information is smartly scraped and loaded into JSON for later import to a recipe manager (e.g. mealie). The highlight image is also saved as a .jp2 image (this is the format used by ATK)
  8. The program will load the next page and repeat.

Disclaimer

This project is for educational, read-only purposes.

The use of this project is done at your own discretion and risk.

You are solely responsible for liability and consequences.

atk_scraper's People

Contributors

dependabot[bot] avatar jecorn avatar matthoendorf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

cabalist jecorn

atk_scraper's Issues

refactoring

needs to be cleaned up. better error handling. better documentation. handle pages differently (not hardcoded). easier way to generate list of recipes?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.