Git Product home page Git Product logo

bln-web-scraping-tutorial's Introduction

Web Scraping 101

Getting started

This repository covers basic web scraping techniques. In this lesson we will be scraping the FDIC's failed banks list.

Notebook version - scrape_fdic.ipynb

Python script version - ./scripts/fdic.py.

To get started clone into this repository:

git clone [email protected]:biglocalnews/nyt-web-scraping-tutorial.git

Change your directory to the repository:

cd nyt-web-scraping-tutorial

Install the requirements:

pip install -r requirements.txt

Start up Jupyter Lab to run the notebook:

jupyter lab

Overview

Web scraping can be more or less difficult depending on the nature of a website. A simple site with no dynamic content and predictable URL patterns could be a quick job, compared to one that uses web forms, randomized URLs, sessions/cookies, dynamically generated content, password-based logins, etc.

Below are some key strategies, concepts and tools that will help with web scraping.

Dissecting a web site

Before you can scrape a site, you have to understand how it works. Here are some questions to ask when assessing a site:

  • Is the target information located on a single web page?

  • Is there a landing page with a list of items that link to child pages with additional details?

  • Does the site use pagination to present a long list of data, files, downstream pages, etc?

  • Do target pages/files have predictable URLs?

  • Do you have to fill out out a search form before seeing target results?

  • Does the site require a user to log in?

  • Is the site using sessions to manage client connections?

  • Is the target data in the source HTML or is it dynamically generated by Javascript after the page has loaded in the browser.

To answer these questions, you must go beyond simply clicking around a website. You must use your web browser to view its source code and use Developer Tools to look under the hood.

Here are some additional resources to level up on core skills for dissecting websites:

  • HTML - A markup language that tells your browser how to structure a page.

  • HTTP - The verbs of the Web. This is how clients and servers communicate. If nothing else, you should understand how to use GET and POST requests (the latter are commonly used with web forms).

  • CSS - A style language that tells a browser how to display page elements. CSS selectors are invaluable for extracting data from elements in a page.

  • Inspecting the Web with Chrome's Dev Tools - A gentle intro to core web technologies through the lens of Chrome Dev tools

  • Javascript and the Document Object Model (DOM) - You don't need to be a master of Javascript to scrape, but it's critical to understand how Javascript can manipulate the DOM to dynamically generate/alter content after a page has been loaded.

Scraping strategies

Below are some high-level strategies for a few common scraping scenarios. Keep in mind that you may run into sites that require you to combine these strategies -- e.g. basic scraping techniques with more advanced stateful web scraping.

Basic scrapes

Let's define a basic scrape as a site where target information is located in the source code itself, and any child pages are easily accessible (perhaps because they use predictable URLs that can be harvested from a landing or index page). The site doesn't use forms or require logins, it does not dynamically generate content, and does not use sessions/cookies. In such cases, you can likely get away with simply using the requests library to grab HTML pages and the BeautifulSoup library to parse and extract data from each page's HTML.

Predictable URLs and Query Strings

If a site contains a list of records that in turn lead to child pages with more detail, you can often scrape the so-called "index" page to harvest links to the child pages. Or perhaps there's a piece of metadata in the records on the index page (e.g. a company ID) that will let you dynamically generate the links to child pages.

The FDIC Failed Banks site and Data.gov are good examples:

https://www.fdic.gov/bank/individual/failed/wafedbank.html

https://catalog.data.gov/dataset/national-student-loan-data-system

Some sites use so-called query strings, which are extra search parameters added to a URL as one or more key=value pairs (following a question mark). Here are two examples:

https://www.whitehouse.gov/?s=coronavirus

https://www.governmentjobs.com/careers/santaclara?department%5B0%5D=County%20Counsel&department%5B1%5D=County%20Executive&sort=Salary%7CDescending

The requests library can you help you construct such URLs.

Web Forms

Often, you will have to fill out a search form to locate target data. Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.

Many web forms use POST requests, where the form information is sent as part of the body of the web request (as opposed to embedded in the URL). In such cases, you can use a tool such as requests.post or Selenium to fill out and submit the form.

Logging in

Sites that require logins can often be handled by simply passing in your login credentials as part of a web form (see Web Forms above). The requests library provides several ways to authenticate, or you can use a stateful web browser such as Selenium to fill out a login form. Login-based sites will also often use sessions/cookies to manage your interactions. See Stateful Web Scraping below for more details on how to handle this.

Dynamic content

Many sites use Javascript to dynamically add or transform page content after you've loaded the page. This means that what you see in the source HTML using View Source will not match what you see in the browser (or the Elements tab of Chrome Developer Tools).

Scraping such a page requires using a library such as Selenium, which uses the "web driver" technology behind browsers such as Firefox to automate browser interactions.

Selenium gives you access to the Document Object Model (DOM) -- the content as seen by a real web browser. The DOM is the internal representation of a page that reflects both the static HTML and elements/styles dynamically added or manipulated by Javascript.

Selenium allows you to automate interactions with the DOM -- the same as a human using a browser -- to generate the content that you're targeting, such as scrolling down a page or stepping through a paginated list of results.

Further, the Python Selenium library provides convenient helper methods to help you access DOM elements, for instance using CSS selectors. This is similar in concept (and often in syntax) to how BeautifulSoup helps you parse and extract data from HTML.

Stateful Web Scraping

Some websites will use sessions to uniquely identify a visitor and maintain a record of the visitor's interaction -- or state --with the site. Web servers often manage sessions by sending a unique key in a cookie to your browser. This key is passed back to the server for each new request a visitor makes (e.g. when submitting a search form). Scraping a session based-site requires you to manage the session in your code. The requests library has support for managing sessions.

Alternatively, you can use a library such as Selenium to mimic a browser and get session management for "free" (although note that session management in Selenium can also get tricky depending on the nature of the site).

bln-web-scraping-tutorial's People

Contributors

esagara avatar

Stargazers

Justine avatar

Watchers

Serdar Tumgoren avatar Lisa Pickoff-White avatar Mike Stucka avatar Cheryl Phillips avatar Joe Nudell avatar  avatar Justin Mayo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.