Git Product home page Git Product logo

dsc-html-css-scraping-recap-houston-ds-021720's Introduction

HTML, CSS and Web Scraping - Recap

Introduction

In this section, you learned a lot about web pages and how to exploit their structure for your own web scraping purposes. Take this opportunity to briefly review some of the key takeaways from the section.

HTML

To start this section, you investigated the basic structure of an HTML page and saw both the nested structure as well as some of the basic tags that you later leveraged for web scraping. This included a tags for links and div tags which act as general containers for other HTML code blocks.

CSS

After taking an initial look at HTML, you then saw the role CSS plays in styling a web page. You learned that HTML deals with content while CSS deals with style. There is certainly more you could learn regarding CSS but an important take away is that CSS selectors can also be used while web scraping. For example, you can select a tag with id or class selector.

Beautiful Soup

After an initial exploration into web development, you then returned to Python and used the requests and Beautiful Soup packages in order to extract data from the web. This was also a great chance for you to practice your data wrangling skills as you often will have to navigate nested data structures and clean messy data, removing spaces, using regular expressions and converting data types.

Precautions

Remember to practice caution when scraping websites. Surfing the web at superhuman speeds will get you banned from many domains and may violate the terms & conditions of many websites that require you to login. As such, there are a few considerations you should take along your way.

  • Are there terms and conditions for using the website?
  • Test your scraping bot on in small samples to debug before scaling to hundreds, thousands or millions of requests.
  • Start thinking about your IP address: getting blacklisted from a website is no fun. Consider using a VPN.
  • Slow your bot down! Add delays along the way with the time package. Specifically, time.sleep(seconds) adds wait time in a program.

Other Scraping Tools

While Beautiful Soup is a powerful go-to tool for scraping the web, remember that there are other tools such as Selenium and Scrapy that have their own advantages and disadvantages. While Beautiful Soup is apt to be your primary scraping tool, for now, it is worth noting that there are other options should you feel like you need additional resources such as interacting with dynamic javascript-based websites.

Summary

This was an exciting section delving into the world of web scraping! There's always a plethora of information to be mined from the web so go out there and get scraping!

dsc-html-css-scraping-recap-houston-ds-021720's People

Contributors

mas16 avatar mathymitchell avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar  avatar Ben Oren avatar Matt avatar Antoin avatar  avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.