Git Product home page Git Product logo

musa-620-week-7's Introduction

MUSA-620-Week-7 - Web Scraping, Part 2

Links

Selenium setup:

  1. Install Java JDK, following these intructions for Windows or these instructions for OS X. You can download the latest Java JDK version here.
  2. Download the latest version of the Selenium Standalone Server.
  3. Download the latest version of Chromedriver. And if you do not already have it, install Chrome.
  4. Take a deep breath. You're halfway there.
  5. Open a command prompt (in Windows press Win+R then type "cmd", in OS X search for "terminal" in Spotlight) and view your PATH environment variable. Windows: echo %PATH% / Mac OS: echo $PATH.
  6. Unpack the Chromedriver executable file to one of the folders listed in your PATH.
  7. NEXT TIME, YOU CAN START AT THIS STEP To start the server, open the command line and go to the directory where the Selenium Standalone Server file is downloaded and run java -jar selenium-server-standalone-3.9.1.jar.

If you are not successful setting up the standalone server, not to worry. You can use Sauce Labs instead. You can sign up for a free Sauce Labs trial account here.

Sauce Labs setup:

  1. After signing up and logging in, click your name in the upper right corner and go to User Settings.
  2. Scroll down and copy the Access Key.
  3. Paste it into the webdriver.r script along with your Sauce Labs username.
  4. While running your scraper, you can watch the browser live from the Sauce Labs Dashboard page.

Assignment

Calculate the price per square foot of condominiums overlooking Rittenhouse Square by scraping the Philadelphia Property Database. Then present this information visually on a map.

This assignment is required. Please turn it in by email to myself (galkamaxd at gmail) and Evan (ecernea at sas dot upenn dot edu).

Due: Wednesday, 14-March by 9am

Description

To calculate the average price per square foot for these homes, you will need to scrape the Philadelphia Property Database.

  • This list of condos (address and unit #) are in the condos-rittenhouse.csv file. So as not to overload the Property Database server, please do not run the full dataset until you have confirmed your scraper is working and are ready for the final run. Instead, please use condo-test-data.csv (contains only four records) for building and testing your scraper.
  • See the scraper-template.r script for some code to get you started.
  • For the purposes of this assignment, please use the most recent Market Value as the price of the condo, as shown in the "VALUATION HISTORY" table. For the area of the condo, please use the field labeled "IMPROVEMENT AREA (SQFT)".
  • Once you have collected the property values, you should calculate the average price per square foot for each of the buildings.
  • Geocode the buildings with ggmap (geocode only the building addresses, not each individual condo). Note: the ggmap geocode has had some stability issues recently. If it fails for some of the addresses, the easiest solution is to record the geocoded addresses and rerun the ones that failed.
  • Present the average price per square footage numbers visually on a map using ggplot2.

If you are having problems getting the Selenium Standalone Server working, you can find more information here. We can also go through it in office hours. However, this project is not meant to be an exercise in setting up servers, so if you are finding it overwhelming to set up the standalone server, you should just use Sauce Labs, as we did in class. You can sign up for a free Sauce Labs trial account here.

Deliverable

  • a map showing each building's average price per square foot
  • the output data from your scraper (the value and square footage of each condo)
  • all R scripts used in scraping, analyzing, and visualizing the data
  • a written explanation of: the steps you took to create it, any challenges you encountered along the way, and reasons for your design choices.

Additional Comments

  • Out of courtesy for the maintainers of the Philly Property Database and its other users, please do your best to avoid overloading their server. When you are ready for your scraper's final run, you should do so outside of normal working hours. And please remember to include at least a few seconds of pause between page loads, using the Sys.sleep() command.

  • An unfortunate reality of web scraping is that it is often messy. This assignment is no exception. If/when you run into problems, please do your best to improvise.

  • If your scraper fails midway through, do not start again from scratch. You should store the results that you already have and pick up where you left off, skipping the problematic address if need be.

  • If the scraper is regularly stopping on valid addresses, you may be hitting up against rate limits (too many requests in too short a time). You can either wait for a bit before resuming and/or slow your scraper down using Sys.sleep().

  • If you run into problems with specific addresses, please use your best judgment to come up with a solution and explain it in your writeup.

musa-620-week-7's People

Contributors

galkamax avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.