andrewguenthner / scraping_mars_site Goto Github PK
View Code? Open in Web Editor NEWsite and back end that scrapes and aggregates Mars data from public domain sources
site and back end that scrapes and aggregates Mars data from public domain sources
Files for a Flask website / app that scrapes public domain Mars data and displays it on a web page, with storage via a Mongo DB. Intended for use as a web scraping demo. Requirements: -- Mongo DB server -- Flask server (development version) -- flask and flask_pymongo libraries for server -- requests, bs4 (Beautiful Soup), and splinter (0.1) for scraping -- pandas (0.23) for table scraping and handling The system is configured to use ChromeDriver on Windows 10 for scraping Files: -- app.py contains the Flask routines -- scrape_mars.py is imported by app.py and contains the scraping routines -- mission_to_mars.jpynb contains the code snippets and tests for scraping -- chromedriver.exe is provided for scraping. Scraping is not configured in headless mode, so that the user can see what's happening. The folder "templates" contains the default route page "index.html" The folder "static" contains "style.css" for overriding the included Bootstrap 4.0 styling in places The folder "op_images" contains screenshots of the server and web site in action for validation purposes Notes: The data is scraped from public domain sites of the United States government with the exception of space-facts.com, which itself gets it data from the United States government. The space-facts data table ("Mars facts") is collected from https://space-facts.com/mars/, and was originally put together by the site owner, Chris Jones. This is a deliberate choice as data from government agencies is cleared for public release without copyright. The images used in the site are also United States government images cleared for public release. Because most Executive Branch agencies of the United States government have been directed to make their publicly releaseable data widely available, I see no issue whatsoever with scraping their site contents via GET type requests and aggregating it. As an 18-year veteran of the Federal civil service and a Code for America volunteer, I can assure you that in the vast majority of cases, these agencies would like to do a better job of getting their data widely used but are hindered by bureuacracy, poor contracting practices (the lowest bidder probably didn't put too much thought into making the data widely accessable), and other pressing concerns (e.g. cyberattacks). By learning to scrape their sites and repackage their data in a nicer form, you are doing them a favor, so I encourage you to learn from this example and give it a try! Just please be considerate and don't overload their servers or cause problems. As for space-facts.com, if you're going to do anything other than occasionally copy and re-use the government data on the site, then you should check with the site owner. The site uses Bootstrap 4.0 for styling, and includes some scripts, so you should be able to add any other Bootstrap classes without needing to edit the header or footer. Both splinter and requests.get are used for scraping. Splinter can be made to run headless, but, as configured, you will see a new instance of Chrome appear on your desktop, and Chrome should warn you that it is being driven remotely. Even if scraping is unsuccessful, the browser should close itself after a minute. The database dictionaries contain success flags and timestamps for each scraping activity. These are not displayed in the site but are available as pagedata.mars_[x].success and pagedata.mars-[x].when_visited attributes (where [x] denotes the scraped item (e.g. "featured_image", "weather_tweet", and so on. The test items scraped are the latest Mars news blurb from JPL, the latest featured image from JPL, the latest weather tweet (these days from Mars Insight), the Mars facts table (from space-facts.com), and enhanced Mars hemisphere images from the US Geological Service. The server code is designed to fail gracefully, so that if the scraping fails then the site will still display, but you will see placeholder content. This makes it much easier to locate the problem. Note that some sites can have temporary downtime, so you may on occasion get placeholder content even if the code is working perfectly. Also note that web sites change over time, so at some point this code will become out-dated. There are plenty of other scraping demos out there, so I offer no long-term committment to maintain the functionality of the demo, and I may archive or delete this repository at any time, so please clone it if you want it to remain available. The site is configured to run on localhost:5000. It has been tested with Chrome v74 on Windows 10 only.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.