hannesdatta / course-odcm Goto Github PK

This repository hosts the course website of Tilburg University's open education class on "Online Data Collection and Management" (oDCM) - learn how to collect web data for your empirical research projects!

Home Page: https://odcm.hannesdatta.com

JavaScript 0.04% HTML 86.69% SCSS 0.18% Shell 0.01% Jupyter Notebook 12.95% Python 0.13%

course-odcm's People

Contributors

Stargazers

Watchers

course-odcm's Issues

intertwine project with course

As I mentioned earlier, I think it would be good practise to structure the team projects a bit more. Here's an example of what I mean: a site followed by a list of potential research ideas. With regards to this course, we may want to ask students to submit all their business/academic research ideas before the lecture in week 2 -> group them by category and site -> curate and validate the most interesting ideas -> publish those as a source of inspiration for students.

This also makes student better think about how to structure, preprocess, and store their data (regardless of whether they actually conduct the analysis or not).

Originally posted by @RoyKlaasseBos in #7 (comment)

mention that scraping is a two-way thing: get AND post (e.g., business)

decide whether extra "business cases" need to be incorporated

Learning goal of podcasts:
(1) How is web scraping used by companies?

business cases:, skyscanner for flight prices, inprijsverhoogd for market research
how do they setup their infrastructure: how often do they scrape, how do they store their data
ever had legal concerns or asked for permission?
what's the future of scraping, in your opinion?
(2) How are APIs used by firms

disclose data, functionality, or algorithms (contacts w/ google cloud); chartmetric API
what does your business do, exactly?
is API your core activity, or just a side activity? (powering the business, vs. being part of the actual business model)
how many customers do you have? and what type of customers?
what's the payment model you're using? why have you opted for that, versus another one? (e.g., free tiers, free developer plan, etc).
how do you decide on API retrieval thresholds?
what's your backend look like - all self-coded, versus some kind of provider?
OTHER QUESTIONS?
@RoyKlaasseBos: in the spirit of Hilke's comments: maybe such podcasts or video discussions exist already? Can you go look for them?

was originally posted as #14

Create Tilburg `book` theme

@andreantonacci; after the revision from @agemarks, can you please

clone the original hugo book theme, and
replace their book design with our updated tilburg design
keep the content of the basic book template as is
paste a raw course structure (on the basis of our sections: course, syllabus, etc in the EXAMPLE section of the page
update the readme of the book theme, and
release it under @tilburgsciencehub\open-education-template?

No rush on this, but let's get this done eventually so others can join soon.

Is end of Nov. realistic?

Thanks.

Collection of scraping and API examples

The example section in dev will feature links and short descriptions to publicly available scrapers and APIs.

Please collect and organize these links. The main goal is to offer students a starting point to get inspired for their own projects.

Possible content headers are:

Code/repositories
- Web scraping projects with code
- API data retrieval projects with code
Tools used for web scraping
Other tutorials on the web
...

Think broadly about useful categories, and then please browse the web for inspiration. Issue #2 has a couple of links already.

Content suggestions

way more detail about XPATH/CSS
difference between href_text and href_attribute when location
think about an "XPATH CHALLENGE": users put in their locators and class needs to try; first one gets a price

Develop legal assessment content

Legality of Web Scraping

Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:

Have to make a clear distinction between types of data:
- Publicly available data (e.g., public LinkedIn profile)
  - User had made the data public
  - No account required for access
  - Not blocked by robots.txt
hiQ Labs case (scraped public LinkedIn profiles for workplace analytics)
Craigslist case (start-ups use their location data)
Why scraping publicly available information online isn't a crime

Originally posted by @RoyKlaasseBos in #14 (comment)

add chatbot?

grading rubric for team project

think about objective performance criteria
cast those in a rubric

Here are a few example assignments /
rubrics:

Tutorial dev: API 101 (FEEDBACK GIULI)

@hannesdatta post material from Twitter API here.

add to web data advanced

Web data advanced currently shows how to scroll through the entire page.

Yet, it may be super useful for students to learn how to only scroll "once", or "twice", or a little bit.

Please add a little section to the tutorial where this is done (inspiration can be find in students' project submissions).

Inspiration

@RoyKlaasseBos, which firms/contact persons for podcast?
[ ]

finalize video editing and make available clips

Tutorial dev: Web Scraping Advanced (FEEDBACK)

verify colleague feedback has been integrated

discuss w/ Hendrik
ask Suzan

Tutorial dev: Web scraping 101 (FEEDBACK GIULI)

e.g., kickstart with this one

https://github.com/hannesdatta/scraping_workshop

Video setup

@andreantonacci, can you please make a suggestion for the recording of clips? This is a good synergy between TSH and this project.

Think about two setups:

at home recording (with OBS); extra equipment needed?
studio recording in Tilburg (they also use OBS I think...)
We need visuals for TSH, but also for this course. Set it up in a way so that we can re-render the course videos in TSH style after this.
Which equipment do we need?
How to have our slide deck if we wanna show things? Or use post-processing?

This one looks cool:

https://www.youtube.com/watch?v=V_KcGS4whJc (check this one out in terms of overlays, etc.)

This one is less ideal, way too traditional:

I want low effort on my side... I'm not a video editor but a prof ;).

Design revamp

Please coordinate with a designer on Fiverr, to make the rendered website mimic the look and feel of tilburguniversity.edu. Potentially a good input for more open education projects at TilburgU ("just clone our course and put in your content").

find a way to notify students about updates

role of GitHub

decide on role of GitHub in this course

ask giuli for feedback

ontwikkeling interactive livestream #2

define requirements for API proxy (pref in Python)

Please write fiverr job description for an API "tunnel"

throttling/retrieval limits
students get own API key (generated by me)
"protect" my own API key
logging of requests per API Key
python script

@RoyKlaasseBos, please finalize so we can ship to Fiverr.

add questions ("food for thought") to workflow

Content of team project

Define broadly the content of the team project.

add software installation to tutorial page

--> http://127.0.0.1:1313/docs/tutorials/software/

Add content from slide deck / link to Tilburg Science Hub to page.

Goal: page should be viewable soon.

revise building blocks, add where needed

finalize team project requirements

Tutorial dev: testing

Giuli has offered to test our notebooks and tutorials. 🤩

Conceptual framework

@RoyKlaasseBos, as discussed, would you be able to think critically about the framework, please?
new_framework.pptx

Maybe there are ways to improve it. Will become the backbone of course (and paper).

A few ideas to improve are enough for a start. Let's see what you can come up with.

Tutorial dev: Software installation

Please complete / extend learning goals where you see fit.
Let us draft a "draaiboek" for a video to record, for both Mac users as well as Windows users [no Linux, right?]
Let's wait for recording until we know the final set of packages we will be using in the class

remove complicated iterators from web data 101

e.g., next(book for book in books), etc.

Slide template design

Define how slides are going to be made
Code-based system

find a setup so I can cut down recording editing

decide how students track progress on learning goals

e.g., "pulse"

open edu logo?

Hi Andrea,

could you please think about a quick way to energise this logo a bit? The notion needs to be that this is TilburgU, but open! Like Tilburg University (with a small text): open education? Or something like this? Buzzy, marketing-ish. Would be cool if you could think about it for a minute or so ;).

decide which self- and peer-assessment tool to use

options:

flask (self-programmed)
excel / google forms
internal tiu?

adapt learning goals

Identify web data sources and evaluate their value in the context of a specific research question or business problem
Evaluate the appropriateness of data source (and find relevant data collection method & assess feasibility)
Collect data via web scraping and Application Protocol Interfaces (APIs) by mixing, extending and repurposing code snippets
Solve encountered problems and provide solutions
Assess the terms and conditions for collecting, storing, and sharing data
Transform semi-structured JSON data to structured data sets for statistical analysis (“parsing”)
Draft, execute, monitor and audit online data collections locally (and remotely)
Document and archive collected data, and make it available for public (re)use

Tutorial dev: Web data for dummies

Content to kickstart:

https://hannesdatta.github.io/course-jads2020/sessions/template_api_mixer.html

https://hannesdatta.github.io/course-jads2020/sessions/webscraper_socialblade.html

https://hannesdatta.github.io/course-jads2020/sessions/workshop.html

Development of content hierarchy for building blocks

update 101 terminology with web scraping paper terminology

sharpen Motivation of Project

Many students seem to have missed that the project is not about the RQ per se, but about generating the data

test new Hugo template

Andrea did some changes to the hugo template at made it available at github.com/tilburgsciencehub/hugo-tiu.

Please use this template in this repository and verify the site still builds correctly.
Please give feedback to @andreantonacci on whether everything ran smoothly.
Please also roll out to github.com/hannesdatta/course-dprep.

@h. Datta I had to change some things from the version you’re currently using in oDCM – maybe you could use this new theme and let me know if it still works?

Cheers,
Andrea

Include tips/lessons learnt from other instructors in own content

Podcasts

Here are the highlights of a podcast about web scraping by a bootcamp instructor that touches on a variety of topics related to oDCM (navigating the DOM, cleaning text data, timers, selenium vs other tools)

Kimberly Fessel (PhD) - Metis
Request and Beautifulsoup
Two strategies:
1. Look for unique attributes (ids / classes)
2. Navigate the Document Object Model - DOM (children, sibling) ~ tree like structure
Good practise website:
- Boxofficemojo
  - Well-structured
  - Movie revenue, publish date, actors
- Sports Reference
  - Player level stats (baseball, basketball, football, hockey)
- Wikipedia
  - But keep in mind the maximum scraping limit (1 page per second)
  - Tutorial

Importance of including pauses in your request (to avoid getting blocked)
- It could be blocked for a hour, a day, but also indefinitely
Importance of saving the data you have collected (site structure may change over time)
- Write as a csv or store as pickle files
Start out in Jupyter Notebooks to make sure you have the right syntax to get the data → convert to a Python script → set-up scheduling
Stripping out characters ("$", ",", non-printing characters)
- Importance of regular expressions (start out with replace() initially
- Convert dates and times to pandas timeseries
Limitations of BeautifulSoup and request
- Does not work for sites that are dynamically loading content (hitting a database and pulling in information).
- Mostly JavaScript websites (YouTube, Open Table)
- Selenium is the solution; launches a Google Chrome driver; sometimes it as simple as launching the site with selenium and then processing the data with request and Beautifulsoup.
  - Other advantages: clicking on things and filling out fields
- Scrapy - cloud deployment and built a "spider" (scraper that keeps on going and look for new links)
- Importance of visualising your results dynamically/interactively (D3, Plotly, Tableau)
- Data widgets getting more mainstream (e.g., NYT) - people getting more data literate

Legality of Web Scraping

Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:

Have to make a clear distinction between types of data:
- Publicly available data (e.g., public LinkedIn profile)
  - User had made the data public
  - No account required for access
  - Not blocked by robots.txt
hiQ Labs case (scraped public LinkedIn profiles for workplace analytics)
Craigslist case (start-ups use their location data)
Why scraping publicly available information online isn't a crime

Originally posted by @RoyKlaasseBos in #14 (comment)

Define learning goals for all tutorials

I have pushed updates to the dev branch with three specific tutorials:

software setup
python bootcamp
web scraping 101 (example, WITH apis - take out), in \src

Please work on these three tutorials, by producing text-based (bullet point style) versions of the tutorial. For python, please only gather relevant links from Datacamp.

Let's not try to be perfect in the first iteration, but rather see the first step towards a more final version.

Also recall our discussion:

Mandatory Notebook that we designed
Mandatory Datacamp courses (decide about order)
Optional courses.

hannesdatta / course-odcm Goto Github PK

course-odcm's People

Contributors

Stargazers

Watchers

Forkers

course-odcm's Issues

Legality of Web Scraping

Podcasts

Legality of Web Scraping

Recommend Projects

Recommend Topics

Recommend Org