Git Product home page Git Product logo

hannesdatta / course-odcm Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 24.0 94.4 MB

This repository hosts the course website of Tilburg University's open education class on "Online Data Collection and Management" (oDCM) - learn how to collect web data for your empirical research projects!

Home Page: https://odcm.hannesdatta.com

JavaScript 0.04% HTML 86.69% SCSS 0.18% Shell 0.01% Jupyter Notebook 12.95% Python 0.13%

course-odcm's People

Contributors

agemarks avatar akram-coding avatar ana-bianca-luca avatar andreantonacci avatar bodr101 avatar gknox79 avatar hannesdatta avatar juliehabets avatar marjoleineee avatar nazlialagoz avatar ralphgit21 avatar royklaassebos avatar vscanturk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

course-odcm's Issues

intertwine project with course

As I mentioned earlier, I think it would be good practise to structure the team projects a bit more. Here's an example of what I mean: a site followed by a list of potential research ideas. With regards to this course, we may want to ask students to submit all their business/academic research ideas before the lecture in week 2 -> group them by category and site -> curate and validate the most interesting ideas -> publish those as a source of inspiration for students.

This also makes student better think about how to structure, preprocess, and store their data (regardless of whether they actually conduct the analysis or not).

Originally posted by @RoyKlaasseBos in #7 (comment)

decide whether extra "business cases" need to be incorporated

Learning goal of podcasts:
(1) How is web scraping used by companies?

business cases:, skyscanner for flight prices, inprijsverhoogd for market research
how do they setup their infrastructure: how often do they scrape, how do they store their data
ever had legal concerns or asked for permission?
what's the future of scraping, in your opinion?
(2) How are APIs used by firms

disclose data, functionality, or algorithms (contacts w/ google cloud); chartmetric API
what does your business do, exactly?
is API your core activity, or just a side activity? (powering the business, vs. being part of the actual business model)
how many customers do you have? and what type of customers?
what's the payment model you're using? why have you opted for that, versus another one? (e.g., free tiers, free developer plan, etc).
how do you decide on API retrieval thresholds?
what's your backend look like - all self-coded, versus some kind of provider?
OTHER QUESTIONS?
@RoyKlaasseBos: in the spirit of Hilke's comments: maybe such podcasts or video discussions exist already? Can you go look for them?

was originally posted as #14

Create Tilburg `book` theme

@andreantonacci; after the revision from @agemarks, can you please

  • clone the original hugo book theme, and
  • replace their book design with our updated tilburg design
  • keep the content of the basic book template as is
  • paste a raw course structure (on the basis of our sections: course, syllabus, etc in the EXAMPLE section of the page
  • update the readme of the book theme, and
  • release it under @tilburgsciencehub\open-education-template?

No rush on this, but let's get this done eventually so others can join soon.

Is end of Nov. realistic?

Thanks.

Collection of scraping and API examples

The example section in dev will feature links and short descriptions to publicly available scrapers and APIs.

Please collect and organize these links. The main goal is to offer students a starting point to get inspired for their own projects.

Possible content headers are:

  • Code/repositories
    • Web scraping projects with code
    • API data retrieval projects with code
  • Tools used for web scraping
  • Other tutorials on the web
    ...

Think broadly about useful categories, and then please browse the web for inspiration. Issue #2 has a couple of links already.

Content suggestions

  • way more detail about XPATH/CSS
  • difference between href_text and href_attribute when location
  • think about an "XPATH CHALLENGE": users put in their locators and class needs to try; first one gets a price

Develop legal assessment content

Legality of Web Scraping

Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:

  • Have to make a clear distinction between types of data:
    • Publicly available data (e.g., public LinkedIn profile)
      • User had made the data public
      • No account required for access
      • Not blocked by robots.txt
  • hiQ Labs case (scraped public LinkedIn profiles for workplace analytics)
  • Craigslist case (start-ups use their location data)
  • Why scraping publicly available information online isn't a crime

Originally posted by @RoyKlaasseBos in #14 (comment)

add to web data advanced

Web data advanced currently shows how to scroll through the entire page.

Yet, it may be super useful for students to learn how to only scroll "once", or "twice", or a little bit.

Please add a little section to the tutorial where this is done (inspiration can be find in students' project submissions).

Video setup

@andreantonacci, can you please make a suggestion for the recording of clips? This is a good synergy between TSH and this project.

Think about two setups:

  • at home recording (with OBS); extra equipment needed?

  • studio recording in Tilburg (they also use OBS I think...)

  • We need visuals for TSH, but also for this course. Set it up in a way so that we can re-render the course videos in TSH style after this.

  • Which equipment do we need?

  • How to have our slide deck if we wanna show things? Or use post-processing?

This one looks cool:

https://www.youtube.com/watch?v=V_KcGS4whJc (check this one out in terms of overlays, etc.)

image

This one is less ideal, way too traditional:

image

I want low effort on my side... I'm not a video editor but a prof ;).

Design revamp

Please coordinate with a designer on Fiverr, to make the rendered website mimic the look and feel of tilburguniversity.edu. Potentially a good input for more open education projects at TilburgU ("just clone our course and put in your content").

Conceptual framework

@RoyKlaasseBos, as discussed, would you be able to think critically about the framework, please?
new_framework.pptx

Maybe there are ways to improve it. Will become the backbone of course (and paper).

A few ideas to improve are enough for a start. Let's see what you can come up with.

Tutorial dev: Software installation

  • Please complete / extend learning goals where you see fit.
  • Let us draft a "draaiboek" for a video to record, for both Mac users as well as Windows users [no Linux, right?]
  • Let's wait for recording until we know the final set of packages we will be using in the class

open edu logo?

Hi Andrea,

could you please think about a quick way to energise this logo a bit? The notion needs to be that this is TilburgU, but open! Like Tilburg University (with a small text): open education? Or something like this? Buzzy, marketing-ish. Would be cool if you could think about it for a minute or so ;).

image

adapt learning goals

  • Identify web data sources and evaluate their value in the context of a specific research question or business problem
  • Evaluate the appropriateness of data source (and find relevant data collection method & assess feasibility)
  • Collect data via web scraping and Application Protocol Interfaces (APIs) by mixing, extending and repurposing code snippets
  • Solve encountered problems and provide solutions
  • Assess the terms and conditions for collecting, storing, and sharing data
  • Transform semi-structured JSON data to structured data sets for statistical analysis (“parsing”)
  • Draft, execute, monitor and audit online data collections locally (and remotely)
  • Document and archive collected data, and make it available for public (re)use

test new Hugo template

Andrea did some changes to the hugo template at made it available at github.com/tilburgsciencehub/hugo-tiu.

  • Please use this template in this repository and verify the site still builds correctly.
  • Please give feedback to @andreantonacci on whether everything ran smoothly.
  • Please also roll out to github.com/hannesdatta/course-dprep.

@h. Datta I had to change some things from the version you’re currently using in oDCM – maybe you could use this new theme and let me know if it still works?

Cheers,
Andrea

Include tips/lessons learnt from other instructors in own content

Podcasts

Here are the highlights of a podcast about web scraping by a bootcamp instructor that touches on a variety of topics related to oDCM (navigating the DOM, cleaning text data, timers, selenium vs other tools)

  • Kimberly Fessel (PhD) - Metis
  • Request and Beautifulsoup
  • Two strategies:
    1. Look for unique attributes (ids / classes)
    2. Navigate the Document Object Model - DOM (children, sibling) ~ tree like structure
  • Good practise website:
    • Boxofficemojo
      • Well-structured
      • Movie revenue, publish date, actors
    • Sports Reference
      • Player level stats (baseball, basketball, football, hockey)
    • Wikipedia
      • But keep in mind the maximum scraping limit (1 page per second)
      • Tutorial
  • Importance of including pauses in your request (to avoid getting blocked)
    • It could be blocked for a hour, a day, but also indefinitely
  • Importance of saving the data you have collected (site structure may change over time)
    • Write as a csv or store as pickle files
  • Start out in Jupyter Notebooks to make sure you have the right syntax to get the data → convert to a Python script → set-up scheduling
  • Stripping out characters ("$", ",", non-printing characters)
    • Importance of regular expressions (start out with replace() initially
    • Convert dates and times to pandas timeseries
  • Limitations of BeautifulSoup and request
    • Does not work for sites that are dynamically loading content (hitting a database and pulling in information).
    • Mostly JavaScript websites (YouTube, Open Table)
    • Selenium is the solution; launches a Google Chrome driver; sometimes it as simple as launching the site with selenium and then processing the data with request and Beautifulsoup.
      • Other advantages: clicking on things and filling out fields
    • Scrapy - cloud deployment and built a "spider" (scraper that keeps on going and look for new links)
    • Importance of visualising your results dynamically/interactively (D3, Plotly, Tableau)
    • Data widgets getting more mainstream (e.g., NYT) - people getting more data literate

Legality of Web Scraping

Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:

  • Have to make a clear distinction between types of data:
    • Publicly available data (e.g., public LinkedIn profile)
      • User had made the data public
      • No account required for access
      • Not blocked by robots.txt
  • hiQ Labs case (scraped public LinkedIn profiles for workplace analytics)
  • Craigslist case (start-ups use their location data)
  • Why scraping publicly available information online isn't a crime

Originally posted by @RoyKlaasseBos in #14 (comment)

Define learning goals for all tutorials

I have pushed updates to the dev branch with three specific tutorials:

  • software setup
  • python bootcamp
  • web scraping 101 (example, WITH apis - take out), in \src

Please work on these three tutorials, by producing text-based (bullet point style) versions of the tutorial. For python, please only gather relevant links from Datacamp.

Let's not try to be perfect in the first iteration, but rather see the first step towards a more final version.

Tutorial dev: Python bootcamp

Please develop a Jupyter Notebook that teaches students the learning goals laid out in the Tutorial page on Python Bootcamp.

Note: you do not have to develop material from scratch, but can also make use of available material. If that material is available OO, we can integrate it in our own notebooks. You could also think about linking to datacamp courses first.

Also recall our discussion:

  • Mandatory Notebook that we designed
  • Mandatory Datacamp courses (decide about order)
  • Optional courses.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.