Git Product home page Git Product logo

courses-intermediate-sql-queries's Introduction

Intermediate SQL Queries

Description

The role of a data scientist is to turn raw data into actionable insights. Much of the world's raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches syntax in SQL shared by many types of databases, such as PostgreSQL, MySQL, SQL Server, and Oracle. You will learn everything you need to know to begin working with databases today!

Learning objectives

  • Learn the basic language and structure of relational databases
  • Select columns of interest from database tables
  • Filter for records matching specific criteria
  • Compute helpful quantities like counts and averages, both overall and within groups
  • Arrange data based on the values of certain columns

Prerequisites

  • None

courses-intermediate-sql-queries's People

Contributors

adelnehme avatar amy-datacamp avatar beccarobins avatar chapmanj-2810 avatar colinricardo avatar datacamp-content-creator avatar filipsch avatar gor181 avatar isserp avatar jasminludolf avatar jng985 avatar jsaddlemyer avatar justmytwospence avatar klmedeiros avatar machow avatar mcls avatar mona-kay avatar ncarchedi avatar nicksolomon avatar ramnathv avatar richierocks avatar ruanne avatar saciestudio avatar shoninouye avatar tuurdutoit avatar vvnkr avatar yashasroy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

courses-intermediate-sql-queries's Issues

improve SCTs involving ordering

For example, WHERE clauses with multiple ANDs and ORs and stuff. This isn't an issue when the solution is correct, but when the solution is incorrect we need to provide feedback relevant to the position of the operands.

For example,

solution:

SELECT * 
FROM films
WHERE language = 'Spanish'
AND release_year = 2000;

student code:

SELECT * 
FROM films
WHERE release_year = 2000
AND language = 'French';

Currently, SCTs will point to the first part of the WHERE clause being incorrect (because the code doesn't match the student code), when in reality it's the second part which is incorrect.

How can we make these kinds of clauses robust to ordering, or is it even possible? Should we just say something like "Check your WHERE clause?", even if the problem is more granular? Perhaps we should use test_student_typed in conjunction with other SCT checks?

Is this exercise incomplete? Instructions, SCTs, etc.

--- type:TabExercise lang:sql xp:100 key:f7dcb9e122

All together now

Time to practice using ORDER BY, GROUP BY and HAVING together!

*** =pre_exercise_code

connect('postgresql', 'films')

*** =sample_code

SELECT release_year, ___(___(budget)) AS avg_budget, ROUND(AVG(___)) AS avg_box_office
FROM ___
___ release_year > 1990
GROUP BY ___
___ (budget) > 60000000
___ ___ release_year ___;

*** =type1: NormalExercise
*** =key1: a141cd669f

*** =instructions1
Get the rounded average budget and average box office earnings for films since 1990, but only if the average budget was greater than $60M in that year. Order by release year, in descending order.
*** =solution1

SELECT release_year, ROUND(AVG(budget)) AS avg_budget, ROUND(AVG(gross)) AS avg_box_office
FROM films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000
ORDER BY release_year DESC;

*** =sct1

sel = check_node('SelectStmt')

*** =key2: 1ed0ce7d61

*** =type2: NormalExercise
*** =instructions2
Get the name, average budget, and average box office take of countries that have made more than 10 films. Order the result by name.
*** =solution2

SELECT country, ROUND(AVG(budget)) AS avg_budget, ROUND(AVG(gross)) AS avg_box_office
FROM films
GROUP BY country
HAVING COUNT(title) > 10
ORDER BY country
LIMIT 5;

*** =sct2

sel = check_node('SelectStmt')

Feedback on films database for use in exercises

@filipsch Would greatly appreciate your feedback on the films database, which we plan to use throughout chapters 1-3 of the intro SQL course. There are 4 tables, currently in CSVs here.

The tables are:

  • films: films recorded in San Francisco
  • actors: actors featured in these films
  • locations: info on the locations in SF where the films were recorded
  • casts: a join table connecting actors to films (these are many-to-many)

Two things to keep in mind:

  1. There won't be any videos in this course, so all instruction will take place on the films data
  2. We're still planning to do the NYC Citi Bike case study in chapter 4

cc @colinedwardricardo

Add an on-boarding exercise

User Story

As a datacamp user, I want to understand the user interface better, since it is different from what I am used to for R and Python.

This exercise should familiarize the user with the following concepts:

  1. The tables displayed at the bottom of the interface are tables from the database.
  2. The results from running a query are displayed as Query Result.
  3. Syntax errors from the SQL engine are shown below, while feedback from SCTs are displayed in the sidebar.

cc @ncarchedi @colinricardo

Pre-launch TODOs

  • Add hints
    • ch1
    • ch2
    • ch3
    • ch4
  • Add SCTs
    • ch1
    • ch2
    • ch3
    • ch4
  • Improve ch4 narrative
  • Clean up ch4

@colinricardo a more efficient way of tracking TODOs

Instructor positioning in Welcome video

Summary

Almost a third of the instructor is offscreen in the Welcome to the Course video. It looks bad and can be easily fixed by adding the following code block to Slide Editing and modifying the values until the instructor fits between the right side of the hexagon and end of the screen:

transformations:
    translateX: 70
    translateY: -6
    scale: 1.45

Images

Intro to SQL:
introsqlstaging

Data Types for Data Science (positioning looks better here):
datatypesstaging

database lingo

@ncarchedi I want your thoughts on this. I don't think we should be expressly using 'record' and 'field' everywhere - I think these terms confuse people, and the majority of SQL tutorials online just use the terms 'rows' and 'columns'. Since this course is aimed at beginners, I think we should refer to their technical aliases (lol) once, but not again. Thoughts?

Indentation of code examples

Currently, all the examples queries are left aligned. For example,

SELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;

Once queries get more complicated, they are clearer to read if indentation is used. For example, I'd naturally write the previous example as

SELECT title
  FROM films
  WHERE release_year > 1994
    AND release_year < 2000;

Feedback on code

Chapter 1

  • Since this is the films database all about films, it may make sense to kick things off with queries on the films table.
  • In count exercises, it seems you want to drive home the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column) a bit more by using the same column/table for all three. Ideally a column with both duplicate values and missing data (e.g. date of birth). Let's discuss if not clear why
  • Getting the total budget (SUM(budget)) doesn't make sense because of the different currencies. Maybe we can hint at this in the exercise as a teaser for filtering for only US films later in the course. (Or maybe this becomes clear in best and worst performers queries?)
  • Why is average duration v2? What's v1?

Get total number of unique dates.

  • This seems pretty contrived. Can we come up with a better example or just leave this one out?

Get the percentage of dead people.

  • Lol, maybe something slightly less morbid?

Chapter 2

Get films released in 1990 or released in 2000.

  • A more interesting example would be e.g. "Get films in English and Spanish."
  • I'm confused why we're using ORDER BY in chapter 2. It's not introduced until chapter 3.
  • Seems like we're doing a lot with years. Are there other interesting aspects of the data we can explore? For example, you're not filtering by any strings until the end of the chapter
  • Also, I'd like to see fewer contrived queries and more questions that people will be intrinsically motivated to answer, e.g. "Get all black and white movies released since 2010."

Chapter 3

Get films filmed in 2000 or 2015, in order of release, and alphabetically.

  • Feels contrived. How about "Get PG and PG-13 movies, ordered by release year and title"
  • Same for next one

Get count of films made in each year.

  • I think it's customary to do SELECT release_year, COUNT(release_year)...instead of the reverse so that reading the query result right to left, you see the groups first, then the counts within the groups

A PROPER GROUP BY ON MULTIPLE COLUMNS SHOULD GO HERE:

  • What are thinking here?

Get count of films, group by release year then order by release year.

  • Seems like this one should just be a slight variation on the first code chunk in this section
  • Some of your queries aren't valid (e.g. you ORDER BY count in one, but that column doesn't exist)

Get lowest box office earnings per year.

  • I want to know what movie this is!! Can bring in a subquery for this, since it's not as simple as adding the title column to the SELECT
  • Same for highest box office (next query)

Get lowest and highest box office earnings per year. Note: with this one, NULL release year still has some entries for min and max.

  • Can't you add a WHERE clause to leave out NULL years?

Chapter 4

  • I think narrative is key for this chapter. Can you dream up a storyline to put our queries into context? E.g. "you've been hired by the City of New York to understand bike share usage..."

Make repo public

It's a free course, so make it public? Just like intro-to-r and intro-to-python.

Too many tasks in 'GROUP BY practice' exercise

We shouldn't have more than 4-5 IMO, especially when there's already a vertical scroll. I'd suggest either moving some tasks to the next exercise or just getting rid of them entirely.

Student feedback

"Hi I have been completing your courses as part of my uni course to learn sql and coding with Jupyter notebook. I was given an assignment using csv and sqlite data files and I had to create a data visualization. Thebfirst data set was NSW crime at the following link: https://data.nsw.gov.au/data/dataset/c18d2459-47c7-46b8-b9db-91d9365b1850 and the second was soccer statistics from the Barcelona Olympics. The problem I faced was that there is no tutorial or any product that shows how to create a graph from start to finish. Are you able to create an example with my dataset explaining the step by step process from start to finish. Then with an sqlite file, then comparing the two files. Your tutorials are great, but lack a complete example with this type of data. The Tips data is too basic and in a format that is easy to use. Plus there is no examples showing how to graph an sqlite file that has multiple tabs. Your help would be greatly appreciated. Regards Bradley Halicek [email protected]"

Cf [email protected] #50568

Providing worked examples for new topics

@colinricardo I think in exercises like this where students are practicing something new for the first time, it makes sense to gradually ease them into writing more and more of the code themselves. Here, you could even imagine giving them all of the code for the first query and just asking them to run it, then adding 1-2 blanks in the next query, then 1-2 more, then no scaffolding, etc. Thoughts?

cc @ramnathv

Feedback on data

films

  • Put the director_id column before actor columns so when the student is previewing on a small screen, they will hopefully see the director and at least the first actor
  • Looks like there are still several dates > 2017 in the directors table. You should be checking your output more carefully. As a general rule, assume you fucked something up, then convince yourself otherwise.
  • Otherwise, I think this dataset is looking pretty good. The only other thing we might want to work on is digging up missing dob's and dod's, but that's not urgent

olympics

  • I think this dataset is looking good. Maybe the only other interesting thing we could add is another column for the country table (e.g. population)

Switching tables mid-exercise

I noticed we often switch tables mid-exercise, which makes it difficult to use bullet exercises (since they are best when we're making sequential and small adjustments to a block of code across tasks). Plus it adds unnecessary cognitive overhead for the student when we really want them to be focusing on the technique, not the data.

For example, this exercise: https://campus.datacamp.com/courses/1946/5261?ex=3

SQLBolt implements this strategy beautifully here: https://sqlbolt.com/lesson/select_queries_with_constraints

Each task is a slight variation the last, which provides smooth repetition for the student and keeps them in the flow.

course image not building

@filipsch I think the use of $GITHUB_TOKEN in requirements.sh is causing the course to fail. Currently, it's used to clone this repo, in order to build the database. Alternatives might be to

  1. move the code in data/films to a separate repo that the new dummy user has access to
  2. give the new dummy user read access to this repo

Feedback: SF films data

General:

  • Tables should be lowercase plural nouns (directors, films, locations)
  • Let's use lowercase and underscores for column names (e.g. release_year)

directors:

  • Can you sort by Name alphabetically`, then make the IDs sequential?
  • Anywhere there's multiple directors (e.g. row 66, 211), let's keep only the first and double-check the DOB
  • Row 24 is empty
  • Maybe missing values should be empty strings instead of none?

films:

  • Same comment as above, make IDs sequential (sorted alphabetically by title)
  • Table should be called films, not filming
  • Main Actor --> lead_actor

locations:

  • Same comment on IDs (sorted alphabetically by location name)
  • Row 10 is empty
  • I thought we had addresses, no?

Feedback on new films database

films:

  • This is subjective, but I'd order the columns by theme: id, title, release_year, country, duration, color, language, certification, gross, budget, review_id
  • Then order by release_year, title
  • Then assign new ids

people:

  • Looks good. I assume it's just date_of_birth and date_of_death to be added here?

reviews:

  • Also looks good. Just curious how ids were assigned? Not necessarily a problem, but ordering seems pretty random.

roles:

  • In an optimal database, I guess we'd have separate lookup tables for categoricals like role, country, certification, etc. but we don't want to overcomplicate things here. Since we're stopping one step short of this "ideal", let's just spell out the roles as "author" and "director" so it's self-explanatory.
  • I would order by film_id, person_id, role (in that order) then assign ids.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.