datacamp / courses-intermediate-sql-queries Goto Github PK

Intermediate SQL Queries by Nick Carchedi

R 31.11% Python 66.36% Shell 2.53%

courses-intermediate-sql-queries's Introduction

Intermediate SQL Queries

Teach: https://www.datacamp.com/teach/repositories/641
Campus: https://www.datacamp.com/courses/intermediate-sql-queries
Docs: https://authoring.datacamp.com

Description

The role of a data scientist is to turn raw data into actionable insights. Much of the world's raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches syntax in SQL shared by many types of databases, such as PostgreSQL, MySQL, SQL Server, and Oracle. You will learn everything you need to know to begin working with databases today!

Learning objectives

Learn the basic language and structure of relational databases
Select columns of interest from database tables
Filter for records matching specific criteria
Compute helpful quantities like counts and averages, both overall and within groups
Arrange data based on the values of certain columns

Prerequisites

None

courses-intermediate-sql-queries's People

Contributors

Stargazers

Watchers

Forkers

biroc quisqualis bgordts dandenney snowdj optionalg deng-jhparra debasish2256 jordanwade it-gro rbeninche dhivyab90 datacamp-teach-tester olugbengajj rushi152 vungocthien liuyangnorway gmpitsakakis habibmrad chihyulai john2912 anuradhajha99 lau87fl ambika0203 gor181 datacamp-content-public jalayrupera sli129 jennytian7 claudiaarf starstruckk camerinrencken sambacisse tufanmaitydev marcusrb sabrahashemi hannansatopay spshreena karagul wasimtalib masquerade-y anakinnn monikasivilli norayrmeliksetyan mpofukelvintafadzwa scotinastats fagan2888 brunocs86 arturo-2493 jrogers721 yuechen0830 abdulmunimawan kenshi98 abdillahi-a plthiyagu sureshmelvinsigera mishackdev benitadiop sbarjak umw0lverine vns1025 sivareddym1 ani-68shende yujiay nishanthkumarpathi theakinman vrutabaldha08 rjayhub data-analisis farrelnvli manhng83 mesumraza zachary1754 varundeshmukh007 andreipazelsky shasre ruialv danieljjjjjj upinfo ikhwan1366 diegolagre slavikonnikov marshanette studentwhj303 halilergul1 jadamoureen cauwerda zainabasal khairuldzulqarnain rawidaniel wesenu gkuo06 bakr-al-tamimi

courses-intermediate-sql-queries's Issues

make sure division scts are working correctly

waiting for this: datacamp/sqlwhat#94

improve SCTs involving ordering

For example, WHERE clauses with multiple ANDs and ORs and stuff. This isn't an issue when the solution is correct, but when the solution is incorrect we need to provide feedback relevant to the position of the operands.

For example,

solution:

SELECT * 
FROM films
WHERE language = 'Spanish'
AND release_year = 2000;

student code:

SELECT * 
FROM films
WHERE release_year = 2000
AND language = 'French';

Currently, SCTs will point to the first part of the WHERE clause being incorrect (because the code doesn't match the student code), when in reality it's the second part which is incorrect.

How can we make these kinds of clauses robust to ordering, or is it even possible? Should we just say something like "Check your WHERE clause?", even if the problem is more granular? Perhaps we should use test_student_typed in conjunction with other SCT checks?

make sure aliases are clearly instructed

add SCTs chapter 2

add hints chapter 3

Is this exercise incomplete? Instructions, SCTs, etc.

--- type:TabExercise lang:sql xp:100 key:f7dcb9e122

All together now

Time to practice using ORDER BY, GROUP BY and HAVING together!

*** =pre_exercise_code

connect('postgresql', 'films')

*** =sample_code

SELECT release_year, ___(___(budget)) AS avg_budget, ROUND(AVG(___)) AS avg_box_office
FROM ___
___ release_year > 1990
GROUP BY ___
___ (budget) > 60000000
___ ___ release_year ___;

*** =type1: NormalExercise
*** =key1: a141cd669f

*** =instructions1
Get the rounded average budget and average box office earnings for films since 1990, but only if the average budget was greater than $60M in that year. Order by release year, in descending order.
*** =solution1

SELECT release_year, ROUND(AVG(budget)) AS avg_budget, ROUND(AVG(gross)) AS avg_box_office
FROM films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000
ORDER BY release_year DESC;

*** =sct1

sel = check_node('SelectStmt')

*** =key2: 1ed0ce7d61

*** =type2: NormalExercise
*** =instructions2
Get the name, average budget, and average box office take of countries that have made more than 10 films. Order the result by name.
*** =solution2

SELECT country, ROUND(AVG(budget)) AS avg_budget, ROUND(AVG(gross)) AS avg_box_office
FROM films
GROUP BY country
HAVING COUNT(title) > 10
ORDER BY country
LIMIT 5;

*** =sct2

sel = check_node('SelectStmt')

averaging order

avg([1, 2, 3]) / 2 != avg([1, 2, 3] / 2)

Update keyword colors | Chapter 2

In NULL and IS NULL, and LIKE and NOT LIKE, all SQL keywords should probably appear in blue. This happens when using the NOT keyword as well.

These are the exercises where these keywords are introduced, but it seems to me the color issue is arises on all SQL exercises where these keywords are used, like Sorting single columns (2) where LIKE and IN remain black.

Screenshots

Cf [email protected], #65775
Cc @bakera81

Feedback on films database for use in exercises

@filipsch Would greatly appreciate your feedback on the films database, which we plan to use throughout chapters 1-3 of the intro SQL course. There are 4 tables, currently in CSVs here.

The tables are:

films: films recorded in San Francisco
actors: actors featured in these films
locations: info on the locations in SF where the films were recorded
casts: a join table connecting actors to films (these are many-to-many)

Two things to keep in mind:

There won't be any videos in this course, so all instruction will take place on the films data
We're still planning to do the NYC Citi Bike case study in chapter 4

cc @colinedwardricardo

add hints chapter 1

make sure DISTINCT scts working correctly

see here: datacamp/sqlwhat#99

Add an on-boarding exercise

User Story

As a datacamp user, I want to understand the user interface better, since it is different from what I am used to for R and Python.

This exercise should familiarize the user with the following concepts:

The tables displayed at the bottom of the interface are tables from the database.
The results from running a query are displayed as Query Result.
Syntax errors from the SQL engine are shown below, while feedback from SCTs are displayed in the sidebar.

cc @ncarchedi @colinricardo

course review

Review of entire course prior to beta testing.

Switch all tab exercises to bullet exercises

@colinricardo discussed this morning with mgmt and we decided to go all bullet exercises until the distinct use cases for tab vs. bullet are more clear (and to avoid unnecessary confusion among students). Can you do this?

Clarify AVG() is mean in 'Aggregate functions'

In Ch2 'Aggregate functions', the AVG() function is described as calculating the average value.

Since our target audience is data scientists, we should clarify that it is the mean.

Adding instructions in MCQ tabs

If I have a TabExercise, with child MCQs, is there a way to add instructions to them? See screenshot below.

add hints chapter 2

add SCTs chapter 3

Improve SCT for Exercise 6 in Sorting and Grouping

This is the exercise in question. User feedback reads:

Hint suggested my answer was wrong because I used “count(*)” instead of “count(title)”. Given there are no NULL titles, these two queries are equivalent. (actual issue with my query was that I hadn’t added the GROUP BY yet)

@colinricardo cc @ncarchedi

Unexpected feedback/code highlighting

@colinricardo is this the expected behavior when I select an incorrect column or do we need to make a change to sqlwhat? This is the 3rd task of SELECTing Multiple Columns.

add hints chapter 4

get chapter 1 ready for beta testing

Make MC answer less obvious!

This is the exercise in question. User feedback reads

Answer to question is literally on the page. No thinking required.

Can we make the answer to this question less obvious so that the user has to do some work 😄

@colinricardo cc @ncarchedi

Typo in 'Even more aliasing'

'is a decimal' should read 'has a decimal'

Pre-launch TODOs

@colinricardo a more efficient way of tracking TODOs

Instructor positioning in Welcome video

Summary

Almost a third of the instructor is offscreen in the Welcome to the Course video. It looks bad and can be easily fixed by adding the following code block to Slide Editing and modifying the values until the instructor fits between the right side of the hexagon and end of the screen:

transformations:
    translateX: 70
    translateY: -6
    scale: 1.45

Images

Intro to SQL:

Data Types for Data Science (positioning looks better here):

Link to Joining Data in PostgreSQL to close course

Might be a great way to get non-subscribers to subscribe and take the follow-up course. Suggested via community slack in DataCamp USA

database lingo

@ncarchedi I want your thoughts on this. I don't think we should be expressly using 'record' and 'field' everywhere - I think these terms confuse people, and the majority of SQL tutorials online just use the terms 'rows' and 'columns'. Since this course is aimed at beginners, I think we should refer to their technical aliases (lol) once, but not again. Thoughts?

Should SELECT * pass for SELECT XXX, YYY?

For example, in the second instruction of this exercise: https://campus.datacamp.com/courses/intro-to-sql-for-data-science/selecting-and-summarizing-columns?ex=3

Do we want this to pass? This is a real question.

SELECT *
FROM films;

Bad error feedback in WHERE IN task 2

I selected the wrong column, but the feedback pointed to the IN clause.

The SCTs look OK to me, but I'm not properly familiar with them yet.

See lines 1277 to 1296 of https://github.com/datacamp/courses-intro-to-sql/blob/master/chapter2.md

Please double check that the SCTs are correct. If not, fix them, if so, please migrate this issue to pm-learn.

Indentation of code examples

Currently, all the examples queries are left aligned. For example,

SELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;

Once queries get more complicated, they are clearer to read if indentation is used. For example, I'd naturally write the previous example as

SELECT title
  FROM films
  WHERE release_year > 1994
    AND release_year < 2000;

Feedback on code

Chapter 1

Since this is the films database all about films, it may make sense to kick things off with queries on the films table.
In count exercises, it seems you want to drive home the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column) a bit more by using the same column/table for all three. Ideally a column with both duplicate values and missing data (e.g. date of birth). Let's discuss if not clear why
Getting the total budget (SUM(budget)) doesn't make sense because of the different currencies. Maybe we can hint at this in the exercise as a teaser for filtering for only US films later in the course. (Or maybe this becomes clear in best and worst performers queries?)
Why is average duration v2? What's v1?

Get total number of unique dates.

This seems pretty contrived. Can we come up with a better example or just leave this one out?

Get the percentage of dead people.

Lol, maybe something slightly less morbid?

Chapter 2

Get films released in 1990 or released in 2000.

A more interesting example would be e.g. "Get films in English and Spanish."
I'm confused why we're using ORDER BY in chapter 2. It's not introduced until chapter 3.
Seems like we're doing a lot with years. Are there other interesting aspects of the data we can explore? For example, you're not filtering by any strings until the end of the chapter
Also, I'd like to see fewer contrived queries and more questions that people will be intrinsically motivated to answer, e.g. "Get all black and white movies released since 2010."

Chapter 3

Get films filmed in 2000 or 2015, in order of release, and alphabetically.

Feels contrived. How about "Get PG and PG-13 movies, ordered by release year and title"
Same for next one

Get count of films made in each year.

I think it's customary to do SELECT release_year, COUNT(release_year)...instead of the reverse so that reading the query result right to left, you see the groups first, then the counts within the groups

A PROPER GROUP BY ON MULTIPLE COLUMNS SHOULD GO HERE:

What are thinking here?

Get count of films, group by release year then order by release year.

Seems like this one should just be a slight variation on the first code chunk in this section
Some of your queries aren't valid (e.g. you ORDER BY count in one, but that column doesn't exist)

Get lowest box office earnings per year.

I want to know what movie this is!! Can bring in a subquery for this, since it's not as simple as adding the title column to the SELECT
Same for highest box office (next query)

Get lowest and highest box office earnings per year. Note: with this one, NULL release year still has some entries for min and max.

Can't you add a WHERE clause to leave out NULL years?

Chapter 4

I think narrative is key for this chapter. Can you dream up a storyline to put our queries into context? E.g. "you've been hired by the City of New York to understand bike share usage..."

Make repo public

It's a free course, so make it public? Just like intro-to-r and intro-to-python.

Too many tasks in 'GROUP BY practice' exercise

We shouldn't have more than 4-5 IMO, especially when there's already a vertical scroll. I'd suggest either moving some tasks to the next exercise or just getting rid of them entirely.

Clean up chapter 4

ORDER BY not tested in 'GROUP BY practice 2' task 4

Instructions say 'Sort your results by release year and country'. Exercise passes without the sorting.

Add success messages

At the moment they are all just "Correct" or "Great work".

Add code examples to all assignment texts

Student feedback

"Hi I have been completing your courses as part of my uni course to learn sql and coding with Jupyter notebook. I was given an assignment using csv and sqlite data files and I had to create a data visualization. Thebfirst data set was NSW crime at the following link: https://data.nsw.gov.au/data/dataset/c18d2459-47c7-46b8-b9db-91d9365b1850 and the second was soccer statistics from the Barcelona Olympics. The problem I faced was that there is no tutorial or any product that shows how to create a graph from start to finish. Are you able to create an example with my dataset explaining the step by step process from start to finish. Then with an sqlite file, then comparing the two files. Your tutorials are great, but lack a complete example with this type of data. The Tips data is too basic and in a format that is easy to use. Plus there is no examples showing how to graph an sqlite file that has multiple tabs. Your help would be greatly appreciated. Regards Bradley Halicek [email protected]"

Cf [email protected] #50568

Improve narrative for chapter 4

Providing worked examples for new topics

@colinricardo I think in exercises like this where students are practicing something new for the first time, it makes sense to gradually ease them into writing more and more of the code themselves. Here, you could even imagine giving them all of the code for the first query and just asking them to run it, then adding 1-2 blanks in the next query, then 1-2 more, then no scaffolding, etc. Thoughts?

cc @ramnathv

Add extra exercises

Feedback on data

films

Put the director_id column before actor columns so when the student is previewing on a small screen, they will hopefully see the director and at least the first actor
Looks like there are still several dates > 2017 in the directors table. You should be checking your output more carefully. As a general rule, assume you fucked something up, then convince yourself otherwise.
Otherwise, I think this dataset is looking pretty good. The only other thing we might want to work on is digging up missing dob's and dod's, but that's not urgent

olympics

I think this dataset is looking good. Maybe the only other interesting thing we could add is another column for the country table (e.g. population)

add SCTs chapter 4

Switching tables mid-exercise

I noticed we often switch tables mid-exercise, which makes it difficult to use bullet exercises (since they are best when we're making sequential and small adjustments to a block of code across tasks). Plus it adds unnecessary cognitive overhead for the student when we really want them to be focusing on the technique, not the data.

For example, this exercise: https://campus.datacamp.com/courses/1946/5261?ex=3

SQLBolt implements this strategy beautifully here: https://sqlbolt.com/lesson/select_queries_with_constraints

Each task is a slight variation the last, which provides smooth repetition for the student and keeps them in the flow.

course image not building

@filipsch I think the use of $GITHUB_TOKEN in requirements.sh is causing the course to fail. Currently, it's used to clone this repo, in order to build the database. Alternatives might be to

move the code in data/films to a separate repo that the new dummy user has access to
give the new dummy user read access to this repo

Feedback: SF films data

General:

Tables should be lowercase plural nouns (directors, films, locations)
Let's use lowercase and underscores for column names (e.g. release_year)

directors:

Can you sort by Name alphabetically`, then make the IDs sequential?
Anywhere there's multiple directors (e.g. row 66, 211), let's keep only the first and double-check the DOB
Row 24 is empty
Maybe missing values should be empty strings instead of none?

films:

Same comment as above, make IDs sequential (sorted alphabetically by title)
Table should be called films, not filming
Main Actor --> lead_actor

locations:

Same comment on IDs (sorted alphabetically by location name)
Row 10 is empty
I thought we had addresses, no?

Feedback on new films database