datacarpentry / sql-ecology-lesson Goto Github PK

Data Management with SQL for Ecologists

Home Page: http://datacarpentry.github.io/sql-ecology-lesson

License: Other

sql data-carpentry ecology carpentries lesson data-wrangling database english stable

sql-ecology-lesson's Introduction

Data Carpentry SQL Lesson for Ecologists

This repository contains the Data Carpentry SQL material based on ecological data. Please see our contribution guidelines for information on how to contribute updates, bug fixes, or other corrections.

Maintainers

Current maintainers

James Foster (@jd-foster)
Adam Mansur (@adamancer)

Past maintainers

Katy Felkner (@katyfelkner)
Christina Koch (@ChristinaLK)
Paula Andrea Martinez (@orchid00)
Timothée Poisot (@tpoisot)
Rémi Rampin (@remram44)
Ethan White (@ethanwhite)

sql-ecology-lesson's People

Contributors

Stargazers

Watchers

Forkers

ethanwhite jrherr fmichonneau emhart lwasser mjmichaelson xuf12 bjalali aorimi waveform80 tracykteal lgatto idigbio nickynicolson c-martinez metajinomics johnrmoreau zormit kevin-vilbig tpoisot bernease sschmeier chdeb adamobeng lurch rizkaz mickley christinalk fanie-riekert caseyyoungflesh jhollist heereman statkclee chrismedrela brainwane mshelley1 kariljordan msuicat stephlabou dmcguckin ashander sarahsupp computecanada bsmith89 asgeorge jacquim almightybeeij alfawolf140 uw-madison-aci henrykironde fuzziebrain senzo-mpungose tonietuk zareestakhra ahetrick ph03n1x007 christophermalone amyehodge justinpringle lampakii remram44 tacnayn juanfrh pmarchand1 gaurav kyclark zoidy neausda hseifert passacagliata tobiasdressel jkreklow julianhocker bockwinkel horst-baumann lvlzay bathmer emgerstner kburkhard ldeiuliis schlauch tatumhawkins helixcn sedaarat jensmith3 nano-jag cjcenter uri-ai-lab djb5 arpankbasak fergarciafer anneglaser hshimon florisvdh nsoranzo amdevine swc2019 zhejyang buehlerd jd-foster

sql-ecology-lesson's Issues

Encourage data download on landing page

It's encouraged, and the files needed are descrilbed in 00-sql-introduction but not on the landing page.

Mentioning the 4 files required on the landing page, rather than just providing the doi, might encourage a few more students to at least download the 16mb zip?

Potentially add a module on 'Using SQL with your own data'

This module would be the last one and would discuss how to use SQL with your own data. It could

Emphasize that SQLite is fine for their own work
Give an overview of what they would do, i.e. import your own CSV files like we did here
List the potential limitations of SQLite
List other database types and some of their strengths/weaknesses (not overwhelmingly technical, more just like "handles databases up to 10GB well", "is good for multiple users", "need access to a Linux server to use and install", "command line interface only")
Discuss that once a database is created, you can connect to it through other languages like R or Python (and that we'll demonstrate that in the capstone)
Have some links to materials or info on good database design

Contributing guidelines need fixes reflecting split-out lessons

Repo to submit pull requests to is no longer the datacarpentry repo.
Lessons are no longer in subdirectories:

Every lesson has a sub-directory of its own, while individual topics are files in that directory. For example, the lessons/shell directory holding our introduction to the shell contains the files 00-intro.md, 01-filedir.md and so on.

Improvements to the JOINS section

I think it would be a good idea to start the "Joins and aliases" section with an example of what happens if we do a JOIN but do not use an ON, as this is not a necessity. Basically explaining that this will lead to a joining of all rows of one table to all rows of the other table. Then explaining why we don't want this, e.g. big tables=lots of calculations = lots of time, finally explaining that we want to join tables based on common entries in two or more tables using ON. Here we also might want to mention briefly, that this relates back to database design and table structure, as we can only join tables that have common table/column entries. So we need to think about how to design the DB upfront.

We should mention that
"ON surveys.species_id = species.species_id" and "USING (species_id)"
will not get the same result, as the "species_id" column in the later will only appear once.

Seb

Add info on ability to change data type within SQLite after data is imported

At the beginning we have people import the data from csv as an exercise. There's a step on import where you have to select the data type for the field. Some people select the wrong type or nothing at all and wonder what they can do to change it.

If you right click or Control click on the column name in the Structure tab, then there's an Edit Column option and you can type in a new data type. When you go to save it, it does warn you that this is dangerous behavior, so people are being warned, but it might be nice to show this option and also to say why it's not generally a good idea to change data types.

kramdown rendering

Is it safe to just remove markdown from the _config.yml in anticipation for the May 1st deprecation of redcarpet in profit of kramdown?

Pinging @datacarpentry/coordinators and @datacarpentry/maintainers too, maybe this is being discussed in issues on another repo.

Fix CONTRIBUTING in new template

remove/merge info from CONTRIBUTING_orig.md into CONTRIBUTING.md

Current online lesson pages are broken

The current lesson page for this repo linked from the main Data Carpentry site are displaying like this:

When I build the pages locally using Jekyll they work, but are filled with FIXME's.

Looks like this has been down for at least 3 days. Cc'ing @gvwilson since he was involved in the most recent changes.

Add comment to instructor notes referring back to messy spreadsheet

To create the link between the importance of structured data and being able to use SQL, add to the instructors notes that this link should be pointed out.

e.g. "Remember that messy spreadsheet? What would have happened if we tried to load that in to SQL?"

Linked to datacarpentry/spreadsheet-ecology-lesson#6

Remove reference to 00-supplement-database-design from guide.md

There are no other references to this, except one line here:

_extras/guide.md:65:### 00-supplement-database-design.md

Is it needed?

Relates to fixing up instructor guide issue #124

Question about teaching SQL lite

Hey Ethan, et al. I am working through the install instructions for SQL lite.

The lesson, as written instructs students to install SQL as a plugin for firefox. I went through that and it is fairly straight forward. However the SWC setup instructions suggest that Sql comes installed with OSX natively but there is a windows installer.

http://ouinformatics.github.io/2015-04-15-ncar/

How is SQLlite normally taught? Should we ask students to go the firefox route? Or is there a gui version of the database that runs independently of the firefox browser.

Thank you for any feedback.
Leah

learning objectives stated twice -- harmonize or merge

Harmonize learning objectives / overview

00-intro
01-basic
02-aggregation
03-joins

Choice is to harmonize them to be the same or merge them to only be in Overview.

As mentioned in datacarpentry/spreadsheet-ecology-lesson#78 this is issue across all DC. Discussion should take place at linked issue, and decision reached there should be implemented here too.

On data import data emphasize types for the fields

Feedback from a workshop. When learners import data, we do have in step 10 the data types that are supposed to be used, but it's easy to miss, so maybe we could restructure that a little.

Also, for the import exercise, we don't tell students what the data types should be.

Format Key Points

These are not currently formatted in any of the episodes. If there are none, delete the box

00-sql-introduction.md
01-sql-basic-queries.md
02-sql-aggregation.md
03-sql-joins-aliases.md

Species not a variable - use species_id

Received via email:

FYI: In the section below, there is a typo. In the CHALLENGE, you are asked to query species. There is no species in the surveys table. You must use species_id.

Sorting
We can also sort the results of our queries by using ORDER BY. For simplicity, let’s go back to the species table and alphabetize it by taxa.

SELECT *
FROM species
ORDER BY taxa ASC;
The keyword ASC tells us to order it in Ascending order. We could alternately use DESC to get descending order.

SELECT *
FROM species
ORDER BY taxa DESC;
ASC is the default.

We can also sort on several fields at once. To truly be alphabetical, we might want to order by genus then species.

SELECT *
FROM species
ORDER BY genus ASC, species ASC;

Challenge
Write a query that returns year, species, and weight in kg from the surveys table, sorted with the largest weights at the top.

Update the links to the new release of Portal project Teaching Database

Update the links to the new release of Portal project Teaching Database
[1] http://figshare.com/articles/Portal_Project_Teaching_Database/1314459
[2] https://github.com/weecology/portal-teachingdb

SQL aggregation "Null values" section: code blocks not formatted as code

In the "null values" section of the SQL Aggregation episode/page, we see:

But if we try to be extra clever, and find the average ourselves, we might get tripped up:

SELECT SUM(weight), COUNT(), SUM(weight)/COUNT() FROM summer_2000 WHERE species_id == ‘PE’

and the SELECT statement isn't formatted as a code block. This is because that SELECT statement -- like others on the same page -- isn't indented enough in the Markdown file. I'm fixing this with a pull request.

Concept map for "what is a relational database"

This issue springs from discussion of pull request #99 in which @heereman wrote a definition of a relational database, and @orchid00 and I thought it might be a good idea to add a concept map to the instructor's notes for this lesson.

My question:

It seems to me like the concepts and linkages @heereman mentioned in their definition are most of the concepts and linkages that ought to go into the concept map -- here's my list, and I'd love feedback on what else should go into the map for this lesson, or what I could leave out:

users/researchers
queries
SQL
data/values
storage
retrieval
tables
rows
columns
records
record IDs
atomic
relationships
joining
functions
nulls
aliases

I'd love to know others' feedback on what I could leave out of this! I'm thinking maybe "atomic"?

Add notes on comments

Why comments are important
What the comment character is in SQL (--)

add diagrams for SQL statements

It could be useful to add diagrams for WHERE and JOIN statements.

For WHERE we can have boxes that show the selection of subsets of the data.

For JOIN there are good figures available with Venn diagrams, although they include more information, such as with INNER and OUTER joins, e.g.

http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

software we use to teach SQL

We currently have people install the Firefox plugin to use SQL. It's nice because it tends to work under all OSes, but there are some challenges because people can't save commands, they don't quite understand how they'll use it after a workshop, and because you can't zoom, when you teach the font is tiny and hard to read on the screen.

An alternative when we're teaching R already is to instead use the RSQLite package and teach SQL through R. This solves a lot of those problems and also works well with a learners workflow. @lwasser already has a lesson that includes it https://github.com/datacarpentry/R-ecology/blob/gh-pages/05-r-and-sql.Rmd and when she taught it, it went well. @karthik has also taught this way and suggested this approach in conversation.

Lesson Release checklist

Freeze Date: 2017-03-15
Scheduled Release Date: 2017-04-21

Checklist of tasks to complete before release:

Add mention of different JOIN types

In the lesson on joins, a brief mention of the existence of multiple types of join might be useful. If students try to apply JOIN to their own data and aren't aware that non-matching rows will be dropped, it could be frustrating. I would be happy to do a PR adding this caution to the end of "Joins" if desired.

committing directly to gh-pages, oops

In preparation for a workshop at UNC, I made several commits directly to gh-pages (checked with @ethanwhite who said "go ahead!' and then, "oops! I am no longer the maintainer). Note that none of them affect the lesson directly, only adding / editing instructor notes and also the lesson documentation (specifically, updating the maintainers)

Let me know if you are ok with these changes, or if you want me to revert and PR. @tpoisot @orchid00

Markdown headers

Is there a rationale for the markdown headers? It seems to be a mix of # and -, and I'm just a little bit obsessive about these things.

Should we decide on a standard, or is there one in place that is convoluted to the point of eluding me?

Add discussion of HAVING

I always end up discussing this because someone always asked about it. It should be included as part of the notes, probably at the end of 02.

data for this lesson

I got lost at "Let’s all open the database we downloaded in SQLite Manager by clicking on the open file icon." in http://www.datacarpentry.org/sql-ecology-lesson/00-sql-introduction.html

There's a reference in http://www.datacarpentry.org/sql-ecology-lesson/ to data files on figshare, I had a look but couldn't see anything.

In the Import section the figshare link to 3 csv files gives a 404.

Add more introduction to SQLite manager

Add some introduction to the buttons, tabs, and navigation in SQLite manager early in the lesson. This is something that has proved confusing for students in the past.

Harmonize learning objectives in _extras/guide to the lessons

Add a list of functions

In 01-..., the "Calculated values" header mentions functions. Maybe a table would be nice.

Switch to using data from Portal Project Teaching Database

We've recently cleaned up the data for this lesson further and posted it on figshare. The lesson should be converted to using the figshare data, which involves:

Changing links
Changing the column names that have been modified
Changing any reference to species code NA to refer to NL since NA as replaced with NL

Update CONTRIBUTING to point to gh-pages branch

In the CONTRIBUTING.md (https://github.com/datacarpentry/sql-ecology/blame/gh-pages/CONTRIBUTING.md#L31) step 5 says to issue PR to the master branch. However when I look at the repo I see only gh-pages. Should the CONTRIBUTING file be updated?

Which learning objectives should we assess?

Greetings everyone,

I'd appreciate your feedback on my current project. We've successfully revised the learning objectives for all of the Ecology lessons to reflect what we are teaching, and now we are in the process of developing surveys to assess our learners (before and after the workshop) on their skills and self-efficacy for the tools they were taught.

Your feedback is extremely valuable. In this document I've added the learning objectives for all of the Ecology lessons. I would appreciate if you'd open this document and add a +1 to the learning objectives you think are most important to assess our learners.

Additionally, I'm scheduling a virtual meeting to review the objectives, discuss your +1's, and come to a consensus. Please provide your availability here.

Our goal isn't to change what we're teaching, but to better understand what we're teaching, and ensure our learning objectives reflect that, so that we can assess our learners.

Thank you for your feedback and your time.
Kari

P.S. Thank you so much to all the maintainers who helped revise the learning objectives!

@ethanwhite @tpoisot @tracykteal @apawlik @fmichonneau @ChristinaLK @c-martinez @kcranston @heereman @waveform80 @nickynicolson @caseyyoungflesh @ErinBecker @jrherr @metajinomics @hlapp @aurielfournier @cbahlai @karthik

SQLite plugin install

sqlSetup.html should include instructions for installing the SQLite plugin for Firefox.

Add 'instructors-notes' to this repo

Add instructors notes to this lesson to describe how to use the lesson and issues that will potentially come up, like #37

Do you want gh-pages for this repo?

Hi Ethan / Tracy,
I'm working on the next DC workshop and have created a gh-pages for the python lessons.

http://lwasser.github.io/python-ecology/

I plan to PR against the main DC repo when we are done. Do you want a gh-pages for SQL and shell as well so you can link to lessons like SWC does? i'm going through these now for our workshop and can edit as needed as i go.
Leah

Handling NULL values in SQL

The dataset we use has a lot of empty cells, so the issue of how to handle missing data often comes up. Can we add a section on handling missing data, like with IS NULL and IS NOT NULL?

Please add a section of aliases to the SQL lessons

This is just a reminder to do so.
Cheers!

check that the learning objectives reflect the content of the lessons [Lesson Release checklist]

in: 00-sql-introduction.md

The learning objectives listed at the top of this lesson are not all covered in this .md file and they are not in the same order as encountered in the lesson.

They should either match the 2 objectives given in the header, or be trimmed to the 3 that are covered:

Describe why relational databases are useful.

```
Define SQLite data types.
```

Create and populate a database from a text file.

Learning Objectives

Create and populate a database from a text file.
Describe why relational databases are useful.
Define SQLite data types.
Select, group, add to, and analyze subsets of data.
Combine data across multiple tables.

potentially change structure of JOIN statement

Our current JOIN statement is

SELECT *
FROM surveys
JOIN species ON surveys.species_id = species.species_id

When I restructured it as

SELECT *
FROM surveys JOIN species
ON surveys.species_id = species.species_id

participants suggested it made more sense to see on the same line the two tables that were being joined. I don't know if this conforms to standard formatting or if it would help beyond this one case, but just a thought.

Add discussion of only including grouped or aggregated fields in SELECT

In queries with GROUP BY it is important to only include grouped or aggregated fields in the select statements. Otherwise results are typically not what the user expects. This should be discussed in 02.

From legacy Access to SQLite

In #45, @tpoisot suggests the idea of a "From Excel to SQL" capstone module. A related idea might be best practices for going "From legacy Access to SQLite". Is this beyond the scope of this lesson or does this sort of thing already exist elsewhere?

More detail on section about Views

@ethanwhite mentioned that Views are the standard way to save queries. This, along with how to export csv files of the data from a query are already mentioned in the current lesson.

Could we add some more detail to the Views section though, to explain what a view is, how to create it and how to use it again later? Something like 'the anatomy of a view'? I think both instructors and learners would benefit for some more detail in that section.

A few points I've run across in particular:

you can select Create View and it pops up a window. Do you copy and paste a query in there, or is it possible to create a View from a query you just ran?
Export View brings up an Export Wizard. You can export the CSV which makes sense that you're exporting the data that's the result of the current query. There's also the option to export the SQL and a couple buttons. Would you ever want to export the SQL? If you do, what buttons do you select?
Once you create a View, I found that I could see what query I ran by clicking on the 'Structure' tab, but that wasn't intuitive. I would have expected it to appear in the 'Execute SQL' tab.

Split material into lessons?

Currently all material is on one file. Most modules are split into smaller chunks - each one is a separate lesson. It may be good to do the same with SQL (?). This should also be combined with adding more examples and some more material / revising wording to expand the lessons.

Improve challenge?

@fmichonneau -- you are the last one to edit this challenge

https://github.com/datacarpentry/sql-ecology-lesson/blame/gh-pages/02-sql-aggregation.md#L35

What is the solution supposed to be?

check for clarity and flow of narrative [Lesson release checklist]

in: 00-sql-introduction.md

The definition below could specify that records (rows) and fields (columns). This is not necessarily obvious to a complete [novice.]

Definition: Relational database

A relational database is a digital database where all data is stored in relations (or tables) containing rows and columns. All tables contain records and all records are identified by a field contiaing a unique value. Every table shares at least one field with another table in one-to-one, one-to-many or many-to-many relationships. This allows the user to access the data in many ways {: .callout}

Improve one clause per line code block and discussion

There are currently two versions of the same SQL statement. The idea is to demonstrate that a better way of writing longer queries is one clause per line, but this isn't discussed in the notes. The code block should be split into two separate code blocks and language about the one clause per line approach added.

Setup link broken in 00-sql-intro

https://github.com/datacarpentry/sql-ecology-lesson/blame/c3138e892504e041ea05021b6ddd851bb8bbd0c7/_episodes/00-sql-introduction.md#L24

Numeric `IN` queries

I was working through the Basic queries lesson and I got stuck for a moment when trying to use numeric arguments to the IN keyword. Up until this point all the numeric queries are unquoted. However, I believe they need to be quoted when used with IN. Submitting a PR shortly with an example query.