datacarpentry / organization-genomics Goto Github PK

View Code? Open in Web Editor NEW

21.0 27.0 76.0 9.04 MB

Project Organization and Management for Genomics

Home Page: https://datacarpentry.org/organization-genomics

License: Other

carpentries data-carpentry lesson spreadsheet spreadsheets data-management metadata english genomics stable

organization-genomics's Introduction

organization-genomics

Lesson on data organization and project setup for genomics.

organization-genomics's People

Contributors

Stargazers

Watchers

organization-genomics's Issues

topic for the repository

Please, consider adding a lesson topic to the repository. To do so you can follow the help about how to add topics to the repository. Check out the topics that the Genomics R intro lesson has gotten to add others that may be relevant to this lesson.

This will help people to know which repositories are lessons and also could be used to automate analysis of the repositories.

01-introduction and 05-ncbi-sra paper reference links broken

The links for the Blount et al 2012 paper and supplementary are broken in both 01-introduction.md and 05-ncbi-sra.md

should we add a picture from the Tenaillon et al. paper demonstrating how to find SRA accessions?

@jessicamizzi and I thought it would be nice to show a png from the Tenaillon et al. paper (where the data for the wrangling-genomics lessons was first sequenced) where the SRA accession or project number can be found. This would demonstrate one way to interact with data: read a paper, find accession number in paper, look paper up on the SRA or ENA. We were unsure of copyright infringement on the original paper though.

Thoughts?

SraRunTable.txt downloads as CSV, not TSV

In the updated version of the SRA Run Selector Page, the downloaded SraRunTable.txt is actually now a comma-delimited file rather than a tab-delimited file as stated on the current version of the "Examining Data on the NCBI SRA Database" page.

Students will need to specify that their spreadsheet program interpret it as comma-delimited, so I suggest the following language: "Using your choice of spreadsheet program, open the SraRunTable.txt file. You may need to tell the program that this is a comma-delimited file in order to have the data separated properly into columns."

Out of place?

"Data about the experiment is usually collected in spreadsheets, like Excel."

This should be moved below "Metadata standards" and be at the start of the actual region discussing the spreadsheets.

It also might be appropriate to introduce other spreadsheets to those who do not know other spreadsheet software exist. For example: "Data about the experiment is usually collected in spreadsheets, such as Microsoft Excel, Libre Office Calc, or Gnumeric."

link broken to navigate to Tenaillon et al. data on SRA

The link encoded here does not lead to the expected place

1. Access the Tenaillon dataset from the provided link: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605.

Motivation on organization in lesson 02-organization

Adding more motivation for why you want to setup the project directory this way and what each of the folders will hold.

Add Good/Better/Best for Sample Names

Ep1 Data Tidyness Metadata Example

Hi!

The screenshot example data under "Structuring data in spreadsheets" is a great example of the good habits described above it however I think it could be misleading for the first two rows to be global headers (versus column names). It could interfere with reading the table directly into R. I know there are parameters to skip some rows but this may be easier to beginners. Another guideline that could be added is the first column are unique sample identifiers and the first row to be unique column names describing ... the descriptions!

I would also suggest to move the definition of metadata to a more obvious area since it could be easily missed under the discussion box and is the dominate concept of the page.

Looking forward to using the genomics lesson!

Cheers,
-Frances

Add Good/Better/Best for Column names

Missing Glossary

Glossary section of reference page is "FIXME"

Metadata standards call-out

Metadata standards should be out on it's own not in a sub-box

Missing content in Introduction episode.

The opening page for this repo has a lot of place holders for

intro text
what they will learn
why this is important
key questions

Sorry if I missed an issue that covers this.

parallel-fastq-dump

@mgalland opened issue #52 in the genomics-workshop repo, which is about curricular content included in this lesson. I'm moving this issue over to this repo, as the Maintainers here will be more qualified to understand and act on this suggestion.

Typo in section heading

EMBL-EBI is misspelt in the section heading of 03-ncbi-sra.

Metadata example in messy spreadsheet

In 01_tidiness_datasheet_example_messy.png, there are "description" columns. These columns have spaces because the contents must NOT be critical for bcl2fastq. So these are metadata columns and later in the cleaned spreadsheet, they continue to have spaces. (Note: our Illumina sequencing instrument submission sheets do not allow metadata that I am aware of.) I propose we change the column descriptions from "Study_Description" to "Study_Metadata" and "Biosample_Description" to "BioSample_Metadata". The "Sample_Owner" must also be a metadata column, and could be called "Owner_Metadata". Without these changes learners would likely put underlines for all spaces in these columns as well. This is an opportunity to make it clear that metadata can exist in a spreadsheet that also contains data, and so should be labeled clearly.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Link to stylesheet for reference page broken

See datacarpentry/shell-genomics#169

Create backups flowchart

as in http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf

Add notes on ssh

I think it would be nice to add some setup notes on ssh.

Lang cleanup in introduction

Lesson 02-organization needs some cleanup of language.

'You should approach your sequencing project in a very similar way to how you do a biological experiment, and ideally, begins with experimental design.', this sentence is confusing right now, I'd suggest 'You should approach your sequencing project similarly to how you do a biological experiment and this ideally begins with experimental design.'

'Genomics projects can quickly accumulates hundreds of files across tens of folders.', accumulates --> accumulate

'Similarly, you probably won’t remember whether your best alignment results were in Analysis1, AnalysisRedone, or AnalysisRedone2; or which quality cutoff you used.' I suggest more options e.g. best alignment results, quality cutoff, version of software, settings for the software you used, etc. But this isn't really a necessary change.

Also typo here:
‘^X’, needs to have the '' removed.

Typo in the common problems section

#Please delete the text below before submitting your contribution.

The line "...help with this lesson and tell how people to do things in the other OS." Should read "...help with this lesson and tell people how to do things in the other OS."

Documenting your commands using "history"

Next to the section that describes the command history:

What if we mention the two redirection operators ">" and ">>" instead of mentioning the latter only? Using the command "$ history > dc_workshop_log_xxxx_xx_xx.sh", we can create a new file with the name dc_workshop_log_xxxx_xx_xx.sh which is not appendable as we move on to use more commands.

Combine (and reduce) the SRA Lesson

Arizona BugBBQ - The SRA lesson is too much and the subject matter is too deep to cover well. We suggest showing an SRA submission spreadsheet in the tidiness section. Learners could browse this is a short exercise and be made aware that this is probably metadata they will need to collect.

_episodes/03-ncbi-sra.md NCBI SRA dowload workflow error

Hello!

I noticed that the workflow for students here: https://github.com/datacarpentry/organization-genomics/blob/gh-pages/_episodes/03-ncbi-sra.md

for the "Download the Lenski SRA data from the SRA Run Selector Table"

Portion of the lesson is now out-dated due to the recent NCBI upgrade

see:

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605&o=acc_s%3Aa

-Dave

Combine Data Tidiness and Planning for NGS

Arizona BBQ Team: There may be some opportunities for combining these two sections. The spreadsheet with the current data set is good, and should be used so we don't need a second spreadsheet exercise (sample submission) - maybe some of those discussion questions can be moved over.

update example spreadsheet in 01-tidiness

In the 01-tidiness lesson, we have an example of a spreadsheet and ask learners to find some things that are wrong with it. The example spreadsheet is field data. It would be better to have some metadata that is more like what people would be using in a genomics experiment. So, we could create a more relevant messy spreadsheet for this exercise.

Files from _config.yml not rendering properly

The following pages are rendering improperly:
Setup
Reference
Code of conduct

@ErinBecker fixed a similar issue by editing the _config.yml file in the wrangling-genomics lesson.

Prerequisite text conflicts with "getting started" text on workshop homepage

Workshop Overview says:

This lesson assumes no prior experience with the tools covered in the workshop. However, learners are expected to have some familiarity with biological concepts, including nucleotide abbreviations and the concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.

Here says:

Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to insure the proper setup of tools for an efficient workflow.
These lessons assume no prior knowledge of the skills or tools.
Prerequisites
This lesson requires a spreadsheet program, such as Excel or OpenOffice, and a web browser.
To most effectively use these materials, please make sure to install everything before working through this lesson.

recommendations on project organization

This issue is for things people have learned on this lesson

Comments

"PIs should require not just end results, but the whole path and parameters to it"

"I wish I had taken this workshop earlier. Deciding how to save and manage dat is one big lesson learned the hard way!"

clarification

Please delete the text below before submitting your contribution.

In the episode ### "Planning for NGS project" the sub-heading ### "Retrieving samples from the facility" may factually confuse, for clarity, I suggest "Retrieving sample sequencing data from the facility". The change will address both the aspect of seq files and seq-file-metadata, otherwise, as is, it insinuates that we are getting back the sample.

file download missing in 01-data-tidiness

The link to download the file in the last exercise of 01-data-tidiness is missing. The file is located at https://github.com/datacarpentry/organization-genomics/blob/gh-pages/files/Ecoli_metadata_composite_messy.xlsx

Quick jump into shell commands

In 02-organization we start using shell commands without opening or introducing the terminal. And then the command are just used and not explained. It is a bit unclear to me if we are expecting them to know the shell commands or not because it says we will introduce you to these commands and then it seems like we expect them to have done the shell lesson before.

I suggest:

Adding in opening terminal/gitbash
A bit of introduction on the power of working in the shell
Link to SWC shell lesson for more info
Introduction to mkdir before using it
Short explanation of ls before using it
Short explanation of nano before using it
Adding more motivation for why you want to setup the project directory this way and what each of the folders will hold.

Writing in more expectations of knowing the commands instead of saying we are introducing them to these commands

Edit/Comment the introduction

Setup instructions should point back to workshop homepage setup

To keep from confusing learners with multiple pages of set-up instructions, it would be ideal to have only one "point of truth" for setup instructions for the whole workshop. That page is the setup page in the workshop overview repo.

We can include text like:

This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All the software and data used in the workshop are
hosted on an Amazon Machine Image (AMI). For information about how to
use the workshop materials, see the
setup instructions on the main workshop page.

The information about installing LibreOffice should first be added to the main setup page.

Re-configuring this lesson

In this issue I'm proposing a reorganization of this module and some changes in the lessons.

General idea for Genomics Organization Introduction

Organizing a project that involves sequencing involves many components. There's the start of the experiment, with the records of the experimental setup and conditions, as well as the sequencing information and the records of the bioinformatics analyses. It's an extension of your lab notebook and freezer samples to digital data and analyses. In this lesson, we'll go through the project organization and documentation that will make your current life more organized and easier for future you to understand what was done.

In this lesson you will learn:

how to structure your metadata, tabular data and information about the experiment. The metadata is the information about the experiment and the samples you're sequencing.
how to prepare for, understand and organize and store the sequencing data that comes back from the sequencing center
how to access and download publicly available data that may need to be used in your bioinformatics analysis
the concepts of organizing the files and documenting the workflow of your bioinformatics analysis

With this structure, I'm proposing to re-order and expand some of the existing lessons

Move 04-data-tidiness to the first lesson and expand the discussion of metadata
Move 03-project-panning to the second lesson and adding a discussion of data storage and importance of keeping raw data raw
Move 05-ncbi-sra to the third lesson and adding a general discussion of publicly available data
Move 02-organization to the fourth lesson and rather than doing the command line discussion and exercises (since we're not on the the cloud yet to have access to the command line), focusing on the concepts of documenting and managing files in a bioinformatics workflow, particularly the branching workflow and exploring parameter space

Before working on this re-configuration, I wanted to get thoughts on this idea from other maintainers and genomics folks. Thanks!

@ErinBecker @mkuzak @Roselynlemusinmegen

project organization guidelines and examples

In this lesson we make some recommendations around project and data organization. We will likely want to be sticking with recommendations, because every project is different, but maybe we could have more a list of guidelines, or some examples of how projects are organized.

We get comments that people are "still trying to get their head around how to organize data"

CSS not displaying on the Setup page

The CSS isn't working at the following link:

https://datacarpentry.org/organization-genomics/setup/

If I use check with a CSS validation tool (http://jigsaw.w3.org/css-validator/validator?uri=https%3A%2F%2Fdatacarpentry.org%2Forganization-genomics%2Fsetup%2F&profile=css3svg&usermedium=all&warning=1&vextwarning=&lang=en), I see the following errors:

File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/bootstrap.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/bootstrap-theme.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/lesson.css: Not Found
File not found: https://datacarpentry.org/organization-genomics/setup/assets/css/syntax.css: Not Found

Broken link to cloud lesson

episode 2 links to the cloud lesson but the link 404s

http://www.datacarpentry.org/cloud-genomics/05-which-cloud/

Locate the Run Selector Table for the Lenski Dataset on the SRA does not work with IE

@bvreede noticed when helping one of the participants that downloading the run selector table does not work with Internet explorer

Messy spreadsheet

In http://www.datacarpentry.org/organization-genomics/01-tidiness/ the messy spreadsheet example is missing

Lesson 3 REL4541B instructions no longer work as-written

In Lesson 3, the instructions to get to REL4541B are no longer valid. I just looked and it appears that pull request #120 also contains a proposed a solution to this issue. My proposition is slightly different, and might reliably still point the students to the REL4541B (SRR2591054) run that the lesson currently intends to cause them to examine, which might be a bonus if the particular run is meaningful to the lesson.

Where the current lesson instructs

Click on the Run Number of the first entry (REL4541B). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.

Modified instructions that should work (at least until the next visual redesign by NCBI) might read like this:

Scroll down to the list of Runs in this SRA Project. Let's try to find a particular run in this large project. Look for a run with Library Name "REL4541B." Try searching for the Library Name in the search box with the orange tag on it, and click Run SRR2591054 from the two results returned.

Duplicated content

03-project-planning and 04-tidiness appear to be largely duplicated content.

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels	New 'The Carpentries' labels
bug	type:bug
discussion	type:discussion
enhancement	type:enhancement
help-wanted	help wanted
newcomer-friendly	good first issue
template-and-tools	type:template and tools
work-in-progress	status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.