Git Product home page Git Product logo

organization-genomics's Introduction

Create a Slack Account with us Slack Status

organization-genomics

Lesson on data organization and project setup for genomics.

organization-genomics's People

Contributors

acharbonneau avatar bebatut avatar binxiepeterson avatar brooksph avatar debpaul avatar dpshelio avatar erinbecker avatar ewallace avatar fmichonneau avatar froggleston avatar hidyverse avatar hoytpr avatar jasonjwilliamsny avatar jcszamosi avatar kweitemier avatar laninsky avatar maneesha avatar metalichen avatar mfoos avatar orchid00 avatar ousodaniel avatar raynamharris avatar sarahbeecroft avatar svigneau avatar taylorreiter avatar tobyhodges avatar tracykteal avatar williamsmicrobegenome avatar willpitchers avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

organization-genomics's Issues

should we add a picture from the Tenaillon et al. paper demonstrating how to find SRA accessions?

@jessicamizzi and I thought it would be nice to show a png from the Tenaillon et al. paper (where the data for the wrangling-genomics lessons was first sequenced) where the SRA accession or project number can be found. This would demonstrate one way to interact with data: read a paper, find accession number in paper, look paper up on the SRA or ENA. We were unsure of copyright infringement on the original paper though.

Thoughts?

SraRunTable.txt downloads as CSV, not TSV

In the updated version of the SRA Run Selector Page, the downloaded SraRunTable.txt is actually now a comma-delimited file rather than a tab-delimited file as stated on the current version of the "Examining Data on the NCBI SRA Database" page.

Students will need to specify that their spreadsheet program interpret it as comma-delimited, so I suggest the following language: "Using your choice of spreadsheet program, open the SraRunTable.txt file. You may need to tell the program that this is a comma-delimited file in order to have the data separated properly into columns."

Out of place?

"Data about the experiment is usually collected in spreadsheets, like Excel."

This should be moved below "Metadata standards" and be at the start of the actual region discussing the spreadsheets.

It also might be appropriate to introduce other spreadsheets to those who do not know other spreadsheet software exist. For example: "Data about the experiment is usually collected in spreadsheets, such as Microsoft Excel, Libre Office Calc, or Gnumeric."

Ep1 Data Tidyness Metadata Example

Hi!

The screenshot example data under "Structuring data in spreadsheets" is a great example of the good habits described above it however I think it could be misleading for the first two rows to be global headers (versus column names). It could interfere with reading the table directly into R. I know there are parameters to skip some rows but this may be easier to beginners. Another guideline that could be added is the first column are unique sample identifiers and the first row to be unique column names describing ... the descriptions!

I would also suggest to move the definition of metadata to a more obvious area since it could be easily missed under the discussion box and is the dominate concept of the page.

Looking forward to using the genomics lesson!

Cheers,
-Frances

parallel-fastq-dump

@mgalland opened issue #52 in the genomics-workshop repo, which is about curricular content included in this lesson. I'm moving this issue over to this repo, as the Maintainers here will be more qualified to understand and act on this suggestion.

Metadata example in messy spreadsheet

In 01_tidiness_datasheet_example_messy.png, there are "description" columns. These columns have spaces because the contents must NOT be critical for bcl2fastq. So these are metadata columns and later in the cleaned spreadsheet, they continue to have spaces. (Note: our Illumina sequencing instrument submission sheets do not allow metadata that I am aware of.) I propose we change the column descriptions from "Study_Description" to "Study_Metadata" and "Biosample_Description" to "BioSample_Metadata". The "Sample_Owner" must also be a metadata column, and could be called "Owner_Metadata". Without these changes learners would likely put underlines for all spaces in these columns as well. This is an opportunity to make it clear that metadata can exist in a spreadsheet that also contains data, and so should be labeled clearly.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

  • Example code chunks run as expected
  • Challenges / exercises run as expected
  • Challenge / exercise solutions are correct
  • Call out boxes (exercises, discussions, tips, etc) render correctly
  • A schedule appears on the lesson homepage (e.g. not “00:00”)
  • Each episode includes learning objectives
  • Each episode includes questions
  • Each episode includes key points
  • Setup instructions are up-to-date, correct, clear, and complete
  • File structure is clean (e.g. delete deprecated files, insure filenames are consistent)
  • Some Instructor notes are provided
  • Lesson links work as expected

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Lang cleanup in introduction

Lesson 02-organization needs some cleanup of language.

'You should approach your sequencing project in a very similar way to how you do a biological experiment, and ideally, begins with experimental design.', this sentence is confusing right now, I'd suggest 'You should approach your sequencing project similarly to how you do a biological experiment and this ideally begins with experimental design.'

'Genomics projects can quickly accumulates hundreds of files across tens of folders.', accumulates --> accumulate

'Similarly, you probably won’t remember whether your best alignment results were in Analysis1, AnalysisRedone, or AnalysisRedone2; or which quality cutoff you used.' I suggest more options e.g. best alignment results, quality cutoff, version of software, settings for the software you used, etc. But this isn't really a necessary change.

Also typo here:
‘^X’, needs to have the '' removed.

Typo in the common problems section

#Please delete the text below before submitting your contribution.


The line "...help with this lesson and tell how people to do things in the other OS." Should read "...help with this lesson and tell people how to do things in the other OS."

Documenting your commands using "history"

Next to the section that describes the command history:

image

What if we mention the two redirection operators ">" and ">>" instead of mentioning the latter only? Using the command "$ history > dc_workshop_log_xxxx_xx_xx.sh", we can create a new file with the name dc_workshop_log_xxxx_xx_xx.sh which is not appendable as we move on to use more commands.

Combine (and reduce) the SRA Lesson

Arizona BugBBQ - The SRA lesson is too much and the subject matter is too deep to cover well. We suggest showing an SRA submission spreadsheet in the tidiness section. Learners could browse this is a short exercise and be made aware that this is probably metadata they will need to collect.

Combine Data Tidiness and Planning for NGS

Arizona BBQ Team: There may be some opportunities for combining these two sections. The spreadsheet with the current data set is good, and should be used so we don't need a second spreadsheet exercise (sample submission) - maybe some of those discussion questions can be moved over.

update example spreadsheet in 01-tidiness

In the 01-tidiness lesson, we have an example of a spreadsheet and ask learners to find some things that are wrong with it. The example spreadsheet is field data. It would be better to have some metadata that is more like what people would be using in a genomics experiment. So, we could create a more relevant messy spreadsheet for this exercise.

Prerequisite text conflicts with "getting started" text on workshop homepage

Workshop Overview says:

This lesson assumes no prior experience with the tools covered in the workshop. However, learners are expected to have some familiarity with biological concepts, including nucleotide abbreviations and the concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.

Here says:

Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to insure the proper setup of tools for an efficient workflow.
These lessons assume no prior knowledge of the skills or tools.
Prerequisites
This lesson requires a spreadsheet program, such as Excel or OpenOffice, and a web browser.
To most effectively use these materials, please make sure to install everything before working through this lesson.

recommendations on project organization

This issue is for things people have learned on this lesson

Comments

"PIs should require not just end results, but the whole path and parameters to it"

"I wish I had taken this workshop earlier. Deciding how to save and manage dat is one big lesson learned the hard way!"

clarification

Please delete the text below before submitting your contribution.


In the episode ### "Planning for NGS project" the sub-heading ### "Retrieving samples from the facility" may factually confuse, for clarity, I suggest "Retrieving sample sequencing data from the facility". The change will address both the aspect of seq files and seq-file-metadata, otherwise, as is, it insinuates that we are getting back the sample.


Quick jump into shell commands

In 02-organization we start using shell commands without opening or introducing the terminal. And then the command are just used and not explained. It is a bit unclear to me if we are expecting them to know the shell commands or not because it says we will introduce you to these commands and then it seems like we expect them to have done the shell lesson before.

I suggest:

  • Adding in opening terminal/gitbash
  • A bit of introduction on the power of working in the shell
  • Link to SWC shell lesson for more info
  • Introduction to mkdir before using it
  • Short explanation of ls before using it
  • Short explanation of nano before using it
  • Adding more motivation for why you want to setup the project directory this way and what each of the folders will hold.

OR

  • Writing in more expectations of knowing the commands instead of saying we are introducing them to these commands

Setup instructions should point back to workshop homepage setup

To keep from confusing learners with multiple pages of set-up instructions, it would be ideal to have only one "point of truth" for setup instructions for the whole workshop. That page is the setup page in the workshop overview repo.

We can include text like:

This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All the software and data used in the workshop are
hosted on an Amazon Machine Image (AMI). For information about how to
use the workshop materials, see the
setup instructions on the main workshop page.

The information about installing LibreOffice should first be added to the main setup page.

Re-configuring this lesson

In this issue I'm proposing a reorganization of this module and some changes in the lessons.

General idea for Genomics Organization Introduction

Organizing a project that involves sequencing involves many components. There's the start of the experiment, with the records of the experimental setup and conditions, as well as the sequencing information and the records of the bioinformatics analyses. It's an extension of your lab notebook and freezer samples to digital data and analyses. In this lesson, we'll go through the project organization and documentation that will make your current life more organized and easier for future you to understand what was done.

In this lesson you will learn:

  1. how to structure your metadata, tabular data and information about the experiment. The metadata is the information about the experiment and the samples you're sequencing.
  2. how to prepare for, understand and organize and store the sequencing data that comes back from the sequencing center
  3. how to access and download publicly available data that may need to be used in your bioinformatics analysis
  4. the concepts of organizing the files and documenting the workflow of your bioinformatics analysis

With this structure, I'm proposing to re-order and expand some of the existing lessons

  • Move 04-data-tidiness to the first lesson and expand the discussion of metadata
  • Move 03-project-panning to the second lesson and adding a discussion of data storage and importance of keeping raw data raw
  • Move 05-ncbi-sra to the third lesson and adding a general discussion of publicly available data
  • Move 02-organization to the fourth lesson and rather than doing the command line discussion and exercises (since we're not on the the cloud yet to have access to the command line), focusing on the concepts of documenting and managing files in a bioinformatics workflow, particularly the branching workflow and exploring parameter space

Before working on this re-configuration, I wanted to get thoughts on this idea from other maintainers and genomics folks. Thanks!

@ErinBecker @mkuzak @Roselynlemusinmegen

project organization guidelines and examples

In this lesson we make some recommendations around project and data organization. We will likely want to be sticking with recommendations, because every project is different, but maybe we could have more a list of guidelines, or some examples of how projects are organized.

We get comments that people are "still trying to get their head around how to organize data"

CSS not displaying on the Setup page

Lesson 3 REL4541B instructions no longer work as-written

In Lesson 3, the instructions to get to REL4541B are no longer valid. I just looked and it appears that pull request #120 also contains a proposed a solution to this issue. My proposition is slightly different, and might reliably still point the students to the REL4541B (SRR2591054) run that the lesson currently intends to cause them to examine, which might be a bonus if the particular run is meaningful to the lesson.

Where the current lesson instructs

Click on the Run Number of the first entry (REL4541B). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.

Modified instructions that should work (at least until the next visual redesign by NCBI) might read like this:

Scroll down to the list of Runs in this SRA Project. Let's try to find a particular run in this large project. Look for a run with Library Name "REL4541B." Try searching for the Library Name in the search box with the orange tag on it, and click Run SRR2591054 from the two results returned.

Duplicated content

03-project-planning and 04-tidiness appear to be largely duplicated content.

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels New 'The Carpentries' labels
bug type:bug
discussion type:discussion
enhancement type:enhancement
help-wanted help wanted
newcomer-friendly good first issue
template-and-tools type:template and tools
work-in-progress status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.