Git Product home page Git Product logo

phenotype_from_genotype's People

Contributors

nataliezelenka avatar

Watchers

 avatar

phenotype_from_genotype's Issues

Chapter 4 (Snowflake) draft

Chapter 4: Snowflake

Research to write up:

  • Data (decide where these graphs/examples are coming from and write up):
    • ALSPAC data set write up.
      • Background what is the data set
      • Raw data (genotype + phenotype), ethics, sequencing type
      • Creating the inputs

Writing to-do

  • A once over of the structure, and quality of the written sections.
    • Introduction
    • Algorithm
      • Rename section to Algorithm (from Snowflake method to Snowflake Algorithm).
      • Name non-background cohort (input cohort?) properly and change references to this.
      • [ x 23andMe data pipeline (optional: release #43)
      • Move Background data (1000G section) to here and include:
        • Limitations
      • Describe only very basic format of input, i.e. what file types are needed to run Snowflake. Where appropriate, perhaps how (generally) to create them.
    • Discussion/conclusion

Saved for next draft

  • Properly describe format of outputs: what types of files and why.
  • Linkage disequilibrium
  • (Optional) Phenotypes where haplotype is not how things are clustering versus where they are
    • (EDA) Number of missense variants after filtering from Consequence file
      • EDA:
        • Distribution of number of SNPs per phenotype (ALPAC)
        • Distribution of number SNP scores within phenotypes (violin plot with some examples).
  • (Optional) Athletes
  • (Optional) CAGI data set write up.
  • Clustering SNPs by phenotype
    • Creating the input scores
      • DcGO "Phenotypes" with weird combinations of phenotypes
      • DcGO prediction, where SNP is in a gene which is not expressed in the tissue.
      • Effect of number of SNPs per phenotypes on the sensitivity of the final score to the FATHMM score.
        • Choose a phenotype with many snps and randomly sample various numbers of them and see the how sensitive the results are.
    • Sensitivity of clustering score to background cohort
    • Dimensionality reduction (When is dimensionality reduction appropriate?)
      • Correlation between SNPs FATHMM scores
      • Too many SNPs for a phenotype.
  • Results
    • EDA Predictions
      • Number of predictions per phenotype, for:
        • ALSPAC (histogram)
        • (Optional) CAGI (histogram)
        • (Optional) Genetrainer (will just be one number since one phenotype of interest).
    • Validation
      • Bootstrapping graph and ROC curve (showing that it doesn’t work overall)
        • For ALSPAC
        • (Optional) For CAGI
        • (Optional) For Genetrainer
    • Examples of predictions (ALSPAC), e.g.
      • re-finding known things
      • Show that single-SNP phenotypes get the "correct" result for people (SNPs).
      • Predictions that are made using information from non-human experiments
      • Predictions where you need a combination of SNPs for a trait.
      • Predictions that find new SNPs in a known gene
    • ALSPAC Missing data specifics

Checklist

  • Citations done
  • Cross-references to other parts of the book
  • Figures, captions and references in text done
  • TODOs for other sections based on this chapter (e.g. in intro/abstract/discussion) written.
  • Headings sorted (right amount of ## everywhere)
  • Index contains: contributions and publications
  • Index DO NOT contain any subheadings

Not doing here

It's okay not to have finished:

  • some signposting text/cross-references to later chapters (since it will be easier to do this after later chapters are done).
  • nice-to-haves like epigraphs, fun asides, hand-illustrations
  • complete polish (i.e. of word choice, things like active/passive, decisions on how to refer to certain things, acronyms explained, etc)
  • Binder links: pretty sure I won't be able to make this work public.
  • Every single TODO

[FEATURE] (Automatic) PDF conversion

The translation to pdf with jupyter-book SORT OF works, but is gross and messy.

At the moment, my plan is to try to get it working at the very end.
My options for that are:

  • write a tidying script for the html, then convert to pdf from there.
  • convert to latex, then edit/write a tidying script, then convert to pdf.
  • hope and pray it all sort of works by the time I want to do it due to the magic of other people working hard on it.

Chapter 2: Biology

  • Check order
    • dna
    • rna
    • protein
    • phenotype

If relevant: descirbe transposable elements:


Decided not to do:

  • Move to a section of C3

[MILESTONE CHECKLIST] 4. Finished draft

At the end of this checkpoint the thesis should be basically completely done. And ready to submit, except for any outstanding frills

  • Put this repo online and make sure book renders online properly
    • Add GH actions to make the repo automatically update to the gh-pages branch
    • Added .nojekyll and redirect index.html to gh-pages branch.

[MILESTONE CHECKLIST] 2. First pass polish of chapter 8

Chapter 8: Combining gene expression data sets

This is a milestone rather than just an issue since this will be the first chapter that is properly finished and that I will commit to not refactoring, changing, or adding to... and that's going to be a big deal: deep breath.

Preparations

  • Read through and plan (#38)

Chapter to-do

  • A once over of the structure, and quality of the written sections.
    • Introduction
    • Background
    • Ontolopy (#33, )
    • Future work
  • All research for this chapter finished/in ipynb/reproducible:
    • Data acquisition (#39, )
    • Data wrangling (#34)
    • Results
      Moved to appendix:
    • Test data simulation (add as separate R ipynb?)
      Moved to future work:
    • Batch Correction

Checklist

  • Citations done
  • Cross-references to other parts of the book
  • Figures, captions and references in text done
  • TODOs for other sections based on this chapter (e.g. in intro/abstract/discussion) written.
  • Headings sorted (right amount of ## everywhere)

Not doing here

It's okay not to have finished:

  • some signposting text/cross-references to later chapters (since it will be easier to do this after later chapters are done).
  • nice-to-haves like epigraphs, fun asides, hand-illustrations
  • complete polish (i.e. of word choice, things like active/passive, decisions on how to refer to certain things, acronyms explained, etc)
  • Binder links: since I should try an easier (non-R) chapter first.
  • Interactive figure captions
  • IRkernel figure captions
  • Every single TODO

[MIGRATE] Migrate Filter Chapter

Sections:

  • Toc
  • index
  • introduction
  • data
  • methods
  • results
  • discussion
  • future work

Things to check:

  • figures
  • citations
  • cross-refs
  • asides
  • TODOs
  • title

[MIGRATE] MAPS section

  • Add section 5.5 from the Google drive to the book as a jupyter notebook (written in jupytext?)
  • Interactivity/binder
  • Citations
  • Book structure (config)
  • Figures/captions

[MIGRATE] Combining data sets chapter

Sections migrated:

  • index/motivation/results
  • bg
  • input data
  • data wrangling
  • uberon_py
  • batch correction
  • results

Final Checklist (all sections):

  • TOC
  • figures (not really because most need to be in ipynb, etc)
  • citations
  • cross-refs
  • title

Combining: 2-data.ipynb (in R)

  • Update the install.R (for GH actions)
  • Explain how data sets were chosen and add code
  • Explain how data sets were downloaded and add code
  • Hide code cells by default

[BUG] GH Pages Action broken

GH Action not updating website :(

Gh-pages peaceiris action is the bit failing:

  [INFO] ForceOrphan: false
  /usr/bin/git clone --depth=1 --single-branch --branch gh-pages ***github.com/NatalieThurlby/phenotype_from_genotype.git /home/runner/actions_github_pages_1639098542290
  Cloning into '/home/runner/actions_github_pages_1639098542290'...
  /usr/bin/git rm -r --ignore-unmatch *
  rm '.buildinfo'
  rm '.nojekyll'
  rm '_downloads/0f6968c734eb0e3592fbb93afaaafdcf/make_dcgo_input.py'
  rm '_downloads/1d5aed2e2ac99c61648969ced4c4bf8c/create-base-simulated-counts.R'
  rm '_downloads/21b06bbc87849cd431043497ef629dae/fasta_to_uniprot.py'
  rm '_downloads/413ee71f781407eaac54a462330a3b01/download-combined.R'
  rm '_images/alignment.png'
  rm '_images/amino_acid_ribosome.png'
  rm '_images/blank.png'
  rm '_images/bristol-alt-crest-red.png'
  rm '_images/bristol-crest.png'
  rm '_images/classifies-people.png'
  rm '_images/clustering_comparison.png'
  rm '_images/clustering_snps.png'
  rm '_images/curse-of-dimensionality.png'
  rm '_images/de_novo_assembly.png'
  rm '_images/difficult-to-understand.png'
  rm '_images/dna-both.png'
  rm '_images/filip.png'
  rm '_images/filip_bootstrap.png'
  rm '_images/general-hazard.png'
  rm '_images/go_rilla.png'
  rm '_images/interesting-scores.png'
  rm '_images/lacks-community.png'
  rm '_images/lego.png'
  rm '_images/linear_metric.png'
  rm '_images/linneaus_ehret.png'
  rm '_images/mendel.png'
  rm '_images/misuse.png'
  rm '_images/mycobacterium_tuberculosis_Yyz.png'
  rm '_images/myobacterium_tuberculosis_rKb.png'
  rm '_images/nonlinear_metric.png'
  rm '_images/ontolopy_logo.png'
  rm '_images/p_hacking.png'
  rm '_images/pca_real.png'
  rm '_images/privacy.png'
  rm '_images/reinforce-bias.png'
  rm '_images/revigo_filip_wrong_cafa2.png'
  rm '_images/shaded_score.png'
  rm '_images/snowflake-overview-new.png'
  rm '_images/tissues_HPA.png'
  rm '_images/transcription.png'
  rm '_images/venn_brain.png'
  rm '_panels_static/panels-main.c949a650a448cc0ae9fd3441c0e17fb0.css'
  rm '_panels_static/panels-variables.06eb56fa6e07937060861dad626602ad.css'
  rm '_sources/c0-front-matter/01-title-page.md'
  rm '_sources/c0-front-matter/02-declaration.md'
  rm '_sources/c0-front-matter/03-abstract.md'
  rm '_sources/c0-front-matter/04-acknowledgements.md'
  rm '_sources/c0-front-matter/05-full-table-of-contents.md'
  rm '_sources/c01-introduction/intro.md'
  rm '_sources/c02-biology-background/0-index.md'
  rm '_sources/c02-biology-background/1-big-questions.md'
  rm '_sources/c02-biology-background/2-biological-molecules.md'
  rm '_sources/c02-biology-background/3-more-dna.md'
  rm '_sources/c02-biology-background/4-more-proteins.md'
  rm '_sources/c02-biology-background/5-phenotype.md'
  rm '_sources/c02-biology-background/6-summary.md'
  rm '_sources/c03-compbio-background/0-index.md'
  rm '_sources/c03-compbio-background/1-sequencing-technology.md'
  rm '_sources/c03-compbio-background/2-measuring-genotype-phenotype.md'
  rm '_sources/c03-compbio-background/3-ontologies.md'
  rm '_sources/c03-compbio-background/4-comp-bio-methods.md'
  rm '_sources/c03-compbio-background/5-bias.md'
  rm '_sources/c03-compbio-background/6-pqi.md'
  rm '_sources/c03-compbio-background/7-summary.md'
  rm '_sources/c04-snowflake/0-index.md'
  rm '_sources/c04-snowflake/1-introduction.md'
  rm '_sources/c04-snowflake/2-snowflake-algorithm.md'
  rm '_sources/c04-snowflake/3-creating-inputs.md'
  rm '_sources/c04-snowflake/4-preprocessing.md'
  rm '_sources/c04-snowflake/5-clustering-snps.md'
  rm '_sources/c04-snowflake/7-discussion.md'
  rm '_sources/c05-alspac/0-index.md'
  rm '_sources/c05-alspac/1-introduction.md'
  rm '_sources/c05-alspac/5-discussion.md'
  rm '_sources/c06-filter/0-index.md'
  rm '_sources/c06-filter/1-introduction.md'
  rm '_sources/c06-filter/2-algorithm.md'
  rm '_sources/c06-filter/3-data.md'
  rm '_sources/c06-filter/5-methods.md'
  rm '_sources/c06-filter/6-results.md'
  rm '_sources/c06-filter/7-discussion.md'
  rm '_sources/c07-ontolopy/0-index.md'
  rm '_sources/c07-ontolopy/1-introduction.md'
  rm '_sources/c07-ontolopy/2-functionality.md'
  rm '_sources/c07-ontolopy/3-how-it-works.md'
  rm '_sources/c07-ontolopy/4-misc-examples.md'
  rm '_sources/c07-ontolopy/5-mapping-example.md'
  rm '_sources/c07-ontolopy/6-discussion.md'
  rm '_sources/c07-ontolopy/7-future-work.md'
  rm '_sources/c08-combining/0-index.md'
  rm '_sources/c08-combining/1-background.md'
  rm '_sources/c08-combining/2-data.md'
  rm '_sources/c08-combining/3-data-wrangling.md'
  rm '_sources/c08-combining/7-discussion.md'
  rm '_sources/c09-conclusion/0-conclusion.md'
  rm '_sources/cz-end-matter/0-appendix.md'
  rm '_sources/cz-end-matter/reference.md'
  rm '_sources/jupyter_book_intro.md'
  rm '_static/__init__.py'
  rm '_static/__pycache__/__init__.cpython-37.pyc'
  rm '_static/basic.css'
  rm '_static/check-solid.svg'
  rm '_static/clipboard.min.js'
  rm '_static/combining_funnel_interactive.html'
  rm '_static/copy-button.svg'
  rm '_static/copybutton.css'
  rm '_static/copybutton.js'
  rm '_static/copybutton_funcs.js'
  rm '_static/css/index.c5995385ac14fb8791e8eb36b4908be2.css'
  rm '_static/css/theme.css'
  rm '_static/custom.css'
  rm '_static/doctools.js'
  rm '_static/documentation_options.js'
  rm '_static/favicon.png'
  rm '_static/file.png'
  rm '_static/images/logo_binder.svg'
  rm '_static/images/logo_colab.png'
  rm '_static/images/logo_jupyterhub.svg'
  rm '_static/jquery-3.5.1.js'
  rm '_static/jquery.js'
  rm '_static/js/index.1c5a1a01449ed65a7b51.js'
  rm '_static/language_data.js'
  rm '_static/minus.png'
  rm '_static/mystnb.css'
  rm '_static/panels-main.c949a650a448cc0ae9fd3441c0e17fb0.css'
  rm '_static/panels-variables.06eb56fa6e07937060861dad626602ad.css'
  rm '_static/plus.png'
  rm '_static/pygments.css'
  rm '_static/searchtools.js'
  rm '_static/sphinx-book-theme.12a9622fbb08dcb3a2a40b2c02b83a57.js'
  rm '_static/sphinx-book-theme.css'
  rm '_static/sphinx-book-theme.e8f53015daec13862f6db5e763c41738.css'
  rm '_static/sphinx-thebe.css'
  rm '_static/sphinx-thebe.js'
  rm '_static/thesis.pdf'
  rm '_static/thesis_logo_with_text.png'
  rm '_static/togglebutton.css'
  rm '_static/togglebutton.js'
  rm '_static/underscore-1.12.0.js'
  rm '_static/underscore.js'
  rm '_static/vendor/fontawesome/5.13.0/LICENSE.txt'
  rm '_static/vendor/fontawesome/5.13.0/css/all.min.css'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.eot'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.svg'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.ttf'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-regular-400.eot'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-regular-400.svg'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-regular-400.ttf'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-regular-400.woff'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-regular-400.woff2'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.eot'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.svg'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.ttf'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff'
  rm '_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2'
  rm '_static/webpack-macros.html'
  rm 'c0-front-matter/01-title-page.html'
  rm 'c0-front-matter/02-declaration.html'
  rm 'c0-front-matter/03-abstract.html'
  rm 'c0-front-matter/04-acknowledgements.html'
  rm 'c0-front-matter/05-full-table-of-contents.html'
  rm 'c01-introduction/intro.html'
  rm 'c02-biology-background/0-index.html'
  rm 'c02-biology-background/1-big-questions.html'
  rm 'c02-biology-background/2-biological-molecules.html'
  rm 'c02-biology-background/3-more-dna.html'
  rm 'c02-biology-background/4-more-proteins.html'
  rm 'c02-biology-background/5-phenotype.html'
  rm 'c02-biology-background/6-summary.html'
  rm 'c03-compbio-background/0-index.html'
  rm 'c03-compbio-background/1-sequencing-technology.html'
  rm 'c03-compbio-background/2-measuring-genotype-phenotype.html'
  rm 'c03-compbio-background/3-ontologies.html'
  rm 'c03-compbio-background/4-comp-bio-methods.html'
  rm 'c03-compbio-background/5-bias.html'
  rm 'c03-compbio-background/6-pqi.html'
  rm 'c03-compbio-background/7-summary.html'
  rm 'c04-snowflake/0-index.html'
  rm 'c04-snowflake/1-introduction.html'
  rm 'c04-snowflake/2-snowflake-algorithm.html'
  rm 'c04-snowflake/3-creating-inputs.html'
  rm 'c04-snowflake/4-preprocessing.html'
  rm 'c04-snowflake/5-clustering-snps.html'
  rm 'c04-snowflake/7-discussion.html'
  rm 'c05-alspac/0-index.html'
  rm 'c05-alspac/1-introduction.html'
  rm 'c05-alspac/5-discussion.html'
  rm 'c06-filter/0-index.html'
  rm 'c06-filter/1-introduction.html'
  rm 'c06-filter/2-algorithm.html'
  rm 'c06-filter/3-data.html'
  rm 'c06-filter/5-methods.html'
  rm 'c06-filter/6-results.html'
  rm 'c06-filter/7-discussion.html'
  rm 'c07-ontolopy/0-index.html'
  rm 'c07-ontolopy/1-introduction.html'
  rm 'c07-ontolopy/2-functionality.html'
  rm 'c07-ontolopy/3-how-it-works.html'
  rm 'c07-ontolopy/4-misc-examples.html'
  rm 'c07-ontolopy/5-mapping-example.html'
  rm 'c07-ontolopy/6-discussion.html'
  rm 'c07-ontolopy/7-future-work.html'
  rm 'c08-combining/0-index.html'
  rm 'c08-combining/1-background.html'
  rm 'c08-combining/2-data.html'
  rm 'c08-combining/3-data-wrangling.html'
  rm 'c08-combining/7-discussion.html'
  rm 'c09-conclusion/0-conclusion.html'
  rm 'cz-end-matter/0-appendix.html'
  rm 'cz-end-matter/reference.html'
  rm 'genindex.html'
  rm 'index.html'
  rm 'jupyter_book_intro.html'
  rm 'objects.inv'
  rm 'search.html'
  rm 'searchindex.js'
  [INFO] first deployment, create new branch gh-pages
  ENOENT: no such file or directory, scandir '/home/runner/work/phenotype_from_genotype/phenotype_from_genotype/_build/html'
  /usr/bin/git init
  Reinitialized existing Git repository in /home/runner/actions_github_pages_1639098542290/.git/
  /usr/bin/git checkout --orphan gh-pages
  fatal: A branch named 'gh-pages' already exists.
  Error: Action failed with "The process '/usr/bin/git' failed with exit code 128"

Fix broken link(s) and add test

On the intro page, the link to the website is broken because of the nataliethurlby -> nataliezelenka issue (thanks to Susana for letting me know)

I think that JupyterBook has a test to check if links resolve - I should add that to the CI!

  • fix known broken links
  • run jupyterbook link checker
  • fix new broken links
  • add jupyterbook link checking to CI

[MILESTONE CHECKLIST] 1. Html book locally built

To-do:

  • Organise GitHub
    • Plan milestones (at least up to the one after this)
    • Change branch name to main
    • Feature branches for chapters

  • Intro page (#2, #10)
  • Front Matter (#6, #9)
  • Chapters
    • Chapter 1: Introduction (#11 ,#12)
    • Chapter 2: Biological background (#3, #13 + #20)
    • Chapter 3: Computational biology background (#15, #21)
    • Chapter 4: Phenotype predictor (#16, #31)
    • Chapter 5: Combined dataset (#17, #28)
    • Chapter 6: Filter (#18, #29)
    • Chapter 7: Conclusions (#19, #29)
  • References (, #12)

This milestone does not:

  • include improvements to any chapters content.
  • interactive content

[MILESTONE CHECKLIST] 6. Before viva

I could continue to:

  • Work on tools websites/documentation/preprints/papers.
  • Add illustrations.
  • Write preprints/submit to journals.
  • Submit to CAGI6
  • Move following suggestions to Ontolopy repo:
    • Add Continuous Integration, with testing on different platforms.
    • Update docstrings so that they have:
      • Return and Rtype formatted nicely
      • Nice examples baked in (look at Pandas examples)
  • Decide if there's anything else (e.g. appendix stuff/illustrations/preprints/CAGI6/filip release that I'd like to do before viva)

[FEATURE] Full table of contents

Full table of contents:

  • html
    • not interferring with sidebar.
  • pdf

In #6, I've noted a couple of different things I've tried so far to get this to work, and plans for how to get it to work.

Complexity science ideas

Complexity science related ideas:

  • Information content (Shannon entropy) of tissue-specific gene expression (per gene/per pathway/over-all) (filter chapter)
  • PPI networks per tissue type? Same underlying connections (presumably, but different weightings to up/down regulation). For a subset of genes(?) i.e. associated with a pathway. E.g. a pathway with high information content or an interesting gene/protein (predicted by dcGO/FATHMM to be interesting in a structure way, but not expressed in tissue of interest).
  • Simulate PPI networks for 2 tissue types for that pathway (a relevant one and an irrelevant one, e.g. endocrine gland and skin if hormone related)

Main checklist

Current to-dos:

  • Get Ontolopy chapter rendering on web
    • Try new Ontolopy release (this worked) #51
  • Check for any VIP todo comments
  • Pull in the big PR
  • Check the license for each section, and be explicit about what it covers, e.g. code/text especially Snowflake
  • pin requirements exact versions using pigar

Filip: release

Functionality:

  • Options for GO annotations (latest, or some timestamp'd releases)
  • Options for RNA-seq datasets (FANTOM/individual/Combined)
  • Options for types of terms to run (e.g. part of GO tree).

Infrastructure:

  • Tests
  • Roadmap
  • Docs
  • License
  • Deployed website
  • Pypi packaging

Writing

  • Chapter

(Bear in mind writing:)

  • JOSS submission

Detailed plan for Snowflake Chapter

Come up with a plan for the Snowflake Chapter: what should be in there and which of these are necessary versus nice. These can be ticked when they have a TODO in the document. The less important TODOs (that don't get done) can be moved to Future Work when finishing up the draft:

  • Data (decide where these graphs/examples are coming from and write up):
    • ALSPAC data set write up.
      • Background what is the data set
      • Raw data (genotype + phenotype), ethics, sequencing type
      • Creating the inputs
      • Missing data
      • EDA:
        • Distribution of number of SNPs per phenotype (ALPAC)
        • Distribution of number SNP scores within phenotypes (violin plot with some examples).
    • 2500 Genomes set write up
    • (Optional) 23andMe data set write up/athletes
    • (Optional) CAGI data set write up.
  • Clustering SNPs by phenotype
    • Creating the input scores
      • DcGO "Phenotypes" with weird combinations of phenotypes
      • DcGO prediction, where SNP is in a gene which is not expressed in the tissue.
      • Effect of number of SNPs per phenotypes on the sensitivity of the final score to the FATHMM score.
        • Choose a phenotype with many snps and randomly sample various numbers of them and see the how sensitive the results are.
    • Sensitivity of clustering score to background cohort
    • Dimensionality reduction (When is dimensionality reduction appropriate?)
      • Correlation between SNPs FATHMM scores
      • Too many SNPs for a phenotype.
  • Results
    • EDA Predictions
      • Number of predictions per phenotype, for:
        • ALSPAC (histogram)
        • (Optional) CAGI (histogram)
        • (Optional) Genetrainer (will just be one number since one phenotype of interest).
    • Validation
      • Bootstrapping graph and ROC curve (showing that it doesn’t work overall)
        • For ALSPAC
        • (Optional) For CAGI
        • (Optional) For Genetrainer
    • Examples of predictions (ALSPAC), e.g.
      • re-finding known things
      • Show that single-SNP phenotypes get the "correct" result for people (SNPs).
      • Predictions that are made using information from non-human experiments
      • Predictions where you need a combination of SNPs for a trait.
      • Predictions that find new SNPs in a known gene
  • Discussion:
    • Linkage disequilibrium
    • (Optional) Phenotypes where haplotype is not how things are clustering versus where they are

Also any setup/admin:

  • Set up the ipynb with jupytext myst md paired.
  • Repository:
    • Check if I have an existing repo or not. (I DO NOT)
    • Create repository: BE MINIMAL. This will be private anyway. No license (yet).
      • Very basic README.
      • Directory structure
  • Check what Jan discussed to add things to plan
  • Update issue #41 with all the to-dos

Combining data: data wrangling

  • Move from Rmd (and existing notebooks) to ipynb and link with jupytext to myst md
    • combine-design
    • process-fantom
    • basic-combining
  • Data cleaning pipeline figure
  • Funnel plot

Finish Ontolopy

Ontolopy polished into a proper thing that is properly released on pypi with a sphinx website.

  • Basic documentation (sphinx)
  • Roadmap
  • Tests (minimal)
  • Website (sphinx)

Also:

  • Re-write up the chapter section, linking to the website.

[FEATURE] Publications section

  • Migrate existing publications section to the end matter section (Markdown below)
  • Write preprints for computational stuff and submit bioarxiv/JOSS (and use Zenodo/GitHub for citing stuff).
  • Add preprint section

Publications

Publications which I have contributed to during this PhD are appended here. This includes the following three peer-reviewed papers. The first two represent collaborative efforts within the Gough Group:

  • The SUPERFAMILY 1.75 database in 2014: a doubling of data
  • A Proteome Quality Index

The third represents an international collaborative effort to predict protein function from genotype (CAFA), in which I participated by developing a protein function predictor and applying it to the challenge data:

  • The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

In addition, I also attach the phenotype predictor patent, to which I contributed development and implementation of the methodology, as well as writing:

  • Determining Phenotype from Genotype

And any preprints?

Corrections: list

  • Abstract
    • Simplify line “In our cells, proteins are constantly being created and are degrading, and are accumulating or interacting to produce the phenotypes that we see at a larger scale: height, levels of enzymes in blood, diseases”
    • Clarify “The versions of the proteins that it is possible for an organism to produce are determined primarily by its protein coding DNA, while the selection of possible proteins that are actively produced in each cell are determined by the environment of the particular cell at each time”
  • Chapter 2:
    • Move Section “2.1.2. The future computational biologists want” to later part of the chapter.
      • Moved to 2.5 "Phenotype"
    • In the fourth para below fig 2.3 “post‑transcriptional modificationssuch as splicing” please make the spacing between “post‑transcriptional modifications” and “such” prominent. Also add RNA editing and RNAi as modes of post-transcriptional regulations.
    • In the section “2.2.1.3. “RNA makes Proteins”, a.k.a. Translation” it may be worth mentioning landmark work of Christian Anfinsen about how sequence of amino acid strings of protein acts as a “code” to precisely determine three-dimensional structure of the protein.
    • In 2.2.1.4 “a different amino acid in a hormone protein could cause the protein to be expressed differently” could change it to “a different amino acid in a hormone protein could cause the protein to be expressed or function differently”
    • In 2.3.2 “(protein‑coding nucleotides)” to “(protein‑coding part of DNA)”; there is repeating word: “in in”, please modify it.
    • In 2.3.3.1. “A gene for X” what does “the same gene can make multiple different proteins” mean is not clear, does it mean different isoforms of the protein?
    • In 2.3.6.2 “However, synonymous SNVs could still have an effect on highlevel traits, since different nucleotides are translated at different speeds.” Here translation at different speed can have effect on both folding and abundance of protein (Kimchi-Sarfaty et al, 2007, Science) Please add few sentences along these lines.
    • It is not clear at many place what is different proteins being referred that is encoded by the same gene? It means protein isoforms or something else?
    • make the motivation for your work stronger
      • Did this more in the Chapter 2 and Chapter 3 summaries
    • in general, needs further references to relevant works
  • Chapter 3:
    • There seems to active and passive voice mix up please follow one mode at least within a chapter.
      3.2.1
    • In 3.2.1.1 “Whole genomes for different organisms can be compared to one another to give us insight about the organisms, or within an organism, individuals can be compared to understand the importance of sections of DNA for that organism.” Not clear what the term plural organisms mean here?
    • In this section there should be a clear demarcation between whole genome assembly, gene annotation and then variants calling; there appears to be some mix up for me that impedes the smooth flow of information content.
      • Went over the signposting information at the start of each section to clearly describe where this information is, but still wanted to keep the symmetry between Chapter 2 and 3.
    • Possibly mention about OMIM and HGMD databases
    • 3.2.2.2 “Measures of mRNA abundance (i.e. gene expression data) are generally considered the best measures of translation (compared to protein abundance for example), and therefore the best data to tell us how DNA’s blueprints are being used in different scenarios” Appears to contrary to general belief that protein abundance are in general better measure. In situations where protein abundances are not easily measurable or trackable mRNA expression can be used as a good proxy for protein abundance. In fact your later aside “Gene Expression and Protein Abundance data” clearly reflects this.
    • 3.2.4. Phenotypes Please provide smooth link between this section to the next “connecting genotypes and phenotypes”.
    • What is the link between 3.3 and 3.4 is it clear?
      • the computational methods (3.4) often use the ontologies and databases described in the previous sections (3.3 and previous). I've added a sentence to signpost this better at the start of 3.4.
    • Section 3.5 can be moved towards the end of the chapter before 3.7 summary
      • I didn't do this because 3.5 (which mentions the different sources of bias in computational biology) motivates 3.6 (which introduces a project that I worked on - PQI - which aims to combat this. But I made i clearer to the reader).
    • Also in the summary it is foremost importance to highlight the core of the chapter genotypes and phenotypes, and linking them and related data sources. The description about bias, potential statistical pitfalls can be mentioned later.
      • Reinforced mention of core information and related bias to the work done later.
    • be more specific about your contribution
      • My contributions are listed in a yellow box at the start of the PQI section and at the start of the Chapter.
    • the superfamily section needs to be expanded
      • expanded substantially
  • Chapter 4:
    • #63
      • Yes, this is described in section 4.6.1.3. But I now signpost it earlier (4.1 "Introduction" and 4.2.1 "Approach") by describing that Snowflake is not meant to perfectly predict all phenotypes, but to uncover mechanisms for some phenotypes.
    • 4.2.2.2. Restricted phylogeny could have been better for deleterious variant predictions?
    • 4.2.2.3 and 4.2.2.4 Schematic illustration of detailed steps would have been very helpful.
    • Did you try different clustering methods and check of consistency?
      • Describe: yes. The clustering methods did give quite different results. But this would be expected.
    • Did you use different distance measure and figure best performing one? Or the Euclidean measure was the only choice?
      • Describe: yes. Tried euclidean as well as what I actually did, which was not euclidean. But can only motivate choice theoretically because there isn't the data to trial lots of different things.
    • Why did you not use UK biobank data instead of 1000 genomes data?
      • Describe: wasn't available.
    • If you explain in more detail how your contribution has significantly enhanced the snowflake it would have been excellent.
    • What is typical range of phenotype score?
    • What is max and min in your application across datasets? Although would depend on the dataset, can you provide a flavor for a typical range?
    • insert pseudocode (4.2.2)
    • compare with new version of snowflake (it it’s available)
      • Was not available
    • expand the metrics/clustering part to better justify choice of dataset
    • the actual results section is very short, need to be expanded and more detailed
  • Chapter 5
    • Application of snowflake to ALSPAC
    • What is current state of application of snowflake to ALSPAC
    • In section “5.2.1. Selection of phenotypes” limitation of snowflake could be due to relatively limited data? Or mutations in regions beyond domains? Did you check?
    • “In selecting phenotypes, I considered only (1) whether Snowflake considered these to be phenotypes where it could make a confident prediction and (2) whether the phenotypes in ALSPAC could be used to validate this prediction. I did not consider additional information that might indicate whether these were phenotypes we might expect to be able to predict, for example, whether these phenotypes were heritable, or consider whether they are desirable to predict. Since I chose these purely by looking at the distribution of scores for Snowflake, our lack of promising results could be an indication that the phenotype‑score (finding interesting distributions of phenotypes) is unsuccessful.” This para is not clear to me. Can you explain?
    • In terms of snowflake application to ALSPAC. Did you consider randomizing the data or generate hypothetical random data and apply snowflake and compare the phenotype score with the ones you got for the original application of snowflake to ALSAPAC?
    • extremely short, consider merging with Chapter 4
  • Chapter 6:
    • Integration of gene expression (tissue-specific) did improve genotype-phenotype prediction to want extent though?
    • Why GTex datasets were not considered?
    • How does isoform expression factor in to this equation?
    • Why proteomics data was not considered?
    • How did you deal with expression data supported by multiple different studies (biological replicates) to be expressed as against to those supported by limited number of studies or samples?
    • I have an issue with being completely having belief in uberon as the gene expression is far more prone to rapid rewiring/reprogramming as compared to protein coding regions.
    • 6.5.1. CAFA 2 Fmax appears quite low (extrapolating from machine learning studies)
    • it might be worth discussing how much results depend on the chosen dataset, and/or on the use of DcGo.
  • 7. Ontolopy
    • Very interesting package that can be used to glean data from OBO files and manipulate them in a customized form. Can Ontolopy be used to build knowledge graphs? Or can be enhanced in future to do so? (Yes very similar)
    • Like gene ontologies are there one to many mappings, in that case is it possible to glean to most relevant mapping in a context specific manner?
    • How reliable are uberon to sample mappings in general? Are there some examples to clearly demonstrate this?
  • 8. Combining RNA‑seq datasets
    • Are there specific demonstrable of Combining RNA‑seq datasets at the data level not at the primary results level. These data might have been acquired in distinct conditions from slightly distinct sample. Are the data being treated either as biological or technical replicates?
    • Does combining gene expression improve correlation with protein abundance?
  • Misc
    • Add page numbers

[MILESTONE CHECKLIST] 3. Migrate and first-pass polish all interactive content

Migrate all interactive content (jupyter notebooks) for the following chapters (in this order):

  • Chapter 6: Filter
    • CAFA3 version
    • Discussion
    • Future work
  • Chapter 3: Compbio background + reproducibility
  • Chapter 4: Snowflake
    • Plan chapter in detail (#40, )
    • Draft chapter (#41,)

This includes:

  • Migrating the actual code (non-code sections should be finished as part of milestone 1.)
  • Finishing the research for each of these chapters. Each notebook should be self-contained and finished.
  • Finalising a sensible jupyter book sidenav (config) layout, and the layout/story for each chapter.
  • Finalising the text in these chapters (for example, rewriting the compio background chapter, which has a lot of non-interactive content)
  • Citations/Figures/Cross-refs
  • TODOs for discussion/intro/abstract

This milestone does not include:

  • Adding tests or other infrastructure for the interactive content, except for any that absolutely MUST be written to migrate/understand it.
  • Adding additional research (extra notebooks can be added later if they are absolutely needed).
  • Migrating chapters 1, 2, and 7 (they don't include any code), or chapter 5 (should be done before this milestone (#25))
  • Nice-to-haves like hand-drawn illustrations, asides, epigraphs, etc.

Decided not to do (for now)

  • filip
    • Release and writeup as a tool #35
    • Include combined dataset in filip
  • Modelling study of genes
    • Distribution of publications over genes (#22, ) & How does this affect Filp?
    • How Complete is the gene ontology? (#23, ) & How does this affect Filp?
  • MAPS (#24, ) & how does this affect Computational Biology? I.e. do a multiverse of a comp bio question?

[MILESTONE CHECKLIST] 5. Ready to submit

Finishing touches

  • Built PDF and checked, any necessary frills added:
    • TOC (#7, )
    • Make PDF downloadable on GitHub/Jupyter Book and make sure that examiners know where to go for latest version in submitted PDF.
  • Once overs (read-through and make minor fix notes).
    • Intro to jupyter-book (how it works/why I did it)
    • Front matter
      • Update word count
      • Update date
      • Do something with signature declaration?
    • Chapter 1
    • Chapter 2
    • Chapter 3
    • Chapter 4
    • Chapter 5
    • Chapter 6
    • Chapter 7
  • Consistency checks.
    • Check names of software are written properly (e.g. Jupyter Book) and cited.
    • Check how I refer to chapters - what case: do I say "Chapter 1"? "this Chapter" or "this chapter?")
    • Make sure acronyms have been defined at first use and added to a glossary if I have one
    • Check external links resolve jupyter-book build mybookname/ --builder linkcheck
  • Credit
    • Permission to reproduce images
    • All software cited
      • IRKernel
      • JupyterBook
      • Jupyter Notebook
      • Binder/MyBinder

Unecessary details

(Deciding not to do anything in this section is as much of a success as doing it and I get to tick it)

  • Illustrations
    • Alternative bristol logo with aside about knobheads.
  • Glossary
  • Interactivity
  • Bibtex faffing
    • Add Patent type
    • Wayback machine backup web links
  • Asides about knobhead eugeneticists, e.g. fischer

Organisation

  • Organised viva
  • Submitted that mofo
  • Organised thank-yous for Kate/Patty, Oliver, Julian, etc.

Chapter 3: Computational Biology

  • Split up into three markdown documents for the three sections
  • Move the sequencing bit from Chapter 2 over to Chapter 3
  • Make sure that everything is in the right order and is signposted (from genotype -> phenotype), with environment interspersed where relevant (e.g. interaction of DNA with environment, RNA with environment, etc).
    • dna
    • rna
      • variant
    • protein
      • protein sequence
      • protein abundance
      • protein structure
      • protein classification
    • phenotype
  • ontology
    • why are ontologies useful?
    • gene ontology annotations
  • bias + error (explanation)
  • PQI
  • summary

Don't include yet:

  • final polish write summary (do that as part of final draft)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.