2017-ievobio / organization Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 4.0 24 KB

Logistical details, Suggestions for discussion topics, Agenda

organization's People

Contributors

Stargazers

Watchers

Forkers

dwinter k8hertweck wrightaprilm klausvigo

organization's Issues

Software Bazaar: TaxonWorks

Authors: Matt Yoder, Species File Group, Collaborators
Source code: TaxonWorks
License: University of Illinois/NCSA Open Source License
Abstract: Biodiversity scientists deal with "broad" data, i.e. they record, analyze, archive and refine information on many different subjects. Integrating, unifying, and maintaining these data over long periods of time is a major concern. TaxonWorks is a open source web-based workbench that seeks in part to address these challenges. Backed by an endowment we seek to surround the project with a community of developers and like-minded scientists to help it evolve over the long term. Various aspects of the software are specifically designed to encourage and facilitate contributions from new members including unit tests, virtualization, and code generators. Configurable interfaces, cloud-based deployment, JSON serving APIs, and a webpack-based JS app pipeline serve the needs of more advanced developers. TaxonWorks has core data concepts (e.g. specimens, images, matrices, sequences, references) that can be richly annotated and that are semantically extensible. Under development for around 2 years the project is now at the cusp of being more broadly exposed, with the goal to rapidly grow those contributing whether they be developers, bug-reporters, or users requesting and influencing new features.

Locus Tree Inference

Authors: Michał Ciach, Anna Muszewska, Paweł Górecki
Abstract: A common task in evolutionary biology is to explain incongruences between gene and species trees in terms of evolutionary events, like gene duplications, gene losses, and horizontal gene transfers. One of the most developed tools for this task is the tree reconciliation, which "embeds" a gene tree into the species tree. One of its disadvantages is that the reconciliation requires the user to specify the costs of evolutionary events, which are usually not known. An alternative is a manual analysis of trees; this is, however, very cumbersome when the trees are complex. We investigate the Locus Tree Inference problem, which highlights the origins of incongruences between the trees. This allows for a non-parametric solution, which facilitates manual analyses. The software is availiable here.

BoF: Discovery and selection of code

How do we find software?
How do we select software?
How do you build software that is discoverable and reusable?
How do we judge quality

Lightning talk: Progress towards more interoperable tree file formats

Authors: Daisie Huang, Arlin Stoltzfus, Michael Lynch Alfaro, Jaime Huerta Cepas, Jeet Sukumaran, Liam J. Revell, Lucas Czech, Damien de Vienne, Tim Vaughan, Jim Allman, David Maddison, Karen Cranston, Guangchuang Yu

Abstract: In May, a group of tree-visualization software developers and other interested parties met for a workshop to discuss interoperability in tree viewers and how we might move towards common file formats that preserve metadata and style in a consistent fashion. While we did not end up agreeing on a single tree file format, we did come up with some principles for how interoperable, interchangeable file formats should be structured and agreed to continue collaborating on developing such formats. I will present some of these ideas and discuss how existing file formats fit within this new conceptual framework.

Software Bazaar: phangorn

Authors: Klaus Schliep, Emmanuel Paradis, Leonardo de Oliveira Martins, Alastair Potts, Tim W. White, Cyrill Stachniss, Michelle Kendall
Source code: CRAN, github
License: GPL (>= 2)
Abstract: phangorn is an all-purpose phylogentic R package to perform phylogentic analysis like maximum likelihood, maximum parsimony or distance based methods. phangorn contains some low level function extending the ape package for tree manipiulations or simulation of sequences. phangorn is designed to make it easy to exchange data with many other common R packages on CRAN or Bioconductor. A strength of the package are the tools to explore phylogenetic incongruence using different tree metrics (e.g. Kuhner-Felsenstein, (weighted) Robinson-Foulds, SPR-distance) and to visualize the conflict signals in form with splits networks. Five vignettes describing common work flows and more information and examples are available on http://www.phangorn.org/.

What's in the README?

Given we'll have a number of folks who are new to GitHub, I'm thinking the first bit of the README should include some of the language David's developed about "why GitHub?"

I'm also tempted to move the schedule to its own file?

BoF: student education (EXAMPLE)

Discussion topic: How do we best meet the needs of student training for bioinformatics? How does this education change depending on whether we are talking about graduate vs undergraduate, or academic vs industry? What tools and concepts are most important?

Software Bazaar: The NSB marine microfossil database system: a resource for paleobiology and paleoceanography

Software Bazaar: NSB system

Johan Renaudie• David Lazarus•, Patrick Diver+

Museum für Naturkunde, Berlin, Germany; + Divdat Consulting, Wesley, AR, USA
[email protected]

The one unique difference between fossils and living organisms is their location in geologic time. Despite the many imperfections of the fossil record, this unique attribute is used in paleobiologic studies of evolutionary patterns and processes in ways not possible with living material. The main resource for this for the last decade has been the Paleobiology Database, which covers most groups of organisms over the Phanerozoic (0-600 Ma [million years]). PBDB provides data mostly at genus-level and correlatable between locations at about 10 my resolution. Much higher resolution data (species level, <1 my age resolution) better suited to studies of evolutionary processes are offered by the Cenozoic (0-60 Ma) marine microfossil record of planktonic and benthic protists, but this data has been little used as yet for paleobiologic studies. We describe here NSB (Neptune Sandbox Berlin: www.nsb-mfn-berlin.de) a newly expanded, improved database system that, complementary to PBDB, provides access to the published marine microfossil record for paleobiologic, and paleoceanographic research. NSB consists of ca 1 million data records of species occurrences, evolutionary events, and continuous geochronologic models of deep-sea sediment sections stored in a Postgres database; a Python-Django website for non-technical users running pre-defined queries and analyses; and supporting programs and data standards, particularly the ADP (Age Depth Plot) program (also python) for developing geochronologic age models for sections; the SOD-OFF (stratigraphic occurrence data - open file format) definition for user-friendly recording of primary data and metadata on microfossil occurrences; and an R package linking the fossil occurrence data to paleoceanographic data in external archives, e.g. Pangea. The system has been developed using (very) intermittent funding over 25 years by different groups in several countries and currently is based at the Natural History Museum in Berlin.

Lightning talk: "Nemo 3: a powerful tool for eco-evolutionary and population genetics modelling"

Authors: Frédéric Guillaume, Olivier Cotto, Max Schmid, Jobran Chebib

Abstract: Modelling eco-evolutionary dynamics is challenging yet necessary to predict the demographic, phenotypic and genetic responses of populations and species subject to environmental changes. Individual-based simulations are a tool of choice in this case because they allow modelling interacting stochastic processes at different scales (spatial, temporal) implicitly. They are thus powerful tools to model populations that adapt to spatial an temporal shifts in their local habitat conditions, and this for species with complex life-cycles, e.g. with overlapping generations. Nemo has all the ingredients necessary to efficiently and rapidly simulating eco-evolutionary dynamics at large spatial and temporal scales, in age-structured populations, with varying degrees of genetic details of the (quantitative) traits under selection. I will thus here present the latest developments brought to Nemo, an individual-based, forward-in-time, stochastic, genetic and spatially explicit simulator that has been around for quite a while now (Guillaume & Rougemont, 2006). I will highlight new features that allow simulating large landscapes, complex genotype-phenotype maps, and phenotypic plasticity of quantitative traits, among others.

BoF: Software longevity

Discussion topic: How to promote it

Lightning talk: "FastNet: Fast and accurate inference of phylogenetic networks using large-scale genomic sequence data"

Authors: Hussein A. Hejase, Natalie VandePol, Gregory A. Bonito, Kevin J. Liu
Abstract:
Advances in next-generation sequencing technologies and phylogenomics have reshaped our understanding of evolutionary biology. One primary outcome is the emerging discovery that interspecific gene flow has played a major role in the evolution of many different organisms across the Tree of Life. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal” gene flow?

The answer to this fundamental question not only depends upon densely sampled and divergent genomic sequence data, but also computational methods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Recent methodological advances have attempted to address this gap. However, in a recent performance study, we demonstrated that the state of the art falls well short of the scalability requirements of existing phylogenomic studies. The methodological gap remains: how can phylogenetic networks be accurately and efficiently inferred using genomic sequence data involving many dozens or hundreds of taxa?

In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. Using synthetic and empirical data spanning a range of evolutionary scenarios, we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topological accuracy. We predict an imminent need for new computational methodologies that can cope with dataset scale at the next order of magnitude, involving thousands of genomes or more. We consider FastNet to be a next step in this direction. We conclude with thoughts on the way forward through future algorithmic enhancements.

BoF: Diversity at iEvoBio

Discussion topic: The iEvoBio organizers have noted a rather dramatic gender bias in attendees at our conference over the last few years, especially in regards to participants in the Software Bazaar. This BoF session will discuss possible causes and consequences of this disparity, including possible ways to minimize this bias. Further information on ways to approach this issue are outlined in this F1000Research article about hackseq.

Lightning talk: SimBit

Authors: @RemiMattheyDoret
Abstract: SimBit is a high performance forward-in-time population genetics simulation platform coded in C++. With SimBit, you can simulate complex demography, plasticity, epistasis, QTL and other things. The main secrets of SimBit relies into the use of bitwise operators and the fact that all memory allocation at the beginning of the simulation. SimBit has very good error handling to ensure a user understand what was unclear about its input.

BoF: Bioinformatics education

Discussion topic: Many of us attending iEvoBio are involved in computational biology/bioinformatics education. Here are some considerations for the iEvoBio hive mind's perusal:

How do we best meet the needs of student training for bioinformatics?
How does this education change depending on whether we are talking about graduate vs undergraduate, or academic vs industry?
What tools and concepts are most important?

(a version of this was initially presented as an example, but has been "officially" submitted for discussion here)

Software Bazaar: RevBayes

Software Bazaar: NAME OF SOFTWARE

Authors: Michael J. Landis, Jeremy M. Brown, Lyndon M. Coghill, William A. Freyman, Tracy A. Heath, Walker C. Pett, April M. Wright, Sebastian Hoehna
Source code: RevBayes Website
License: GPL license v3
Abstract: Use and estimation of phylogenetic trees remains a critical challenge in evolutionary biology. Many software packages implement only a subset of available phylogenetic models, or can work with a subset of data types, such as only being able to read in molecular data. The demands of growing datasets and new types of analyses have meant workers are increasingly forced to use multiple programs and pipelines for analyses.

RevBayes was developed as a flexible toolkit to perform standard phylogenetic analyses, divergence time estimation, and macroevolutionary modeling. Models are implemented and accessed using the Rev computing language, which is similar to R or BUGS. Rev implements a graphical model framework, allowing users to flexibly assemble models and priors to perform complex and custom Bayesian analyses. RevBayes reads most standard character types, and also accessory data, such as biogeographic data or characters for ancestral state reconstruction. Most common Bayesian phylogenetic tasks can be performed in RevBayes, as can joint or serial macroevolutionary analyses. The software is available via GitHub. Documentation, a collection of tutorials, and the user forum can be found at http://revbayes.github.io/.

Software Bazaar: Open Tree of Life

Authors: Jim Allman, Joseph Brown, Karen Cranston, Cody Hinchliff, Mark Holder, Jonathan Leto, Emily McTavish, Peter Midford, Ben Redelings, Richard Ree, Jonathan Rees, Stephen Smith
Source code: https://github.com/OpenTreeOfLife
License: GPL, BSD
Abstract: Open Tree of Life is a phylogeny of all species synthesized from taxonomy and published phylogenies. Version 9.1 has 2,637,204 tip taxa, with 55,226 from phylogenies. The web application includes an interface for users to upload and curate phylogenies for incorporation into the tree. The interface includes tree upload, tree and study metadata, taxonomic name resolution of tip labels, and conflict analysis between input trees and the Open Tree of Life and Open Tree Taxonomy.

BoF: What can software makers do to help biologists do reproducible science?

Discussion topic: I realize a big portion is education - one needs to know about a way of working or a particular tool (e.g., git) before they can use them. But besides education, for software makers that are not in the education space, what can they do to encourage more reproducible ways of working? Is it about better documentation about how X tool can be used in a reproducible way? Does it all come down to education (i.e., the carpentries), thereby tools are just tools and the carpentries will help people learn how to put them together in a reproducible way? Can our software tools we make do a better job of doing X, Y, or Z?

Software Bazaar: phylip (EXAMPLE)

phylip: a phylogenetic inference package for computers

Authors: J. Felsenstein et al.
Source code: PHYLIP home page
License: GPL

Abstract: It has recently become possible to generate molecular phylogenetic data from
almost any species on earth. The onslaught on new data and the development of
maximum likelihood methods mean many phylogenetic analysis can no longer be
performed on pen and paper alone. I will present my new software package
"phylip". Phylip is a set of C executables that allow users to calculate distance
matrices, estimate trees using likelihood, distance and parsimony methods and
visualize the resulting trees as image files. I welcome comments from users and
would-be users at the meeting.

Lightning talk: Ten simple rules for collaborative lesson development

Authors: Gabriel A. Devenyi, Rémi Emonet, Rayna M. Harris, Kate L. Hertweck, Damien Irving, Ian Milligan, Greg Wilson

Abstract: The model of collaborative code development presents an
alternative model to traditional scientific lesson
development. By leveraging a community approach, educational
resources can be more sustainable, robust, and responsive.
These ten simple rules outline best practices for this model of
collaborative resource development:

Clarify your audience
Build community around lessons
Build modular lessons that can be re-purposed
Teach best practices for lesson development
Encourage and empower contributors
Publish periodically and recognize contributions
Evaluate lessons at several scales
Reduce, re-use, recycle
Link to other resources
You can't please everyone

These rules are described in more detail in a manuscript currently in preparation.

Lightning talk: "ape: Analyses of Phylogenetics and Evolution"

Authors: Emmanuel Paradis, Simon Blomberg, Ben Bolker, Julien Claude, Hoa Sien Cuong, Richard Desper, Gilles Didier, Benoit Durand, Julien Dutheil, RJ Ewing, Olivier Gascuel, Christoph Heibl, Anthony Ives, Bradley Jones, Franz Krah, Daniel Lawson, Vincent Lefort, Pierre Legendre, Jim Lemon, Rosemary McCloskey, Johan Nylander, Rainer Opgen-Rhein, Andrei-Alin Popescu, Manuela Royer-Carenzi, Klaus Schliep, Korbinian Strimmer, Damien de Vienne
Abstract: ape is now almost 15 years on CRAN and has 192 dependencies on CRAN alone. In the development version we recently introduced the possibility to read in trees with nodes of degree 2 and made the necessary changes to handle these trees (traversing tree, plotting, etc.). We will shortly show some of these new features, which as a byproduct will also make lots of code substantially faster. We want to reach out to user and developers to give us feedback, to send trees which previously could not read into R and give some advice how to adjust for these changes in your code.

Template for issues and PRs?

It would be great to have a template that included required info for talk/BoF/software bazaar submissions.

Lightning talk: `rotl` : an R package to interact with the Open Tree of Life data

Authors: François Michonneau, Joseph Brown, David Winter

Abstract: While phylogenies have been getting easier to build, it has been difficult to reuse, combine and synthesize the information they provide because published trees are often only available as image files, and taxonomic information is not standardized across studies.
The Open Tree of Life (OTL) project addresses these issues by providing a digital tree that encompasses all organisms, built by combining taxonomic information and published phylogenies. The project also provides tools and services to query and download parts of this synthetic tree, as well as the source data used to build it. Here, we present rotl, an R package to search and download data from the Open Tree of Life directly in R.
rotl uses common data structures allowing researchers to take advantage of the rich set of tools and methods that are available in R to manipulate, analyse and visualize phylogenies.

License: rotl is free, open source and released under a Simplified BSD licence

Lightning Talk: "Using Genome-Scale Data to Resolve Cryptic Species Problems"

Authors: Fiona C. Wood, Yonas I. Tekle
Abstract: The advent of molecular phylogenetics has led to numerous cases where species determination differs based on molecular or morphological data. In some cases, different populations of the same morphospecies are genetically very distinct, while in other cases, morphologically different populations are genetically identical. This is particularly problematic when looking at microbial species, as there are limited morphological characters that can be used reliably for taxonomic studies. In microbial eukaryotes in general and amoeboids in particular, discerning isolates can be challenging when there is a discordance between the commonly used genes (single or a few) and the limited morphological data. One approach to resolve such a problem is to analyze more genes, as in a genome-scale comparison, which can provide more useful data for thorough taxonomic delineation based on individual and overall genetic distances. Here, we describe an automated pipeline for matching and comparing genes from large-scale sequencing analyses. Our pipeline can effectively identify and distinguish paralogs that would otherwise impede comparisons of genome-scale data. We demonstrate the effectiveness of our pipeline with a test case using transcriptome data from the Amoebozoan genus Cochliopodium. We used our pipeline to match almost 4,000 genes across two isolates of Cochliopodium and calculate intrastrain and interstrain genetic distances. This test case shows that our pipeline is useful in quickly comparing large-scale sequencing data and providing better insight on the true degree of divergence between populations. In the future, we plan to test the pipeline with more cryptic species cases to assess and improve its general applicability.

Software Bazaar: BuddySuite

BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

Authors: Stephen Bond, Karl Keat, Sofia Barreira, and Andy Baxevanis
Source code: GitHub
License: Public Domain (Work of US government)

Abstract: The ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki.

Software Bazaar: Nemo 3 (and nemosub)

Authors: Frédéric Guillaume, Olivier Cotto, Max Schmid, Jobran Chebib
Source code: WEBSITE
License: GPL
Abstract: Nemo is an individual-based, forward-time, genetically explicit, and stochastic simulation software designed for the study of the evolution of life history and quantitative traits, and genetic markers under various types of selection, in a spatially explicit, metapopulation framework. Nemo is designed to run large simulation campaigns and has neat new features to help spawn large batches of simulations on high-performance computing infrastructures with /nemosub/. Nemo can also efficiently handle simulations of genomic data (SNP) in large populations using genetic maps or along known pedigrees. All in all, Nemo is an ideal tool for the study of eco-evolutionary dynamics on real landscapes, the exploration of theoretical evolutionary scenarios, or for applications in conservation biology.

BoF: Likelihood testsuite

Discussion topic: The phylogenetic likelihood function is central to programs that estimate phylogenies or infer evolutionary rates in a Bayesian or maximum likelihood paradigm. However, different programs sometimes compute different likelihood values for the same data sets. This can be because of bugs or numerical inaccuracies (such as underflow) in the implemention of the Felsenstein pruning algorithm and matrix exponentials. It can also be because of incorrect specification of the model. For example, models such as the branch-site model of positive selection are quite complex, and might easily be mis-specified in software attempting a new implementation of the model.

A likelihood testsuite can be helpful in checking that likelihoods are being computed correctly and identically in the wide variety of likelihood-based programs, as well as being a good source of unit tests for such programs. Furthermore, a likelihood testsuite could be invaluable in supporting new implementations of complex models by checking that they have specified models correctly. A prototype testsuite has found significant differences in a few cases between two bayesian phylogeny programs.

Questions: How can we establish a likelihood testsuite that can be used to test a wide variety of phylogeny programs, such as raxml, phyml, beast, revbayes, etc? If two programs disagree on the likelihood, how would we establish which one is correct? How can we establish "true" values of likelihoods when existing programs all suffer from roundoff-errors? Can we convince the authors of programs designed to MAXIMIZE or AVERAGE over parameters to allow computing likelihoods for specific fixed parameter values?

Quick Question on Software Bazaar

Hi all-

I had a question on the software bazaar. If we want to enter a piece of software into it, do all people who will be present at the stand/listed on the submission need to be registered for iEvoBio?

Thanks!

Software Bazaar: SLiM 2

SLiM 2: scriptable, interactive, fast evolutionary modeling

Authors: Benjamin C. Haller, Philipp W. Messer
Source code: SLiM home page
License: GPL

Abstract: Individual-based, genetically-explicit modeling is an essential tool for connecting empirical population genetic patterns to hypotheses about evolutionary process. For example, such models can help answer questions about past demographic history, the forces of spatial and temporal selection acting on a population, and the genetic architecture underlying traits. SLiM 2 is a software package that makes constructing and running these models easy. SLiM models are extremely scriptable, allowing a huge diversity of models to be implemented, and they can run on both Linux and macOS, including on computing clusters for maximal performance. On macOS an interactive modeling environment, SLiMgui, is provided for rapid model development and graphical debugging; this can also make a great teaching tool for interactive labs in population genetics. At the Software Bazaar I can give you a demo of SLiM, or even do some quick modeling of your system with you if you like.

Software Bazaar redo?

There seems to be a few folks who were interested in setting up again for the software bazaar at tonight's poster session (and maybe even tomorrow, too?). Drop a note here if you're interested, and we'll get organized in about the same place we were last night (although please note we won't have power strips available).

@mciach @KlausVigo @bredelings @RemiMattheyDoret @fredgui @wrightaprilm @bomeara @jhill1 @kevinjliu @bhaller @mjy @biologyguy

Schedule for lightning talk

Hi,

I'm looking forward to attending iEvoBio for the first time! The schedule here indicates that lightning talks are on 6/23, while the Presenter Information page on Wordpress says that it's on the last day of Evolution. Which is it?

Thank you,
YeeMey Seah

Software Bazaar: Phylotastic

Software Bazaar: Phylotastic suite of tools

Authors: Arlin Stoltzfus, Enrico Pontelli, Brian O'Meara, Dima Mozzherin, Dail Laughinghouse, Thanh Nguyen, Van Nguyen, Abu Saleh
Source code: https://github.com/phylotastic
License: GPL
Abstract: Until recently, expert knowledge on phylogenetic relationships-- the Tree of Life-- was scattered in thousands of inaccessible publications. Then the OpenTree project processed thousands of publications to generate a "synthetic tree" covering millions of species, a resource with enormous potential both for educational purposes and to guide research. Yet, there is still a very large accessibility gap, making it unlikely that this trove of knowledge will flow into the hands of scientists, educators, and the public. The Phylotastic project aims to lower the barriers to accessing the tree of life: we want to make getting a species tree just as easy-- and as fast-- as getting online driving directions. We solve problems involved in using scientific names, extracting relationships, integrating time-scales, and finding images and other resources about species. We have tools for going from photos of museum labels to a phylogeny on an iPhone, a web portal for returning the tree of life, an R package for pulling in trees, and tools for getting chronograms.

BoF: Why write new software instead of collaborating?

Why do people write new software?

Kate: so many people would rather write, say, their own new aligner that has no dependencies. People seem to think that they trust their own code most? Kate says that she tends to look for review articles, but often they're older, and it's not clear if the most-used package is cited because it's the easiest to understand.

Review articles are a pretty blunt instrument: only one person's opinion, not updated.

Too hard to read/understand someone else's code.

Maybe should email other developers? But sometimes they don't respond.

Maybe should be able to hire someone to modularize the code for extensibility.

We need to figure out how to plan for the amount of person-hours that it takes to do a large analysis or other major extension of the project.

Some kinds of software are more modularizable than others.

Interoperable files: if software can ingest/egest same file formats, that is a form of modularization.

A library might be hard to create because it's hard to chunk out different parts of the functionality.

Would Slim/Nemo/Simbit ever have the time/will to organize and unify their packages?

maybe hackathons?

Another obstacle is that fundamentally, people have different goals. Worried that people might not be okay with others rewriting their entire code base.

It'd be nice if more packages had nice APIs that would wrap any sort of code and then could be made available in other environments like R.

Maybe that's the next thing we need in an intermediate carpentry: modularization.

One of the enemies of modularization is optimization. Can't always play nice and wait for other code.
But maybe should look for optimizations in other areas instead of breaking modularity?

Why not?
Ben: it's hard to merge 3 versions of code that already exist. Hard to invest time/effort in merging without obvious benefit. Code review would be great, but who has the time?

Lightning talk: "Making toast: Use of analogies for bioinformatics education" (EXAMPLE)

Abstract: Contemporary biology is moving towards heavy reliance on computational methods to manage, find patterns, and derive meaning from large-scale data, such as genomic sequences. Biology teachers are increasingly compelled to prepare students with skills to meet these challenges. However, introducing biology students to more abstract concepts associated with computational thinking remains a major challenge. Analogies have long been used in science classrooms to help students comprehend complex concepts by relating them to familiar processes. Here I present a multi-step procedure for introducing students to large-scale data analysis (bioinformatics workflows) by asking them to describe a common daily task: making toast. First, students describe the main steps associated with this procedure. Next, students are presented with alternative scenarios for materials and equipment and are asked to extend the analogy to accommodate them. Finally, students are led through examples of how the analogy breaks down, or fails to accurately represent, a bioinformatics analysis. This structured approach to student exploration of analogies related to computational biology capitalizes on diverse student experiences to both clarify concepts and ameliorate possible misconceptions. Similar methods can be used to introduce many abstract concepts in both biology and computer science. Published article available here.

Lighning Talk: aTRAM 2.0 faster assembly of loci from next-gen data

Lightning talk: "aTRAM 2.0 faster assembly of loci from next-gen data"

Authors: Julie Allen, Raphael LaFrance, Robert Guralnick
Abstract: ABSTRACT HERE

Software Bazaar: PhyloNet-HMM

Authors: Kevin J. Liu, Jingxuan Dai, Kathy Truong, Ying Song, Michael H. Kohn, Luay Nakhleh
Source code: https://bioinfocs.rice.edu/software/phmm
License: GNU GPLv3
Abstract:
One outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the integration of genetic material from one species into the genome of an individual in another species. The evolution of several groups of eukaryotic species has involved hybridization, and cases of adaptation through introgression have been already established. In this work, we report on PhyloNet-HMM—a new comparative genomic framework for detecting introgression in genomes. PhyloNet-HMM combines phylogenetic networks with hidden Markov models (HMMs) to simultaneously capture the (potentially reticulate) evolutionary history of the genomes and dependencies within genomes. A novel aspect of our work is that it also accounts for incomplete lineage sorting and dependence across loci. Application of our model to variation data from chromosome 7 in the mouse (Mus musculus domesticus) genome detected a recently reported adaptive introgression event involving the rodent poison resistance gene Vkorc1, in addition to other newly detected introgressed genomic regions. Based on our analysis, it is estimated that about 9% of all sites within chromosome 7 are of introgressive origin (these cover about 13 Mbp of chromosome 7, and over 300 genes). Further, our model detected no introgression in a negative control data set. We also found that our model accurately detected introgression and other evolutionary processes from synthetic data sets simulated under the coalescent model with recombination, isolation, and migration. Our work provides a powerful framework for systematic analysis of introgression while simultaneously accounting for dependence across sites, point mutations, recombination, and ancestral polymorphism.

Lightning Talk: Stomata Counter: A tool for plant image analysis

Authors: @KarlFetter
Abstract:

Software: SimBit! A flexible, high performance forward-in-time population genetics simulation platform

If you are submitting information for a Birds of a Feather (BoF) topic, lightning talk, or entry
to the Software Bazaar, please use the relevant template lines below.

Software Bazaar: SimBit

Authors: Remi Matthey-Doret, Michael Whitlock
Source code: https://github.com/RemiMattheyDoret/SimBit
License: MIT License
Abstract: SimBit is a high performance forward-in-time population genetics simulation platform coded in C++. With SimBit, you can simulate complex demography, plasticity, epistasis, QTL and other things. The main secrets of SimBit relies into the use of bitwise operators and the fact that all memory allocation at the beginning of the simulation. SimBit has very good error handling to ensure a user understand what was unclear about its input.

Software Bazaar: Supertree Toolkit

Authors: Jon Hill, Katie E. Davis, Jamie Tovar, Matthew A. Wills
Source code: GitHub
License: GPL v3

Abstract: Large supertrees are onerous tasks, involving the collection, storage, and processing of hundreds to thousands of individual phylogenies composed of tens to tens of thousands of taxa. The supertree Toolkit is designed to curate, process and store these data and the associated meta-data to aid in robust supertree constructions. It features a user-friendly graphical user interface (GUI) which contains both context sensitive documentation and graphic prompts to ensure that minimal data entry is completed. These data can then be manipulated according to a well-defined, but flexible, processing pipeline using either GUI or a command-line based tool. Processing steps include standardising names, deleting or replacing taxa, ensuring adequate taxonomic overlap, ensuring data independence, and creating a matrix to use in TNT or other phylogenetic software. These core processing steps are augmented by the ability to construct a taxonomy from online resources which can be added to a supertree in post-processing and create subsets of data to enable novel analyses to be carried out, for example. This software has been successfully used to store and process data consisting of over 1000 trees. This software will make large supertree creation a much easier task and increase the robustness of such phylogenies

Software Bazaar: Locus Tree Inference

Name: Locus Tree Inference
Authors: Michał Ciach, Anna Muszewska, Paweł Górecki
Source code: github
License: GPL-3.0
Abstract: Locus Tree Inference allows for decomposition of a gene tree into a set of subtrees which are consistent with a given species tree. It can be used to detect evolutionary events such as a horizontal gene transfer and to score the level of incongruence between the trees. As opposed to tree reconciliation, it does not require specification of any parameters, runs in approximately linear time, and is suitable for large trees (with up to thousands of leaves).

Lightning talk: "SLiM 2: scriptable, interactive, fast evolutionary modeling"

Authors: Benjamin C. Haller, Philipp W. Messer

Abstract: Individual-based, genetically-explicit modeling is an essential tool for connecting empirical population genetic patterns to hypotheses about evolutionary process. For example, such models can help answer questions about past demographic history, the forces of spatial and temporal selection acting on a population, and the genetic architecture underlying traits. SLiM 2 is a software package that makes constructing and running these models easy. SLiM models are extremely scriptable, allowing a huge diversity of models to be implemented, and they can run on both Linux and macOS, including on computing clusters for maximal performance. On macOS an interactive modeling environment, SLiMgui, is provided for rapid model development and graphical debugging; this can also make a great teaching tool for interactive labs in population genetics. Since I have presented SLiM in the past, this year I will focus on features that are new in SLiM 2.3: continuous-space models, spatial interactions, and landscape maps.

BoF: Software publication

Goal: How to publish journal articles describing software products, and how to gain citations for them.

Journal of open source software (JOSS)

new journal, 1.5 yr
can be 1 paragraph long
just point to github
peer review is about software not text
no IF yet

Journals with programs have high impact.

Why we need to publish?

proof, validation, test
citation

Method or result first?

result first (MS example）
when will a program be “good enough” for publication?
github is a good outlet.

Good outlets:

PLoS ONE, PeerJ, BMC Bioinfo…

We should start citing GitHub repositories

But they are not peer-reviewed.

Sometimes it’s easier to write my own program than using other people’s.

Software products are evolving, not frozen.

Software shouldn’t publish at every update.

Publication is a skill that needs to be taught.

In Poland students are supposed to code independently.

It’s important to get helps.

Pipelining.

They are worth publishing.
Pipelining is hard to be consistent.
GFF is treated differently

Documentation.

Poor docs lead to rejection.
Programmers are usually reluctant in writing docs.

Test datasets.

Good to have, hard to make.

Send code to people from another field for review.

Lightning talk: Towards an automated phylogeny generator: integrating taxonomic information with supertree construction

Authors: Jon Hill, Katie E. Davis

Abstract:
Constructing accurate, large phylogenies is time-consuming and onerous. In particular, the data needs processing and carefully cross-checking to remove synonyms, mis-spellings, outgroup taxa, etc. Manually checking the operation taxonomic units (OTUs) is an error-prone task in particular. However, Application Programming Interfaces do exist for a number of online taxonomic databases, including Encyclopedia of Life, World Register of Marine Species and Integrated Taxonomic Information Service. It is possible to automatically download taxonomic information from these and use these data to automatically check the OTUs in phylogenetic construction. Here, we will show how these data can be used to automatically process data when constructing phylogenies via supertree methods. We evaluate the performance of these methods on a diverse range of taxonomic groups, including oscine songbirds, bumblebees, achelatan lobsters and cockroaches/termites. We show that different taxonomic databases can produce similar phylogenies, but they differ in detail. The methods describe show that it will be possible to automatically construct large phylogenies with minimal human input.

Lightning talk: "phASE-Stitcher: A tool for phasing genome wide haplotype in F1 hybrids using Phase Informative Reads"

Lightning talk: "phASE-Stitcher: A tool for phasing genome wide haplotype in F1 hybrids using Phase Informative Reads"

Authors: Bishwa K. Giri, Dr. David L. Remington
Abstract:
Next gen sequencing data has provided breakthrough in mining the genetics variants in lots of organism either model or non-model. However, next level of problem is in tying these variants in a set of haplotypes which could provide strong data for population genetics, genomics and other studies. While lots of methods have been proposed to identify the most probably haplotype states using a large dataset, there is still a paucity of tools that can provide accurate haplotype in hybrid individuals. Additionally, hybrids and mixed population are highly heterozygote in nature, so inaccurate configuration of phase states of the alleles in hybrids or mixed parental population can lead to misrepresentation of the actually biology occurring in the organism. Here we present a method and tool that takes the readback phased state of the alleles from next generation sequence data and segregates them into probable population haplotype using the frequency of the alleles in the parental population. We also introduce first order markov model to improve the accuracy of the phase states. This method/tool can be use to prepare the genome wide haplotype of the hybrid individuals, population level reference genome panels and preparation of more accurate diploid genome of an individual. The discussed tool can be found here.