Git Product home page Git Product logo

linearham's Introduction

linearham

     __      _.._
  .-'__`-._.'.--.'.__.,
 /--'  '-._.'    '-._./
/__.--._.--._.'``-.__/
'._.-'-._.-._.-''-..'

Docker Repository on Quay  

References

  1. Dhar A, Ralph DK, Minin VN, Matsen FA IV (2020) A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis. PLoS Comput Biol 16(8): e1008030 (https://doi.org/10.1371/journal.pcbi.1008030).

Installation

Linearham's dependencies are described in the Dockerfile. We recommend that you run linearham inside a Docker container, since this will make installation much easier (if you're new to Docker, read this). However, you can also install the dependencies by hand, in which case you should clone the repository and run each command in the Dockerfile that's on a line starting with RUN (treat WORKDIR as cd). The more similar your system is to that described by the Dockerfile's FROM line (at the moment, debian), the easier this will be.

Using Docker

It's best to start by running an interactive session in the container:

docker run -it quay.io/matsengrp/linearham /bin/bash.

You can also create a shell script with your linearham commands for docker to run, and put it in place of /bin/bash (use test.sh as an example). If you want your run to be reproducible, choose a tag of the form v<stuff> from quay.io, then specify it like so: quay.io/matsengrp/linearham:v<stuff>.

To access your own data from within the container, or to persist your output beyond the container, you must use volumes by specifying -v. We recommend using this convention (other paths may not work):

  1. Choose a directory outside of Docker that contains both your input data directory and your desired output directory
  2. Mount it as a volume to /linearham/work inside the container: -v /your/local/path:/linearham/work
  3. All commands inside the container referencing paths inside that directory should do so via /linearham/work, e.g. setting --outdir=/linearham/work/output inside the container will persist your output outside the container in /your/local/path/output.

Note that because Docker must run as root, this means that you will be writing to the directory on your host machine as root, so a) be very careful and b) don't choose anything anywhere near / (something like /home/user/several/sub/dirs is good).

Running linearham

All linearham actions are run using scons in the main linearham directory. Available actions are --run-partis, --run-linearham, and --build-partis-linearham. Note that because of the way scons parses arguments, you must always use an = sign in all args: --arg=val. For the same reason, you also have to spell args exactly right, e.g. writing --arg-nam instead of --arg-name will silently ignore it.

The input for Linearham is a partis output file. If you've already run partis to create this file, you only need to run --run-linearham; if not, you can have linearham run partis for you with --run-partis.

--run-partis

This runs partis on an input sequence file:

scons --run-partis --fasta-path=<file> --locus={igh|igk|igl} --outdir=<dir> [--parameter-dir=<dir>]

--fasta-path can be any file type that partis --infname handles (see partis help). Partis uses a directory with fitted sample-specific parameters (--parameter-dir), which if not already present will be automatically inferred based on the sequences in the fasta file. These parameters are more accurate if inferred on the entire repertoire of many clonal families; thus because linearham runs on only a single family at a time it is better if you can use parameters cached in a previous partis run on the entire repertoire, and then pass them to linearham with --parameter-dir. However, if you don't, the automatically-inferred parameters will still work fine, they'll just be somewhat less accurate (since they'll only be based on the one family).

Other partis-related arguments:

option description
--all-clonal-seqs If set, attempts to force all sequences in the fasta file into the same clonal family; otherwise it runs partis partition to infer the clonal families. "Attempts" means everything will end up together that doesn't have, say, different cdr3 lengths or wildly different naive sequences.
--locus Which immunoglobulin locus (defaults to igh)?
--outdir The output directory (defaults to output).

--run-linearham

Once you have a partis output file, whether you made it separately or with linearham, you can run linearham itself:

scons --run-linearham --outdir=<dir> [--partis-yaml-file=<file>] [--parameter-dir=<dir>]

If there is one clonal family (i.e. cluster) in the partis output file, linearham will run on that. If there is more than one, you'll have to select a cluster to run on using several options. In such a case you'll likely first want to run linearham with no cluster selection, in which case it'll print a list of the available clusters and exit (see also parse_cluster.py below). Partis performs clustering hierarchically, so its output stores a list of partitions, where each partition divides the sequences in the repertoire into clonal families (clusters). By default, linearham looks in the best (most likely) of these partitions, but you can specify the (zero-based) index of a different one with --partition-index. Within a partition, you can specify a cluster either by (zero-based) index with --cluster-index, or with the unique id of a particular sequence in the cluster with --cluster-seed-unique-id (see partis --seed-unique-id for more info). Options to specify the cluster on which to run:

option description
--partition-index zero-based index of partition from which to select the cluster on which to run (defaults to most likely partition)
--cluster-index zero-based index of cluster on which to run in the selected partition
--cluster-seed-unique-id choose the cluster in which the sequence with this name is found
--lineage-unique-ids same as --cluster-seed-unique-id, but also goes on to perform detailed lineage/mutation analysis. (see also below)

Other options:

option description
--partis-yaml-file Path to the partis output file that is linearham's input. Defaults to the location in --outdir to which the linearham --run-partis action will have written it (if it ran)
--outdir The output directory (defaults to output).
--parameter-dir Directory from which linearham reads partis hmm files. If not set, it defaults to the location in --outdir used by --run-partis. As for --run-partis (above), parameters will be much more accurate if you cache them with partis beforehand on the entire repertoire, but if this isn't possible they'll be inferred automatically on the one family on which you're running linearham, which should be fine.

If you don't have a clonal family/cluster of interest, or are not sure how to identify it using these options, you can run scripts/parse_cluster.py to work it out.

For example, running:

./scripts/parse_cluster.py lib/partis/test/reference-results/partition-new-simu.yaml --fasta-output-file parsed_cluster.fa --yaml-output-file parsed_cluster.yaml | less -RS

will print a table of available clusters in the best partition similar to this:

 available clusters in partition at index 29 (best):
index   size    unique_ids
0       71      [...]
1       11      [...]
2       262     [...]
3       4       [...]

Using the indices from this table, you can specify the corresponding clusters to Linearham. Running on the cluster with 262 sequences from the above table would look like:

scons --run-linearham --cluster-index=2 <args.. >

You can also figure out which sequences are in which clusters with the partis view-output action piped to less -RS.

Other linearham-related arguments:

option list? description
--template-path no The RevBayes template path (defaults to templates/revbayes_template.rev).
--mcmc-iter yes How many RevBayes MCMC iterations should we use (defaults to 10000)?
--mcmc-thin yes What RevBayes MCMC thinning frequency should we use (defaults to 10)?
--tune-iter yes How many RevBayes tuning iterations should we use (defaults to 5000)?
--tune-thin yes What RevBayes tuning thinning frequency should we use (defaults to 100)?
--num-rates yes The number of gamma rate categories (defaults to 4).
--burnin-frac yes What fraction of MCMC burnin should we use (defaults to 0.1)?
--subsamp-frac yes What bootstrap sampling fraction should we use (defaults to 0.05)?
--rng-seed yes The random number generator (RNG) seed (defaults to 0).
--asr-pfilters no The ancestral sequence posterior probability thresholds (defaults to 0.1).
--no-nestly-subdirs no if set, all output files are written directly to --outdir, rather than to a nested series of subdirs. Useful if you'd rather handle directory structure with the code that's calling linearham, and/or you don't plan to run many different combinations of mcmc parameters.

For the arguments that can be specified as a (,-separated) list (see middle column), linearham will run revbayes separately, writing to separate nested output directories, for all combinations of all such parameters. For more information on these arguments, run scons --help.

--build-partis-linearham

This compiles linearham, partis, and other dependencies. You'll only need to run this if you've either modified some source code or you're installing without docker.

Run steps

Running linearham consists of a series of steps, whose precedence and running is handled by scons. See also below for more detail on the various inputs and outputs of each step.

step command description
get linearham info lib/partis/bin/partis get-linearham-info reformat the information in all annotations in the partis output file for use by subsequent linearham steps, writes to partis_run.yaml
select single cluster scripts/parse_cluster.py pull annotation for single specified cluster out of partis_run.yaml, and write it to cluster.yaml and its sequences to cluster_seqs.fasta
make revbayes input scripts/generate_revbayes_rev_file.py use seqs in cluster_seqs.fasta and template revbayes config templates/revbayes_template.rev to write revbayes config for this run to revbayes_run.rev
run revbayes lib/revbayes/projects/cmake/rb run revbayes with config file revbayes_run.rev, writing output to revbayes_run.stdout.log. This step is usually by far the slowest; you can adjust e.g. the mcmc options above to trade off speed for confidence/accuracy.
run phylo hmm _build/linearham/linearham --pipeline run actual linearham phylo hmm, using cluster.yaml, <--parameter-dir>, and revbayes_run.trees to write lh_revbayes_run.trees
collect run statistics scripts/run_bootstrap_asr_ess.R collects info from lh_revbayes_run.trees and cluster_seqs.fasta to write three output files: linearham_run.{trees,log,ess}
calculate naive seq stats scripts/tabulate_naive_probs.py collect info from linearham_run.trees to write aa_naive_seqs.{png,fasta,dnamap}
calculate lineage info scripts/tabulate_lineage_probs.py collect info from linearham_run.trees and aa_naive_seqs.fasta to write lineage summary info to aa_lineage_seqs.{pfilter0.1.dot,fasta,dnamap,pfilter0.1.png} (only run if --lineage-unique-ids is set)
write git version info git rev-parse write commit/tag info to enable reproducibility
write final annotations scripts/write_lh_annotations.py Use original partis annotation in cluster.yaml and linearham stats in linearham_run.log to write final linearham annotations to linearham_annotations_{best,all}.yaml

Output files

Most of the output files you're likely to need are by default in the mcmc*/burninfrac*/ subdir of --outdir: e.g. burninfrac0.1_subsampfrac0.05/

file format description
linearham_run.log tsv posterior samples of annotations/naive sequences and parameters for the phylogenetic substitution and rate variation models
linearham_run.trees newick posterior tree samples with ancestral sequence annotations (formatted for use by Dendropy)
linearham_run.ess tsv approximate effective sample sizes for each field in linearham_run.log
aa_naive_seqs.fasta fasta each sampled naive amino acid sequence and its associated posterior probability
aa_naive_seqs.dnamap fasta (ish) map from each sampled naive amino acid sequence to its corresponding set of nucleotide naive sequences and posterior probabilities
aa_naive_seqs.png png logo plot of naive amino acid sequence posterior probability using WebLogo to visualize per-site uncertainties
linearham_annotations_all.yaml yaml annotations corresponding to the posterior tree samples (collapsing unique annotations, and with posterior probabilites set in 'logprob' key)
linearham_annotations_best.yaml yaml most likely annotation, i.e. the one that corresponded to the largest number of posterior tree samples

If --no-nestly-subdirs is set, instead of the cluster-*/mcmc*/burninfrac*/lineage_* subdirs, all files are written to the top-level dir (i.e. the calling program must specify a different dir in order to run with different parameters).

Every posterior tree sample corresponds to one sampled annotation; however before writing to linearham_annotations_all.yaml, duplicate annotations are collapsed. Each resulting unique annotation is assigned a probability proportional to the number of times it was sampled. These unique annotations are sorted by the resulting new 'logprob' key (in descending order) and written to linearham_annotations_all.yaml. Every sampled tree that contributed to each unique annotation is also added to that annotation (as a list in annotation['tree-info']['linearham']['trees']).

If --lineage-unique-ids is specified, there will also be additional lineage-specific output files, by default in subdirectories like lineage_<uid>/:

file format description
aa_lineage_seqs.fasta fasta for each intermediate ancestor in the lineage of the sequence with the specified id, the sampled amino acid sequence and its associated posterior probability
aa_lineage_seqs.dnamap fasta(ish) for each intermediate ancestor of the lineage of the sequence with the specified id, map from sampled amino acid sequence to its corresponding set of nucleotide sequences and posterior probabilities
aa_lineage_seqs.pfilterX.svg svg posterior probability lineage graphic made with Graphviz, where X is the posterior probability cutoff for the sampled sequences (see details below).

The posterior probability lineage plot (aa_lineage_seqs.pfilterX.svg) summarizes all the inferred ancestral sequences (and transitions among them) that were observed in the sampled trees. Each node represents an amino acid sequence: either inferred ("naive" if it was a naive sequence in any tree, otherwise "inter") or the seed sequence. The node's percent label (which also determines the color weight) is the fraction of trees in which it was observed, whether as naive or intermediate (so note that this can be larger than the probability in a naive sequence's name in the fasta file above, since in this plot we include also instances where it was an intermediate). Edges are labelled with the mutations separating their two nodes, and with the percent of transitions (in all trees) from the parent node that connect to the child node (which also determines the color weight). Edges with probability below are not plotted, so if you want to see more detail you should decrease this (note that this means plotted numbers generally don't add exactly to 100).

Most of the rest of the files in --outdir are just used to pass information among various linearham steps. By default, in the top-level dir are:

  • the partis output file that was used as input for linearham (e.g. partis_run.yaml), which contains partis-inferred clonal families (clusters) and annotations (including inferred naive sequence)
  • sub dirs for each cluster on which linearham was run, with name of form cluster-N/ for the cluster index N

Within each cluster's subdir cluster-N/ are:

  • a fasta file cluster-N/cluster_seqs.fasta with each of the cluster's input sequences, as well as its partis-inferred naive sequence
    • Note that the input sequences have SHM indels "reversed" (reverted to their state in the naive rearrangement), and non-variable regions (V/J framework) trimmed off.
    • This is equivalent to assuming that all shm indels occured at the tips of the tree, which is often not a good assumption, but standard phylogenetic approaches do not handle indels, so if you care about indels you'll need to handle them separately/by hand.
  • a partis yaml output file cluster.yaml resulting from pulling just this cluster out of the original partis output file that was used as input
  • the linearham output dir cluster-N/mcmciter<stuff> where <stuff> records the exact options of the revbayes run e.g. mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/

Within the linearham output dir mcmciter<stuff>

e.g. mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/:

file format description
revbayes_run.trees tsv results from tree sampling iterations, including phylogenetic substitution model and rate variation parameters and Newick trees
revbayes_run.log tsv results from tree sampling iterations but branch lengths are listed in tabular form rather than in a Newick tree
revbayes_run.rev Rev generated RevBayes script for tree sampling
revbayes_run.stdout.log txt stdout log of RevBayes tree sampling run
lh_revbayes_run.trees tsv results from tree sampling iterations plus V(D)J recombination information and contribution of each tree to posterior estimation
burninfrac<stuff> dir subdirectory for results from the run_bootstrap_asr_ess.R step onwards, i.e. final results

Naive sequence comparisons

One way to visualize the various output naive sequences and their probabilities is with lib/partis/bin/cf-linearham.py, which takes as input a linearham output dir and a partis output file (the latter preferably created with the --calculate-alternative-annotations option set). It then prints an ascii-art comparison of the amino acid and nucleotide naive sequences, as well as (for partis) a rundown of the alternative gene calls and their probabilities (the most likely of which was presumably input to linearham). More info here.

linearham's People

Contributors

dunleavy005 avatar eharkins avatar jgallowa07 avatar matsen avatar naveenjasti avatar psathyrella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linearham's Issues

Disable copy constructors?

One can disable copy constructors by making them private and not defining them. Should we do this for our big objects that we don't accidentally want getting passed by value?

SmooshableStack

A SmooshableStack will be a collection of smooshables that have the same left and right flexes. This is an abstraction for there being various alternate linear segments.

When two SmooshableStacks get smooshed together, the resulting SmooshableStack must be indexed by the product of the input stacks.

Can we use a common interface for Smooshables and SmooshableStacks so that they can both be part of a SmooshableChain?

B-W or stochastic EM?

How quickly can we do B-W or stochastic EM using linearham?

screenshot 2016-10-24 at 1 40 05 pm
screenshot 2016-10-24 at 1 40 35 pm

From

Lam, T. Y., & Meyer, I. M. (2010). Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training. Algorithms for Molecular Biology: AMB, 5, 38. https://doi.org/10.1186/1748-7188-5-38

Do we really want to use scons?

I'm still orienting myself with how linearham works so I could be way off base, but I don't see how this is really something that scons is designed for. It seems like the vast majority of users will run linearham once on each sample, and won't ever do some complicated rerun of parts of a build, which is I think what scons is for. I'm sure we will fix the not-running problem that chaim is having, and i'm sure I'm doing something wrong in my command line to cause scons to tell me there's nothing to do when it hasn't done anything, but it just seems like most users' experience of linearham is going to be a lot of scons messages clogging up std out, and a lot of trying to figure out why scons isn't doing anything.

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
git rev-parse --verify HEAD >> _output/git.log && git describe --dirty >> _output/git.log
scons: done building targets.

i of course realize it may not be feasible to switch off of scons soon or even ever, but still.

linearham processing failure -- need help interpreting/solving error message

I'm trying to run linearham against the output of a partis partition. I started one job specifying cluster 0 and realized that it will take a very long time with the default parameters, so I picked a smaller cluster (sort of at random), reduced the tuning iterations and tried that analysis. My command line is:

Singularity linearham_latest.sif:~/v5x3838/linearham/linearham> scons --run-linearham --cluster-ind=10201 --partis-yaml-file=TSPN-m6-dedup2.part.yaml --parameter-dir=_output/._data_TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2/ --template-path=templates/revbayes_template.rev --num-cores=4 --tune-iter=500 &

I get the following output (note that I also had a parallel job with the same command line running on cluster 15, so the outputs could be interleaved and there is definitely a message from the cluster 15 job -- apologies for the confusion; the same failure ultimately occurred for the cluster 15 job but I'm just showing the output for the cluster 10201 job).

I don't know how to address this error; I also note that the sequences listed in the error do not belong to cluster 10201, but to cluster 0 -- there are 1749, vs. the expected 13. The cluster 10201 analysis output folder does seem to have the correct set of sequences.

-bash-4.2$ pwd
/home/v5x3838/v5x3838/linearham/linearham/output/cluster10201
-bash-4.2$ ls
cluster_seqs.fasta mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0
-bash-4.2$ fgrep -c Seq cluster_seqs.fasta
13

/home/v5x3838/v5x3838/linearham/linearham/output/cluster0
-bash-4.2$ fgrep -c Seq cluster_seqs.fasta
1749
-bash-4.2$ head cluster_seqs.fasta

naive
NNNNNNNNNNNN... [edited]
Seq_101635
NNNNNNNNNNNNC[...]
Seq_77199
NNNNNNNNNNNNC...
Seq_31507
NNNNNNNNNNNNG...

Note that the sequence numbers from cluster 0 are indeed the numbers appearing in the cluster 10201 output.

I'll go try to review the sequence lengths in a later addition to this issue.

Run log here

scons: done reading SConscript files.
scons: Building targets ...
_build/linearham/linearham --pipeline --yaml-path output/partis_run.yaml --cluster-ind 10201 --hmm-param-dir _output/._data_TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2/hmm/hmms --seed 0 --num-rates 4 --input-path output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/revbayes_run.trees --output-path output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/lh_revbayes_run.trees
scripts/generate_revbayes_rev_file.py templates/revbayes_template.rev --fasta-path output/cluster15/cluster_seqs.fasta --mcmc-iter 10000 --mcmc-thin 10 --tune-iter 500 --tune-thin 100 --num-rates 4 --seed 0 --output-path output/cluster15/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/revbayes_run.rev
Rscript --slave --vanilla scripts/run_bootstrap_asr_ess.R output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/lh_revbayes_run.trees output/cluster10201/cluster_seqs.fasta 0.1 0.05 4 0 output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_run.trees output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_run.log output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_run.ess
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
lib/revbayes/projects/cmake/rb output/cluster15/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/revbayes_run.rev > output/cluster15/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/revbayes_run.stdout.log
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
scripts/tabulate_naive_probs.py output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_run.trees --output-base output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/aa_naive_seqs
scripts/write_lh_annotations.py output/partis_run.yaml output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_run.log --output-base output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_annotations
Traceback (most recent call last):
  File "scripts/write_lh_annotations.py", line 72, in <module>
    update_partis_line_with_lh_annotation(partis_annotation_copy, lh_line, glfo)
  File "scripts/write_lh_annotations.py", line 56, in update_partis_line_with_lh_annotation
    utils.add_implicit_info(glfo, partis_line, check_line_keys=debug)
  File "/lrlhps/users/v5x3838/linearham/linearham/lib/partis/python/utils.py", line 1677, in add_implicit_info
    raise Exception('naive and mature sequences different lengths %d %d for %s:\n    %s\n    %s' % (len(line['naive_seq']), len(line['seqs'][iseq]), ' '.join(line['unique_ids']), line['naive_seq'], line['seqs'][iseq]))
Exception: naive and mature sequences different lengths 364 361 for Seq_101635 Seq_77199 Seq_31507 Seq_1228224 Seq_1406223 Seq_65283 Seq_140241 Seq_137393 Seq_266012 Seq_804603 Seq_37208 Seq_2354 Seq_280719 Seq_63290 Seq_346860 Seq_415247 Seq_893043 Seq_1276274 Seq_200797 Seq_220727 Seq_227062 Seq_1286156 Seq_1490066 Seq_305508 Seq_39626 Seq_836086 Seq_49250 Seq_578085 Seq_971382 Seq_313096 Seq_1484974 Seq_1521083 Seq_1639416 Seq_1742156 Seq_1485409 Seq_1218381 Seq_209536 Seq_299040 Seq_1586146 Seq_208873 Seq_1936063 Seq_328045 Seq_136842 Seq_385567 Seq_1274969 Seq_40662 Seq_250535 Seq_762960 Seq_275533 Seq_663926 Seq_21919 Seq_368872 Seq_173484 Seq_393546 Seq_5098 Seq_247192 Seq_471714 Seq_787082 Seq_836676 Seq_830684 Seq_805583 Seq_253507 Seq_822933 Seq_588810 Seq_810440 Seq_773318 Seq_291126 Seq_244975 Seq_410193 Seq_369511 Seq_369519 Seq_1158328 Seq_29907 Seq_1070999 Seq_750403 Seq_176910 Seq_32000 Seq_40034 Seq_619823 Seq_40038 Seq_405547 Seq_633393 Seq_497598 Seq_181125 Seq_1509967 Seq_1073567 Seq_259735 Seq_74482 Seq_333768 Seq_1435981 Seq_218753 Seq_671933 Seq_398128 Seq_323777 Seq_559894 Seq_2240508 Seq_372914 Seq_308551 Seq_979705 Seq_199678 Seq_2052 Seq_464400 Seq_230492 Seq_751490 Seq_703701 Seq_1010751 Seq_70270 Seq_213438 Seq_1807742 Seq_2075595 Seq_775827 Seq_512978 Seq_37268 Seq_129566 Seq_35188 Seq_2258448 Seq_834400 Seq_330531 Seq_277780 Seq_440865 Seq_1090143 Seq_105176 Seq_317777 Seq_278782 Seq_1406853 Seq_539448 Seq_883640 Seq_287568 Seq_571864 Seq_158915 Seq_578102 Seq_215443 Seq_48575 Seq_48570 Seq_342582 Seq_286761 Seq_236908 Seq_402045 Seq_436699 Seq_126096 Seq_313749 Seq_1937693 Seq_457519 Seq_1300307 Seq_382592 Seq_92639 Seq_17482 Seq_550284 Seq_252964 Seq_77262 Seq_168891 Seq_501974 Seq_805265 Seq_682683 Seq_356948 Seq_1638665 Seq_1510450 Seq_124421 Seq_1663954 Seq_73531 Seq_1725104 Seq_424088 Seq_917016 Seq_1677473 Seq_427953 Seq_1369 Seq_1344668 Seq_1448646 Seq_88824 Seq_126906 Seq_31676 Seq_213642 Seq_177843 Seq_669762 Seq_137576 Seq_347835 Seq_719787 Seq_84698 Seq_749715 Seq_788463 Seq_1110227 Seq_78369 Seq_385443 Seq_855378 Seq_1463568 Seq_399783 Seq_218382 Seq_103077 Seq_148715 Seq_506834 Seq_139954 Seq_221613 Seq_13769 Seq_458765 Seq_20034 Seq_224575 Seq_1348392 Seq_124961 Seq_848043 Seq_350257 Seq_1055466 Seq_904768 Seq_1299186 Seq_38111 Seq_75983 Seq_186387 Seq_457762 Seq_771446 Seq_197589 Seq_320635 Seq_176559 Seq_55874 Seq_893192 Seq_26622 Seq_203931 Seq_250634 Seq_1002083 Seq_272792 Seq_1449732 Seq_4838 Seq_198818 Seq_522536 Seq_1151688 Seq_185365 Seq_870483 Seq_290124 Seq_1225114 Seq_802736 Seq_174500 Seq_244393 Seq_1096106 Seq_421169 Seq_318312 Seq_183306 Seq_678981 Seq_1314287 Seq_425716 Seq_1066486 Seq_1997553 Seq_1888464 Seq_168654 Seq_152338 Seq_275573 Seq_1956792 Seq_1307374 Seq_1104045 Seq_413243 Seq_233932 Seq_1415907 Seq_635043 Seq_1670612 Seq_382895 Seq_759382 Seq_1754852 Seq_35514 Seq_404418 Seq_768572 Seq_466536 Seq_274195 Seq_185386 Seq_130794 Seq_13267 Seq_1679675 Seq_175979 Seq_907529 Seq_1321087 Seq_407484 Seq_762387 Seq_324953 Seq_114720 Seq_805213 Seq_891037 Seq_1212640 Seq_46712 Seq_19901 Seq_273930 Seq_702624 Seq_61796 Seq_620845 Seq_850923 Seq_435352 Seq_65023 Seq_151983 Seq_980281 Seq_1035752 Seq_429163 Seq_534500 Seq_676142 Seq_104320 Seq_68056 Seq_16442 Seq_56549 Seq_1312006 Seq_861821 Seq_360525 Seq_580680 Seq_1016485 Seq_172587 Seq_13665 Seq_566897 Seq_29451 Seq_26339 Seq_5207 Seq_55111 Seq_1177799 Seq_1139450 Seq_741876 Seq_536984 Seq_482503 Seq_804751 Seq_190298 Seq_66240 Seq_742517 Seq_100738 Seq_594470 Seq_2270326 Seq_346999 Seq_29793 Seq_75962 Seq_167253 Seq_201177 Seq_439417 Seq_1405592 Seq_1325199 Seq_760802 Seq_421735 Seq_334415 Seq_74958 Seq_256376 Seq_256372 Seq_102313 Seq_179346 Seq_389714 Seq_17551 Seq_217623 Seq_967096 Seq_640764 Seq_15540 Seq_545333 Seq_526733 Seq_470869 Seq_743398 Seq_1579117 Seq_447193 Seq_907947 Seq_579749 Seq_258015 Seq_95948 Seq_99306 Seq_589619 Seq_95162 Seq_31895 Seq_1797906 Seq_61612 Seq_1117457 Seq_666575 Seq_339140 Seq_37652 Seq_1042520 Seq_156316 Seq_301816 Seq_625408 Seq_800723 Seq_300940 Seq_261875 Seq_245118 Seq_1092195 Seq_794869 Seq_106899 Seq_404232 Seq_52222 Seq_771323 Seq_1224285 Seq_190484 Seq_688026 Seq_64019 Seq_415831 Seq_9604 Seq_796839 Seq_1189828 Seq_958780 Seq_519837 Seq_1073676 Seq_1265265 Seq_527767 Seq_45939 Seq_1104197 Seq_189300 Seq_206926 Seq_2286 Seq_1283812 Seq_686794 Seq_1266498 Seq_860411 Seq_483880 Seq_1742512 Seq_315646 Seq_450943 Seq_11513 Seq_46747 Seq_455285 Seq_62809 Seq_270602 Seq_69457 Seq_244051 Seq_137899 Seq_742862 Seq_113117 Seq_142089 Seq_513349 Seq_1634907 Seq_1175258 Seq_847543 Seq_343115 Seq_286488 Seq_1851623 Seq_316319 Seq_1503188 Seq_899669 Seq_1302884 Seq_2075590 Seq_1118616 Seq_101550 Seq_247152 Seq_311334 Seq_1448252 Seq_141944 Seq_550479 Seq_344096 Seq_716239 Seq_160414 Seq_256110 Seq_359605 Seq_87618 Seq_660711 Seq_500638 Seq_232133 Seq_56303 Seq_252446 Seq_410021 Seq_508349 Seq_109908 Seq_44672 Seq_887259 Seq_497551 Seq_1026329 Seq_59962 Seq_161387 Seq_251295 Seq_101833 Seq_495273 Seq_1053138 Seq_86894 Seq_735177 Seq_1250408 Seq_343068 Seq_585784 Seq_852170 Seq_1249821 Seq_292033 Seq_101701 Seq_1271545 Seq_1673956 Seq_762283 Seq_134577 Seq_1218901 Seq_449549 Seq_18797 Seq_339774 Seq_165132 Seq_1030658 Seq_159608 Seq_955144 Seq_1695326 Seq_1241204 Seq_251445 Seq_112581 Seq_601275 Seq_1647948 Seq_128335 Seq_994157 Seq_391088 Seq_710594 Seq_123184 Seq_248288 Seq_913275 Seq_272883 Seq_837141 Seq_177667 Seq_733923 Seq_36034 Seq_30475 Seq_185977 Seq_425916 Seq_862339 Seq_1140712 Seq_656012 Seq_64516 Seq_552804 Seq_568039 Seq_446336 Seq_520734 Seq_574917 Seq_974150 Seq_680253 Seq_640366 Seq_1837933 Seq_1020 Seq_228326 Seq_17109 Seq_306993 Seq_860988 Seq_1969436 Seq_1099762 Seq_97583 Seq_1254762 Seq_201626 Seq_47003 Seq_253939 Seq_1235861 Seq_219388 Seq_73435 Seq_396468 Seq_187467 Seq_89272 Seq_1584992 Seq_142338 Seq_710160 Seq_710165 Seq_1308819 Seq_203251 Seq_991126 Seq_1532315 Seq_837489 Seq_94674 Seq_1623773 Seq_927315 Seq_425594 Seq_866323 Seq_118515 Seq_35885 Seq_1000522 Seq_1261258 Seq_380040 Seq_137354 Seq_342136 Seq_237636 Seq_32492 Seq_286555 Seq_528817 Seq_129790 Seq_1534919 Seq_390124 Seq_546754 Seq_477808 Seq_942719 Seq_424071 Seq_49797 Seq_942559 Seq_17073 Seq_1626107 Seq_272542 Seq_18742 Seq_97473 Seq_624981 Seq_596468 Seq_541735 Seq_544812 Seq_85564 Seq_1434028 Seq_160305 Seq_819239 Seq_204656 Seq_956371 Seq_326823 Seq_321379 Seq_241353 Seq_182053 Seq_222895 Seq_222891 Seq_193861 Seq_43769 Seq_12120 Seq_428219 Seq_852889 Seq_205895 Seq_337999 Seq_208354 Seq_571442 Seq_1155778 Seq_27389 Seq_1116574 Seq_1013138 Seq_536283 Seq_709639 Seq_266894 Seq_410821 Seq_550826 Seq_1247632 Seq_219590 Seq_320420 Seq_136264 Seq_968518 Seq_569463 Seq_503063 Seq_324642 Seq_8693 Seq_707570 Seq_89955 Seq_1198329 Seq_779389 Seq_298563 Seq_1024329 Seq_1065775 Seq_34725 Seq_71445 Seq_71444 Seq_208540 Seq_1223354 Seq_1562521 Seq_1365570 Seq_346172 Seq_865416 Seq_1019193 Seq_13293 Seq_1119013 Seq_450615 Seq_9506 Seq_867341 Seq_1240991 Seq_55508 Seq_253452 Seq_177422 Seq_640779 Seq_1634890 Seq_921042 Seq_854148 Seq_342358 Seq_626958 Seq_186223 Seq_255530 Seq_283646 Seq_296366 Seq_113717 Seq_1076196 Seq_1580620 Seq_81770 Seq_594362 Seq_355164 Seq_358137 Seq_83094 Seq_456550 Seq_1306803 Seq_1636192 Seq_105246 Seq_1612598 Seq_64668 Seq_70319 Seq_6949 Seq_633660 Seq_498877 Seq_928151 Seq_140348 Seq_703077 Seq_24860 Seq_361682 Seq_848694 Seq_179089 Seq_125811 Seq_1270086 Seq_908184 Seq_438351 Seq_103568 Seq_141181 Seq_841641 Seq_74597 Seq_227238 Seq_1897264 Seq_558017 Seq_122159 Seq_861953 Seq_266246 Seq_1152839 Seq_211006 Seq_1895154 Seq_129358 Seq_778006 Seq_1247538 Seq_154285 Seq_300235 Seq_1277735 Seq_114313 Seq_1093380 Seq_1190469 Seq_237144 Seq_398679 Seq_95032 Seq_613559 Seq_630543 Seq_415425 Seq_1720333 Seq_847277 Seq_59471 Seq_1344563 Seq_890653 Seq_295272 Seq_1085479 Seq_608208 Seq_1233287 Seq_40438 Seq_564095 Seq_124754 Seq_1037538 Seq_114411 Seq_2023432 Seq_77698 Seq_291770 Seq_111835 Seq_102544 Seq_117679 Seq_47663 Seq_375885 Seq_955043 Seq_950134 Seq_33769 Seq_20096 Seq_85542 Seq_298530 Seq_1426794 Seq_275017 Seq_543247 Seq_310806 Seq_1144537 Seq_20532 Seq_376237 Seq_399122 Seq_10439 Seq_166089 Seq_577031 Seq_789830 Seq_1752701 Seq_88025 Seq_29757 Seq_135989 Seq_1564240 Seq_329946 Seq_512300 Seq_606730 Seq_159008 Seq_337713 Seq_294939 Seq_456663 Seq_450625 Seq_304398 Seq_1015257 Seq_1189939 Seq_277279 Seq_476314 Seq_236225 Seq_1178257 Seq_874066 Seq_232788 Seq_1529661 Seq_1327967 Seq_1570031 Seq_171953 Seq_18473 Seq_844039 Seq_153518 Seq_14157 Seq_283316 Seq_502458 Seq_521473 Seq_274327 Seq_100926 Seq_1570546 Seq_1082043 Seq_675135 Seq_212914 Seq_499010 Seq_591714 Seq_63909 Seq_414513 Seq_1521937 Seq_1590730 Seq_55060 Seq_434307 Seq_106858 Seq_2041763 Seq_127176 Seq_494526 Seq_256617 Seq_15854 Seq_97370 Seq_561964 Seq_502802 Seq_605528 Seq_1301456 Seq_1542968 Seq_1920461 Seq_70735 Seq_329444 Seq_2561 Seq_42138 Seq_741034 Seq_460944 Seq_1350577 Seq_1031265 Seq_1294788 Seq_23588 Seq_139330 Seq_185330 Seq_1036682 Seq_106537 Seq_807311 Seq_4086 Seq_474559 Seq_135232 Seq_151493 Seq_324812 Seq_259879 Seq_21975 Seq_923314 Seq_152577 Seq_360 Seq_722358 Seq_1241949 Seq_732817 Seq_37106 Seq_360706 Seq_220285 Seq_736245 Seq_46421 Seq_552096 Seq_45190 Seq_739012 Seq_87230 Seq_75041 Seq_1655251 Seq_113202 Seq_767201 Seq_1947916 Seq_188850 Seq_39591 Seq_2061054 Seq_663237 Seq_928925 Seq_244832 Seq_869483 Seq_223531 Seq_533068 Seq_78908 Seq_884128 Seq_1947892 Seq_209670 Seq_321291 Seq_2086514 Seq_697005 Seq_222569 Seq_1192945 Seq_2027592 Seq_841633 Seq_246175 Seq_1050825 Seq_11150 Seq_11154 Seq_59873 Seq_16273 Seq_817454 Seq_1537829 Seq_254026 Seq_388511 Seq_22304 Seq_13824 Seq_401378 Seq_63488 Seq_53368 Seq_236251 Seq_99957 Seq_358708 Seq_1418152 Seq_95093 Seq_100435 Seq_182385 Seq_313679 Seq_720544 Seq_11586 Seq_266104 Seq_381735 Seq_129148 Seq_35748 Seq_1284200 Seq_83105 Seq_618314 Seq_737127 Seq_1027062 Seq_1195343 Seq_517100 Seq_524635 Seq_835869 Seq_6072 Seq_181882 Seq_511697 Seq_2748 Seq_380178 Seq_693834 Seq_108513 Seq_58736 Seq_866909 Seq_324653 Seq_56560 Seq_334929 Seq_722656 Seq_1800588 Seq_48432 Seq_209053 Seq_25036 Seq_25039 Seq_1138460 Seq_373555 Seq_51487 Seq_709744 Seq_310794 Seq_522368 Seq_162900 Seq_998656 Seq_785838 Seq_1156068 Seq_910733 Seq_182637 Seq_95384 Seq_1576627 Seq_43116 Seq_347369 Seq_1056321 Seq_145788 Seq_509855 Seq_377570 Seq_244383 Seq_265 Seq_714696 Seq_171019 Seq_659419 Seq_1632450 Seq_548098 Seq_75534 Seq_998448 Seq_332514 Seq_370521 Seq_81769 Seq_43757 Seq_217103 Seq_75094 Seq_57548 Seq_253857 Seq_364730 Seq_1715896 Seq_177010 Seq_1333241 Seq_29853 Seq_1924197 Seq_22417 Seq_752406 Seq_531141 Seq_283287 Seq_269681 Seq_1240198 Seq_646098 Seq_1105279 Seq_108237 Seq_13600 Seq_16384 Seq_589896 Seq_56369 Seq_77887 Seq_412728 Seq_194837 Seq_545185 Seq_36850 Seq_199206 Seq_338383 Seq_67849 Seq_71389 Seq_563712 Seq_771566 Seq_138354 Seq_457904 Seq_375542 Seq_542791 Seq_715348 Seq_107005 Seq_571183 Seq_977075 Seq_125370 Seq_134491 Seq_148534 Seq_1282866 Seq_254877 Seq_867058 Seq_337851 Seq_18532 Seq_353594 Seq_1966128 Seq_1309443 Seq_1988610 Seq_24794 Seq_698071 Seq_430296 Seq_144835 Seq_67046 Seq_72761 Seq_818072 Seq_372467 Seq_764310 Seq_114930 Seq_462391 Seq_1052295 Seq_198591 Seq_160188 Seq_882014 Seq_160182 Seq_227942 Seq_298528 Seq_911609 Seq_159260 Seq_25063 Seq_431368 Seq_388274 Seq_419274 Seq_605683 Seq_38327 Seq_268824 Seq_189131 Seq_756024 Seq_311436 Seq_983229 Seq_875768 Seq_131341 Seq_841030 Seq_130693 Seq_593118 Seq_159546 Seq_751912 Seq_49750 Seq_237022 Seq_406234 Seq_1239879 Seq_241320 Seq_117554 Seq_313716 Seq_695990 Seq_1307852 Seq_145158 Seq_593093 Seq_264741 Seq_1436103 Seq_1549728 Seq_459167 Seq_1007694 Seq_830150 Seq_709981 Seq_112179 Seq_426376 Seq_18597 Seq_785157 Seq_739302 Seq_23943 Seq_1167649 Seq_361456 Seq_406654 Seq_1080162 Seq_157100 Seq_303154 Seq_375705 Seq_1170545 Seq_218528 Seq_370259 Seq_719165 Seq_1052938 Seq_200074 Seq_564374 Seq_1173541 Seq_538247 Seq_1755345 Seq_226269 Seq_237813 Seq_825070 Seq_1127932 Seq_6253 Seq_109799 Seq_1033863 Seq_729925 Seq_16976 Seq_606575 Seq_92052 Seq_237768 Seq_219008 Seq_219003 Seq_880776 Seq_74132 Seq_167119 Seq_155274 Seq_15921 Seq_1172478 Seq_78058 Seq_1584584 Seq_1015390 Seq_61912 Seq_361035 Seq_602141 Seq_495252 Seq_579184 Seq_1145081 Seq_2003508 Seq_541097 Seq_1504920 Seq_1254788 Seq_413326 Seq_7014 Seq_374244 Seq_576342 Seq_75414 Seq_200531 Seq_987042 Seq_372233 Seq_105621 Seq_18295 Seq_994734 Seq_697801 Seq_416965 Seq_1012181 Seq_709873 Seq_1138191 Seq_716744 Seq_756077 Seq_432738 Seq_27508 Seq_285340 Seq_661007 Seq_22060 Seq_1371661 Seq_1447046 Seq_831860 Seq_322912 Seq_395630 Seq_503951 Seq_515637 Seq_125048 Seq_118571 Seq_372434 Seq_282616 Seq_2098672 Seq_1543349 Seq_849876 Seq_106994 Seq_5612 Seq_3616 Seq_15496 Seq_462563 Seq_4863 Seq_34212 Seq_159361 Seq_21935 Seq_397683 Seq_30895 Seq_183331 Seq_192331 Seq_1424169 Seq_97952 Seq_380382 Seq_888361 Seq_1342683 Seq_1204593 Seq_1152112 Seq_282222 Seq_277006 Seq_170611 Seq_359635 Seq_93981 Seq_448308 Seq_65721 Seq_840331 Seq_1171193 Seq_1102498 Seq_320057 Seq_71557 Seq_202805 Seq_702198 Seq_217326 Seq_483147 Seq_1311695 Seq_46662 Seq_88675 Seq_76211 Seq_143936 Seq_60763 Seq_60760 Seq_170856 Seq_1188627 Seq_126508 Seq_37266 Seq_61406 Seq_900343 Seq_1413072 Seq_1919833 Seq_168772 Seq_405181 Seq_178073 Seq_1078311 Seq_991722 Seq_1096224 Seq_116758 Seq_322596 Seq_296640 Seq_54334 Seq_45294 Seq_26018 Seq_572620 Seq_2083235 Seq_47944 Seq_828536 Seq_261283 Seq_299478 Seq_566067 Seq_1026729 Seq_1283482 Seq_2895 Seq_503985 Seq_13573 Seq_1383934 Seq_183242 Seq_143342 Seq_1276948 Seq_66680 Seq_330586 Seq_61620 Seq_641510 Seq_7328 Seq_10583 Seq_241032 Seq_647352 Seq_1098986 Seq_773267 Seq_111209 Seq_39615 Seq_572547 Seq_36047 Seq_679198 Seq_576897 Seq_478190 Seq_759960 Seq_730084 Seq_739305 Seq_29834 Seq_366198 Seq_2057449 Seq_1025487 Seq_6067 Seq_2066909 Seq_1707671 Seq_789134 Seq_73468 Seq_1541175 Seq_696102 Seq_522146 Seq_92559 Seq_5872 Seq_3438 Seq_48744 Seq_166521 Seq_218661 Seq_58903 Seq_455503 Seq_97792 Seq_1371851 Seq_922242 Seq_500842 Seq_751249 Seq_33041 Seq_119079 Seq_42580 Seq_311401 Seq_1334927 Seq_2236217 Seq_687746 Seq_24809 Seq_352602 Seq_316052 Seq_222223 Seq_11094 Seq_415140 Seq_459341 Seq_1816025 Seq_940450 Seq_373597 Seq_366298 Seq_979649 Seq_222524 Seq_216312 Seq_1451609 Seq_598890 Seq_1979387 Seq_152024 Seq_44568 Seq_167067 Seq_201896 Seq_67365 Seq_208641 Seq_782786 Seq_283375 Seq_113984 Seq_1191076 Seq_16742 Seq_1369481 Seq_859576 Seq_1408182 Seq_1667273 Seq_346010 Seq_459188 Seq_451539 Seq_915362 Seq_321493 Seq_241266 Seq_158149 Seq_1690814 Seq_80116 Seq_103019 Seq_8719 Seq_53696 Seq_139757 Seq_111139 Seq_406295 Seq_1587910 Seq_223175 Seq_140523 Seq_40569 Seq_49951 Seq_22118 Seq_334265 Seq_113045 Seq_1531956 Seq_548486 Seq_361922 Seq_127892 Seq_737080 Seq_288025 Seq_1635985 Seq_93076 Seq_9065 Seq_108558 Seq_641225 Seq_279132 Seq_689920 Seq_887237 Seq_594049 Seq_250454 Seq_96281 Seq_406479 Seq_19439 Seq_348222 Seq_792896 Seq_68172 Seq_660039 Seq_79480 Seq_286559 Seq_1701854 Seq_1121848 Seq_448942 Seq_948698 Seq_678631 Seq_615672 Seq_60665 Seq_627715 Seq_982874 Seq_1760817 Seq_1299477 Seq_936406 Seq_133855 Seq_216414 Seq_309654 Seq_851993 Seq_500670 Seq_176614 Seq_885440 Seq_86796 Seq_47404 Seq_866847 Seq_380285 Seq_253401 Seq_201011 Seq_994799 Seq_80392 Seq_662681 Seq_175858 Seq_870816 Seq_173040 Seq_117267 Seq_1448766 Seq_176799 Seq_282974 Seq_375289 Seq_332812 Seq_113563 Seq_188627 Seq_371549 Seq_9796 Seq_59594 Seq_1406982 Seq_253216 Seq_1860336 Seq_353273 Seq_63407 Seq_289564 Seq_131905 Seq_1733914 Seq_1112612 Seq_255490 Seq_640219 Seq_526883 Seq_103863 Seq_841100 Seq_767118 Seq_161782 Seq_653009 Seq_1431960 Seq_320144 Seq_419450 Seq_129075 Seq_660507 Seq_52987 Seq_279683 Seq_129323 Seq_254331 Seq_1181027 Seq_446951 Seq_645136 Seq_98782 Seq_1395408 Seq_322906 Seq_27512 Seq_1630127 Seq_579132 Seq_98464 Seq_169498 Seq_1414829 Seq_1572114 Seq_2012309 Seq_522122 Seq_607299 Seq_544072 Seq_66255 Seq_1634436 Seq_143839 Seq_178845 Seq_520079 Seq_315520 Seq_55833 Seq_193277 Seq_205188 Seq_81773 Seq_884977 Seq_227887 Seq_3912 Seq_118177 Seq_89073 Seq_502273 Seq_873357 Seq_1084235 Seq_454947 Seq_681728 Seq_29225 Seq_2069561 Seq_248252 Seq_526219 Seq_1202089 Seq_87578 Seq_154663 Seq_275490 Seq_225118 Seq_924982 Seq_97964 Seq_579530 Seq_696247 Seq_341938 Seq_16319 Seq_44467 Seq_304016 Seq_773180 Seq_67429 Seq_1471668 Seq_275404 Seq_592202 Seq_78683 Seq_22610 Seq_1342738 Seq_618636 Seq_1028055 Seq_748609 Seq_402058 Seq_140923 Seq_728474 Seq_527535 Seq_515944 Seq_1231281 Seq_525110 Seq_132757 Seq_1650611 Seq_92628 Seq_621653 Seq_5308 Seq_441246 Seq_1572775 Seq_1619896 Seq_90696 Seq_797867 Seq_964941 Seq_421640 Seq_252839 Seq_435552 Seq_1559637 Seq_261296 Seq_553237 Seq_1346639 Seq_171712 Seq_3011 Seq_3046 Seq_637205 Seq_441890 Seq_355084 Seq_1394955 Seq_763610 Seq_1136501 Seq_118088 Seq_451087 Seq_185022 Seq_120824 Seq_720991 Seq_243111 Seq_1632132 Seq_589255 Seq_1410122 Seq_355514 Seq_636869 Seq_781754 Seq_1053605 Seq_1339982 Seq_1825645 Seq_45024 Seq_791637 Seq_1229246 Seq_254135 Seq_413462 Seq_259675 Seq_312112 Seq_553642 Seq_595213 Seq_464489 Seq_1159402 Seq_687626 Seq_368503 Seq_899399 Seq_1164987 Seq_512489 Seq_136100 Seq_805514 Seq_33567 Seq_8688 Seq_1129536 Seq_616440 Seq_923984 Seq_2360417 Seq_2148738 Seq_166694 Seq_531474 Seq_1378925 Seq_968343 Seq_1208940 Seq_798590 Seq_90370 Seq_648535 Seq_231226 Seq_92733 Seq_363290 Seq_810360 Seq_815466 Seq_1146028 Seq_315513 Seq_290511 Seq_390136 Seq_2001233 Seq_441679 Seq_128759 Seq_1034622 Seq_196921 Seq_619552 Seq_83548 Seq_128158 Seq_1018520 Seq_1950654 Seq_171699 Seq_19789 Seq_20291 Seq_223665 Seq_958377 Seq_907592 Seq_764201 Seq_1460115 Seq_130544 Seq_56832 Seq_64395 Seq_146403 Seq_99182 Seq_4137 Seq_149071 Seq_270574 Seq_359677 Seq_925333 Seq_34121 Seq_20397 Seq_1023739 Seq_230956 Seq_1967323 Seq_141178 Seq_1025104 Seq_1541028 Seq_644686 Seq_371735 Seq_1003790 Seq_807996 Seq_340423 Seq_16734 Seq_413355 Seq_1524662 Seq_328782 Seq_281551 Seq_84148 Seq_73277 Seq_1430613 Seq_405637 Seq_266087 Seq_219251 Seq_996196 Seq_208573 Seq_191570 Seq_531392 Seq_662784 Seq_1069234 Seq_102255 Seq_315556 Seq_492447 Seq_180078 Seq_169412 Seq_1496395 Seq_5725 Seq_381359 Seq_272997 Seq_322163 Seq_201485 Seq_154122 Seq_112237 Seq_255600 Seq_88940 Seq_1698235 Seq_100489 Seq_524583 Seq_46988 Seq_47658 Seq_30069 Seq_9079 Seq_920926 Seq_1223788 Seq_543389 Seq_163285 Seq_314089 Seq_78370 Seq_156448 Seq_576040 Seq_1001908 Seq_93393 Seq_171996 Seq_306005 Seq_748098 Seq_246377 Seq_1293747 Seq_243441 Seq_195347 Seq_439156 Seq_54479 Seq_1252834 Seq_461406 Seq_219117 Seq_1447690 Seq_82676 Seq_115933 Seq_1134061 Seq_133863 Seq_298383 Seq_299036 Seq_588702 Seq_785511 Seq_476322 Seq_781362 Seq_360240 Seq_424665 Seq_32624 Seq_68449 Seq_1424811 Seq_1053172 Seq_1272977 Seq_597317 Seq_335033 Seq_104428 Seq_1582084 Seq_31588 Seq_1144004 Seq_1602878 Seq_672528 Seq_182266 Seq_1517873 Seq_199421 Seq_2667 Seq_174257 Seq_566909 Seq_750159 Seq_278415 Seq_897467 Seq_798901 Seq_188344 Seq_10311 Seq_369900 Seq_122459 Seq_285296 Seq_476651 Seq_554265 Seq_568107 Seq_86613 Seq_559449 Seq_534056 Seq_635493 Seq_586772 Seq_1059957 Seq_850917 Seq_379241 Seq_876955 Seq_402619 Seq_775140 Seq_964162 Seq_4040 Seq_169662:
    NNNNNNNNNNNNCAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGAGGGAGTTATCTGGGGCTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
    NNNNNNNNNNNNCAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTATTATTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCATTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGGCCCTGGGGATTGATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG
scons: *** [output/cluster10201/mcmciter10000_mcmcthin10_tuneiter500_tunethin100_numrates4_seed0/burninfrac0.1_subsampfrac0.05/linearham_annotations_best.yaml] Error 1
scons: building terminated because of errors.

make default tag on quay.io

I'm not sure if that's exactly what I mean that we should do, but I'm working on the readme and I think what we currently say -- go follow this link to quay.io and guess what tag you should use -- isn't a good long term choice.

The simplest comparison to ham

The simplest test comparison to ham would be an HMM of this shape:

--------
   |||||
   --------

two linear segments, with a collection of transitions between them in a range.

Can't seem to get linearham to run -- probably an issue understanding inputs

Hi,

I successfully ran a partis partition job as follows:

/partis/bin/partis partition --infname ./data/TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2.fasta --outfname ./TSPN-m6-dedup2.part.yaml --locus igh --initial-germline-dir ./data/germlines/aliva --sanitize-input-germlines --leave-default-germline --n-procs=8

(I had to use a partis Docker instance -- for some reason the partis code inside the linearham Docker image doesn't have your recent fixes, nor mafft.)

Then, using the linearham Docker image I try this:

scons --run-linearham --cluster-ind=0 --partis-yaml-file=../TSPN-m6-dedup2.part.yaml --outdir=../_output/ --parameter-dir=../_output/._data_TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2/
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
scons: done building targets.

(I've mapped in the linearham directory with all executables to be just below the directory where I created the partis outputs:
ls ..
TSPN-m6-dedup2.part.yaml data lham.recipe node plots run1.view_output.log
_output c hat3.seed linearham output pre
).

I don't really know scons and can't seem to get it to print out its steps. I tried deducing the command line to run outside of scons, but can't seem to get the parameters straight there.

My aim is to get the linearham analysis, hopefully in a graphical form as in your paper, for cluster 0 of the partis analysis. I'd really appreciate any help.

Thanks,
Bill

Test creation of [VDJ]Smooshables

  VGermline vgerm_obj(V_root);
  DGermline dgerm_obj(D_root);
  JGermline jgerm_obj(J_root);

  Smooshable VSmoosh = VSmooshable(vgerm_obj, std::make_pair(0,2),
                                   std::make_pair(5,7), emission_indices, 2);

  Smooshable DXSmoosh, DNSmoosh;
  std::tie(DXSmoosh, DNSmoosh) = DSmooshables(dgerm_obj, std::make_pair(4,5),
                                              std::make_pair(6,8), std::make_pair(10,11),
                                              emission_indices, 5);

  Smooshable JXSmoosh, JNSmoosh;
  std::tie(JXSmoosh, JNSmoosh) = JSmooshables(jgerm_obj, std::make_pair(5,6),
                                              std::make_pair(8, 9), std::make_pair(12, 13),
                                              emission_indices, 8);

Can we test the creation of these Smooshables?

What do we do with N bases in the middle of a read?

Partis' solution is to give a 1/4 emission probability to an N base, as described in ham's EmissionLogprob.

It seems clear to us that the emission probability should be 1 for an N, because N represents any base. We are not emitting a random variable, we are emitting a set.

Practically speaking in the code, our current solution is to map N's to (alphabet.size+1) and an extra row of the germline emission matrix with 1's. We can fill this row with 0.25 for the time being to match partis.

Cause of huge number of low-probability nodes

@dunleavy005

@mmshipley and I are wondering if it's expected that (or what would cause this) reducing the size of a family e.g. by tightening clustering definitions would result in a dramatically larger number of very low probability nodes in linearham's output? I think the numbers are going from roughly 50-100 seqs to 10-20 seqs, and the 10-20 are going to all be in the same sublineage that we care about, which is why i wouldn't expect results to change all that much?

This is an example of what things look like with the smaller cluster (whereas they look fine with the larger cluster):

image

(This is only with one of the filtering settings, but none of the other settings really improve things -- the tighter ones give nodes that aren't connected to the rest of the asr).

Docker image missing mafft

Trying to using your image at Dockerhub but I reach a point where mafft is called and it isn't part of the build.

(I'd also note that your build file doesn't seem to pull in the revbayes package as part of the build, but perhaps you address that manually since it's so big.)

Get partis SW results into linearham

Just to have everything in one place, here's the notation we've been using:

screenshot 2016-10-16 at 4 32 31 pm

For every gene we would like the query-level coordinates of every germline gene alignment, which is denoted m_G here for a germline gene G. We would also like the "flexes", which are the ranges of uncertainty in the beginning and end points for each of the germline gene matches.

AFACT everything we need is in the qinfo data structure.

  • qrbounds are the bounds of the S-W match indexing sites of the query sequence
  • glbounds are the bounds of the S-W match indexing sites of the GL sequence
{'first_match_qrbounds': (6, 299),
 'glbounds': {'IGHD2-15*01': (18, 31),
              'IGHD2-2*03': (15, 22),
              'IGHD2-21*02': (12, 21),
              'IGHD2-8*01': (9, 18),
              'IGHD3-10*01': (9, 19),
              'IGHD3-16*02': (22, 30),
              'IGHD3-22*01': (18, 24),
              'IGHD6-19*01': (9, 21),
              'IGHD6-6*01': (5, 18),
              'IGHJ1*01': (0, 52),
              'IGHJ2*01': (0, 53),
              'IGHJ4*02': (2, 48),
              'IGHJ5*02': (5, 51),
              'IGHJ6*02': (1, 62),
              'IGHJ6*03': (1, 62),
              'IGHV4-61*08': (6, 299)},
 'matches': {'d': [(34, 'IGHD3-16*02'),
                   (30, 'IGHD6-19*01'),
                   (29, 'IGHD6-6*01'),
                   (29, 'IGHD2-2*03'),
                   (29, 'IGHD2-15*01'),
                   (27, 'IGHD2-21*02'),
                   (26, 'IGHD3-10*01'),
                   (24, 'IGHD3-22*01'),
                   (21, 'IGHD2-8*01')],
             'j': [(175, 'IGHJ2*01'),
                   (158, 'IGHJ1*01'),
                   (152, 'IGHJ5*02'),
                   (146, 'IGHJ4*02'),
                   (143, 'IGHJ6*03'),
                   (143, 'IGHJ6*02')],
             'v': [(505, 'IGHV4-61*08')]},
 'name': '-2126437327330379358',
 'new_indel': False,
 'qrbounds': {'IGHD2-15*01': (299, 312),
              'IGHD2-2*03': (303, 310),
              'IGHD2-21*02': (299, 308),
              'IGHD2-8*01': (299, 308),
              'IGHD3-10*01': (299, 309),
              'IGHD3-16*02': (300, 308),
              'IGHD3-22*01': (299, 305),
              'IGHD6-19*01': (300, 312),
              'IGHD6-6*01': (299, 312),
              'IGHJ1*01': (312, 364),
              'IGHJ2*01': (311, 364),
              'IGHJ4*02': (318, 364),
              'IGHJ5*02': (318, 364),
              'IGHJ6*02': (302, 363),
              'IGHJ6*03': (302, 363),
              'IGHV4-61*08': (6, 299)},
 'seq': 'TCACCTCATCTACTAACTCCCTGCCTTAGGTTGGTGTTGTATGTGTGGAAAAATCATCTGGTTTTCTGCCTGTCTCGCGGTCGTTAGACCATTCGTTCTTGTTGCTGCAACTGCTTGCTTTTGCCCCTAGGCACGCGGGCAGGTAGGCTTCAAGACCTTTATTCCAAT
CCTAGCTCGGACTTTATCTCTTTCTACTCCCTAGTGCATGTCACGTCAGGTGTGTCGTTTCCGGGATAACCCCGACTGATTCTCCTCTCTGTTTGATCTTTTGACTAGGGGCTCTGTACCTGTGCGATAGATAGTTACCGGCCCGCATGGTGCTTCCGCATGTGGGGACGGGGGGCC
CTGGTCGCTGTCTCCCCAG'}

Therefore it seems like

  • m_G is (left hand qrbound) - (left hand glbound) for G
  • the flexes are essentially the (min, max) of each of the entries in qrbounds, stratified by gene type

SCons not recognizing arguments to linearham?

root@d4094c82cf56:/linearham# scons --run-linearham --partis-yaml-file /linearham/data/linearham/partis_run.yaml --parameter-dir /linearham/data/partis_params/ --outdir /linearham/data/linearham/
scons: Reading SConscript files ...
usage: scons [OPTION] [TARGET] ...

SCons Error: --partis-yaml-file option requires an argument

root@d4094c82cf56:/linearham# ls -l /linearham/data/linearham/partis_run.yaml
-rw-r--r-- 1 root root 43392 Mar  6 16:23 /linearham/data/linearham/partis_run.yaml

root@d4094c82cf56:/linearham# scons --run-linearham --parameter-dir /linearham/data/partis_params/ --outdir /linearham/data/linearham/
scons: Reading SConscript files ...
usage: scons [OPTION] [TARGET] ...

SCons Error: --parameter-dir option requires an argument

root@d4094c82cf56:/linearham# cp -R data/partis_params/ data/linearham/paramter_dir

root@d4094c82cf56:/linearham# scons --run-linearham --outdir /linearham/data/linearham/
scons: Reading SConscript files ...
usage: scons [OPTION] [TARGET] ...

SCons Error: --outdir option requires an argument

Designing a Phylo object

I'd like to start getting to the nitty-gritty of how the objects will look for our phyloHMM implementation.

First, let's assume that there is a "Joe" class (named in honor of Felsenstein) that wraps all of the libpll data structures.

I propose a "Phylo" object:

  • Contains a Joe object with the current tree and the complete MSA (with all the germline bases that could align to the query sequences)
  • Contains information about the germline genes under consideration and their alignments to the MSA
  • Have methods such that we can ask for per-site likelihoods (and their derivatives?) for a specific gene

The idea then is that to calculate a marginal likelihood, we set up the corresponding Smooshables with the per-site likelihoods calculated by the Phylo object, then run the usual marginal likelihood algorithm. We can do whatever we want to the Joe object (modify tree and branch lengths, etc) and ask for such a re-calculation. This way the germline-encoded Smooshables only need to know what gene and range of that gene they represent; NTI Smooshables know what range of the MSA they represent.

I also propose that all Smooshishs have a dirty flag. In this way the calculation could proceed:

  1. Modify tree somehow
  2. Tell the Joe to update itself
  3. Mark every Smooshish in the Pile (recurring up through Chains, etc) as dirty
  4. Recalculate. If a Smooshable is dirty, it asks for its per-site likelihoods from the Phylo object, and if a Chain object is dirty it just recalculates based on its prev and curr (asking them to recalculate as well, which they do if they are dirty).

I haven't carefully thought out what to do with indels.

Remove matrix copy construction in Smooshable et al constructors?

@bcclaywell, should we just not worry about the overhead of the copy constructor used in the Smooshable constructor?

Seems like we are always building a matrix for the explicit purpose of putting in a Smooshable. Is there a way we could just move it into place?

We discussed this somewhere else, but putting it in an issue will remind me of the decision.

git describe crashes build

with error message fatal: No names found, cannot describe anything.

It appears this happens when there's no tags in the local repo, which i think is a problem since if memory serves the default git pull/clone do not pull any tags.

Docker runs example and quits

This is probably a dumb "I don't know how to use Docker" question, but I when try to enter the Docker container, I can't get an interactive session. What am I doing wrong?

schrammca@cogsworth $> docker run -it -v /home/schrammca/2016201_Rv217/20170427/:/data quay.io/matsengrp/linearham

scons: Reading SConscript files ...
Mkdir("output")
scons: done reading SConscript files.
scons: Building targets ...
git rev-parse --verify HEAD >> output/git.log && git describe --dirty >> output/git.log
lib/partis/bin/partis annotate --all-seqs-simultaneous --infname data/liao_dataset.fasta --parameter-dir output/parameter_dir --locus igh --extra-annotation-columns linearham-info --outfname output/partis_run.yaml > output/partis_run.stdout.log
scons: done building targets.
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scripts/parse_cluster.py output/partis_run.yaml --indel-reversed-seqs --fasta-output-file output/cluster-0/cluster_seqs.fasta --yaml-output-file output/cluster-0/cluster.yaml
 only one annotation in partis output file. Using it.
  writing 313 sequences to output/cluster-0/cluster_seqs.fasta
scripts/generate_revbayes_rev_file.py templates/revbayes_template.rev --fasta-path output/cluster-0/cluster_seqs.fasta --mcmc-iter 25 --mcmc-thin 1 --tune-iter 0 --tune-thin 100 --num-rates 4 --seed 0 --output-path output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/revbayes_run.rev
lib/revbayes/projects/cmake/rb output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/revbayes_run.rev > output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/revbayes_run.stdout.log
_build/linearham/linearham --pipeline --yaml-path output/cluster-0/cluster.yaml --cluster-ind 0 --hmm-param-dir output/parameter_dir/hmm/hmms --seed 0 --num-rates 4 --input-path output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/revbayes_run.trees --output-path output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/lh_revbayes_run.trees
Rscript --slave --vanilla scripts/run_bootstrap_asr_ess.R output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/lh_revbayes_run.trees output/cluster-0/cluster_seqs.fasta 0.1 0.05 1 0 output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.trees output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.log output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.ess
scripts/tabulate_naive_probs.py output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.trees --output-base output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/aa_naive_seqs
scripts/tabulate_lineage_probs.py output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.trees output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/aa_naive_seqs.fasta --seed-seq KC576081.1 --pfilters 0.1 --output-base output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/lineage_KC576081.1/aa_lineage_seqs
scripts/write_lh_annotations.py output/cluster-0/cluster.yaml output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_run.log --output-base output/cluster-0/mcmciter25_mcmcthin1_tuneiter0_tunethin100_numrates4_rngseed0/burninfrac0.1_subsampfrac0.05/linearham_annotations
git rev-parse --verify HEAD >> output/git.log && git describe --dirty >> output/git.log
scons: done building targets.

schrammca@cogsworth $> 

NRegion and SmooshableNRegion

Given emission probabilities for the various states, and a vector of N-region length prior probabilities, we need an object equivalent to a Germline that contains that information.

Then given an observed sequence (vector of indices) and an NRegion we can have a SmooshableNRegion analogous to the SmooshableGermline.

For purposes of comparison to ham, having the prior probabilities be geometric for first go would seem sage.

linearham crashed

root@d4094c82cf56:/linearham# scons --run-linearham                                                     
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scripts/generate_revbayes_rev_file.py templates/revbayes_asr_template.rev --fasta-path output/cluster-0/cluster_seqs.fasta --mcmc-iter 10000 --mcmc-thin 10 --tune-iter 5000 --tune-thin 100 --num-rates 4 --seed 0 --output-path output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/revbayes_run.rev
lib/revbayes/projects/cmake/rb output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/revbayes_run.rev > output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/revbayes_run.stdout.log
_build/linearham/linearham --pipeline --yaml-path output/cluster-0/cluster.yaml --cluster-ind 0 --hmm-param-dir output/parameter_dir/hmm/hmms --seed 0 --num-rates 4 --input-path output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/revbayes_run.trees --output-path output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/lh_revbayes_run.trees
linearham: _build/linearham/VDJGermline.cpp:51: std::unordered_map<std::__cxx11::basic_string<char>, linearham::GermlineGene> linearham::CreateGermlineGeneMap(std::__cxx11::string): Assertion `dir != nullptr' failed.
Aborted (core dumped)
scons: *** [output/cluster-0/mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/lh_revbayes_run.trees] Error 134
scons: building terminated because of errors.

Smooshables need likelihood scalers

Right now everything is stored in terms of probabilities, which is lovely and fast, though we need to avoid underflow.

I propose that they have a member variable that just stores a log likelihood, which scales the overall likelihood of each entry of each of the matrices.

This would also be handy for defining a smooshable from an alignment of a read to a germline gene, in which the part not incorporated in the smooshable could have its likelihood plopped into a simple variable.

(Reminder: both of the marginal and the Viterbi probabilities need to be scaled.)

biological validation data sets

Following Duncan's idea in lab meeting, let's talk about biological validation data. Simulation data is straightforward.

  • Chaim I guess has some data that would be appropriate due to depth of sampling?
  • There are experiments in which a mouse is engineered with a given naive sequence.
  • There are situations for HIV bnabs in which there are some samples that are close to the root, and some farther away. The samples close to the root "reveal" the naive sequence. Do we agree?

NodeList object where it expected a string

(linearham) [loraxis] linearham/ > scons --run-linearham --parameter-dir=~/work/partis/test/reference-results/test/parameters/simu --outdir=_output --partis-yaml-file=partis-out.yaml
scons: Reading SConscript files ...
AttributeError: 'NodeList' object has no attribute 'rfind':
  File "/home/dralph/work/linearham/SConstruct", line 332:
    @nest.add_target()
  File "/home/dralph/.local/lib/python2.7/site-packages/nestly/scons.py", line 152:
    self.nest.add(key, nestfunc, create_dir=False)
  File "/home/dralph/.local/lib/python2.7/site-packages/nestly/core.py", line 174:
    for r in nestable(control):
  File "/home/dralph/.local/lib/python2.7/site-packages/nestly/scons.py", line 150:
    return [func(destdir, control)]
  File "/home/dralph/work/linearham/SConstruct", line 345:
    + " --yaml-output-file ${TARGETS[1]}")
  File "/usr/lib/python2.7/posixpath.py", line 98:
    return genericpath._splitext(p, sep, altsep, extsep)
  File "/usr/lib/python2.7/genericpath.py", line 99:
    sepIndex = p.rfind(sep)

linearham 2.0: the Awakening

As we discussed last week, a linearham re-structuring is required if we want to have objects that can be easily used for HMM hidden state sampling.

Remember for a standard HMM with x_1, ..., x_n being the hidden states and y_1, ..., y_n being the observed states (Scott, 2002):

gif latex

where the "alphas" are the cached HMM forward probabilities. This suggests a forward pass through the model to cache the "alphas", and then a backward pass to sample the hidden states. To make this easier to implement, we need to have data structures that store the quantities shown in the formulas above.

Currently, we've set things up (using Smooshables) that allows for easy computation of the HMM likelihood, but impossible hidden state sampling. One way to see this: we "smoosh" V, D, J marginal probability matrices left-to-right and then add these "Chains" up in our "Pile" top-to-bottom, when we really need to marginalize probabilities top-to-bottom at each site position as we go left-to-right (i.e. order of operations is swapped).

I propose the following:

Partition the HMM into distinct regions, the "germline" regions and the "junction" regions. The "germline" regions are guaranteed to contain germline states (known via the input S-W flexbounds) and thus have collapsed state spaces. The V-D and D-J "junction" regions are trickier to handle because there are many possible state configurations.

Our test example:
img_20180720_083850

In the test example above, the V "germline" region would extend from sites 0-3 and the state space would consist of the different V germline genes (i.e. IGHV1-2*56, ....) with a cached forward probability per state. This would be computed by doing a "Smooshable-like" calculation with no exit probability (P(pad) P(emit 1) P(1 -> 2) P(emit 2) P(2 -> 3) P(emit 3) for our example V gene). We also have a transition probability matrix from the V "germline" region to the V-D "junction" region at sites 4-7. The state space here is ("V,3", "V,4", "D, N_a", "D, N_c", "D, N_g", "D, N_t", "D, 0", "D, 1", "D, 2"), thus the t.p.m. is (1 x 9) in our example. The forward probability matrix for this "junction" region would have rows indexed by the different states and columns by the different sites (i.e. 4, 5, 6, 7). There would then be a transition probability matrix from the V-D "junction" region to the D "germline" region, which would be treated similarly to the V "germline" region. The remaining regions have a similar structure.

Computationally, we'd store all the ^ matrices in some class (Data?). Of course, we want to compute the minimal number of libpll likelihoods for emissions so utilizing the work in #43 will be important. One big difference is that I think our initial thoughts of using a {germline base, germline rate, MSA pos.} -> xMSA pos. map should be adjusted to {germline base, MSA pos.} -> xMSA pos. as libpll doesn't seem to be adding per-site rates in the near future (let's stick to 4-rate gamma?). Extracting emission likelihoods is simple as each HMM state encodes a germline base and we iterate across the MSA positions. To make our lives easier, we could even store a vector<string> holding the states in a particular region and a vector<int> holding the corresponding germline bases. Right now, we have two maps germ_xmsa_indices_ and nti_xmsa_indices_ that stores the xMSA indices used to extract emission probs. in germline/NTI regions, respectively; we could do something analogous for the "germline" and "junction" regions in this proposal.

Given all of the storage requirements described ^, we'd have to use the parsed partis HMM parameter information (from the YAML files) to "fill in" these matrices/vectors/maps. Once we do that, we do our forward pass to fill in the forward probability entries and then can sample hidden states according to the backward recursion.

It does feel a bit like re-inventing the wheel WRT to partis's HMM state structure, but there seem to be a few benefits of having linearham:

  1. reduced state space in the "germline" regions, leading to faster HMM likelihood computation (otherwise we're doing "quadraticham" and would be slow when doing weighted SRS for tons of trees),
  2. hidden state sampling allows us to characterize naive sequence uncertainty (unlike partis viterbi),
  3. linearham can work on single sequence reads (star-tree) or phylogenetic MSAs (phylo-tree).

Thoughts, @matsen ??

Compare to bcrham

Time to cook up some tests comparing to bcrham.

Here are the sequences for which we have HMM parameters already in the linearham repo.

>IGHV1-2*04
CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
>IGHD7-27*01
CTAACTGGGGA
>IGHJ4*01
ACTACTTTGACTACTGGGGCCAAGGAACCCTGGTCACCGTCTCCTCAG

Let's use those to make some fake sequences that can be run through bcrham. The first one could be just concatenating the sequences together, then adding some insertions and deletions, etc.

Consistent naming convention

Right now sometimes matrices have matrix in the title, sometimes not. Same for vectors.

Make this consistent.

Add more info to manual

placeholder

├── cluster0
│   ├── cluster_seqs.fasta
│   └── mcmciter100000_mcmcthin100_tuneiter5000_tunethin100_numrates4_seed0
│       ├── burninfrac0.0001_subsampfrac0.05: everything in this dir is an output file
│       │   ├── aa_naive_seqs.dnamap: complements the .fasta file, has dna sequences
│       │   ├── aa_naive_seqs.fasta: aa	naive sequences	with probabilities (should match up with numbers from linearham_run.log)
│       │   ├── aa_naive_seqs.png: logo	plot of of per-site aa naive sequence uncertainties corresponding to .fasta
│       │   ├── linearham_run.ess: unimportant
│       │   ├── linearham_run.log: each line is one sample. Ignore log likelihoods, treat each line as equal probability. (the log probs are for the distributions that were sampled from to *get* the lines in these files, i.e. ignore them)
│       │   ├── linearham_run.trees: each line is one sample.  Ignore log likelihoods, treat each line as equal probability.
│       │   └── seedseqQA013.2-igh: for	each seed sequence we have a aa_lineage_seqs file, which is analagous to aa_naive_seqs in parent dir
│       │       ├── aa_lineage_seqs.dnamap: includes same naive	seq stuff as in	parent dir, as well as similar stuff for intermediate ancestors	(except ignore probs for intermediates, since there isn\'t an easy way to compare them)
│       │       ├── aa_lineage_seqs.fasta: corresponds to dnamap
│       │       ├── aa_lineage_seqs.pfilter0.01.dot: each is a naive to seed lineage plot with the indicated probability threshold (threshold means we ignore potential ancestor chains/lineages less than that prob)
│       │       ├── aa_lineage_seqs.pfilter0.01.png
│       │       ├── aa_lineage_seqs.pfilter0.02.dot
│       │       ├── aa_lineage_seqs.pfilter0.02.png
│       │       ├── aa_lineage_seqs.pfilter0.04.dot
│       │       ├── aa_lineage_seqs.pfilter0.04.png
│       │       ├── aa_lineage_seqs.pfilter0.06.dot
│       │       ├── aa_lineage_seqs.pfilter0.06.png
│       │       ├── aa_lineage_seqs.pfilter0.08.dot
│       │       └── aa_lineage_seqs.pfilter0.08.png
│       ├── lh_revbayes_run.trees: ignore this file (the annotations seemed to not be self-consistent)
│       ├── revbayes_run.log
│       ├── revbayes_run.rev
│       ├── revbayes_run.stdout.log
│       └── revbayes_run.trees
├── partis_run.stdout.log
├── partis_run.yaml
└── run_qa013_synth_v19_1final.sh

Make matrices supporting Viterbi calculation optional?

Unless I'm mistaken, the extra matrices and data structures required to support Viterbi calculation incur a significant overhead.

Here's one proposal:

  • add a flag that turns Viterbi calculation on or off
  • raise an error if we ask for a traceback if it's off

A more elegant solution would be to have the current Smooshable class become something like a ViterbiSmooshable, and then have Smooshable just support the marginal functionality.

We should discuss this more before implementing.

figure out why output dir has to be under `/linearham`

Next problem:

root@e22536b936fb:/linearham# scons --run-partis --fasta-path=../data/VRC43H-min5.fa --all-clonal-seqs --parameter-dir=../data/partis_params --outdir=../data/linearham
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
scons: done building targets.
root@e22536b936fb:/linearham# ls ../data/linearham/
root@e22536b936fb:/linearham# 

Where do I look for an error message?

Viterbi calculations in the NTI region are incorrect

Currently, we calculate viterbi probabilities and paths through the NTI region by marginalizing over the hidden NTI states and just recording the best NTI exit point. We should, instead, be recording the viterbi probability at each NTI base, and store backpointers to the optimal hidden NTI bases (i.e. we shouldn't be calculating the NTI likelihood, we should be doing a standard HMM viterbi procedure in this region).

(Side note: We also should be doing proper viterbi analysis, instead of just getting viterbi paths for each VDJ smooshed chain, and then max'ing over all VDJ chains, but this is up for discussion).

change actions to positional argument

As far as i can tell there's a variety of boolean commands to specify which action to run, --run-partis, --run-linearham, --build-partis-linearham. This is pretty confusing.

Use new partis fcn add_seqs_to_line()

It's hard to believe when we talked about this before I didn't realize how easy it would be, but I wrote a function to add new sequences (i.e. inferred intermediates from a phylo program) to an existing partis annotation without running partis, it just uses a mafft step if there's some constant region cruft. I'm pretty sure there's at least one place in linearham that we want to use this rather than running partis from scratch?
https://github.com/matsengrp/linearham/blob/master/SConstruct#L285
https://github.com/psathyrella/partis/blob/dev/python/utils.py#L512

0-indexing for flexes

... so that a flex of 0 corresponds with no uncertainty for the transition point. It is consistent with 0-indexing other places.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.