caporaso-lab / sourcetracker2 Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 45.0 2.65 MB

SourceTracker2

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 41.60% Python 58.04% Makefile 0.03% TeX 0.33%

sourcetracker2's People

Contributors

Stargazers

Watchers

sourcetracker2's Issues

Add "method 5"

From @lkursell on March 21, 2016 22:24

Although a more descriptive term for how this works would be handy, like 'sequence_density', such that sourcetracker2 seq_density -i table.biom -m mf.txt -o out/ is callable

Copied from original issue: biota/sourcetracker2_internal#27

improve `validate_gibbs_input`

@gregcaporaso suggested that validate_gibbs_input should have more informative errors on the input dataframes, specifically, saying why they fail (e.g. sources had a nan at loc X).

Automatic tuning of alphas & fit evaluation

Cross validated alpha tuning and fit evaluation

It would be good to have a parse_sample_metadata function, and then use that instead of calls like sample_metadata = pd.read_table(mapping_fp, index_col=0). This call will be problematic in some cases that we've run into (heres how it should be done), and we want to be able to fix that in one place rather than have the file be potentially parsed differently in different parts of the code base.

update readme

Multiple issues here:

The urn example needs to be transposed.
A better description of the algorithm (this can happen in an ipynb, the current one is not good enough).
A discussion of runtimes and including how burnins/restarts etc work so people understand what parameters control the overall runtime.

install fails on linux 64

A QIIME forum user reported an error installing because conda couldn't find scikitbio=0.4.3. Conda auto-suggested scikit-bio instead, which I think should work, but need to investigate this failure.

@gregcaporaso - is this a problem with our install documentation or does that user need to just add e.g. the bioconda channel?

add Jensen-Shannon calculator for sources

Users will want to know the Jensen-Shannon divergence of their sources. The manuscript will allow users to relate the JSD between the sources and the accuracy/precision of the mixing proportions they get back. This is not a novel idea, based on the work in Dan's original paper.

refactor discussion / what I think we need before release

@gregcaporaso @justin212k @lkursell @johnchase @nscott1 @mjk1000

Based on working with the code this weekend to try and get @johnchase and @lkursell a simple API for launching jobs I have come to the conclusion that the code has accumulated some cruft. Cruft = technical debt that will come due at some time, so I think we need to consider refactoring. I am not advocating something major, but a set of suggestions that @lkursell and @johnchase have already essentially asked for.

Below are my thoughts, please let me know what you think

The things I think would help and plan on doing once we modify/come to agreement

Refactoring all functions that are not the gibbs_sampler to use pd.DataFrame's. A significant amount of code is spent in sourcetracker/sourcetracker.py as well as sourcetracker/cli/gibbs.py to ensure that arrays representing source data, sink data, and the mapping file data are in the proper order etc. DataFrames can help remove much of this code (@lkursell and @johnchase pointed this out). This could include converting input biom tables to a DataFrame as it would make some operations slightly easier. No necessity of this, but a possibility for a little extra gain.
Rewriting ConditionalProbability to be clearer and more easily extensible. In our efforts to eke out speed improvements I wrote this class so that it would precompute everything that could be precomputed. The speed gains from this are minor based on my current understanding, and it makes understanding whats happening much less straightforward. In addition, as we move to add other layers of probability (e.g. what @lkursell) discussed in his last email, we will likely want to add them in this class. Without refactoring, it's going to be hell trying to do this.
Visualization needs to come with basic call I think (though perhaps leaving it in the ipython notebook is okay). It will take very little work to create a set of simple visualizations and this will make people switching to this code much more likely.

What I need help with

I'd love to get rid of the functools.partial nonsense that we are using to enable passing jobs to an ipyparallel client. I don't think we should need to do this and it would get rid of a fair amount of code, but I don't understand why we need to do this in the first place. If you try and pass a function to the engines that is not wrapped by partial you get errors indicating the local namespace doesn't contain the variables you are trying to pass.

Things I'd like to get a consensus on

Currently we are asking each engine in a cluster to operate on a single sink and write the results to an intermediate file. We did this to allow easy recovery from a failure (if you are running ST2 on 500 samples and it quits at sample 485 you'd be able to save that 97% of the work) but it means more functions and more intermediate text files. Do people think it would be better to just store data in memory and write out a final result when every computation had been completed? @lkursell has suggested that the in-memory approach might be good as he has had no mid-run failures in his entire time using this code.

Display verbose progress

From @lkursell on February 12, 2016 22:39

Displaying ST2's progress, like the original, is very helpful for making sure commands are getting executed correctly, especially given the time required for jobs to finish. And they are a very convenient way for determining progress.

Copied from original issue: biota/sourcetracker2_internal#14

rename plot_mpm to mpm_heatmap

Plot is too general - we'll want to support other types of plotting in the future.

`gibbs` unintuitively fails if source and sink tables don't exactly match features

hat-tip to @johnchase for pointing this out.

If the input dataframes for _gibbs have different numbers of features (different columns) the function will fail with an IndexError like the following

IndexError: index 11751 is out of bounds for axis 0 with size 11290

_gibbs should check that there are the same number of features in both sources_df and sinks_df. An additional check would be that the identity of those features are the same.

delay parameter

if the delay parameter in the gibbs_sampler function is set to 1 the function will never take draws (x-y mod 1 always = 0).

this is a legacy from the original sourcetracker code. we need to establish if there is any reason that a delay of 1 is unacceptable and rewrite the checker if not.

possible bug

From @wdwvt1 on January 30, 2016 0:15

Unsure if this is a bug - but if the number of unknown sequences reaches 0 and alpha2 is set to 0, then lines 370-372 of sourcetracker.py might raise a zero division error.

We have not verified this in tests (can't get the unknown that low) but it deserves investigation.

Copied from original issue: biota/sourcetracker2_internal#9

Allow removal of 'contamination' sources from sink samples

by identifying 'contamination' sources in the mapping file and them subtracting these sequence counts from the sources of interest for each sink sample

initial (2.0.1) alpha release

Address comments in #38 before building this release (that was accidentally merged before the comments were addressed, but the comments were minor).

_gibbs fails on dataframes with floats

_gibbs fails with:
IndexError: index 6283 is out of bounds for axis 0 with size 6283
When the are floats in the sink_df. The error occurs when trying to create the order to walk through the sequences.

Probably want to add a check so that a less cryptic error message is given to the user.

test addition

@gregcaporaso and I pair reviewed and decided there are a couple keys places where test coverage should be improved.

Addition of tests of the entire package against SourceTracker 1. We can do this simply by building several tables, running with ST1, and comparing the distribution of proportions to the distribution of proportions we'd get from ST2. This will be a stochastic test, and we will have to set our acceptable error rate (alpha=.05) where tests will fail but no significant change has taken place. Something like the following.

from scipy.stats import ttest_1samp
st1_ms = np.array([[.3, .4, .3],[.2, .3, .5]])
st2_ms = gibbs(source, sinks)

obs_t, obs_p = ttest_1samp(st2_ms, st1_ms.mean(0))
p_threshhold = .05
self.assertTrue(obs_p > p_threshhold)

Addition of tests for gibbs and gibbs_loo (these are the largest sources of untested code in the entire package). We'll have to make a hand rolled example that we run through from start to finish.
Consistency tests for gibbs and gibbs_loo - these will just test that when a code change happens, a PRNG seeded examples didn't change. These do not test accuracy, just consistency.

refactor `gibbs_sampler`

For simulations to determine convergence rates of gibbs_sampler, I've been setting burnins=0, delay=1, draws_per_restart=X where X is the number of passes I want to make. I think this should be the default way we call gibbs_sampler, by just giving it a max number of passes. So, it'd look like:

def gibbs_sampler(cp, max_passes):

Then, an API user can take the returned data and use any parameters they like for the delay, burnins, etc. This removes draws/restart as a parameter, but that could be replicated by just repeating the call.

More collapse methods for source samples

From @lkursell on March 24, 2016 17:47

Currently, ST2 adds together sample OTU counts for samples that belong to the same source. This can cause overestimation of contributions from Source environments that have more samples.

One solution would be to be able to pass a --collapse_sources and --collapse_sinks flag with a --collapse_method such as mean or sum. Currently this can be done with the collapse_samples.py script, although that necessitates changes in mapping files and sample names.

Copied from original issue: biota/sourcetracker2_internal#28

Visualization of mixing probabilities

From @lkursell on January 29, 2016 22:23

Want a quick function, or --flag, to pass so that the results can be visualized. Bar chart is likely most helpful. Might have some upper limits on number of sinks/sources for which the figures make sense.

Copied from original issue: biota/sourcetracker2_internal#8

memory leak

There is a memory leak that has become apparent in long-running simulations. The memory usage of python when running _gibbs steadily increases despite there being no clear reason for doing so (the memory is preallocated for results storage in gibbs_sampler).

I believe the error has to do with one of the following:

numpy array copies that occur and are not garbage collected link1, link2.
pandas dataframes not being correctly garbage collected (might be the same bug as 1) link1, link2, link3.
Memory leak when ipyparallel is running link1.

Based on the threads I have read (those linked above) I am guessing that either a bunch of array copies are occurring that are not getting collected, or there is some interaction between the cluster and multiple sinks.

Unexpected behavior with null values in api call

The private api call _gibbs has unexpected behavior when null values are present in the source table.

For example

sink = pd.DataFrame([[1, 2], [1, 0]], index=['x', 'y'])
source = pd.DataFrame([[1, 2], [np.nan, np.nan]], index=['v', 'z'])

mix, std = _gibbs(source, sink)
mix

    v   z   Unknown
x   1.0 0.0 0.0
y   1.0 0.0 0.0

The simplest and probably easiest solution is to return an error if null values are present in the source or sink tables.

add test that calls the plot_mpm

This won't do much, but will confirm that the public API hasn't changed. This will address the coverage decrease in #71.

"Python not installed as framework"

When I created a python 3 environment with anaconda and tried to import from sourcetracker.sourcetracker, I got:

**RuntimeError**: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.

This SO post had a solution with the adding of backend: TkAgg to a ~/.matplotlib/matplotlibrc

Curious if others run into this problem. If so, we should update the documentation.

better error message when passed an existing directory

From @gregcaporaso on March 1, 2016 22:32

Existing directories aren't allowed as parameters to -o, but the error message doesn't tell the user that the directory already exists, just that it's a directory.

$ sourcetracker2 gibbs -o sourcetracker
Usage: sourcetracker2 gibbs [OPTIONS]

Error: Invalid value for "-o" / "--output_dir": Path "sourcetracker" is a directory.

@wdwvt1 asked me to help out with this as he hasn't had success sorting it out.

Copied from original issue: biota/sourcetracker2_internal#25

Autogenerate figures

Generate the pie charts and distributions a la R version

speed improvement with dictionary indexing

The innermost loop of the gibbs function must access different slices of an array millions of times during a single run. These slices are columns, their order of selection is random (each pass of the loop selects a different column), and they may only be selected one at a time.

If we could speed up the speed at which numpy returned the slices, we might be able to save some time. As a test of this idea I used the following code to simulate drawing from a table with 100 sinks, 2000 features, and a number of inner-loop passes (slices taken) between 2**5 and 2**20 (30-1,000,000).

The three schemes I tried are: an input array that is in row-major format (C-order), and input array in column-major format (F-order) and a dictionary with keys as column indices and values as a 1D-array of the row values.

The results show that the dictionary indexing method takes approximately 90% of the time of the F-order method.

_dict[:, 0] / forder[:, 0]
array([ 0.84039466,  1.01139224,  0.92655346,  0.88919508,  0.83832869,
        0.86242302,  0.87799022,  0.92113187,  0.84468985,  0.89854044,
        0.90343308,  0.89311444,  0.89205119,  0.89685498,  0.9020272 ,
        0.90483933])

This approach seems promising to me. Pending new profiling results we should see if rewriting the internals to save 10% on indexing is worth it.

check out use of Dufrêne-Legendre indicator species analysis

From @wdwvt1 on February 15, 2016 8:18

Read this which used Dufrêne-Legendre indicator species analysis to find which taxa were most indicative of a specific crustal region.

Copied from original issue: biota/sourcetracker2_internal#16

Full output

Include the full output of the average count of each bug from each source in the sink sample

where should sourcetracker 2 live?

Add example commands from README.md to tests run by Travis

Small doc changes

From @lkursell on February 22, 2016 22:50

Would be nice to put in the default values for all the params. None are listed currently. Also, update doc / install instructions for use with Qiime AWS AMI specific.

Install Anaconda:
Run:
$ bash ~/Anaconda3-2.5.0-MacOSX-x86_64.sh
whereever you downloaded the Anaconda.sh file to

Create the Python 3 env
$conda create -n py3 python=3 numpy scipy h5py hdf5

Get SourceTracker2
$git clone https://github.com/biota/sourcetracker2.git

Make sure to put the SourceTracker2 directory in a user specific directory, rather than a root directory, or it’ll cause problems!

Install SourceTracker2
$python setup.py install
from inside sourcetracker2 directory

Installed successfully!

Copied from original issue: biota/sourcetracker2_internal#21

refactor `gibbs_sampler` to remove proportion calculation

currently gibbs_sampler calculates the mixing proportions from the full draw ( aka contingency table aka full results). this could be moved out into another function, making the code more modular and easier to test.

do we really need gibbs to be a subcommand?

Since it's the only one, would it be more intuitive if we just called sourcetracker2 rather than sourcetracker2 gibbs? Or are we planning to add others? If we do keep it, for users, the command gibbs probably isn't the most useful. Maybe track (or something along those lines) would be more intuitive (i.e., you don't need to know implementation details of the methods to understand the command name)?

Adding functionality to SourceTracker 2

One of the best things I thought Dan changed in ST1 later version is that it could tell you which specific OTUs were specific for a particular environment. This would be very useful indeed if we could add this functionality or add it to the instructions if easy to do.

Create SourceTracker2 api

It would be really handy to have a SourceTracker2 api, something that could be run in a notebook or easily integrated into a larger analysis pipeline. I'm thinking it would look something like the following:

def sourcetracker(source_ids, sink_ids, table, std=False, cluster=None, loo=False, 
                  alpha1=0.001, alpha2=0.0, beta1=0.0, draws=0.0, 
                  burnin=0.0, delay=0.0)
  '''doc string''

 Parameters
    ----------
    source_ids : list
        Sample IDs that should be used as the source, must correspond to IDs in the OTU_table
    sink_ids : list
        Source IDs...
    table : pd.DataFrame
        Pandas dataframe object where columns are observations and rows are samples
    cluster : ipyparallel.Client
        user created parallel client object

    These rest of the parameters should corrospond to the CL call

Returns
    ----------
    mixing_proportions : pd.DataFrame
        A pandas DataFrame of the mixing proportions

# a bunch of code...
    return mixing_proportions

The overall workflow could look something like this:

$ ipcluster start -n 4

from sourcetracker2 import sourcetracker
import ipyparallel as ipp
c = ipp.Client()

mix_proportions_df = sourcetracker(source_ids, sink_ids, otu_table, cluster=c)

Update / install travis

the travis build status badge is broken currently. @gregcaporaso can you take a look?

Install instructions?

Do we need to add install instructions to the readme?

add functions for comparing sourcetracker results

The upcoming SourceTracker 2 comparisons are going to require a function (e.g., compare_sourcetracker_results) that takes an observed_composition and expected_composition of a sample, and returns a single value indicating their similarity. This function should also take a metric parameter which parameterizes how similarity is computed. This is similar in nature to the scikit-bio beta_diversity driver function.

Some of the metrics that we'd want to be able to use are:

spearman correlation
pearson correlation
F-measure (this would be a qualitative metric, where true positives mean that an observed source is an actual source, ...)

I envision observed_composition and expected_composition being dicts mapping source environment type to proportion.

This function would create a pairwise similarity, so we'd also want to be able to summarize many pairwise similarities (e.g., for multiple different samples). We could start with np.median for this, and get more fancy when we need to.

@wdwvt1, you had some other ideas about metrics that would be useful for this. Could you add those here?

I could add this pretty quickly if this sounds like the right plan. It's kind of holding me up on an internal biota project right now. We could start with this being a private internal API and explore making it public another time.

cc @johnchase @wdwvt1 @lkursell

migrate functionality that is in sourcetracker2_internal but not sourcetracker2

@wdwvt1, @lkursell mentioned that you have some code that is in sourcetracker2_internal which is not in sourcetracker2. Can you issue a pull request that moves that code to sourcetracker2? We'll then be getting rid of the sourcetracker2_internal repository (email to follow shortly on the new strategy for internal development).

improve script testing approach

In #36 I added the README commands to the travis tests by hard-coding them in .travis.yml. We should come up with a better approach for that, possibly using click's testing framework. It is important that these are tested though, as that's what users will interact with first.

42% runtime reduction by replacing np.random.choice

ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time of a run is spent rescaling the joint probability values and choosing a new index (line 569) as revealed by this line profiler run.

Function: gibbs_sampler at line 436

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 #ommitted
   481                                               """
   482                                               # Basic bookkeeping information we will use throughout the function.
   483         6            7      1.2      0.0      num_sources = cp.V
   484         6            3      0.5      0.0      num_features = cp.tau
   485         6           24      4.0      0.0      source_indices = np.arange(num_sources)
   486         6           22      3.7      0.0      sink = sink.astype(np.int32)
   487         6           43      7.2      0.0      sink_sum = sink.sum()
   488                                           
   489                                               # Calculate the number of passes that need to be conducted.
   490         6            6      1.0      0.0      total_draws = restarts * draws_per_restart
   491         6            6      1.0      0.0      total_passes = burnin + (draws_per_restart - 1) * delay + 1
   492                                           
   493                                               # Results containers.
   494         6           25      4.2      0.0      final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
   495         6          492     82.0      0.0      final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   496         6           83     13.8      0.0      final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   497                                           
   498                                               # Sequences from the sink will be randomly assigned a source environment
   499                                               # and then reassigned based on an increasingly accurate set of
   500                                               # probabilities. The order in which the sequences are selected for
   501                                               # reassignment must be random to avoid a systematic bias where the
   502                                               # sequences occuring later in the taxon_sequence book-keeping vector
   503                                               # receive more accurate reassignments by virtue of more updates to the
   504                                               # probability model. 'order' will be shuffled each pass, but can be
   505                                               # instantiated here to avoid unnecessary duplication.
   506         6          202     33.7      0.0      order = np.arange(sink_sum, dtype=np.int32)
   507                                           
   508                                               # Create a bookkeeping vector that keeps track of each sequence in the
   509                                               # sink. Each one will be randomly assigned an environment, and then
   510                                               # reassigned based on the increasinly accurate distribution. sink[i] i's
   511                                               # will be placed in the `taxon_sequence` vector to allow each individual
   512                                               # count to be removed and reassigned.
   513         6          887    147.8      0.0      taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
   514                                           
   515                                               # Update the conditional probability class now that we have the sink sum.
   516         6           21      3.5      0.0      cp.set_n(sink_sum)
   517         6          211     35.2      0.0      cp.precalculate()
   518                                           
   519                                               # Several bookkeeping variables that are used within the for loops.
   520         6            3      0.5      0.0      drawcount = 0
   521         6            6      1.0      0.0      unknown_idx = num_sources - 1
   522                                           
   523        36           47      1.3      0.0      for restart in range(restarts):
   524                                                   # Generate random source assignments for each sequence in the sink
   525                                                   # using a uniform distribution.
   526                                                   seq_env_assignments, envcounts = \
   527        30         7735    257.8      0.0              generate_environment_assignments(sink_sum, num_sources)
   528                                           
   529                                                   # Initially, the count of each taxon in the 'unknown' source should be
   530                                                   # 0.
   531        30          160      5.3      0.0          unknown_vector = np.zeros(num_features, dtype=np.int32)
   532        30           20      0.7      0.0          unknown_sum = 0
   533                                           
   534                                                   # If a sequence's random environmental assignment is to the 'unknown'
   535                                                   # environment we alter the training data to include those sequences
   536                                                   # in the 'unknown' source.
   537    797765       657500      0.8      0.1          for e, t in zip(seq_env_assignments, taxon_sequence):
   538    797735       629758      0.8      0.1              if e == unknown_idx:
   539    199601       514084      2.6      0.1                  unknown_vector[t] += 1
   540    199601       149597      0.7      0.0                  unknown_sum += 1
   541                                           
   542       660          762      1.2      0.0          for rep in range(1, total_passes + 1):
   543                                                       # Iterate through sequences in a random order so that no
   544                                                       # systematic bias is introduced based on position in the taxon
   545                                                       # vector (i.e. taxa appearing at the end of the vector getting
   546                                                       # better estimates of the probability).
   547       630       507358    805.3      0.1              np.random.shuffle(order)
   548                                           
   549  16753065     14481942      0.9      1.7              for seq_index in order:
   550  16752435     15050110      0.9      1.8                  e = seq_env_assignments[seq_index]
   551  16752435     14324415      0.9      1.7                  t = taxon_sequence[seq_index]
   552                                           
   553                                                           # Remove the ith sequence and update the probability
   554                                                           # associated with that environment.
   555  16752435     20047543      1.2      2.4                  envcounts[e] -= 1
   556  16752435     13817773      0.8      1.6                  if e == unknown_idx:
   557   6568698     20939355      3.2      2.5                      unknown_vector[t] -= 1
   558   6568698      5282662      0.8      0.6                      unknown_sum -= 1
   559                                           
   560                                                           # Calculate the new joint probability vector based on the
   561                                                           # removal of the ith sequence. Scale this probability vector
   562                                                           # for use by np.random.choice.
   563  16752435     16877702      1.0      2.0                  jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
   564  16752435    162371578      9.7     19.4                                             envcounts)
   565                                           
   566                                                           # Reassign the sequence to a new source environment and
   567                                                           # update counts for each environment and the unknown source
   568                                                           # if necessary.
   569  16752435    466547414     27.8     55.6                  new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
   570                                                           #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
   571                                           
   572                                           
   573  16752435     19320519      1.2      2.3                  seq_env_assignments[seq_index] = new_e_idx
   574  16752435     23089569      1.4      2.8                  envcounts[new_e_idx] += 1
   575                                           
   576  16752435     14819348      0.9      1.8                  if new_e_idx == unknown_idx:
   577   6706687     23665932      3.5      2.8                      unknown_vector[t] += 1
   578   6706687      5595877      0.8      0.7                      unknown_sum += 1
   579                                           
   580       630          560      0.9      0.0              if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
   581                                                           # Update envcounts array with the assigned envs.
   582        30           75      2.5      0.0                  final_envcounts[drawcount] = envcounts
   583                                           
   584                                                           # Assign vectors necessary for feature table reconstruction.
   585        30         2088     69.6      0.0                  final_env_assignments[drawcount] = seq_env_assignments
   586        30         1653     55.1      0.0                  final_taxon_assignments[drawcount] = taxon_sequence
   587                                           
   588                                                           # We've made a draw, update this index so that the next
   589                                                           # iteration will be placed in the correct index of results.
   590        30           28      0.9      0.0                  drawcount += 1
   591                                           
   592         6            5      0.8      0.0      return (final_envcounts, final_env_assignments, final_taxon_assignments)

If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838s). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.

The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpy here, but no plan.

MemoryError with rarefaction

In my use case, I'm dealing with metabolite counts with on the order of 10^9 counts for a single molecule in a single sample. When I try running sourcetracker2, I'm getting an out of memory error. as follows.

Traceback (most recent call last):
  File "/home/mortonjt/miniconda/envs/st2/bin/sourcetracker2", line 11, in <module>
    sys.exit(cli())
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_cli/gibbs.py", line 182, in gibbs
    sink_rarefaction_depth)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_sourcetracker.py", line 184, in subsample_sources_sinks
    replace=False)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/skbio/stats/_subsample.py", line 255, in subsample_counts
    counts_sum)
  File "skbio/stats/__subsample.pyx", line 22, in skbio.stats.__subsample._subsample_counts_without_replacement (skbio/stats/__subsample.c:1394)
MemoryError

Now when there are this many counts, it is probably good enough to use sample with replacement, rather than sample without replacement. The easiest fix would be to add an argument to allow replace=True here and here

Its also worthwhile re-evaluating the rarefaction benchmarks for sourcetracker. Its possible that removing subsampling completely could actually improve the accuracy.

Create QIIME2 plugin

Create a plugin for Qiime2. Try and use old q2d2 metadata filtering to select sinks and sources.

parameterize mapping file values and column names

This is one of the biggest user frustrations with ST1.

documentation errors

Accumulate documentation errors here until enough to justify a PR.

_cli/gibbs.py: line 109 missing a closing parenthesis.

ipcluster can remain open in some circumstances

From @wdwvt1 on February 27, 2016 7:40

If a user deletes the output folder that has been created before all the jobs launched by IPcluster have finished, the cluster will fail to shut itself down. This could cause serious resource leaks.

Copied from original issue: biota/sourcetracker2_internal#23

42% runtime reduction by replacing np.random.choice

ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).

Function: gibbs_sampler at line 436

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 #ommitted
   481                                               """
   482                                               # Basic bookkeeping information we will use throughout the function.
   483         6            7      1.2      0.0      num_sources = cp.V
   484         6            3      0.5      0.0      num_features = cp.tau
   485         6           24      4.0      0.0      source_indices = np.arange(num_sources)
   486         6           22      3.7      0.0      sink = sink.astype(np.int32)
   487         6           43      7.2      0.0      sink_sum = sink.sum()
   488                                           
   489                                               # Calculate the number of passes that need to be conducted.
   490         6            6      1.0      0.0      total_draws = restarts * draws_per_restart
   491         6            6      1.0      0.0      total_passes = burnin + (draws_per_restart - 1) * delay + 1
   492                                           
   493                                               # Results containers.
   494         6           25      4.2      0.0      final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
   495         6          492     82.0      0.0      final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   496         6           83     13.8      0.0      final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   497                                           
   498                                               # Sequences from the sink will be randomly assigned a source environment
   499                                               # and then reassigned based on an increasingly accurate set of
   500                                               # probabilities. The order in which the sequences are selected for
   501                                               # reassignment must be random to avoid a systematic bias where the
   502                                               # sequences occuring later in the taxon_sequence book-keeping vector
   503                                               # receive more accurate reassignments by virtue of more updates to the
   504                                               # probability model. 'order' will be shuffled each pass, but can be
   505                                               # instantiated here to avoid unnecessary duplication.
   506         6          202     33.7      0.0      order = np.arange(sink_sum, dtype=np.int32)
   507                                           
   508                                               # Create a bookkeeping vector that keeps track of each sequence in the
   509                                               # sink. Each one will be randomly assigned an environment, and then
   510                                               # reassigned based on the increasinly accurate distribution. sink[i] i's
   511                                               # will be placed in the `taxon_sequence` vector to allow each individual
   512                                               # count to be removed and reassigned.
   513         6          887    147.8      0.0      taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
   514                                           
   515                                               # Update the conditional probability class now that we have the sink sum.
   516         6           21      3.5      0.0      cp.set_n(sink_sum)
   517         6          211     35.2      0.0      cp.precalculate()
   518                                           
   519                                               # Several bookkeeping variables that are used within the for loops.
   520         6            3      0.5      0.0      drawcount = 0
   521         6            6      1.0      0.0      unknown_idx = num_sources - 1
   522                                           
   523        36           47      1.3      0.0      for restart in range(restarts):
   524                                                   # Generate random source assignments for each sequence in the sink
   525                                                   # using a uniform distribution.
   526                                                   seq_env_assignments, envcounts = \
   527        30         7735    257.8      0.0              generate_environment_assignments(sink_sum, num_sources)
   528                                           
   529                                                   # Initially, the count of each taxon in the 'unknown' source should be
   530                                                   # 0.
   531        30          160      5.3      0.0          unknown_vector = np.zeros(num_features, dtype=np.int32)
   532        30           20      0.7      0.0          unknown_sum = 0
   533                                           
   534                                                   # If a sequence's random environmental assignment is to the 'unknown'
   535                                                   # environment we alter the training data to include those sequences
   536                                                   # in the 'unknown' source.
   537    797765       657500      0.8      0.1          for e, t in zip(seq_env_assignments, taxon_sequence):
   538    797735       629758      0.8      0.1              if e == unknown_idx:
   539    199601       514084      2.6      0.1                  unknown_vector[t] += 1
   540    199601       149597      0.7      0.0                  unknown_sum += 1
   541                                           
   542       660          762      1.2      0.0          for rep in range(1, total_passes + 1):
   543                                                       # Iterate through sequences in a random order so that no
   544                                                       # systematic bias is introduced based on position in the taxon
   545                                                       # vector (i.e. taxa appearing at the end of the vector getting
   546                                                       # better estimates of the probability).
   547       630       507358    805.3      0.1              np.random.shuffle(order)
   548                                           
   549  16753065     14481942      0.9      1.7              for seq_index in order:
   550  16752435     15050110      0.9      1.8                  e = seq_env_assignments[seq_index]
   551  16752435     14324415      0.9      1.7                  t = taxon_sequence[seq_index]
   552                                           
   553                                                           # Remove the ith sequence and update the probability
   554                                                           # associated with that environment.
   555  16752435     20047543      1.2      2.4                  envcounts[e] -= 1
   556  16752435     13817773      0.8      1.6                  if e == unknown_idx:
   557   6568698     20939355      3.2      2.5                      unknown_vector[t] -= 1
   558   6568698      5282662      0.8      0.6                      unknown_sum -= 1
   559                                           
   560                                                           # Calculate the new joint probability vector based on the
   561                                                           # removal of the ith sequence. Scale this probability vector
   562                                                           # for use by np.random.choice.
   563  16752435     16877702      1.0      2.0                  jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
   564  16752435    162371578      9.7     19.4                                             envcounts)
   565                                           
   566                                                           # Reassign the sequence to a new source environment and
   567                                                           # update counts for each environment and the unknown source
   568                                                           # if necessary.
   569  16752435    466547414     27.8     55.6                  new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
   570                                                           #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
   571                                           
   572                                           
   573  16752435     19320519      1.2      2.3                  seq_env_assignments[seq_index] = new_e_idx
   574  16752435     23089569      1.4      2.8                  envcounts[new_e_idx] += 1
   575                                           
   576  16752435     14819348      0.9      1.8                  if new_e_idx == unknown_idx:
   577   6706687     23665932      3.5      2.8                      unknown_vector[t] += 1
   578   6706687      5595877      0.8      0.7                      unknown_sum += 1
   579                                           
   580       630          560      0.9      0.0              if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
   581                                                           # Update envcounts array with the assigned envs.
   582        30           75      2.5      0.0                  final_envcounts[drawcount] = envcounts
   583                                           
   584                                                           # Assign vectors necessary for feature table reconstruction.
   585        30         2088     69.6      0.0                  final_env_assignments[drawcount] = seq_env_assignments
   586        30         1653     55.1      0.0                  final_taxon_assignments[drawcount] = taxon_sequence
   587                                           
   588                                                           # We've made a draw, update this index so that the next
   589                                                           # iteration will be placed in the correct index of results.
   590        30           28      0.9      0.0                  drawcount += 1
   591                                           
   592         6            5      0.8      0.0      return (final_envcounts, final_env_assignments, final_taxon_assignments)

If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.

Investigate memoization for ConditionalProbability.calculate_cp_slice

From @wdwvt1 on January 5, 2016 16:59

@justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k and @gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice. Since this function is deterministic, and merely sets the probabilities so that np.random.choice can draw from them.

We can also use the improved lru_cache function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.

Things to note when we implement this:

No alteration to the test code should be necessary since this isn't changing the number of calls to the PRNG.
The documentation here suggests using a cache size of 2^N - I imagine to avoid resizing or allocating memory problems.

The general cost of the memoization approach for N iterations of the innermost loop will be (this is my guess, please correct me if I am wrong):
N lookups at O(1), growth of dictionary up to maximum size
Assignment to dictionary N - x times where x is the number of times the result is already in the cache
Overhead associated with keeping the cache

The savings will be:
Calculations of full, joint probability vector x times.

Copied from original issue: biota/sourcetracker2_internal#2

resolve different default parameters

We have three sets of default parameters floating around. The CLI defaults for ST2, the CLI defaults for ST1, and another set that are introduced in the API documentation here (the "API doc defaults").

The "API doc defaults" are orders of magnitude faster. In our upcoming parameter evaluation, we should decide on which of these we want to use under which circumstances (e.g., is there a fast setting and a more accurate setting?), and more clearly document that. We should also review the findings presented in Henry et al as that is relevant here.

caporaso-lab / sourcetracker2 Goto Github PK

sourcetracker2's People

Contributors

Stargazers

Watchers

Forkers

sourcetracker2's Issues

Recommend Projects

Recommend Topics

Recommend Org