caporaso-lab / sourcetracker2 Goto Github PK
View Code? Open in Web Editor NEWSourceTracker2
License: BSD 3-Clause "New" or "Revised" License
SourceTracker2
License: BSD 3-Clause "New" or "Revised" License
From @lkursell on March 21, 2016 22:24
Although a more descriptive term for how this works would be handy, like 'sequence_density', such that sourcetracker2 seq_density -i table.biom -m mf.txt -o out/
is callable
Copied from original issue: biota/sourcetracker2_internal#27
@gregcaporaso suggested that validate_gibbs_input
should have more informative errors on the input dataframes, specifically, saying why they fail (e.g. sources
had a nan
at loc X).
Cross validated alpha tuning and fit evaluation
It would be good to have a parse_sample_metadata
function, and then use that instead of calls like sample_metadata = pd.read_table(mapping_fp, index_col=0)
. This call will be problematic in some cases that we've run into (heres how it should be done), and we want to be able to fix that in one place rather than have the file be potentially parsed differently in different parts of the code base.
Multiple issues here:
A QIIME forum user reported an error installing because conda couldn't find scikitbio=0.4.3
. Conda auto-suggested scikit-bio
instead, which I think should work, but need to investigate this failure.
@gregcaporaso - is this a problem with our install documentation or does that user need to just add e.g. the bioconda
channel?
Users will want to know the Jensen-Shannon divergence of their sources. The manuscript will allow users to relate the JSD between the sources and the accuracy/precision of the mixing proportions they get back. This is not a novel idea, based on the work in Dan's original paper.
@gregcaporaso @justin212k @lkursell @johnchase @nscott1 @mjk1000
Based on working with the code this weekend to try and get @johnchase and @lkursell a simple API for launching jobs I have come to the conclusion that the code has accumulated some cruft. Cruft = technical debt that will come due at some time, so I think we need to consider refactoring. I am not advocating something major, but a set of suggestions that @lkursell and @johnchase have already essentially asked for.
Below are my thoughts, please let me know what you think
The things I think would help and plan on doing once we modify/come to agreement
gibbs_sampler
to use pd.DataFrame
's. A significant amount of code is spent in sourcetracker/sourcetracker.py
as well as sourcetracker/cli/gibbs.py
to ensure that arrays representing source data, sink data, and the mapping file data are in the proper order etc. DataFrame
s can help remove much of this code (@lkursell and @johnchase pointed this out). This could include converting input biom
tables to a DataFrame
as it would make some operations slightly easier. No necessity of this, but a possibility for a little extra gain.ConditionalProbability
to be clearer and more easily extensible. In our efforts to eke out speed improvements I wrote this class so that it would precompute everything that could be precomputed. The speed gains from this are minor based on my current understanding, and it makes understanding whats happening much less straightforward. In addition, as we move to add other layers of probability (e.g. what @lkursell) discussed in his last email, we will likely want to add them in this class. Without refactoring, it's going to be hell trying to do this.What I need help with
functools.partial
nonsense that we are using to enable passing jobs to an ipyparallel client. I don't think we should need to do this and it would get rid of a fair amount of code, but I don't understand why we need to do this in the first place. If you try and pass a function to the engines that is not wrapped by partial
you get errors indicating the local namespace doesn't contain the variables you are trying to pass.Things I'd like to get a consensus on
From @lkursell on February 12, 2016 22:39
Displaying ST2's progress, like the original, is very helpful for making sure commands are getting executed correctly, especially given the time required for jobs to finish. And they are a very convenient way for determining progress.
Copied from original issue: biota/sourcetracker2_internal#14
Plot is too general - we'll want to support other types of plotting in the future.
hat-tip to @johnchase for pointing this out.
If the input dataframes for _gibbs
have different numbers of features (different columns) the function will fail with an IndexError
like the following
IndexError: index 11751 is out of bounds for axis 0 with size 11290
_gibbs
should check that there are the same number of features in both sources_df
and sinks_df
. An additional check would be that the identity of those features are the same.
if the delay
parameter in the gibbs_sampler
function is set to 1
the function will never take draws (x-y mod 1 always = 0).
this is a legacy from the original sourcetracker code. we need to establish if there is any reason that a delay of 1 is unacceptable and rewrite the checker if not.
From @wdwvt1 on January 30, 2016 0:15
Unsure if this is a bug - but if the number of unknown sequences reaches 0 and alpha2 is set to 0, then lines 370-372 of sourcetracker.py might raise a zero division error.
We have not verified this in tests (can't get the unknown that low) but it deserves investigation.
Copied from original issue: biota/sourcetracker2_internal#9
by identifying 'contamination' sources in the mapping file and them subtracting these sequence counts from the sources of interest for each sink sample
Address comments in #38 before building this release (that was accidentally merged before the comments were addressed, but the comments were minor).
_gibbs
fails with:
IndexError: index 6283 is out of bounds for axis 0 with size 6283
When the are floats in the sink_df
. The error occurs when trying to create the order
to walk through the sequences.
Probably want to add a check so that a less cryptic error message is given to the user.
@gregcaporaso and I pair reviewed and decided there are a couple keys places where test coverage should be improved.
from scipy.stats import ttest_1samp
st1_ms = np.array([[.3, .4, .3],[.2, .3, .5]])
st2_ms = gibbs(source, sinks)
obs_t, obs_p = ttest_1samp(st2_ms, st1_ms.mean(0))
p_threshhold = .05
self.assertTrue(obs_p > p_threshhold)
gibbs
and gibbs_loo
(these are the largest sources of untested code in the entire package). We'll have to make a hand rolled example that we run through from start to finish.gibbs
and gibbs_loo
- these will just test that when a code change happens, a PRNG seeded examples didn't change. These do not test accuracy, just consistency.For simulations to determine convergence rates of gibbs_sampler
, I've been setting burnins=0
, delay=1
, draws_per_restart=X
where X
is the number of passes I want to make. I think this should be the default way we call gibbs_sampler
, by just giving it a max number of passes. So, it'd look like:
def gibbs_sampler(cp, max_passes):
Then, an API user can take the returned data and use any parameters they like for the delay, burnins, etc. This removes draws/restart as a parameter, but that could be replicated by just repeating the call.
From @lkursell on March 24, 2016 17:47
Currently, ST2 adds together sample OTU counts for samples that belong to the same source. This can cause overestimation of contributions from Source environments that have more samples.
One solution would be to be able to pass a --collapse_sources
and --collapse_sinks
flag with a --collapse_method
such as mean or sum. Currently this can be done with the collapse_samples.py
script, although that necessitates changes in mapping files and sample names.
Copied from original issue: biota/sourcetracker2_internal#28
From @lkursell on January 29, 2016 22:23
Want a quick function, or --flag, to pass so that the results can be visualized. Bar chart is likely most helpful. Might have some upper limits on number of sinks/sources for which the figures make sense.
Copied from original issue: biota/sourcetracker2_internal#8
There is a memory leak that has become apparent in long-running simulations. The memory usage of python when running _gibbs
steadily increases despite there being no clear reason for doing so (the memory is preallocated for results storage in gibbs_sampler
).
I believe the error has to do with one of the following:
numpy array
copies that occur and are not garbage collected link1, link2.pandas dataframes
not being correctly garbage collected (might be the same bug as 1) link1, link2, link3.ipyparallel
is running link1.Based on the threads I have read (those linked above) I am guessing that either a bunch of array copies are occurring that are not getting collected, or there is some interaction between the cluster and multiple sinks.
The private api call _gibbs
has unexpected behavior when null values are present in the source table.
For example
sink = pd.DataFrame([[1, 2], [1, 0]], index=['x', 'y'])
source = pd.DataFrame([[1, 2], [np.nan, np.nan]], index=['v', 'z'])
mix, std = _gibbs(source, sink)
mix
v z Unknown
x 1.0 0.0 0.0
y 1.0 0.0 0.0
The simplest and probably easiest solution is to return an error if null values are present in the source or sink tables.
This won't do much, but will confirm that the public API hasn't changed. This will address the coverage decrease in #71.
When I created a python 3 environment with anaconda and tried to import from sourcetracker.sourcetracker
, I got:
**RuntimeError**: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.
This SO post had a solution with the adding of backend: TkAgg
to a ~/.matplotlib/matplotlibrc
Curious if others run into this problem. If so, we should update the documentation.
From @gregcaporaso on March 1, 2016 22:32
Existing directories aren't allowed as parameters to -o
, but the error message doesn't tell the user that the directory already exists, just that it's a directory.
$ sourcetracker2 gibbs -o sourcetracker
Usage: sourcetracker2 gibbs [OPTIONS]
Error: Invalid value for "-o" / "--output_dir": Path "sourcetracker" is a directory.
@wdwvt1 asked me to help out with this as he hasn't had success sorting it out.
Copied from original issue: biota/sourcetracker2_internal#25
Generate the pie charts and distributions a la R version
The innermost loop of the gibbs
function must access different slices of an array millions of times during a single run. These slices are columns, their order of selection is random (each pass of the loop selects a different column), and they may only be selected one at a time.
If we could speed up the speed at which numpy returned the slices, we might be able to save some time. As a test of this idea I used the following code to simulate drawing from a table with 100 sinks
, 2000 features
, and a number of inner-loop passes (slices taken) between 2**5
and 2**20
(30-1,000,000).
The three schemes I tried are: an input array that is in row-major format (C-order), and input array in column-major format (F-order) and a dictionary with keys as column indices and values as a 1D-array of the row values.
The results show that the dictionary indexing method takes approximately 90% of the time of the F-order method.
_dict[:, 0] / forder[:, 0]
array([ 0.84039466, 1.01139224, 0.92655346, 0.88919508, 0.83832869,
0.86242302, 0.87799022, 0.92113187, 0.84468985, 0.89854044,
0.90343308, 0.89311444, 0.89205119, 0.89685498, 0.9020272 ,
0.90483933])
This approach seems promising to me. Pending new profiling results we should see if rewriting the internals to save 10% on indexing is worth it.
Include the full output of the average count of each bug from each source in the sink sample
From @lkursell on February 22, 2016 22:50
Would be nice to put in the default values for all the params. None are listed currently. Also, update doc / install instructions for use with Qiime AWS AMI specific.
Install Anaconda:
Run:
$ bash ~/Anaconda3-2.5.0-MacOSX-x86_64.sh
whereever you downloaded the Anaconda.sh file to
Create the Python 3 env
$conda create -n py3 python=3 numpy scipy h5py hdf5
Get SourceTracker2
$git clone https://github.com/biota/sourcetracker2.git
Make sure to put the SourceTracker2 directory in a user specific directory, rather than a root directory, or itโll cause problems!
Install SourceTracker2
$python setup.py install
from inside sourcetracker2 directory
Installed successfully!
Copied from original issue: biota/sourcetracker2_internal#21
currently gibbs_sampler
calculates the mixing proportions from the full draw ( aka contingency table aka full results). this could be moved out into another function, making the code more modular and easier to test.
Since it's the only one, would it be more intuitive if we just called sourcetracker2
rather than sourcetracker2 gibbs
? Or are we planning to add others? If we do keep it, for users, the command gibbs
probably isn't the most useful. Maybe track
(or something along those lines) would be more intuitive (i.e., you don't need to know implementation details of the methods to understand the command name)?
One of the best things I thought Dan changed in ST1 later version is that it could tell you which specific OTUs were specific for a particular environment. This would be very useful indeed if we could add this functionality or add it to the instructions if easy to do.
It would be really handy to have a SourceTracker2 api, something that could be run in a notebook or easily integrated into a larger analysis pipeline. I'm thinking it would look something like the following:
def sourcetracker(source_ids, sink_ids, table, std=False, cluster=None, loo=False,
alpha1=0.001, alpha2=0.0, beta1=0.0, draws=0.0,
burnin=0.0, delay=0.0)
'''doc string''
Parameters
----------
source_ids : list
Sample IDs that should be used as the source, must correspond to IDs in the OTU_table
sink_ids : list
Source IDs...
table : pd.DataFrame
Pandas dataframe object where columns are observations and rows are samples
cluster : ipyparallel.Client
user created parallel client object
These rest of the parameters should corrospond to the CL call
Returns
----------
mixing_proportions : pd.DataFrame
A pandas DataFrame of the mixing proportions
# a bunch of code...
return mixing_proportions
The overall workflow could look something like this:
$ ipcluster start -n 4
from sourcetracker2 import sourcetracker
import ipyparallel as ipp
c = ipp.Client()
mix_proportions_df = sourcetracker(source_ids, sink_ids, otu_table, cluster=c)
the travis build status badge is broken currently. @gregcaporaso can you take a look?
Do we need to add install instructions to the readme?
The upcoming SourceTracker 2 comparisons are going to require a function (e.g., compare_sourcetracker_results
) that takes an observed_composition
and expected_composition
of a sample, and returns a single value indicating their similarity. This function should also take a metric
parameter which parameterizes how similarity is computed. This is similar in nature to the scikit-bio beta_diversity
driver function.
Some of the metrics that we'd want to be able to use are:
I envision observed_composition
and expected_composition
being dicts mapping source environment type to proportion.
This function would create a pairwise similarity, so we'd also want to be able to summarize many pairwise similarities (e.g., for multiple different samples). We could start with np.median
for this, and get more fancy when we need to.
@wdwvt1, you had some other ideas about metrics that would be useful for this. Could you add those here?
I could add this pretty quickly if this sounds like the right plan. It's kind of holding me up on an internal biota project right now. We could start with this being a private internal API and explore making it public another time.
@wdwvt1, @lkursell mentioned that you have some code that is in sourcetracker2_internal which is not in sourcetracker2. Can you issue a pull request that moves that code to sourcetracker2? We'll then be getting rid of the sourcetracker2_internal repository (email to follow shortly on the new strategy for internal development).
In #36 I added the README commands to the travis tests by hard-coding them in .travis.yml
. We should come up with a better approach for that, possibly using click's testing framework. It is important that these are tested though, as that's what users will interact with first.
ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice
. Fully 55% of the time of a run is spent rescaling the joint probability values and choosing a new index (line 569) as revealed by this line profiler run.
Function: gibbs_sampler at line 436
Line # Hits Time Per Hit % Time Line Contents
==============================================================
#ommitted
481 """
482 # Basic bookkeeping information we will use throughout the function.
483 6 7 1.2 0.0 num_sources = cp.V
484 6 3 0.5 0.0 num_features = cp.tau
485 6 24 4.0 0.0 source_indices = np.arange(num_sources)
486 6 22 3.7 0.0 sink = sink.astype(np.int32)
487 6 43 7.2 0.0 sink_sum = sink.sum()
488
489 # Calculate the number of passes that need to be conducted.
490 6 6 1.0 0.0 total_draws = restarts * draws_per_restart
491 6 6 1.0 0.0 total_passes = burnin + (draws_per_restart - 1) * delay + 1
492
493 # Results containers.
494 6 25 4.2 0.0 final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
495 6 492 82.0 0.0 final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
496 6 83 13.8 0.0 final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
497
498 # Sequences from the sink will be randomly assigned a source environment
499 # and then reassigned based on an increasingly accurate set of
500 # probabilities. The order in which the sequences are selected for
501 # reassignment must be random to avoid a systematic bias where the
502 # sequences occuring later in the taxon_sequence book-keeping vector
503 # receive more accurate reassignments by virtue of more updates to the
504 # probability model. 'order' will be shuffled each pass, but can be
505 # instantiated here to avoid unnecessary duplication.
506 6 202 33.7 0.0 order = np.arange(sink_sum, dtype=np.int32)
507
508 # Create a bookkeeping vector that keeps track of each sequence in the
509 # sink. Each one will be randomly assigned an environment, and then
510 # reassigned based on the increasinly accurate distribution. sink[i] i's
511 # will be placed in the `taxon_sequence` vector to allow each individual
512 # count to be removed and reassigned.
513 6 887 147.8 0.0 taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
514
515 # Update the conditional probability class now that we have the sink sum.
516 6 21 3.5 0.0 cp.set_n(sink_sum)
517 6 211 35.2 0.0 cp.precalculate()
518
519 # Several bookkeeping variables that are used within the for loops.
520 6 3 0.5 0.0 drawcount = 0
521 6 6 1.0 0.0 unknown_idx = num_sources - 1
522
523 36 47 1.3 0.0 for restart in range(restarts):
524 # Generate random source assignments for each sequence in the sink
525 # using a uniform distribution.
526 seq_env_assignments, envcounts = \
527 30 7735 257.8 0.0 generate_environment_assignments(sink_sum, num_sources)
528
529 # Initially, the count of each taxon in the 'unknown' source should be
530 # 0.
531 30 160 5.3 0.0 unknown_vector = np.zeros(num_features, dtype=np.int32)
532 30 20 0.7 0.0 unknown_sum = 0
533
534 # If a sequence's random environmental assignment is to the 'unknown'
535 # environment we alter the training data to include those sequences
536 # in the 'unknown' source.
537 797765 657500 0.8 0.1 for e, t in zip(seq_env_assignments, taxon_sequence):
538 797735 629758 0.8 0.1 if e == unknown_idx:
539 199601 514084 2.6 0.1 unknown_vector[t] += 1
540 199601 149597 0.7 0.0 unknown_sum += 1
541
542 660 762 1.2 0.0 for rep in range(1, total_passes + 1):
543 # Iterate through sequences in a random order so that no
544 # systematic bias is introduced based on position in the taxon
545 # vector (i.e. taxa appearing at the end of the vector getting
546 # better estimates of the probability).
547 630 507358 805.3 0.1 np.random.shuffle(order)
548
549 16753065 14481942 0.9 1.7 for seq_index in order:
550 16752435 15050110 0.9 1.8 e = seq_env_assignments[seq_index]
551 16752435 14324415 0.9 1.7 t = taxon_sequence[seq_index]
552
553 # Remove the ith sequence and update the probability
554 # associated with that environment.
555 16752435 20047543 1.2 2.4 envcounts[e] -= 1
556 16752435 13817773 0.8 1.6 if e == unknown_idx:
557 6568698 20939355 3.2 2.5 unknown_vector[t] -= 1
558 6568698 5282662 0.8 0.6 unknown_sum -= 1
559
560 # Calculate the new joint probability vector based on the
561 # removal of the ith sequence. Scale this probability vector
562 # for use by np.random.choice.
563 16752435 16877702 1.0 2.0 jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
564 16752435 162371578 9.7 19.4 envcounts)
565
566 # Reassign the sequence to a new source environment and
567 # update counts for each environment and the unknown source
568 # if necessary.
569 16752435 466547414 27.8 55.6 new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
570 #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
571
572
573 16752435 19320519 1.2 2.3 seq_env_assignments[seq_index] = new_e_idx
574 16752435 23089569 1.4 2.8 envcounts[new_e_idx] += 1
575
576 16752435 14819348 0.9 1.8 if new_e_idx == unknown_idx:
577 6706687 23665932 3.5 2.8 unknown_vector[t] += 1
578 6706687 5595877 0.8 0.7 unknown_sum += 1
579
580 630 560 0.9 0.0 if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
581 # Update envcounts array with the assigned envs.
582 30 75 2.5 0.0 final_envcounts[drawcount] = envcounts
583
584 # Assign vectors necessary for feature table reconstruction.
585 30 2088 69.6 0.0 final_env_assignments[drawcount] = seq_env_assignments
586 30 1653 55.1 0.0 final_taxon_assignments[drawcount] = taxon_sequence
587
588 # We've made a draw, update this index so that the next
589 # iteration will be placed in the correct index of results.
590 30 28 0.9 0.0 drawcount += 1
591
592 6 5 0.8 0.0 return (final_envcounts, final_env_assignments, final_taxon_assignments)
If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838s). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice
makes a different number of calls to the PRNG compared to np.random.multinomial
, so the seeded tests all fail.
The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpy
here, but no plan.
In my use case, I'm dealing with metabolite counts with on the order of 10^9 counts for a single molecule in a single sample. When I try running sourcetracker2, I'm getting an out of memory error. as follows.
Traceback (most recent call last):
File "/home/mortonjt/miniconda/envs/st2/bin/sourcetracker2", line 11, in <module>
sys.exit(cli())
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_cli/gibbs.py", line 182, in gibbs
sink_rarefaction_depth)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_sourcetracker.py", line 184, in subsample_sources_sinks
replace=False)
File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/skbio/stats/_subsample.py", line 255, in subsample_counts
counts_sum)
File "skbio/stats/__subsample.pyx", line 22, in skbio.stats.__subsample._subsample_counts_without_replacement (skbio/stats/__subsample.c:1394)
MemoryError
Now when there are this many counts, it is probably good enough to use sample with replacement, rather than sample without replacement. The easiest fix would be to add an argument to allow replace=True
here and here
Its also worthwhile re-evaluating the rarefaction benchmarks for sourcetracker. Its possible that removing subsampling completely could actually improve the accuracy.
Create a plugin for Qiime2. Try and use old q2d2 metadata filtering to select sinks and sources.
This is one of the biggest user frustrations with ST1.
Accumulate documentation errors here until enough to justify a PR.
_cli/gibbs.py
: line 109 missing a closing parenthesis.
From @wdwvt1 on February 27, 2016 7:40
If a user deletes the output folder that has been created before all the jobs launched by IPcluster
have finished, the cluster will fail to shut itself down. This could cause serious resource leaks.
Copied from original issue: biota/sourcetracker2_internal#23
ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice
. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).
Function: gibbs_sampler at line 436
Line # Hits Time Per Hit % Time Line Contents
==============================================================
#ommitted
481 """
482 # Basic bookkeeping information we will use throughout the function.
483 6 7 1.2 0.0 num_sources = cp.V
484 6 3 0.5 0.0 num_features = cp.tau
485 6 24 4.0 0.0 source_indices = np.arange(num_sources)
486 6 22 3.7 0.0 sink = sink.astype(np.int32)
487 6 43 7.2 0.0 sink_sum = sink.sum()
488
489 # Calculate the number of passes that need to be conducted.
490 6 6 1.0 0.0 total_draws = restarts * draws_per_restart
491 6 6 1.0 0.0 total_passes = burnin + (draws_per_restart - 1) * delay + 1
492
493 # Results containers.
494 6 25 4.2 0.0 final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
495 6 492 82.0 0.0 final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
496 6 83 13.8 0.0 final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
497
498 # Sequences from the sink will be randomly assigned a source environment
499 # and then reassigned based on an increasingly accurate set of
500 # probabilities. The order in which the sequences are selected for
501 # reassignment must be random to avoid a systematic bias where the
502 # sequences occuring later in the taxon_sequence book-keeping vector
503 # receive more accurate reassignments by virtue of more updates to the
504 # probability model. 'order' will be shuffled each pass, but can be
505 # instantiated here to avoid unnecessary duplication.
506 6 202 33.7 0.0 order = np.arange(sink_sum, dtype=np.int32)
507
508 # Create a bookkeeping vector that keeps track of each sequence in the
509 # sink. Each one will be randomly assigned an environment, and then
510 # reassigned based on the increasinly accurate distribution. sink[i] i's
511 # will be placed in the `taxon_sequence` vector to allow each individual
512 # count to be removed and reassigned.
513 6 887 147.8 0.0 taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
514
515 # Update the conditional probability class now that we have the sink sum.
516 6 21 3.5 0.0 cp.set_n(sink_sum)
517 6 211 35.2 0.0 cp.precalculate()
518
519 # Several bookkeeping variables that are used within the for loops.
520 6 3 0.5 0.0 drawcount = 0
521 6 6 1.0 0.0 unknown_idx = num_sources - 1
522
523 36 47 1.3 0.0 for restart in range(restarts):
524 # Generate random source assignments for each sequence in the sink
525 # using a uniform distribution.
526 seq_env_assignments, envcounts = \
527 30 7735 257.8 0.0 generate_environment_assignments(sink_sum, num_sources)
528
529 # Initially, the count of each taxon in the 'unknown' source should be
530 # 0.
531 30 160 5.3 0.0 unknown_vector = np.zeros(num_features, dtype=np.int32)
532 30 20 0.7 0.0 unknown_sum = 0
533
534 # If a sequence's random environmental assignment is to the 'unknown'
535 # environment we alter the training data to include those sequences
536 # in the 'unknown' source.
537 797765 657500 0.8 0.1 for e, t in zip(seq_env_assignments, taxon_sequence):
538 797735 629758 0.8 0.1 if e == unknown_idx:
539 199601 514084 2.6 0.1 unknown_vector[t] += 1
540 199601 149597 0.7 0.0 unknown_sum += 1
541
542 660 762 1.2 0.0 for rep in range(1, total_passes + 1):
543 # Iterate through sequences in a random order so that no
544 # systematic bias is introduced based on position in the taxon
545 # vector (i.e. taxa appearing at the end of the vector getting
546 # better estimates of the probability).
547 630 507358 805.3 0.1 np.random.shuffle(order)
548
549 16753065 14481942 0.9 1.7 for seq_index in order:
550 16752435 15050110 0.9 1.8 e = seq_env_assignments[seq_index]
551 16752435 14324415 0.9 1.7 t = taxon_sequence[seq_index]
552
553 # Remove the ith sequence and update the probability
554 # associated with that environment.
555 16752435 20047543 1.2 2.4 envcounts[e] -= 1
556 16752435 13817773 0.8 1.6 if e == unknown_idx:
557 6568698 20939355 3.2 2.5 unknown_vector[t] -= 1
558 6568698 5282662 0.8 0.6 unknown_sum -= 1
559
560 # Calculate the new joint probability vector based on the
561 # removal of the ith sequence. Scale this probability vector
562 # for use by np.random.choice.
563 16752435 16877702 1.0 2.0 jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
564 16752435 162371578 9.7 19.4 envcounts)
565
566 # Reassign the sequence to a new source environment and
567 # update counts for each environment and the unknown source
568 # if necessary.
569 16752435 466547414 27.8 55.6 new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
570 #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
571
572
573 16752435 19320519 1.2 2.3 seq_env_assignments[seq_index] = new_e_idx
574 16752435 23089569 1.4 2.8 envcounts[new_e_idx] += 1
575
576 16752435 14819348 0.9 1.8 if new_e_idx == unknown_idx:
577 6706687 23665932 3.5 2.8 unknown_vector[t] += 1
578 6706687 5595877 0.8 0.7 unknown_sum += 1
579
580 630 560 0.9 0.0 if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
581 # Update envcounts array with the assigned envs.
582 30 75 2.5 0.0 final_envcounts[drawcount] = envcounts
583
584 # Assign vectors necessary for feature table reconstruction.
585 30 2088 69.6 0.0 final_env_assignments[drawcount] = seq_env_assignments
586 30 1653 55.1 0.0 final_taxon_assignments[drawcount] = taxon_sequence
587
588 # We've made a draw, update this index so that the next
589 # iteration will be placed in the correct index of results.
590 30 28 0.9 0.0 drawcount += 1
591
592 6 5 0.8 0.0 return (final_envcounts, final_env_assignments, final_taxon_assignments)
If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice
makes a different number of calls to the PRNG compared to np.random.multinomial
, so the seeded tests all fail.
The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpy
here, but no plan.
From @wdwvt1 on January 5, 2016 16:59
@justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k and @gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice
. Since this function is deterministic, and merely sets the probabilities so that np.random.choice
can draw from them.
We can also use the improved lru_cache
function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.
Things to note when we implement this:
2^N
- I imagine to avoid resizing or allocating memory problems.The general cost of the memoization approach for N
iterations of the innermost loop will be (this is my guess, please correct me if I am wrong):
N
lookups at O(1), growth of dictionary up to maximum size
Assignment to dictionary N - x
times where x
is the number of times the result is already in the cache
Overhead associated with keeping the cache
The savings will be:
Calculations of full, joint probability vector x
times.
Copied from original issue: biota/sourcetracker2_internal#2
We have three sets of default parameters floating around. The CLI defaults for ST2, the CLI defaults for ST1, and another set that are introduced in the API documentation here (the "API doc defaults").
The "API doc defaults" are orders of magnitude faster. In our upcoming parameter evaluation, we should decide on which of these we want to use under which circumstances (e.g., is there a fast setting and a more accurate setting?), and more clearly document that. We should also review the findings presented in Henry et al as that is relevant here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.