szhan / tsimpute Goto Github PK
View Code? Open in Web Editor NEWGenome-wide genotype imputation using tree sequences.
License: MIT License
Genome-wide genotype imputation using tree sequences.
License: MIT License
Calling add_individual_to_tree_sequence
and adding one individual at a time is far too slow. It is much faster and more efficient to prepare all the edge data and then call append_columns
.
The logic of that function seems good now. It could be refactored with newer lines of code.
It is useful to keep the names of the samples/individuals in the original VCF.
When the number of data points get too large, bokeh
gets slow, so it would be useful to have arguments to bound parent node id and/or genomic position/site index when calling create_sample_path_vis_app
.
I think the following argument should suffice:
range_nodes=(min, max)
range_position=(min, max)
Default should be the full ranges.
In the initial implementation, 'is_valid' is a class function. Turn is_valid
into a boolean attribute that is assigned right after initialistion.
numba
doesn't like empty lists with non-precise data types, so it needs to be turned off before running pytest on get_matching_indices
. See phylokit for inspiration.
It is a major bottleneck when calling make_compatible_genotypes
on moderately sized datasets, e.g., with ~1 million sites.
Add some plotting routines to help diagnose potential issues with sample paths. This is the working version.
def plot_sample_path(path, site_pos, tracks=None, window=None):
fig, ax = plt.subplots(1, 1, figsize=(20, 5))
ax.plot(
site_pos,
path,
)
# Add tracks
if tracks is not None:
for i in np.arange(len(tracks)):
ax.plot(
tracks[i][0],
np.repeat(-(i + 1) * 1_000, len(tracks[i][0])),
marker="|",
color=tracks[i][1],
linestyle=""
)
if window is not None:
assert len(window) == 2
ax.set_xlim(window[0], window[1])
ax.set_ylabel("Index of sample")
ax.set_xlabel("Genomic position");
The function add_individuals_to_tree_sequence
only adds new edges to an existing tree sequence but not new mutations. A separate function needs to be implemented to add new mutations that occur above newly added sample nodes of the individuals.
Presently, make_compatible_genotypes
has the option acgt_alleles
. When acgt_alleles
is set to False
(default), it remaps all the genotypes in ds2
so that its remapped genotypes have the same allele lists as ds1
. When that option is set to True
, it remaps all the genotypes in both ds2
and ds1
to ACGT (so remap_genotypes
is run twice, which can be wasteful if different ds2
is remapped to the same ds1
, e.g., lshmm-imputed and BEAGLE-imputed genotypes to reference panel genotypes). This function should be refactored into two separate functions, so that remap_genotypes
is run once.
I've been working mostly on FinnGen's Google Cloud account. I'm repackaging some code that I've been using often there.
This can be done using 'concurrent.futures'.
This function has not been well tested (besides a few manual tests). I'm using it to add new sample paths from LS HMM to an existing tree sequence, so it would be probably a good idea to develop a few more tests.
Some documentation is needed, and major refactor needs to be done. Also, the function should be renamed to something else, for example, create_sample_path_vis_app
.
These tests should check that the genotypes from newly added individuals inherit mutations affecting their parents in a tree sequence and that VCFs written from the tree sequence contain those genotypes for downstream analyses.
In the future, support of the use of list
will be deprecated in numba
. See Deprecation of reflection for List and Set types.
get_matching_indices
and remap_genotypes
should be updated accordingly.
At the moment, sample paths are added one at a time when calling add_sample_to_tree_sequence
. The proper way of adding a new diploid individual to an existing tree sequence should be to add two sets of sample edges to the tree sequence and then to reference the new sample nodes to the individual.
This function should help compare genotypes stored in two VCF files, e.g., one from BEAGLE containing imputed genotypes and the other containing ground-truth genotypes. The plan is to develop a version of this function to work for cases where only biallelic or monoallelic sites are compared before generalising it to be incorporated into sgkit
.
The main function to rework is remap_state_space(ts, sd, samples=None)
. Subsequent steps do not take SampleData
as input.
When there are lots of sites, bokeh
takes so long to render that it is unwieldy. There should be an argument that allows users to subset the sites to visualise to allow for a smoother UI experience.
See #148
I should know a bit better the nitty-gritty details of the LS HMM implementation of BEAGLE 4.1. The source code (Java) of BEAGLE 4.1 is not available (download link) when I checked, so I've emailed Brian Browning for a copy.
So as to protect attributes from being modified accidentally.
Right now, this one notebook does the following:
tskit.lshmm
.It is easier to divide them up into the following stages, one per notebook:
sgkit
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.