szhan / tsimpute Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 1.67 MB

Genome-wide genotype imputation using tree sequences.

License: MIT License

Python 11.45% Jupyter Notebook 88.47% Shell 0.07%

genomics bioinformatics tree-sequences

tsimpute's People

Contributors

Watchers

tsimpute's Issues

Append a bunch of sample edges at once to an existing tree sequence

Calling add_individual_to_tree_sequence and adding one individual at a time is far too slow. It is much faster and more efficient to prepare all the edge data and then call append_columns.

Refactor `add_individual_to_tree_sequence`

The logic of that function seems good now. It could be refactored with newer lines of code.

Keep VCF header when making compatible genotypes

It is useful to keep the names of the samples/individuals in the original VCF.

Add arguments to set bounds when visualising sample paths

When the number of data points get too large, bokeh gets slow, so it would be useful to have arguments to bound parent node id and/or genomic position/site index when calling create_sample_path_vis_app.

I think the following argument should suffice:

range_nodes=(min, max)
range_position=(min, max)

Default should be the full ranges.

Make `is_valid` in `SamplePath` a class attribute

In the initial implementation, 'is_valid' is a class function. Turn is_valid into a boolean attribute that is assigned right after initialistion.

Turn off `numba` when running tests on `get_matching_indices`

numba doesn't like empty lists with non-precise data types, so it needs to be turned off before running pytest on get_matching_indices. See phylokit for inspiration.

Scale up `get_matching_indices`

It is a major bottleneck when calling make_compatible_genotypes on moderately sized datasets, e.g., with ~1 million sites.

Visualise LS HMM copying paths

Add some plotting routines to help diagnose potential issues with sample paths. This is the working version.

def plot_sample_path(path, site_pos, tracks=None, window=None):
    fig, ax = plt.subplots(1, 1, figsize=(20, 5))
    ax.plot(
        site_pos,
        path,
    )
    # Add tracks
    if tracks is not None:
        for i in np.arange(len(tracks)):
            ax.plot(
                tracks[i][0],
                np.repeat(-(i + 1) * 1_000, len(tracks[i][0])),
                marker="|",
                color=tracks[i][1],
                linestyle=""
            )
    if window is not None:
        assert len(window) == 2
        ax.set_xlim(window[0], window[1])
    ax.set_ylabel("Index of sample")
    ax.set_xlabel("Genomic position");

Example output.

Add mutations after adding new individuals to a tree sequence

The function add_individuals_to_tree_sequence only adds new edges to an existing tree sequence but not new mutations. A separate function needs to be implemented to add new mutations that occur above newly added sample nodes of the individuals.

Add `contig_length` when making compatible genotypes

Refactor `make_compatible_genotypes`

Presently, make_compatible_genotypes has the option acgt_alleles. When acgt_alleles is set to False (default), it remaps all the genotypes in ds2 so that its remapped genotypes have the same allele lists as ds1. When that option is set to True, it remaps all the genotypes in both ds2 and ds1 to ACGT (so remap_genotypes is run twice, which can be wasteful if different ds2 is remapped to the same ds1, e.g., lshmm-imputed and BEAGLE-imputed genotypes to reference panel genotypes). This function should be refactored into two separate functions, so that remap_genotypes is run once.

Package routines

I've been working mostly on FinnGen's Google Cloud account. I'm repackaging some code that I've been using often there.

Parallelise `get_traceback_path`

This can be done using 'concurrent.futures'.

Add tests for `add_individual_to_tree_sequence`

This function has not been well tested (besides a few manual tests). I'm using it to add new sample paths from LS HMM to an existing tree sequence, so it would be probably a good idea to develop a few more tests.

Refactor code to visualise sample paths

Some documentation is needed, and major refactor needs to be done. Also, the function should be renamed to something else, for example, create_sample_path_vis_app.

Add tests for `write_vcf` from tree sequences augmented by `add_individual_to_tree_sequence`

These tests should check that the genotypes from newly added individuals inherit mutations affecting their parents in a tree sequence and that VCFs written from the tree sequence contain those genotypes for downstream analyses.

Pass `Typed.List` to `get_matching_indices`

In the future, support of the use of list will be deprecated in numba. See Deprecation of reflection for List and Set types.

get_matching_indices and remap_genotypes should be updated accordingly.

Allow for adding sample paths of an individual to an existing tree sequence

At the moment, sample paths are added one at a time when calling add_sample_to_tree_sequence. The proper way of adding a new diploid individual to an existing tree sequence should be to add two sets of sample edges to the tree sequence and then to reference the new sample nodes to the individual.

Function to compare genotypes stored in `xarray.dataset` using `sgkit`

This function should help compare genotypes stored in two VCF files, e.g., one from BEAGLE containing imputed genotypes and the other containing ground-truth genotypes. The plan is to develop a version of this function to work for cases where only biallelic or monoallelic sites are compared before generalising it to be incorporated into sgkit.

Download unified genealogies.
Simplify the trees down to only high-coverage individuals.
Split the individuals into reference panel and target cohort (one set of trees per group).
Prepare data objects and files (VCFs and samples) for imputation.
Impute using BEAGLE.
Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

Steps 1 to 3.
Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
Step 5.
Step 6.

szhan / tsimpute Goto Github PK

tsimpute's People

Contributors

Watchers

tsimpute's Issues

Recommend Projects

Recommend Topics

Recommend Org