Git Product home page Git Product logo

tsimpute's People

Contributors

szhan avatar

Watchers

 avatar  avatar

tsimpute's Issues

Add arguments to set bounds when visualising sample paths

When the number of data points get too large, bokeh gets slow, so it would be useful to have arguments to bound parent node id and/or genomic position/site index when calling create_sample_path_vis_app.

I think the following argument should suffice:

  • range_nodes=(min, max)
  • range_position=(min, max)

Default should be the full ranges.

Scale up `get_matching_indices`

It is a major bottleneck when calling make_compatible_genotypes on moderately sized datasets, e.g., with ~1 million sites.

Visualise LS HMM copying paths

Add some plotting routines to help diagnose potential issues with sample paths. This is the working version.

def plot_sample_path(path, site_pos, tracks=None, window=None):
    fig, ax = plt.subplots(1, 1, figsize=(20, 5))
    ax.plot(
        site_pos,
        path,
    )
    # Add tracks
    if tracks is not None:
        for i in np.arange(len(tracks)):
            ax.plot(
                tracks[i][0],
                np.repeat(-(i + 1) * 1_000, len(tracks[i][0])),
                marker="|",
                color=tracks[i][1],
                linestyle=""
            )
    if window is not None:
        assert len(window) == 2
        ax.set_xlim(window[0], window[1])
    ax.set_ylabel("Index of sample")
    ax.set_xlabel("Genomic position");

Example output.
Screen Shot 2023-06-05 at 9 46 40 AM

Add mutations after adding new individuals to a tree sequence

The function add_individuals_to_tree_sequence only adds new edges to an existing tree sequence but not new mutations. A separate function needs to be implemented to add new mutations that occur above newly added sample nodes of the individuals.

Refactor `make_compatible_genotypes`

Presently, make_compatible_genotypes has the option acgt_alleles. When acgt_alleles is set to False (default), it remaps all the genotypes in ds2 so that its remapped genotypes have the same allele lists as ds1. When that option is set to True, it remaps all the genotypes in both ds2 and ds1 to ACGT (so remap_genotypes is run twice, which can be wasteful if different ds2 is remapped to the same ds1, e.g., lshmm-imputed and BEAGLE-imputed genotypes to reference panel genotypes). This function should be refactored into two separate functions, so that remap_genotypes is run once.

Package routines

I've been working mostly on FinnGen's Google Cloud account. I'm repackaging some code that I've been using often there.

Add tests for `add_individual_to_tree_sequence`

This function has not been well tested (besides a few manual tests). I'm using it to add new sample paths from LS HMM to an existing tree sequence, so it would be probably a good idea to develop a few more tests.

Refactor code to visualise sample paths

Some documentation is needed, and major refactor needs to be done. Also, the function should be renamed to something else, for example, create_sample_path_vis_app.

Allow for adding sample paths of an individual to an existing tree sequence

At the moment, sample paths are added one at a time when calling add_sample_to_tree_sequence. The proper way of adding a new diploid individual to an existing tree sequence should be to add two sets of sample edges to the tree sequence and then to reference the new sample nodes to the individual.

Function to compare genotypes stored in `xarray.dataset` using `sgkit`

This function should help compare genotypes stored in two VCF files, e.g., one from BEAGLE containing imputed genotypes and the other containing ground-truth genotypes. The plan is to develop a version of this function to work for cases where only biallelic or monoallelic sites are compared before generalising it to be incorporated into sgkit.

Visualise imputation performance

Some plotting routines should help to examine imputation results. A common way to look at imputation results is to look at imputation quality over MAF.

For example,
Screen Shot 2023-06-05 at 10 01 29 AM

Look into source code of BEAGLE 4.1

I should know a bit better the nitty-gritty details of the LS HMM implementation of BEAGLE 4.1. The source code (Java) of BEAGLE 4.1 is not available (download link) when I checked, so I've emailed Brian Browning for a copy.

Split `prepare_dataset.ipynb` into separate notebooks

Right now, this one notebook does the following:

  1. Download unified genealogies.
  2. Simplify the trees down to only high-coverage individuals.
  3. Split the individuals into reference panel and target cohort (one set of trees per group).
  4. Prepare data objects and files (VCFs and samples) for imputation.
  5. Impute using BEAGLE.
  6. Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

  1. Steps 1 to 3.
  2. Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
  3. Step 5.
  4. Step 6.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.