Git Product home page Git Product logo

pikun's Introduction

Introduction

pikun is a Python package for the analysis and visualization of species delimitation models in an information theoretic framework that provides a true distance or metric space for these models based on the variance of information criterion of (Meila, 2007). The name pikun is from a Kumeyaay (Ipai word for "sparrowhawk", in homage to the indigenous people of Southern California, on whose land I live and work and has become my home.

The species delimitation models being analyzed may be generated by any inference package, such as BP&P, SNAPP, DELINEATE etc., or constructed based on taxonomies or classifications based on conceptual descriptions in literature, geography, folk taxonomies, etc. Regardless of source or basis, each species delimitation model can be considered a partition of taxa or lineages and thus can be represented in a dedicated and widely-supported data exchange format, "SPART-XML", which pikun takes as one of its input formats, in addition to DELINEATE.

For every collection of species delimitation models, pikun generates a set of partition profiles, partition comparison tables, and a suite of graphical plots visualizing data in these tables. The partition profiles report unitary information theoretic and other statistics for each of the species delimitation partition, including the probability and entropy of each partition following [@meila-2007-comparing-clusterings].

The partition comparison tables, on the other hand, provide a range of bivariate statistics for every distinct pair of partitions, including the mutual information, joint entropy, etc., as well as a information theoretic distance statistics are true metrics on the space of species distribution models: the variance of information [@meila-2007-comparing-clusterings] and the normalized joint variation of information distance [@vinh-2010-information-theoretic].

Installation

Installing from the GitHub Repositories

We recommend that you install directly from the main GitHub repository using pip (which works with an Anaconda environment as well):

$ python3 -m pip install --user --upgrade git+https://github.com/jeetsukumaran/pikun.git

or

$ python3 -m pip install --user --upgrade git+git://github.com/jeetsukumaran/pikun.git

Applications

Analysis

pikun-analyze is a command-line program that analyzes a collection of partition definitions.

Input Formats

pikun-analyze takes as its input a collection of partitions specified in one of the following data formats:

  • A simple list of of lists in JSON format. For e.g., given four populations: pop1, pop2, pop3, and pop4:

    [
        [["pop1", "pop2", "pop3", "pop4"]],
        [["pop1"], ["pop2", "pop3", "pop4"]],
        [["pop1", "pop2"], ["pop3", "pop4"]],
        [["pop2"], ["pop1", "pop3", "pop4"]],
        [["pop1"], ["pop2"], ["pop3", "pop4"]],
        [["pop1", "pop2", "pop3"], ["pop4"]],
        [["pop2", "pop3"], ["pop1", "pop4"]],
        [["pop1"], ["pop2", "pop3"], ["pop4"]],
        [["pop1", "pop3"], ["pop2", "pop4"]],
        [["pop3"], ["pop1", "pop2", "pop4"]],
        [["pop1"], ["pop3"], ["pop2", "pop4"]],
        [["pop1", "pop2"], ["pop3"], ["pop4"]],
        [["pop2"], ["pop1", "pop3"], ["pop4"]],
        [["pop2"], ["pop3"], ["pop1", "pop4"]],
        [["pop1"], ["pop2"], ["pop3"], ["pop4"]]
    ]

    This can be explicitly specified by passing the argument "json-list" to the -f or --format option:

    $ pikun-analyze -f json-list partitions.json
    $ pikun-analyze --format json-list partitions.json
    
  • DELINEATE

    $ pikun-analyze -f delineate delineate-results.json
    $ pikun-analyze --format delineate delineate-results.json
    
  • SPART-XML

    $ pikun-analyze -f spart-xml data.xml
    $ pikun-analyze --format spart-xml data.xml
    

Analysis Options

  • The output file names and paths can be specified by using the -o/--output-title and -O/--output-directory

    $ pikun-analyze \
        -f delineate \
        -o project42 \
        -O analysis_dir \
        delineate-results.json
    $ pikun-analyze \
        --format delineate \
        --output-title project42 \
        --output-directory analysis_dir \
        delineate-results.json
    
  • The number of partitions can are read from the input set can be restricted to the first $n$ partitions using the --limit-partitions option:

    $ pikun-analyze \
        --format delineate \
        --output-title project42 \
        --output-directory analysis_dir \
        --limit-partitions 10 \
        delineate-results.json
    

    This is option is particularly useful when the number of partitions in the input is large and/or most of the partitions in the input set may not be of interest. For e.g., a typical DELINEATE analysis may generate hundreds if not thousands of partitions, and most of these are low-probability ones of not much practical interest. Using the --limit flag will focus on just the subset of interest, which will help with computation time and resources.

Output

pikun-analyze will generate two tab-delimited (.tsv) files (named and located based on the -o/--output-title and -O/--output-directory options):

  • output-directory/output-title-profiles.tsv
  • output-directory/output-title-comparisons.tsv

These files provide univariate and a mix of univariate and bivariate statistics, respectively, for the partitions.

Both of these files can be directly loaded as a PANDAS data frame for more detailed analysis:

>>> import pandas as pd
>>> df1 = pd.read_cs(
...     "output-directory/output-title-comparisons.tsv",
...     delimiter="\t"
... )

The -comparisons file includes the variance of information distance statistics: vi_distance and vi_normalized_kraskov.

pikun's People

Contributors

jeetsukumaran avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.