Git Product home page Git Product logo

hypertools's Introduction

Hi πŸ‘‹!! I'm a psych professor and the director of the Contextual Dynamics Lab at Dartmouth College. I build models of memory and brain networks 🧠, build tools 🧰, teach some open courses πŸ§‘β€πŸ«, and do various other data sciencey things πŸ€“πŸ§‘β€πŸ’». I'm also a dad, husband, nature enthusiast 🌳🌲, and amateur cupcake baker 🧁!

Keywords:

  • Learning and memory πŸ€”
  • Education technology πŸ§‘β€πŸ«
  • Brain networks πŸ•ΈοΈ
  • Computational neuroscience, cognitive neuroscience πŸ’»
  • Natural language processing, natural language understanding πŸ“š
  • Machine learning πŸ€–
  • Data visualization πŸ“ˆ

Check out my lab website for more information about my lab, research, papers, and code!

hypertools's People

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

michaelchen78

hypertools's Issues

write a "save" funcion

should complement the load function by saving a dataframe, formatted text, or a list thereof to a file type that may be easily loaded in by the load function (given the file path).

documentation: sphinx

need to check over sphinx documentation and make sure everything recompiles and is updated to reflect toolbox changes

refactor "load" function

extend the load function to support the following:

  • existing named datasets (convert to CSV files, numpy arrays, pickles, or some other convenient format)
  • arbitrary URLs (download CSV files, spreadsheets, and other pandas-supported filetypes)
  • local files (any pandas-supported filetype)
  • obj files
  • seaborn datasets (named)
  • scipy/scikit-learn datasets (named)
  • other datasets we analyzed in the arxiv preprint and/or JMLR paper
  • text (txt, rtf, doc) files
  • images (jpg, png, gif, etc. that can be loaded with scikit-image)
  • nii files (via niilearn)
  • youtube videos? (extract text/images?)
  • music (mp3/wav files?)

also add some cool additional datasets:

the function should return either:

  • a pandas dataframe (if data is tabular)
  • formatted text
  • a list of text and dataframes

in other words, the load function should return an object that can be directly plotted or manipulated with other hypertools functions.

Other notes:

  • the load function's supported datatypes should be organized in modules so that support for new filetypes and/or extensions can be easily added

color management bugs

this should result in black dots, but instead crashes:

hyp.plot(data, 'k.')

this should also work, but instead crashes:

hyp.plot(data, '.', color='k')

this works correctly, however:
hyp.plot(data[0], '.', color='k')

fix caching

Caching is really useful when the goal is to re-run the same line of code multiple times (it makes subsequent runs faster). But the current implementation of caching seems to be ignoring some parameters. For example, calling plot with the reduce flag returns a DataGeometry object (as expected). However, subsequent calls to plot using the same dataset (but a different reduce argument) seem to return the same object, rather than calling reduce again with the new argument.

Ideas:

  • The simplest way to fix this would be to entirely remove caching.
  • Alternatively, we could add a cache flag to specify whether or not to cache a particular function call. When cache==False, this could force a "refresh" of whatever function was being called, as a sort of hack to fix the undesired behavior
  • The "best" solution would to actually figure out why caching isn't working as expected and then fix it.

Plan:

  • Try the simplest solution first and then build from there...

documentation: docstrings

all doc strings need to be added. some can be copied over (and/or lightly modified) from the prior version of the toolbox.

clean up

i think we should get rid of:

  • caching
  • datageometry objects
  • defaulting to numpy arrays (replace with dataframes)
  • support for backwards compatibility (e.g. arguments that don't really do anything or that have been replaced) and multi-named keywords (e.g. 'color' and 'colors')
  • support for multi-use arguments (e.g. align=True and align='hyper' do the same thing...but we should just support align='hyper' so that the API is cleaner and more consistent

Gaps between clusters

Need to check over how datasets that span multiple clusters are plotted, particularly in animated plots. It may be necessary to add in additional connecting segments between groups when the current segment is the last in a group but not the last overall within the current block of adjacent time points. Since the next observation could come from any group, i think it'll be necessary to search over all other groups. The connection could be labeled using the correct group and colored either using the current group or an average of the current and next group's colors.

smoothing and/or resampling and/or projecting into low-D using UMAP results in strange "jumps"

the following code:

import hypertools as hyp

data = hyp.load('weights')

umap3d = {'model': 'UMAP', 'args': [], 'kwargs': {'n_components': 3}}
manip = [{'model': 'Smooth', 'args': [], 'kwargs': {'kernel_width': 25}},
         {'model': 'Resample', 'args': [], 'kwargs': {'n_samples': 1000}},
         'ZScore']
hyperalign = {'model': 'HyperAlign', 'args': [], 'kwargs': {'n_iter': 2}}

fig = hyp.plot(data, manip=manip, reduce=umap3d, align=hyperalign)

results in strange "jumps" in trajectories that should be smooth:
write3d_static

package organization

it'd be nice to organize code in modules and sub-modules, each with a consistent (well-documented!) API so that sub-modules (or modules) may be easily extended and added, along with a defaults specified in the config.ini file under its own section or sub-section:

  • plot: contain matplotlib folder for matplotlib plotting, plotly folder for plotly plotting, etc. each folder should contain a plot.py file, an animate.py file, and a stylize.py file
  • vectorize: contains code for turning data in different formats into feature vectors. should support pandas dataframes, text data, images, nifti files, wav files, EEG data(?), etc.
  • manip: contains code for resampling, smoothing, normalizing, z-scoring, z-transforming, etc.
  • align: contains hyperalignment, SRM, and other code for aligning matrix data
  • cluster: wrap various packages (e.g. scikit-learn, HDBSCAN) for clustering data into discrete clusters or mixtures of clusters
  • reduce: wrap various packages (scikit-learn, UMAP, etc.) for projecting high-dimensional data onto lower-dimensional spaces

ensure "story trajectories" pipeline can be reproduced

using previous versions of hypertools, the demo "gif" can be reproduced as follows:

import hypertools as hyp
import timecorr as tc
from scipy.interpolate import pchip
import numpy as np

def resample(a, n):
    b = np.zeros([n, a.shape[1]])
    x = np.linspace(1, a.shape[0], num=a.shape[0], endpoint=True)
    xx = np.linspace(1, a.shape[0], num=n, endpoint=True)
    for i in range(a.shape[1]):
        interp = pchip(x, a[:, i])
        b[:, i] = interp(xx)
    return b

pieman_data = hyp.load('weights').data
smoothed = tc.smooth(pieman_data, kernel_fun=tc.helpers.gaussian_weights, kernel_params={'var': 50})
resampled = [resample(x, 1000) for x in smoothed]

aligned = hyp.align(resampled, align='hyper')
hyp.plot(aligned, align=True, reduce='UMAP', animate=True, duration=30, tail_duration=4.0, zoom=0.5, save_path='pieman.mov')

A similar approach should work with the revamped version, but it doesn't seem to work well in practice:

import hypertools as hyp

data = hyp.load('weights')

manip = [{'model': 'Smooth', 'args': [], 'kwargs': {'kernel_width': 25}},
         {'model': 'Resample', 'args': [], 'kwargs': {'n_samples': 1000}},
         'ZScore']
hyperalign = {'model': 'HyperAlign', 'args': [], 'kwargs': {'n_iter': 2}}

hyp.plot(data, manip=manip, align=hyperalign, animate='window', reduce='UMAP', duration=30, focused=4, zoom=0.5)

there are at least a few differences that i notice:

  • prior versions of hypertools normalized data by default. this no longer happens. in the above example, i've specified that the data should be z-scored (within feature) prior to passing to UMAP, but i need to double check that the "preprocessing" is analogous to the prior version. e.g. should 'ZScore' be replaced with 'Normalize' and/or some other preprocessing step?
  • the first demo uses a Gaussian kernel (variance = 50) to smooth the data, whereas the second example uses a boxcar kernel (width=25). i can't imagine that this would substantially change the results...but you never know...
  • i don't think that the normalization step gets applied in the old hyp.align function, but it's possible the previous demo normalized the data twice (once prior to the first alignment step and again prior to the second alignment/projecting into 3D steps). if so, the second demo could be updated to normalize multiple times.

it's also possible that something is off with the revised hyperalignment implementation, despite that the alignment tests are passing...

set defaults for all functions

organize default parameters for every supported model, from every module, in a single config.ini file.

each module should get a section (plot, cluster, align, reduce, etc.) and each sub-function should get a sub-section (static, animate, etc.)

this should also store locations of named datasets/models

Create "minimum viable" skeleton package

Create a highly simplified new version of hypertools with better organization and modularity. It should have the following features:

1.) support only dataframes or lists of dataframes (no "format data" functions)
2.) reduce: apply a scikit-learn dimensionality reduction model to the data to get a new dataframe (or list of dataframes)
3.) align: apply hyperalignment or SRM to the data to get a new dataframe (or list of dataframes)
4.) cluster: apply discrete clustering to the data and color datapoints according to their clusters.
5.) plot: take in one or more dataframes and generate a static 2D or 3D plot by calling reduce (if ndims > 3). data must be > 1D, or an error is thrown. generate either a scatterplot, a line plot, or a scatterplot/lineplot combo. internally, this should call the reduce and/or align and/or cluster functions, as specified by the user.

multilevel indices

given a dataframe with a multilevel index, the line/dot colors should be determined from the outermost index. line thickness of the outermost index should be scaled by a factor of L (number-of-levels).

For each "lower" level in turn, reduce the line thickness by 1 and opacity by some amount (play around with it to see what looks good).

E.g., suppose a dataframe contains the levels timepoints --> trials --> subjects --> experiments.

Each "experiment" should show the average trajectory (or scatterplot) for everything below it, with large/thick and mostly opaque dots/markers, in one color (or colormap) per experiment. Each "subject" should show that subject's average trajectory/scatterplot within that experiment, colored the same as the experiment but smaller/thinner/less opaque. Each trial should show the timecourse across timepoints (in even smaller/thinner/less opaque lines/dots), again colored the same as the subjects/experiments.

The idea is to show a summary of each part of the dataset, but with subtle details added to show (as subtlety increases) more and more low-level detail.

add support for numpy arrays

create a simple format_data function:

  • when the user passes a numpy array, convert it to a dataframe automatically
  • always return a list of dataframes

deal with nested lists

nested lists could be converted to multilevel dataframes, where the outermost list level corresponds to the outermost index level.

clustered observations not displaying correctly when mode involves lines

The following code produces the incorrect plot below:

gaussian_mixture = {'model': 'GaussianMixture', 'args': [], 'kwargs': {'n_components': 3,
                                                                           'mode': 'fit_predict_proba'}}
hyp.plot(data[0], cluster=gaussian_mixture, cmap='husl')

cluster_failure

There are a couple of potential fail points that need to be explored (along with other possibilities):

  • only a single trace is plotted, whereas other examples have used multiple traces
  • mode = lines versus markers

pandas stack and unstack

add support for stacking multiple dataframes (and increasing the index levels). also add a complementary "undo stack" operation that turns a multilevel dataframe into a list

add support for missing data

enhance format_data by adding support for missing data:

  • missing rows are filled in using interpolation (support scikit-learn functions)
  • missing columns (within a given row, where other rows are not missing data in that column) are filled in using PPCA
  • if a column is entirely nans, remove it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.