jeremymanning / hypertools Goto Github PK

This project forked from contextlab/hypertools

A python toolbox for gaining geometric insights into high-dimensional data

Home Page: http://hypertools.readthedocs.io/en/latest/

License: MIT License

Python 0.51% Jupyter Notebook 99.49%

hypertools's Introduction

Hi 👋!! I'm a psych professor and the director of the Contextual Dynamics Lab at Dartmouth College. I build models of memory and brain networks 🧠, build tools 🧰, teach some open courses 🧑‍🏫, and do various other data sciencey things 🤓🧑‍💻. I'm also a dad, husband, nature enthusiast 🌳🌲, and amateur cupcake baker 🧁!

Keywords:

Learning and memory 🤔
Education technology 🧑‍🏫
Brain networks 🕸️
Computational neuroscience, cognitive neuroscience 💻
Natural language processing, natural language understanding 📚
Machine learning 🤖
Data visualization 📈

Check out my lab website for more information about my lab, research, papers, and code!

hypertools's People

Stargazers

Watchers

Forkers

michaelchen78

hypertools's Issues

color management bugs

this should result in black dots, but instead crashes:

hyp.plot(data, 'k.')

this should also work, but instead crashes:

hyp.plot(data, '.', color='k')

this works correctly, however:
hyp.plot(data[0], '.', color='k')

camera position not updated correctly for 3d animations saved as GIFs

For 3d animations where the camera "spins" around the plotted data, saving the figure as a GIF seems to discard camera position information across frames. This results in a static view in the saved-out GIF, even when the figure looks correct.

add support for multi-class clusters and mixture models

update the cluster toolbox to support multi-class models (e.g. gaussian mixture models)

jupyterlab support

add instructions: https://plotly.com/python/getting-started/#jupyterlab-support-python-35

ensure "story trajectories" pipeline can be reproduced

using previous versions of hypertools, the demo "gif" can be reproduced as follows:

import hypertools as hyp
import timecorr as tc
from scipy.interpolate import pchip
import numpy as np

def resample(a, n):
    b = np.zeros([n, a.shape[1]])
    x = np.linspace(1, a.shape[0], num=a.shape[0], endpoint=True)
    xx = np.linspace(1, a.shape[0], num=n, endpoint=True)
    for i in range(a.shape[1]):
        interp = pchip(x, a[:, i])
        b[:, i] = interp(xx)
    return b

pieman_data = hyp.load('weights').data
smoothed = tc.smooth(pieman_data, kernel_fun=tc.helpers.gaussian_weights, kernel_params={'var': 50})
resampled = [resample(x, 1000) for x in smoothed]

aligned = hyp.align(resampled, align='hyper')
hyp.plot(aligned, align=True, reduce='UMAP', animate=True, duration=30, tail_duration=4.0, zoom=0.5, save_path='pieman.mov')

A similar approach should work with the revamped version, but it doesn't seem to work well in practice:

import hypertools as hyp

data = hyp.load('weights')

manip = [{'model': 'Smooth', 'args': [], 'kwargs': {'kernel_width': 25}},
         {'model': 'Resample', 'args': [], 'kwargs': {'n_samples': 1000}},
         'ZScore']
hyperalign = {'model': 'HyperAlign', 'args': [], 'kwargs': {'n_iter': 2}}

hyp.plot(data, manip=manip, align=hyperalign, animate='window', reduce='UMAP', duration=30, focused=4, zoom=0.5)

there are at least a few differences that i notice:

prior versions of hypertools normalized data by default. this no longer happens. in the above example, i've specified that the data should be z-scored (within feature) prior to passing to UMAP, but i need to double check that the "preprocessing" is analogous to the prior version. e.g. should 'ZScore' be replaced with 'Normalize' and/or some other preprocessing step?
the first demo uses a Gaussian kernel (variance = 50) to smooth the data, whereas the second example uses a boxcar kernel (width=25). i can't imagine that this would substantially change the results...but you never know...
i don't think that the normalization step gets applied in the old hyp.align function, but it's possible the previous demo normalized the data twice (once prior to the first alignment step and again prior to the second alignment/projecting into 3D steps). if so, the second demo could be updated to normalize multiple times.

it's also possible that something is off with the revised hyperalignment implementation, despite that the alignment tests are passing...

add more advanced support for text data: flair, hugging face, word2vec, tensorflow, etc.

update format_data to add support for other (non-scikit-learn) word embedding models. automatically download pre-trained models as needed.

add support for multicolored lines

when per-timepoint colors are specified (or implied, e.g. via cluster), line plots should be multicolored

add support for numpy arrays

create a simple format_data function:

when the user passes a numpy array, convert it to a dataframe automatically
always return a list of dataframes

smoothing and/or resampling and/or projecting into low-D using UMAP results in strange "jumps"

the following code:

import hypertools as hyp

data = hyp.load('weights')

umap3d = {'model': 'UMAP', 'args': [], 'kwargs': {'n_components': 3}}
manip = [{'model': 'Smooth', 'args': [], 'kwargs': {'kernel_width': 25}},
         {'model': 'Resample', 'args': [], 'kwargs': {'n_samples': 1000}},
         'ZScore']
hyperalign = {'model': 'HyperAlign', 'args': [], 'kwargs': {'n_iter': 2}}

fig = hyp.plot(data, manip=manip, reduce=umap3d, align=hyperalign)

results in strange "jumps" in trajectories that should be smooth:

refactor "load" function

extend the load function to support the following:

existing named datasets (convert to CSV files, numpy arrays, pickles, or some other convenient format)
arbitrary URLs (download CSV files, spreadsheets, and other pandas-supported filetypes)
local files (any pandas-supported filetype)
obj files
seaborn datasets (named)
scipy/scikit-learn datasets (named)
other datasets we analyzed in the arxiv preprint and/or JMLR paper
text (txt, rtf, doc) files
images (jpg, png, gif, etc. that can be loaded with scikit-image)
nii files (via niilearn)
youtube videos? (extract text/images?)
music (mp3/wav files?)

also add some cool additional datasets:

umap zoo animals
project gutenberg books (named?)

the function should return either:

a pandas dataframe (if data is tabular)
formatted text
a list of text and dataframes

in other words, the load function should return an object that can be directly plotted or manipulated with other hypertools functions.

Other notes:

the load function's supported datatypes should be organized in modules so that support for new filetypes and/or extensions can be easily added

clean up

i think we should get rid of:

caching
datageometry objects
defaulting to numpy arrays (replace with dataframes)
support for backwards compatibility (e.g. arguments that don't really do anything or that have been replaced) and multi-named keywords (e.g. 'color' and 'colors')
support for multi-use arguments (e.g. align=True and align='hyper' do the same thing...but we should just support align='hyper' so that the API is cleaner and more consistent

no background for 2d animations exported as GIFs

carving off the 2d gif component of this issue:

backgrounds do not show up correctly when exporting 2d animations as GIFs:

documentation: readme file

the readme file and example images/animations in the readme file need to be updated

add support for missing data

enhance format_data by adding support for missing data:

missing rows are filled in using interpolation (support scikit-learn functions)
missing columns (within a given row, where other rows are not missing data in that column) are filled in using PPCA
if a column is entirely nans, remove it

fix caching

Caching is really useful when the goal is to re-run the same line of code multiple times (it makes subsequent runs faster). But the current implementation of caching seems to be ignoring some parameters. For example, calling plot with the reduce flag returns a DataGeometry object (as expected). However, subsequent calls to plot using the same dataset (but a different reduce argument) seem to return the same object, rather than calling reduce again with the new argument.

Ideas:

The simplest way to fix this would be to entirely remove caching.
Alternatively, we could add a cache flag to specify whether or not to cache a particular function call. When cache==False, this could force a "refresh" of whatever function was being called, as a sort of hack to fix the undesired behavior
The "best" solution would to actually figure out why caching isn't working as expected and then fix it.

Plan:

Try the simplest solution first and then build from there...

minimum viable reduce function

Create "minimum viable" skeleton package

Create a highly simplified new version of hypertools with better organization and modularity. It should have the following features:

1.) support only dataframes or lists of dataframes (no "format data" functions)
2.) reduce: apply a scikit-learn dimensionality reduction model to the data to get a new dataframe (or list of dataframes)
3.) align: apply hyperalignment or SRM to the data to get a new dataframe (or list of dataframes)
4.) cluster: apply discrete clustering to the data and color datapoints according to their clusters.
5.) plot: take in one or more dataframes and generate a static 2D or 3D plot by calling reduce (if ndims > 3). data must be > 1D, or an error is thrown. generate either a scatterplot, a line plot, or a scatterplot/lineplot combo. internally, this should call the reduce and/or align and/or cluster functions, as specified by the user.

set defaults for all functions

organize default parameters for every supported model, from every module, in a single config.ini file.

each module should get a section (plot, cluster, align, reduce, etc.) and each sub-function should get a sub-section (static, animate, etc.)

this should also store locations of named datasets/models

documentation: sphinx

need to check over sphinx documentation and make sure everything recompiles and is updated to reflect toolbox changes

multilevel indices

given a dataframe with a multilevel index, the line/dot colors should be determined from the outermost index. line thickness of the outermost index should be scaled by a factor of L (number-of-levels).

For each "lower" level in turn, reduce the line thickness by 1 and opacity by some amount (play around with it to see what looks good).

E.g., suppose a dataframe contains the levels timepoints --> trials --> subjects --> experiments.

Each "experiment" should show the average trajectory (or scatterplot) for everything below it, with large/thick and mostly opaque dots/markers, in one color (or colormap) per experiment. Each "subject" should show that subject's average trajectory/scatterplot within that experiment, colored the same as the experiment but smaller/thinner/less opaque. Each trial should show the timecourse across timepoints (in even smaller/thinner/less opaque lines/dots), again colored the same as the subjects/experiments.

The idea is to show a summary of each part of the dataset, but with subtle details added to show (as subtlety increases) more and more low-level detail.

pandas stack and unstack

add support for stacking multiple dataframes (and increasing the index levels). also add a complementary "undo stack" operation that turns a multilevel dataframe into a list

documentation: tutorials

all of the tutorials need to be checked over and updated

Gaps between clusters

Need to check over how datasets that span multiple clusters are plotted, particularly in animated plots. It may be necessary to add in additional connecting segments between groups when the current segment is the last in a group but not the last overall within the current block of adjacent time points. Since the next observation could come from any group, i think it'll be necessary to search over all other groups. The connection could be labeled using the correct group and colored either using the current group or an average of the current and next group's colors.

clustered observations not displaying correctly when mode involves lines

The following code produces the incorrect plot below:

gaussian_mixture = {'model': 'GaussianMixture', 'args': [], 'kwargs': {'n_components': 3,
                                                                           'mode': 'fit_predict_proba'}}
hyp.plot(data[0], cluster=gaussian_mixture, cmap='husl')

There are a couple of potential fail points that need to be explored (along with other possibilities):

only a single trace is plotted, whereas other examples have used multiple traces
mode = lines versus markers

minimum viable plot function

documentation: docstrings

all doc strings need to be added. some can be copied over (and/or lightly modified) from the prior version of the toolbox.

add basic support for text data: scikit-learn

when the user passes in text data, update format_data to project it into a word embedding space using built-in scikit-learn models (LDA, NMF)

animation bugs: camera placement in 3d animations

in 3d animations where the data aren't centered around (0, 0, 0), the camera and zoom are positioned strangely

deal with nested lists

nested lists could be converted to multilevel dataframes, where the outermost list level corresponds to the outermost index level.

minimum viable cluster function

write a "save" funcion

should complement the load function by saving a dataframe, formatted text, or a list thereof to a file type that may be easily loaded in by the load function (given the file path).

package organization

it'd be nice to organize code in modules and sub-modules, each with a consistent (well-documented!) API so that sub-modules (or modules) may be easily extended and added, along with a defaults specified in the config.ini file under its own section or sub-section:

plot: contain matplotlib folder for matplotlib plotting, plotly folder for plotly plotting, etc. each folder should contain a plot.py file, an animate.py file, and a stylize.py file
vectorize: contains code for turning data in different formats into feature vectors. should support pandas dataframes, text data, images, nifti files, wav files, EEG data(?), etc.
manip: contains code for resampling, smoothing, normalizing, z-scoring, z-transforming, etc.
align: contains hyperalignment, SRM, and other code for aligning matrix data
cluster: wrap various packages (e.g. scikit-learn, HDBSCAN) for clustering data into discrete clusters or mixtures of clusters
reduce: wrap various packages (scikit-learn, UMAP, etc.) for projecting high-dimensional data onto lower-dimensional spaces