districtdatalabs / cultivar Goto Github PK

View Code? Open in Web Editor NEW

51.0 26.0 18.0 2.16 MB

Multidimensional data explorer and visualization tool.

Home Page: http://trinket.districtdatalabs.com

License: Apache License 2.0

Python 39.89% CSS 1.86% JavaScript 3.54% HTML 52.76% Makefile 1.95%

data-analysis data-exploration data-management visualization

cultivar's Introduction

Trinket

Multidimensional data explorer and visualization tool.

About

This is a dataset management and visualization tool that is being built as part of the DDL Multidimensional Visualization Research Lab. See: Parallel Coordinates for more on the types of visualizations we're experimenting with.

For more information, please enjoy the documentation found at trinket.readthedocs.org.

Contributing

Trinket is open source, but because this is an District Data Labs project, we would appreciate it if you would let us know how you intend to use the software (other than simply copying and pasting code so that you can use it in your own projects). If you would like to contribute (especially if you are a student or research labs member at District Data Labs), you can do so in the following ways:

Add issues or bugs to the bug tracker: https://github.com/DistrictDataLabs/trinket/issues
Work on a card on the dev board: https://waffle.io/DistrictDataLabs/trinket
Create a pull request in Github: https://github.com/DistrictDataLabs/trinket/pulls

Note that labels in the Github issues are defined in the blog post: How we use labels on GitHub Issues at Mediocre Laboratories.

If you are a member of the District Data Labs Faculty group, you have direct access to the repository, which is set up in a typical production/release/development cycle as described in A Successful Git Branching Model. A typical workflow is as follows:

Select a card from the dev board - preferably one that is "ready" then move it to "in-progress".
Create a branch off of develop called "feature-[feature name]", work and commit into that branch.
```
 ~$ git checkout -b feature-myfeature develop
```

Once you are done working (and everything is tested) merge your feature into develop.

 ~$ git checkout develop
 ~$ git merge --no-ff feature-myfeature
 ~$ git branch -d feature-myfeature
 ~$ git push origin develop

Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.

Throughput

Attribution

The image used in this README, "window#1" by Namelas Frade is licensed under CC BY-NC-ND 2.0

Changelog

The release versions that are sent to the Python package index (PyPI) are also tagged in Github. You can see the tags through the Github web application and download the tarball of the version you'd like. Additionally PyPI will host the various releases of Trinket (eventually).

The versioning uses a three part version system, "a.b.c" - "a" represents a major release that may not be backwards compatible. "b" is incremented on minor releases that may contain extra features, but are backwards compatible. "c" releases are bug fixes or other micro changes that developers should feel free to immediately update to.

Version 0.2

tag: v0.2
deployment: Wednesday, January 27, 2016
commit: (see tag)

This minor update gave a bit more functionality to the MVP prototype, even though the version was intended to have a much more impactful feature set. However after some study, the workflow is changing, and so this development branch is being pruned and deployed in preparation for the next batch. The major achievement of this version is the documentation that discusses our approach, as well as the dataset search and listing page that is now available.

Version 0.1

tag: v0.1
deployment: Tuesday, October 13, 2015
commit: c863e42

MVP prototype type of a dataset uploader and management application. This application framework will become the basis for the research project in the DDL Multidimensional Visualization Research Labs. For now users can upload datasets, and manage their description, as well as preview the first 20 rows.

cultivar's People

Contributors

Stargazers

Watchers

Forkers

datafighter wleepang btomtom5 brycecaine liumason cheviana ndanielsen chrisemoulton looselycoupled ojedatony1616 bahadasx jkeung forestdengtech mogaio imraghava lepy fagan2888 shalevy1

cultivar's Issues

Large files "hang" uploader

There is no feedback to the user during file upload, so it appears to just simply hang where it is. This would hopefully be fixed by Issue #3 - otherwise some sort of "uploading" animation and blocking should be required.

3D tours

FYI - Ben doesn't want to work on this

Enable user to include metadata with upload

So that we have some sense of where the data is coming from.

AJAXify the uploader

Currently the uploader is just a opacity = 0 file input that has been Bootstrap styled. However, we should use dropzone.js or jQuery File Upload in order to make this process a bit more seamless, including progress bars, updating, etc.

Training vs. Testing Datasets

So one interesting thing that is happening is that you have to separately upload training and testing data sets and manage them separately. It seems like they should be in the "same" dataset and managed together, even though they are two different files.

Maybe we can use #5 (related datasets) somehow to implement this functionality?

Should we create training/testing datasets for every single file we upload?

Upload Error: line contains NULL byte

Internal Server Error: /upload/
Traceback (most recent call last):
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
response = wrapped_callback(request, _callback_args, *_callback_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/base.py", line 71, in view
return self.dispatch(request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/braces/views/_access.py", line 98, in dispatch
request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/base.py", line 89, in dispatch
return handler(request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/edit.py", line 215, in post
return self.form_valid(form)
File "/app/coffer/views.py", line 50, in form_valid
form.save()
File "/app/coffer/forms.py", line 51, in save
dataset=self.cleaned_data['dataset'], uploader=self.request.user
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
return getattr(self.get_queryset(), name)(_args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 348, in create
obj.save(force_insert=True, using=self.db)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
force_update=force_update, update_fields=update_fields)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 758, in save_base
update_fields=update_fields)
File "/app/.heroku/python/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 201, in send
response = receiver(signal=self, sender=sender, **named)
File "/app/coffer/signals.py", line 51, in dataset_file_compute
header = reader.next()
File "/app/.heroku/python/lib/python2.7/site-packages/unicodecsv/py2.py", line 117, in next
row = self.reader.next()
Error: line contains NULL byte

Support for out of ram datasets

Should we have support for datasets that would require multiple nodes to analyze

500 error on upload w/ missing col/row values

I'm getting an error when I attempt to upload datasets that have missing values in some of the columns/rows. Noticing this because a lot of gov't datasets use the first few rows of a table to provide metadata info.

iPython Notebook Download

Provide the ability to access a dataset from an iPython notebook (e.g. pd.read_trinket(url) or something like that) and furthermore, allow the download of an iPython notebook that sets up a basic analysis for the user.

One idea here is to simply use the Django template engine to write various information into the JSON of a Jupyter notebook then send it to download.

Create a search and listing page for datasets

Create a mechanism to search for and filter a list of datasets by their various attributes. The dataset listing page should allow users to easily find and display a particular dataset.

Dataset Overwrite/Versioning System

Right now if you upload a duplicate file, the file is modified on S3 - e.g. its "last modified" timestamp changes. We need to ask some important questions for data management:

Are we simply "touching" the file or are we rewriting it?
What counts as a duplicate on S3? Presumably just the filename, or are we protected by the hash?
Can we use some temporary data store in S3 that gets cleaned regularly for protection?
Should we save datasets according to their hash, then rename on download?

We should make sure that a dataset cannot be overridden if someone uploads a different dataset with the same name.

SSL 404 Redirect

Accessing Trinket on Heroku via http (instead of https) leads to a 404 redirect error when signing in with Google.

Add Google Analytics

So we can track the behavior of users on the website.

Splitting into a matrix

After you do feature detection, if there is no logical matrix, (triplet, ID and index) the system should transform create its own schema.

Feature nomination tool for visualization

some way to narrow it down. needs some kind of separability measure. order by variance?

Preview overruns x-overflow

Datasets with many columns overrun the viewable area creating an awkward display. Perhaps the preview should be its own page.

Standarizing

Setting the mean to zero, scaling ,and normalizing the dataset.

Should we assume a clean dataset

Dataset are often dirty and will need to be cleaned. Trinket should be robust enough to handle this.

Add Organizations

Create a stub to allow datasets to be owned by either users or by organizations. We will probably use this more in the future, but it's best to be ready.

Related Datasets

Create a dataset linking mechanism so that we could see related datasets. This process can be intelligent/automatic by inspecting field names or descriptions; otherwise it can be user driven.

Continuous or Categorical

Trinket should have some feature to determine if data is continuous or categorical.

This should be somewhat guessed on behalf on the user, by the system. However, ultimately the user should have control.

Zooming

Async Upload with Celery

Right now the file upload process is synchronous and the following computations are performed:

Read the file for the hash value
Read the file for length, no. dimensions
Insert record/collision into database
Store on S3 if success.

Especially since we're moving to dropzone.js (#3) we might as well make this computation chain asynchronous using Celery tasks. This will allow us to do better detection and also give variable status reports for different file uploads.

This might help some of the problems/questions cited in #7.

Research Auto-analysis Feature

Auto analysis assigns each column/feature a data type (dtype in the parlance of NumPy and Pandas), e.g. categorical, numeric, real, integer, etc. This types must be automatically inferred from the dataset.

Questions to answer:

How pandas does this?
What does column-major mean for Trinket?
What types are we looking for?
How lightweight/heavyweight must this be?
Is there a certain density of data required to make a decision?
Do you have to go through the whole dataset to make a decision?
Can we use a sample approach to reading the data?
How do we detect if there is a header row or not?
Can we automatically detect delimiters and quote characters? (e.g. ; vs ,)

Interesting stuff/libraries in: Data Type Recognition/Guessing of CSV data in python

Dimension Histograms and Ranking: 2D

Gather/create datasets for auto analysis validation

We need a variety of datasets to show that our auto analysis technique works/doesn't work and how completely it works, and when it fails.

Consider data sets with:

a variety of delimiters and escape characters
headers and no headers
rows of varying lengths
columns of many different data types
datasets with errors (multiple datatypes per column)
datasets with null values of a variety of types

Reskin the API docs

Make the API docs the same bootstrap theme as the overall app.

Dataset Searching

At this point we have the following accomplished:

Listing page that displays the most recent uploads in a table
Pagination by 20 datasets and the pagination component
A button to go to the listing page from the home page

Next up:

put navbar links for upload/listing for easy navigation (only if authenticated)
create API list endpoint for the datasets
create ordering and search for datasets
create a form that will allow you to search for datasets.

Other questions include full text search, etc. However we may need to rethink how this is done given our new workflow, so I'm backlogging the rest of this task.

Feature Identification

At the very first pass, we should be able to identify datatypes of each column.

This should be in the MVP

Dataset Licensing

Provide a variety of licenses including private ones such that we can manage the use (and fair use) of the datasets in our repository.

Add throughput to README

From Waffle, it's pretty cool!

Implement Hierarchical Clustering

We need to have some implementation of clustering that we can lose to build other features on .

Hierarchical is useful because it is ideal for brushing, zooming, and filtering.

Identify multicollinearity in dataset columns

Multi file dataset workflow

Implement/wireframe the multi-file dataset workflow and detail pages.

Clean up/Organize Documentation

Since we're going to publish this, maybe add some of the descriptive content we have from the SBIR and make the docs look good and ready to go.

Create % complete for view of dataset

Create a progress bar that describes how complete a dataset description is in terms of choosing the license, the description, field categories, etc.

Implement beta auto analysis

Create a module in Trinket for auto analysis (we need a good name for it). It should expose a single function or class at it's root that can be used within a celery task.

This function/class should take as input a file-like object and generic keyword arguments (**kwargs).

As output, the function should return a tuple/list whose length is the (maximum) number of columns in the dataset, and whose values contain the datatype of each column, ordered by column index.

Other stuff:

No third party dependencies except unicodecsv and numpy.

Parallel Coordinates

Dimension Histograms and Ranking: 1D

Create 1D (histograms) and 2D (ranking) plots for features or pairs of features. This ranking should be interactive and allow the user to explore datasets.

Optimize the distance function used for Clustering

The system should help the user determine the most ideal distance function based on the data.

With the distance function, we propose running a general distance function and using an optimization to find the most ideal parameter space.

We could also have user input.