Git Product home page Git Product logo

cultivar's Introduction

Trinket

Multidimensional data explorer and visualization tool.

Build Status Coverage Status Documentation Status Stories in Ready

Colorful Wall

About

This is a dataset management and visualization tool that is being built as part of the DDL Multidimensional Visualization Research Lab. See: Parallel Coordinates for more on the types of visualizations we're experimenting with.

For more information, please enjoy the documentation found at trinket.readthedocs.org.

Contributing

Trinket is open source, but because this is an District Data Labs project, we would appreciate it if you would let us know how you intend to use the software (other than simply copying and pasting code so that you can use it in your own projects). If you would like to contribute (especially if you are a student or research labs member at District Data Labs), you can do so in the following ways:

  1. Add issues or bugs to the bug tracker: https://github.com/DistrictDataLabs/trinket/issues
  2. Work on a card on the dev board: https://waffle.io/DistrictDataLabs/trinket
  3. Create a pull request in Github: https://github.com/DistrictDataLabs/trinket/pulls

Note that labels in the Github issues are defined in the blog post: How we use labels on GitHub Issues at Mediocre Laboratories.

If you are a member of the District Data Labs Faculty group, you have direct access to the repository, which is set up in a typical production/release/development cycle as described in A Successful Git Branching Model. A typical workflow is as follows:

  1. Select a card from the dev board - preferably one that is "ready" then move it to "in-progress".

  2. Create a branch off of develop called "feature-[feature name]", work and commit into that branch.

     ~$ git checkout -b feature-myfeature develop
    
  3. Once you are done working (and everything is tested) merge your feature into develop.

     ~$ git checkout develop
     ~$ git merge --no-ff feature-myfeature
     ~$ git branch -d feature-myfeature
     ~$ git push origin develop
    
  4. Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.

Throughput

Throughput Graph

Attribution

The image used in this README, "window#1" by Namelas Frade is licensed under CC BY-NC-ND 2.0

Changelog

The release versions that are sent to the Python package index (PyPI) are also tagged in Github. You can see the tags through the Github web application and download the tarball of the version you'd like. Additionally PyPI will host the various releases of Trinket (eventually).

The versioning uses a three part version system, "a.b.c" - "a" represents a major release that may not be backwards compatible. "b" is incremented on minor releases that may contain extra features, but are backwards compatible. "c" releases are bug fixes or other micro changes that developers should feel free to immediately update to.

Version 0.2

  • tag: v0.2
  • deployment: Wednesday, January 27, 2016
  • commit: (see tag)

This minor update gave a bit more functionality to the MVP prototype, even though the version was intended to have a much more impactful feature set. However after some study, the workflow is changing, and so this development branch is being pruned and deployed in preparation for the next batch. The major achievement of this version is the documentation that discusses our approach, as well as the dataset search and listing page that is now available.

Version 0.1

  • tag: v0.1
  • deployment: Tuesday, October 13, 2015
  • commit: c863e42

MVP prototype type of a dataset uploader and management application. This application framework will become the basis for the research project in the DDL Multidimensional Visualization Research Labs. For now users can upload datasets, and manage their description, as well as preview the first 20 rows.

cultivar's People

Contributors

bbengfort avatar ojedatony1616 avatar rebeccabilbro avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cultivar's Issues

Large files "hang" uploader

There is no feedback to the user during file upload, so it appears to just simply hang where it is. This would hopefully be fixed by Issue #3 - otherwise some sort of "uploading" animation and blocking should be required.

3D tours

FYI - Ben doesn't want to work on this

AJAXify the uploader

Currently the uploader is just a opacity = 0 file input that has been Bootstrap styled. However, we should use dropzone.js or jQuery File Upload in order to make this process a bit more seamless, including progress bars, updating, etc.

Training vs. Testing Datasets

So one interesting thing that is happening is that you have to separately upload training and testing data sets and manage them separately. It seems like they should be in the "same" dataset and managed together, even though they are two different files.

Maybe we can use #5 (related datasets) somehow to implement this functionality?

Should we create training/testing datasets for every single file we upload?

Upload Error: line contains NULL byte

Internal Server Error: /upload/
Traceback (most recent call last):
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
response = wrapped_callback(request, _callback_args, *_callback_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/base.py", line 71, in view
return self.dispatch(request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/braces/views/_access.py", line 98, in dispatch
request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/base.py", line 89, in dispatch
return handler(request, _args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/views/generic/edit.py", line 215, in post
return self.form_valid(form)
File "/app/coffer/views.py", line 50, in form_valid
form.save()
File "/app/coffer/forms.py", line 51, in save
dataset=self.cleaned_data['dataset'], uploader=self.request.user
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
return getattr(self.get_queryset(), name)(_args, *_kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 348, in create
obj.save(force_insert=True, using=self.db)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
force_update=force_update, update_fields=update_fields)
File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/base.py", line 758, in save_base
update_fields=update_fields)
File "/app/.heroku/python/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 201, in send
response = receiver(signal=self, sender=sender, **named)
File "/app/coffer/signals.py", line 51, in dataset_file_compute
header = reader.next()
File "/app/.heroku/python/lib/python2.7/site-packages/unicodecsv/py2.py", line 117, in next
row = self.reader.next()
Error: line contains NULL byte

500 error on upload w/ missing col/row values

I'm getting an error when I attempt to upload datasets that have missing values in some of the columns/rows. Noticing this because a lot of gov't datasets use the first few rows of a table to provide metadata info.

iPython Notebook Download

Provide the ability to access a dataset from an iPython notebook (e.g. pd.read_trinket(url) or something like that) and furthermore, allow the download of an iPython notebook that sets up a basic analysis for the user.

One idea here is to simply use the Django template engine to write various information into the JSON of a Jupyter notebook then send it to download.

Create a search and listing page for datasets

Create a mechanism to search for and filter a list of datasets by their various attributes. The dataset listing page should allow users to easily find and display a particular dataset.

Dataset Overwrite/Versioning System

Right now if you upload a duplicate file, the file is modified on S3 - e.g. its "last modified" timestamp changes. We need to ask some important questions for data management:

  1. Are we simply "touching" the file or are we rewriting it?
  2. What counts as a duplicate on S3? Presumably just the filename, or are we protected by the hash?
  3. Can we use some temporary data store in S3 that gets cleaned regularly for protection?
  4. Should we save datasets according to their hash, then rename on download?

We should make sure that a dataset cannot be overridden if someone uploads a different dataset with the same name.

SSL 404 Redirect

Accessing Trinket on Heroku via http (instead of https) leads to a 404 redirect error when signing in with Google.

Splitting into a matrix

After you do feature detection, if there is no logical matrix, (triplet, ID and index) the system should transform create its own schema.

Preview overruns x-overflow

Datasets with many columns overrun the viewable area creating an awkward display. Perhaps the preview should be its own page.

Standarizing

Setting the mean to zero, scaling ,and normalizing the dataset.

Add Organizations

Create a stub to allow datasets to be owned by either users or by organizations. We will probably use this more in the future, but it's best to be ready.

Related Datasets

Create a dataset linking mechanism so that we could see related datasets. This process can be intelligent/automatic by inspecting field names or descriptions; otherwise it can be user driven.

Continuous or Categorical

Trinket should have some feature to determine if data is continuous or categorical.

This should be somewhat guessed on behalf on the user, by the system. However, ultimately the user should have control.

Async Upload with Celery

Right now the file upload process is synchronous and the following computations are performed:

  1. Read the file for the hash value
  2. Read the file for length, no. dimensions
  3. Insert record/collision into database
  4. Store on S3 if success.

Especially since we're moving to dropzone.js (#3) we might as well make this computation chain asynchronous using Celery tasks. This will allow us to do better detection and also give variable status reports for different file uploads.

This might help some of the problems/questions cited in #7.

Research Auto-analysis Feature

Auto analysis assigns each column/feature a data type (dtype in the parlance of NumPy and Pandas), e.g. categorical, numeric, real, integer, etc. This types must be automatically inferred from the dataset.

Questions to answer:

  • How pandas does this?
  • What does column-major mean for Trinket?
  • What types are we looking for?
  • How lightweight/heavyweight must this be?
  • Is there a certain density of data required to make a decision?
  • Do you have to go through the whole dataset to make a decision?
  • Can we use a sample approach to reading the data?
  • How do we detect if there is a header row or not?
  • Can we automatically detect delimiters and quote characters? (e.g. ; vs ,)

Interesting stuff/libraries in: Data Type Recognition/Guessing of CSV data in python

Gather/create datasets for auto analysis validation

We need a variety of datasets to show that our auto analysis technique works/doesn't work and how completely it works, and when it fails.

Consider data sets with:

  • a variety of delimiters and escape characters
  • headers and no headers
  • rows of varying lengths
  • columns of many different data types
  • datasets with errors (multiple datatypes per column)
  • datasets with null values of a variety of types

Dataset Searching

At this point we have the following accomplished:

  • Listing page that displays the most recent uploads in a table
  • Pagination by 20 datasets and the pagination component
  • A button to go to the listing page from the home page

Next up:

  • put navbar links for upload/listing for easy navigation (only if authenticated)
  • create API list endpoint for the datasets
  • create ordering and search for datasets
  • create a form that will allow you to search for datasets.

Other questions include full text search, etc. However we may need to rethink how this is done given our new workflow, so I'm backlogging the rest of this task.

Feature Identification

At the very first pass, we should be able to identify datatypes of each column.

This should be in the MVP

Dataset Licensing

Provide a variety of licenses including private ones such that we can manage the use (and fair use) of the datasets in our repository.

Implement Hierarchical Clustering

We need to have some implementation of clustering that we can lose to build other features on .

Hierarchical is useful because it is ideal for brushing, zooming, and filtering.

Clean up/Organize Documentation

Since we're going to publish this, maybe add some of the descriptive content we have from the SBIR and make the docs look good and ready to go.

Create % complete for view of dataset

Create a progress bar that describes how complete a dataset description is in terms of choosing the license, the description, field categories, etc.

Implement beta auto analysis

Create a module in Trinket for auto analysis (we need a good name for it). It should expose a single function or class at it's root that can be used within a celery task.

This function/class should take as input a file-like object and generic keyword arguments (**kwargs).

As output, the function should return a tuple/list whose length is the (maximum) number of columns in the dataset, and whose values contain the datatype of each column, ordered by column index.

Other stuff:

  • No third party dependencies except unicodecsv and numpy.

Dimension Histograms and Ranking: 1D

Create 1D (histograms) and 2D (ranking) plots for features or pairs of features. This ranking should be interactive and allow the user to explore datasets.

Optimize the distance function used for Clustering

The system should help the user determine the most ideal distance function based on the data.

With the distance function, we propose running a general distance function and using an optimization to find the most ideal parameter space.

We could also have user input.

Dropdown Dataset Edit Form

On the dataset detail view, the edit button should display a dropdown form for editing the details of the dataset - it shouldn't simply redirect to the admin screen.

Put in a Feedback loop

As people interact with trinket put in an automatic analysis tool to log what is being used and what is not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.