oudalab / bismol Goto Github PK

The Bismol classification system prototype

License: Other

Python 37.44% HTML 1.13% JavaScript 7.13% CSS 0.37% Shell 0.83% Jupyter Notebook 25.87% R 24.95% Dockerfile 2.06% Pug 0.22%

bismol's Issues

Fix interfaces

With the jobandworker branch merge, all interface fields previously called 'url' should be changed to 'id'

Add tags field to message object

Message object will need a list of tags to be modified for learning/training/classification

Rate Limiting on Illness/Exercise Ingestors

USA/Africa collectors are fine, however the illness and exercise collectors are getting 420s from Twitter.

Collecting tweets over the continent of Africa

We can collect tweets for the continent of Africa but we need to make a few decisions:

What are the Languages that we will accept? English only?
What bounding boxe(s) are considered Africa?
Should we include (0°, 0°)?
Should we use multiple ingestors for different regions?

Thanks!

Checking for unknown file

When running on local system, I encounter this error:
FileNotFoundError: [Errno 2] No such file or directory: '/data/tweetsdb/tweet_health_20161231151722.json'
Shouldn't be checking for this file?

Verbs and exclusion terms

We may want to link the following verbs to the following exclusion terms in the exercise data collection queries:

Run/running/ran (-wild, -late, -errands, -water)
Hike/hiking/hiked (-rent, -wage, -fare)
Yoga (-pants)
Pool (-union, -uber)
walk/walking/walk (-dead, into in, ins)
surf/surfing/surfed (-internet, -web, -crowd, - channel)

I think that should capture most of the more ambiguous terms without returning a lot of irrelevant data!

How does a user distinguish points of interest

Point of interest are ambiguous or ill-defined points which the algorithm cannot cleanly cluster. Humans can help cluster these points to increase the total clustering time.
To implement this we can do the following.

Keep track to the distance traveled for each point during the clustering process.
Identity the furthest and least traveled points. We can use the top and bottom fixed number, percentage, or quartile.
In addition to moving points, we can allow the user to (1) scatter or (2) accept the current position. Scatter randomizes the position allowing the clustering process to reset and a accepts stops the movement of a particular datapoint.

The scatter and accept functions can lead us into the beginning of streaming data.

Link commits related to this point by adding the issue number to the commit message.

Particles are moving quickly

The particles in the t-SNE animation are moving quickly and we need to understand a little more about the movement.

Show visual trails of the particles moving in t-sne.
Measure and possible show the delay between changes and updates

Design Experiment for Users

In order to understand user perceptions of assisting with the clustering process we will need to design an experiment. This experiment will test the ability of humans to assist the clustering process and allow us to understand if humans are helping and how much they think they are helping.

We will also need to design a small training module to get users used to the clustering method.

Turn bismol into a proper python package

We want to make bismol a package that can be created and installed with a pip or easy install.

Here is a guide for achieving this: https://python-packaging.readthedocs.org/en/latest/

Make python3 required
Add the list of requirements so they can automatically downloaded https://caremad.io/2013/07/setup-vs-requirement/
Make it easy to run launch the full pipeline

Remove large objects from the repo

We committed some very large objects in the repo. I believe they are the rethink db logs. We can remove them using something like git filter-branch --tree-filter 'rm filename' HEAD where filename is the name of the large file. But your should double check to make sure this command does this (and doesn't delete the whole repo).
Also, it is a good practice to git add filename files one by one so you don't include any extra log or tmp files.

Support methods of highlighting and gestering

If we want to select multiple

If you select an item, temporarily grey out everything else so we can see its new movement.
When clicking and holding, the lengths that the screen is held is the increase in the capture radius of items.
Show comment trails so we can see the direction of the particles
Add a gesture for multi select and un-selecting

Others?

Define functional relation between Job and worker

Worker should manage:

input and output of data
model training with job
classification with job
other functions as needed (training cycles, ect)

And Job should provide functions:

training(**kwargs) where kwargs is a dictionary of named arguments and values
run classifier(iterator, stop_condition) where the iterator is "live", i.e. can be modified and added to while its being iterated over. It should be message objects and stop condition can be changed in real time by the worker. This should return another iterator of classified message objects.

Pulls Tweets with same text and different id's

I found that the issue in the labeler pulling the same tweet is that our database contains tweets with the same text but different id's, sometimes upwards of 20-30 times.

Add ID field to message object

The first collumn of the neel data is actually an id. Tweets also have a unique id, see:

http://tkang.blogspot.com/2011/01/tweepy-twitter-api-status-object.html

so maybe it would be a good idea to have that in the message object? Idk, just an idea.

Allow the PostgreSQL docker server to use external volumes that are persistent.
Create a backup system for the tweet database.
Implement public networking between PostgreSQL and any other web host or docker
Upload the customized PostgreSQL Docker service to docker hub.
Create docker service for tweet collector.
Upload docker server for tweet collector to docker hub.
Update passwords and security

oudalab / bismol Goto Github PK

bismol's People

Contributors

Stargazers

Watchers

Forkers

bismol's Issues

Recommend Projects

Recommend Topics

Recommend Org