Git Product home page Git Product logo

entrofy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entrofy's Issues

Use for-loop instead of list comprehension

entrofy/core.py has

    # Run the specified number of randomized trials
    results = [__entrofy(df_binary.values, n, rng,
                         w=target_weight,
                         q=target_prob,
                         pre_selects=pre_selects_i,
                         quantile=quantile,
                         alpha=alpha)
               for _ in range(n_trials)]

    # Select the trial with the best score
    max_score, best = results[0]
    for score, solution in results[1:]:
        if score > max_score:
            max_score = score
            best = solution

This could be replaced with

    # Run the specified number of randomized trials
    max_score = 0
    for _ in range(n_trials):
        score, solution = __entrofy(df_binary.values, n, rng,
                         w=target_weight,
                         q=target_prob,
                         pre_selects=pre_selects_i,
                         quantile=quantile,
                         alpha=alpha)
        if score > max_score:
            max_score = score
            best = solution

Doing tests

I started an ipython notebook to do some tests, I'd like to put it in the notebook folder, but I can't figure it out how to import entrofy in the notebook using a relative path. Anyone knows how to do it?

A better name?

I figure most potential users of this thing won't have a clue what entropy means.

Can we come up with something a bit more people-friendly?

Python 3 for review

@dhuppenkothen I am wondering about the Python 3 PR #59

We are going to be using this for Waterhackweek selection next month.

Is this your preferred citation?

Huppenkothen, D., McFee, B., & Norén, L. (2019). Entrofy Your Cohort: A Data Science Approach to Candidate Selection. arXiv preprint arXiv:1905.03314.

Thanks!

--

LinAlgError: singular matrix after 2to3

I run 2to3, see rgaiacs@1e10d5d, but when I rerun notebooks/tutorial.ipynb I got

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-9-00bdbb9c2e31> in <module>()
      3 ax1 = entrofy.plotting.plot_correlation(df, "age", "age", ax=ax1,
      4                                         xtype="continuous",
----> 5                                         ytype="continuous", cont_type="kde")
      6 ax2 = entrofy.plotting.plot_correlation(df, "age", "age", ax=ax2,
      7                                         xtype="continuous",

/home/raniere/anaconda3/lib/python3.5/site-packages/entrofy-0.0.0-py3.5.egg/entrofy/plotting.py in plot_correlation(df, xlabel, ylabel, xmapper, ymapper, ax, xtype, ytype, cmap, prefac, cat_type, cont_type, s)
    515     elif ((xtype == "continuous") & (ytype == "continuous")):
    516         ax = _plot_continuous(df, xlabel, ylabel, ax, plottype=cont_type,
--> 517                               n_levels=10, cmap="YlGnBu", shade=True)
    518 
    519     else:

/home/raniere/anaconda3/lib/python3.5/site-packages/entrofy-0.0.0-py3.5.egg/entrofy/plotting.py in _plot_continuous(df, xlabel, ylabel, ax, plottype, n_levels, cmap, shade)
    368     if plottype == "kde":
    369         sns.kdeplot(x_clean, y_clean, n_levels=n_levels, shade=shade,
--> 370                     ax=ax, cmap=cmap)
    371 
    372     elif plottype == "scatter":

/home/raniere/anaconda3/lib/python3.5/site-packages/seaborn-0.8.1-py3.5.egg/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
    651         ax = _bivariate_kdeplot(x, y, shade, shade_lowest,
    652                                 kernel, bw, gridsize, cut, clip, legend,
--> 653                                 cbar, cbar_ax, cbar_kws, ax, **kwargs)
    654     else:
    655         ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,

/home/raniere/anaconda3/lib/python3.5/site-packages/seaborn-0.8.1-py3.5.egg/seaborn/distributions.py in _bivariate_kdeplot(x, y, filled, fill_lowest, kernel, bw, gridsize, cut, clip, axlabel, cbar, cbar_ax, cbar_kws, ax, **kwargs)
    383         xx, yy, z = _statsmodels_bivariate_kde(x, y, bw, gridsize, cut, clip)
    384     else:
--> 385         xx, yy, z = _scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)
    386 
    387     # Plot the contours

/home/raniere/anaconda3/lib/python3.5/site-packages/seaborn-0.8.1-py3.5.egg/seaborn/distributions.py in _scipy_bivariate_kde(x, y, bw, gridsize, cut, clip)
    442     """Compute a bivariate kde using scipy."""
    443     data = np.c_[x, y]
--> 444     kde = stats.gaussian_kde(data.T)
    445     data_std = data.std(axis=0, ddof=1)
    446     if isinstance(bw, string_types):

/home/raniere/anaconda3/lib/python3.5/site-packages/scipy/stats/kde.py in __init__(self, dataset, bw_method)
    170 
    171         self.d, self.n = self.dataset.shape
--> 172         self.set_bandwidth(bw_method=bw_method)
    173 
    174     def evaluate(self, points):

/home/raniere/anaconda3/lib/python3.5/site-packages/scipy/stats/kde.py in set_bandwidth(self, bw_method)
    497             raise ValueError(msg)
    498 
--> 499         self._compute_covariance()
    500 
    501     def _compute_covariance(self):

/home/raniere/anaconda3/lib/python3.5/site-packages/scipy/stats/kde.py in _compute_covariance(self)
    508             self._data_covariance = atleast_2d(np.cov(self.dataset, rowvar=1,
    509                                                bias=False))
--> 510             self._data_inv_cov = linalg.inv(self._data_covariance)
    511 
    512         self.covariance = self._data_covariance * self.factor**2

/home/raniere/anaconda3/lib/python3.5/site-packages/scipy/linalg/basic.py in inv(a, overwrite_a, check_finite)
    817         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    818     if info > 0:
--> 819         raise LinAlgError("singular matrix")
    820     if info < 0:
    821         raise ValueError('illegal value in %d-th argument of internal '

LinAlgError: singular matrix

for

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,6))

ax1 = entrofy.plotting.plot_correlation(df, "age", "age", ax=ax1,
                                        xtype="continuous",
                                        ytype="continuous", cont_type="kde")
ax2 = entrofy.plotting.plot_correlation(df, "age", "age", ax=ax2,
                                        xtype="continuous",
                                        ytype="continuous", cont_type="scatter")

Change colours on bars?

I know that the bars don't work yet, but when we get them to work, we should change the colours on them (at least away from red and green), because colour blindness. I'll find out what good colours to use might be.

Automatic generation of max-entropy targets

Since we're already parsing the input table to binarize columns, we know which columns can be grouped into mutual exclusion sets. This way, instead of blindly initializing the target distribution to 0.5 for each column, we can normalize by the grouping instead.

This should help give better first-pass answers when the input set has gross imbalances with multiple values in a particular category.

Use quantiles for tie-breaking

Instead of degrading the top score by a little bit and then finding all candidates above that score, we should be using quantiles in __entrofy for that task, which should be more stable and reliable.

Numerical error in _check_probabilities

Is it a problem that I get this error:


RuntimeError: <entrofy.mappers.ObjectMapper object at 0x12d9b5128> total target probability 1.000000000000001 > 0

Seems like I should be able to get away with that :-)

Entrofy control widgets

I'm thinking these would look nice inside a drawer panel.

The sliders will have to be generated dynamically, since the labels depend on the table columns.

Pandas FutureWarning

df.loc[nonnulls, '{}{}'.format(self.prefix, key)] = column[nonnulls].apply(self._map[key])

need to add the .astype(float) after apply

for key, mapper_ in self._map.items():
            df.loc[nonnulls, '{}{}'.format(self.prefix, key)] = column[nonnulls].apply(mapper_).astype(float)
            df.loc[~nonnulls, '{}{}'.format(self.prefix, key)] = None

Think carefully about automatic determination of data types

We need to think a little more carefully about the automatic determination of data types.
At the moment, it assumes any column with floats is a continuous variable and any other column is a categorical variable. This may not always be the case (e.g. there could be a column if integers that should be used as a continuous variable or a column of say, three distinct float variables that designate categories).
We need to think carefully how to handle it, including not handling it at all and let the user pass in a dictionary of strings with the relevant data types?

Weights and targets need some thought

In the current version, the sliders don't add up to one for each category: it is in principle possible to put the same value (say 80%) on all options of a given category, which doesn't make intrinsic sense (we can't have 80% students, 80% postdocs and 80% senior faculty).
There might not a good intuition for users for what the underlying code will do in this case, how it prioritizes the different options within the category. We should think about how to document this or how to expose the user to what it's doing.

Update readme

I think it would be worth updating the readme to be more in line with the message in the newer version of the paper.

Numba acceleration?

I'm trying to run entrofy on a dataset of 30M rows, and it's... slow. Part of this is unavoidable given that the algorithm is Ω(kn), but I think there's a good amount of pythonic overhead that we could clear out with numba acceleration.

I'll play around with this to see if it significantly improves speed, but broader question: is it worth putting numba in our dependency chain? It's pretty heavy, and would only matter for weird edge cases like the one I'm currently in.

jquery.js:1496 Uncaught Error: Syntax error, unrecognized expression

When I load

$ cat applications.csv 
Name,Gender,Home institution,Career Stage
A,Male,X (Y-Z),Phase 2
B,Female,U1,Phase 2
C,Male,U2,Phase 3
D,Male,U3,Phase 3

and try to run entrofy I got

jquery.js:1496 Uncaught Error: Syntax error, unrecognized expression: .target#Home institution_X (Y-Z)
    at Function.ga.error (jquery.js:1496)
    at ga.tokenize (jquery.js:2113)
    at ga.select (jquery.js:2517)
    at Function.ga [as find] (jquery.js:893)
    at m.fn.init.find (jquery.js:2733)
    at new m.fn.init (jquery.js:2850)
    at m (jquery.js:73)
    at p:255
    at Array.map (<anonymous>)
    at get_targets (p:253)

as a Javascript error. The error is only visible at the web browser developer console which made me think if the software was working (I will create another issue for this).

I changed applications.csv to be

Name,Gender,Home institution,Career Stage
A,Male,X Y-Z,Phase 2
B,Female,U1,Phase 2
C,Male,U2,Phase 3
D,Male,U3,Phase 3

and I got a different error:

Traceback (most recent call last):
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/raniere/anaconda3/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/raniere/SSI/src/entrofy/app/server.py", line 53, in sample
    pre_selects)
  File "/home/raniere/SSI/src/entrofy/app/entrofy.py", line 202, in process_table
    score, rows = entrofy(X, k, q=np.asarray([float(_) for _ in q]),
  File "/home/raniere/SSI/src/entrofy/app/entrofy.py", line 202, in <listcomp>
    score, rows = entrofy(X, k, q=np.asarray([float(_) for _ in q]),
TypeError: float() argument must be a string or a number, not 'NoneType'

Hypothesis

Special characters in the values of some column is creating Javascript issues. So far, I had problems with (, ) and /.

Flask backend

Just set up the bare bones of the flask server.

Back-end functions we'll need to implement:

  • Upload/parse csv, return json. This will make our lives easier down the road
  • Run the model
  • Compute statistics of the data
  • Maybe session-level caching of data, to cut down on io/traffic

Document __entrofy arguments

__entrofy() has some arguments who name is one letter long. Could this function be better documented or use more meaningful names for arguments?

Table code

the current implementation uses [bootstrap-tables], which is okay, but has some strange behavior when it comes to pagination and exporting of selected rows.

Another option is datatables; the api is pretty similar, and I think it would do pretty much what we need to do.

Multi-quantile search

This thought has been kicking around in my head for a while, and I wanted to write it down before we lose track of it.

Problem: setting quantile<1 breaks the performance guarantee of the method, but also enables it to avoid local optima. Maybe it's possible to avoid local optima while preserving the guarantee?

Proposed solution: Rather than running all n_trials with the same quantile value, we could allocate the trials across various quantile thresholds min_quantile <= q <= 1. If we always have the q=1 case in there (ie, the deterministic/strict greedy method), then the monte carlo algorithm can always select it as a viable solution, and we preserve the performance guarantee.

We probably wouldn't want to distribute n_trials uniformly over the quantile range, since the larger q gets, the less variability there will be in the resulting solutions, so over-sampling here is wasted effort. Conversely, we'd do better to exlpore more thoroughly for lower q values. My gut says that distributing the samples geometrically over the quantile range ought to work well, and should be relatively easy to implement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.