Git Product home page Git Product logo

gini's Introduction

Metrics

gini's People

Contributors

lkev avatar oliviaguest avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gini's Issues

Handle negative values less dangerously

Currently, the range is shifted to make values be non-negative:

    if np.amin(array) < 0:
        # Values cannot be negative:
        array -= np.amin(array)

This is dangerous. This should be controlled by a user-specified function argument with a default value of False, e.g. shift_negative=False.

I know this is a documented assumption that the inputs be positive. At a minimum, it's safer to raise an exception if an assumption is violated, rather than to handle it forcibly.

Numba Speed Up?

Hi :)

Do you think it could be valuable to add Numba speed up to the function?
Since it is clean numpy code it should be as easy as adding a one decorator.

Some code for reproducibility

from time import time

import numpy as np
from numba import jit

import matplotlib.pyplot as plt

def gini_normal(array):
    """Calculate the Gini coefficient of a numpy array."""
    # based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif
    # from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
    array = array.flatten() #all values are treated equally, arrays must be 1d
    if np.amin(array) < 0:
        array -= np.amin(array) #values cannot be negative
    array += 0.0000001 #values cannot be 0
    array = np.sort(array) #values must be sorted
    index = np.arange(1,array.shape[0]+1) #index per array element
    n = array.shape[0]#number of array elements
    return ((np.sum((2 * index - n  - 1) * array)) / (n * np.sum(array)))

@jit(nopython=True)
def gini_numba(array):
    """Calculate the Gini coefficient of a numpy array."""
    # based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif
    # from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
    array = array.flatten() #all values are treated equally, arrays must be 1d
    if np.amin(array) < 0:
        array -= np.amin(array) #values cannot be negative
    array += 0.0000001 #values cannot be 0
    array = np.sort(array) #values must be sorted
    index = np.arange(1,array.shape[0]+1) #index per array element
    n = array.shape[0]#number of array elements
    return ((np.sum((2 * index - n  - 1) * array)) / (n * np.sum(array)))

def profiler(func):
    """Quick and dirty utility func for timing perfromance"""
    timing = []
    for max_iter in (1e1, 1e2, 1e3, 1e4, 1e5, 1e6):
        
        start = time()
        for iteration in range(int(max_iter)):

            func(np.random.random(size=(10)))
        
        timing.append(time() - start)
    
    return timing

###################################################

time_normal = profiler(gini_normal)
time_numba = profiler(gini_numba)

plt.figure(figsize=(5, 5))
plt.plot(
    [1e1, 1e2, 1e3, 1e4, 1e5, 1e6],
    time_normal, 
    label='Raw Numpy'
)
plt.plot(
    [1e1, 1e2, 1e3, 1e4, 1e5, 1e6],
    time_numba, 
    label='Numba + Numpy'
)

plt.ylabel('Seconds')
plt.xlabel('Number of Iterations')

plt.legend()
plt.show()

Ques: how can it be modified to use it with categorical variables?

I'd would like to calculate gini's index with categorical variables.

I have data with zones visited by people for example:

  • person 1: [zone2, zone4, zone5, zone2, zone2]
  • person 2 [zone1, zone5, zone4, zone1, zone1, zone3]
  • person 3 [zone3, zone3, zone3, zone1, zone3]

and I want to know how dispersed (or not very dispersed) is that person depending on the areas you visit in a parameter from 0 to 1. So, i want to obtain that person 3 is less disperse than person 1. I think that the value 0 of this parameter represent less dispersion. To do that i believe that gini's index represent that, but my variables (zones) are categorical.

Do you know how can i resolve this?

Question about offsetting 0 value.

Hi Olivia,
Regard to this line array += 0.0000001
Is there a particular reason using 0.0000001 and not some smaller positive number, such as np.nextafter(np.float64(0), np.float64(1))?

Also, why not first check if a zero exists in the array before adding 0.0000001?

Doesn't return negative gini

Hi, thank you for putting this together! In the readme, you have an example creating random integers and showing the gini is "low", being .33. As someone who has built many ML models with a gini under .33, this seemed weird to me. The other gini function you reference being similar to, here (https://github.com/pysal/pysal/pull/862/files), has a function that instead returns around -.33 as opposed to +.33 when calculating the gini of random numbers. Thus, I believe there is an issue returning negative ginis in your function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.