Git Product home page Git Product logo

lifelines's Introduction

PyPI version Anaconda-Server Badge DOI

What is survival analysis and why should I learn it? Survival analysis was originally developed and applied heavily by the actuarial and medical community. Its purpose was to answer why do events occur now versus later under uncertainty (where events might refer to deaths, disease remission, etc.). This is great for researchers who are interested in measuring lifetimes: they can answer questions like what factors might influence deaths?

But outside of medicine and actuarial science, there are many other interesting and exciting applications of survival analysis. For example:

  • SaaS providers are interested in measuring subscriber lifetimes, or time to some first action
  • inventory stock out is a censoring event for true "demand" of a good.
  • sociologists are interested in measuring political parties' lifetimes, or relationships, or marriages
  • A/B tests to determine how long it takes different groups to perform an action.

lifelines is a pure Python implementation of the best parts of survival analysis.

Documentation and intro to survival analysis

If you are new to survival analysis, wondering why it is useful, or are interested in lifelines examples, API, and syntax, please read the Documentation and Tutorials page

Contact

Development

See our Contributing guidelines.

lifelines's People

Contributors

abdealiloko avatar agartland avatar aparij avatar arturomoncadatorres avatar badr-moufad avatar benkuhn avatar bluemoo avatar camdavidsonpilon avatar davegolland avatar deepyaman avatar dwilson1988 avatar invictus2010 avatar jlim13 avatar jonathanronen avatar josellanes avatar jseabold avatar kilo59 avatar kstark avatar lazarillo avatar lgmoneda avatar mathurinm avatar msanpe avatar naereen avatar nomennominatur avatar pzivich avatar sean-reed avatar spacecowboy avatar tirkarthi avatar unessam avatar vincent-maladiere avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lifelines's Issues

multivariate_logrank_test doesn't work with Series arguments

This bug affects multivariate_logrank_test but stems from group_survival_table_from_events in utils.py. If group_survival_table_from_events is called with groups as a pandas Series and any of durations, event_observed, or min_observations is a numpy array, then the code fails because of pandas-dev/pandas#6168 (which will be fixed in numpy 1.9).

multivariate_logrank_test calls group_survival_table_from_events with a numpy array, so if the groups argument to multivariate_logrank_test is a Series then we have an error on line 71 of utils.py. If the event_observed argument to multivariate_logrank_test is a Series, group_survival_table_from_events fails on line 74 (where it calls survival_table_from_events which fails on 134). This is a different bug.

Test cases:

In [12]: data
Out[12]: 
   duration done_feeding  race
0        16         True     0
1         1         True     1
2         4        False     2
3         3         True     2
4        36         True     2

[5 rows x 3 columns]

In [13]: multivariate_logrank_test(data.duration, data.race, data.done_feeding)---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-13-ee2bb8839bd4> in <module>()
----> 1 multivariate_logrank_test(data.duration, data.race, data.done_feeding)

/usr/local/lib/python2.7/dist-packages/lifelines/statistics.pyc in multivariate_logrank_test(event_durations, groups, event_observed, alpha, t_0, **kwargs)
    152         event_observed = np.ones((event_durations.shape[0], 1))
    153 
--> 154     unique_groups, rm, obs, _ = group_survival_table_from_events(groups, event_durations, event_observed, np.zeros_like(event_durations), t_0)
    155     n_groups = unique_groups.shape[0]
    156 

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in group_survival_table_from_events(groups, durations, event_observed, min_observations, limit)
     69     T = durations[ix]
     70     C = event_observed[ix]
---> 71     B = min_observations[ix]
     72 
     73     g_name = str(g)

IndexError: unsupported iterator index

In [14]: multivariate_logrank_test(data.duration, data.race.values, data.done_feeding)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-7853f8768dfd> in <module>()
----> 1 multivariate_logrank_test(data.duration, data.race.values, data.done_feeding)

/usr/local/lib/python2.7/dist-packages/lifelines/statistics.pyc in multivariate_logrank_
test(event_durations, groups, event_observed, alpha, t_0, **kwargs)
    152         event_observed = np.ones((event_durations.shape[0], 1))
    153 
--> 154     unique_groups, rm, obs, _ = group_survival_table_from_events(groups, event_d
urations, event_observed, np.zeros_like(event_durations), t_0)
    155     n_groups = unique_groups.shape[0]
    156 

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in group_survival_table_from_
events(groups, durations, event_observed, min_observations, limit)
     81         g_name = str(g)
     82         data = data.join(survival_table_from_events(T, C, B, 
---> 83                     columns=['removed:' + g_name, "observed:" + g_name, 'censore
d:' + g_name, 'entrance' + g_name]),
     84                     how='outer')
     85     data = data.fillna(0)

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in survival_table_from_events
(durations, event_observed, min_observations, columns, weights)
    132     df[columns[1]] = event_observed
    133     death_table = df.groupby("event_at").sum()
--> 134     death_table[columns[2]] = (death_table[columns[0]] - death_table[columns[1]]
).astype(int)
    135 
    136     #deal with late births

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)                            [34/434]
   1636 
   1637     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, ke
y)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, 
item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _find_block(self, it
em)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _check_have(self, it
em)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named observed:1'

In [15]: multivariate_logrank_test(data.duration.values, data.race.values, data.done_feeding.values)
Out[15]: 
('Results\n   df: 2\n   alpha: 0.95\n   t 0: -1\n   test: logrank\n   null distribution: chi squared\n\n   __ p-value ___|__ test statistic __|__ test results __\n         0.12832 |              4.106 |     None   ',
 0.12832470243700733,
 None)

Install with pip

The pip installer using the github repo address doesn't seem to check for numpy and scipy installs but installs over whatever is already there.

pip install -U git+https://github.com/CamDavidsonPilon/lifelines.git

nosetests fail (solved: compiler on Windows)

Howdy. I cloned lifelines onto a Windows 7 computer and tried running the nosetests, and got this error:

E
======================================================================
ERROR: Failure: ImportError (No module named _statistics)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\nose\loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Anaconda\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Anaconda\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\Users\jmschr\Documents\GitHub\lifelines\lifelines\tests\test_suite.py", line 26, in <module>
    from ..statistics import (logrank_test, multivariate_logrank_test,
  File "C:\Users\jmschr\Documents\GitHub\lifelines\lifelines\statistics.py", line 10, in <module>
    from lifelines._statistics import concordance_index as _cindex
ImportError: No module named _statistics

----------------------------------------------------------------------
Ran 1 test in 1.293s

FAILED (errors=1)

I am currently using the CoxPHFitter in my work, but there seems to be some areas which do not work.

import error after installing from wheel (cython issue)

I installed from the lifelines-0.4.0.0-cp27-none-win32.whl file, and am now having the following issue. Lifelines 0.3 worked fine.

Traceback (most recent call last):
File ".\x.py", line 11, in
from lifelines.statistics import logrank_test
File "E:\Python27\lib\site-packages\lifelines\statistics.py", line 10,
from lifelines._statistics import concordance_index as _cindex

Cross validation should stratify for events

When doing crossvalidation on censored data, it is important to stratify the different pieces on the event variable, e.g. to keep the fraction of censored cases roughly equal between the pieces.

Example and explanation

For example, consider a (small) data set with the following event variables:

[0, 0, 1, 1, 1, 1, 1, 1]

25% of cases are censored. Now, doing repeated k-fold validation when k=4 can result the following pieces:

[1, 1]
[1, 1]
[1, 1]
[0, 0]

Validating on the final piece can by definition never score greater than random with the c-index.

Example implementation

One way to stratify is to first divide the data set by the number classes (in this case censored or event) and then essentially do the k-fold division separately for each class. Then each time around the loop, combine the pieces from both classes as the training data.

As a reference, here is my old non-panda solution. I am not suggesting this code as a replacement, just as a "brainstorming thing":

# This might be a decimal number, remember to round it off
indices = np.arange(len(data))
# I generalized to arbitrary amount of classes for some reason...
classes = np.unique(data[:, eventcol])
classindices = {}
for c in classes:
    classindices[c] = indices[data[:, eventcol] == c]

for n in range(ntimes):
    # Re-shuffle the data every time
    for c in classes:
        np.random.shuffle(classindices[c])

    for k in range(kfold):
        valindices = []
        trnindices = []

        # Join the data pieces
        for p in range(kfold):
            # validation piece
            if k == p:
                for idx in classindices.values():
                    # Calc piece length
                    plength = int(round(len(idx) / kfold))
                    valindices.extend(idx[p*plength:(p+1)*plength])
            else: # training piece
                for idx in classindices.values():
                    # Calc piece length
                    plength = int(round(len(idx) / kfold))
                    trnindices.extend(idx[p*plength:(p+1)*plength])
         # Actual model stuff follows...

Aalen Additive Filter fails with non-standard indices on dataframe

Example of non-numeric index:

In [33]: aaf = ll.AalenAdditiveFitter()

In [34]: example
Out[34]: 
   duration done_feeding white
a        16         True  True
b         1         True  True
c         4        False  True
d         3         True  True
e        36         True  True

[5 rows x 3 columns]

In [35]: aaf.fit(example, duration_col='duration', event_col='done_feeding')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-4c043f45cf2b> in <module>()
----> 1 aaf.fit(example, duration_col='duration', event_col='done_feeding')

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in fit(self, dataframe, duration_col, event_col, timeline, id_col, show_progress)
    443 
    444         if id_col is None:
--> 445             self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
    446         else:
    447             self._fit_varying(dataframe,duration_col,event_col,id_col,timeline,show_progress)

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in _fit_static(self, dataframe, duration_col, event_col, timeline, show_progress)
    526 
    527             relevant_individuals = (ids==id)
--> 528             assert relevant_individuals.sum() == 1.
    529 
    530             #perform linear regression step.

AttributeError: 'bool' object has no attribute 'sum'

Example of numeric index where index isn't just 0 through n-1

In [36]: example.index = pd.Index([0, 1, 2, 4, 5])

In [37]: example
Out[37]: 
   duration done_feeding white
0        16         True  True
1         1         True  True
2         4        False  True
4         3         True  True
5        36         True  True

[5 rows x 3 columns]

In [38]: aaf.fit(example, duration_col='duration', event_col='done_feeding')
 [-----------------100%-----------------] 4 of 4 complete in 0.0 sec---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-38-4c043f45cf2b> in <module>()
----> 1 aaf.fit(example, duration_col='duration', event_col='done_feeding')

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in fit(self, dataframe, duration_col, event_col, timeline, id_col, show_progress)
    443 
    444         if id_col is None:
--> 445             self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
    446         else:
    447             self._fit_varying(dataframe,duration_col,event_col,id_col,timeline,show_progress)

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in _fit_static(self, dataframe, duration_col, event_col, timeline, show_progress)
    526 
    527             relevant_individuals = (ids==id)
--> 528             assert relevant_individuals.sum() == 1.
    529 
    530             #perform linear regression step.

AssertionError: 

The same thing happens if I assign the index to be 1, 2, 3, 4, 5. When I assigned index to be 0, 1, 2, 3, 4 (the default) the code works.

From my brief look at the code, the bug stems from setting T to df[duration_col] (on line 485 of estimation.py). This gives T the index of the input dataframe, but the loop starting at line 415 assumes that id is between 0 and df.shape[0].

I'm only moderately familiar with pandas and I'm fairly busy finishing up my school year right now, but if this isn't fixed by mid-June I should have time to fix it then.

pip installlifelines - failure (solved: missing fortran)

Using Anaconda in both Linux Mint abd Mac OS. pip install lifelines results in:

Downloading/unpacking lifelines
  Downloading lifelines-0.4.3.tar.gz (628kB): 628kB downloaded
  Running setup.py (path:/tmp/pip_build_root/lifelines/setup.py) egg_info for package lifelines
    build_src
    building extension "lifelines._statistics" sources
    f2py options: []
    f2py:> build/src.linux-i686-2.7/lifelines/_statisticsmodule.c
    Reading fortran codes...
        Reading file 'lifelines/_statistics.f90' (format:free)
    Post-processing...
        Block: _statistics
                Block: concordance_index
    Post-processing (stage 2)...
    Building modules...
        Building module "_statistics"...
            Creating wrapper for Fortran function "concordance_index"("concordance_index")...
            Constructing wrapper function "concordance_index"...
              cindex = concordance_index(event_times,predictions,event_observed,[rows])
        Wrote C/API module "_statistics" to file "build/src.linux-i686-2.7/lifelines/_statisticsmodule.c"
        Fortran 77 wrappers are saved to "build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f"
      adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.
      adding 'build/src.linux-i686-2.7' to include_dirs.
      adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.
    build_src: building npy-pkg config files

    warning: no files found matching '*' under directory 'styles'
    warning: no previously-included files matching '*.py[co]' found under directory '*'
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/lib/python2.7/dist-packages (from lifelines)
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages (from lifelines)
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7 (from lifelines)
Downloading/unpacking pandas>=0.14 (from lifelines)
  Downloading pandas-0.14.1.tar.gz (6.7MB): 6.7MB downloaded
  Running setup.py (path:/tmp/pip_build_root/pandas/setup.py) egg_info for package pandas

    warning: no files found matching 'README.rst'
    no previously-included directories found matching 'doc/build'
    warning: no previously-included files matching '*.so' found anywhere in distribution
    warning: no previously-included files matching '*.pyd' found anywhere in distribution
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '.git*' found anywhere in distribution
    warning: no previously-included files matching '.DS_Store' found anywhere in distribution
    warning: no previously-included files matching '*.png' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Requirement already satisfied (use --upgrade to upgrade): tornado in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6 in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Downloading/unpacking nose (from matplotlib->lifelines)
  Downloading nose-1.3.4.tar.gz (277kB): 277kB downloaded
  Running setup.py (path:/tmp/pip_build_root/nose/setup.py) egg_info for package nose

    no previously-included directories found matching 'doc/.build'
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/lib/python2.7/dist-packages (from pandas>=0.14->lifelines)
Installing collected packages: lifelines, pandas, nose
  Running setup.py install for lifelines
    unifing config_cc, config, build_clib, build_ext, build commands --compiler options
    unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
    build_src
    building extension "lifelines._statistics" sources
    f2py options: []
      adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.
      adding 'build/src.linux-i686-2.7' to include_dirs.
      adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.
    build_src: building npy-pkg config files
    customize UnixCCompiler
    customize UnixCCompiler using build_ext
    customize Gnu95FCompiler
    Found executable /usr/bin/gfortran
    customize Gnu95FCompiler
    customize Gnu95FCompiler using build_ext
    building 'lifelines._statistics' extension
    compiling C sources
    C compiler: i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

    compile options: '-Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
    i686-linux-gnu-gcc: build/src.linux-i686-2.7/fortranobject.c
    In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:
    build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:
    build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    error: Command "i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c build/src.linux-i686-2.7/fortranobject.c -o build/temp.linux-i686-2.7/build/src.linux-i686-2.7/fortranobject.o" failed with exit status 1
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lifelines/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-xn95nq-record/install-record.txt --single-version-externally-managed --compile:
    running install

running build

running config_cc

unifing config_cc, config, build_clib, build_ext, build commands --compiler options

running config_fc

unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options

running build_src

build_src

building extension "lifelines._statistics" sources

f2py options: []

  adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.

  adding 'build/src.linux-i686-2.7' to include_dirs.

  adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.

build_src: building npy-pkg config files

running build_py

creating build/lib.linux-i686-2.7

creating build/lib.linux-i686-2.7/lifelines

copying lifelines/_cox_regression.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/generate_datasets.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/__init__.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/estimation.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/plotting.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/progress_bar.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/datasets.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/statistics.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/utils.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/_statistics.f90 -> build/lib.linux-i686-2.7/lifelines

copying lifelines/../README.md -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../README.txt -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../LICENSE -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../MANIFEST.in -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../Untitled0.ipynb -> build/lib.linux-i686-2.7/lifelines/..

creating build/lib.linux-i686-2.7/datasets

copying lifelines/../datasets/static_test.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/psychiatric_patients.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/gehan.dat -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/dd.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/divorce.dat -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/Divorces Rates.ipynb -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/The Gehan Survival Data.ipynb -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/panel_test.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/canadian_senators.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/lung.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/2002FemResp.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

running build_ext

customize UnixCCompiler

customize UnixCCompiler using build_ext

customize Gnu95FCompiler

Found executable /usr/bin/gfortran

customize Gnu95FCompiler

customize Gnu95FCompiler using build_ext

building 'lifelines._statistics' extension

compiling C sources

C compiler: i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC



creating build/temp.linux-i686-2.7

creating build/temp.linux-i686-2.7/build

creating build/temp.linux-i686-2.7/build/src.linux-i686-2.7

creating build/temp.linux-i686-2.7/build/src.linux-i686-2.7/lifelines

compile options: '-Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'

i686-linux-gnu-gcc: build/src.linux-i686-2.7/fortranobject.c

In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:

build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:

build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

error: Command "i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c build/src.linux-i686-2.7/fortranobject.c -o build/temp.linux-i686-2.7/build/src.linux-i686-2.7/fortranobject.o" failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lifelines/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-xn95nq-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lifelines
Storing debug log for failure in /home/juan/.pip/pip.log

Statistical tests statistically fail

Because we are running example statistical tests, they have p chance of failing, purely by chance. This was okay before, but with multiple python versions running it's a problem.

Solution is to think of new ways to tests these tests.

Unit tests fail with Pandas 0.14

As the title says, the unit tests fail with Pandas 0.14. They run successfully with 0.13.1 though. Python version is 3.4 and a pip freeze of other packages used:

matplotlib==1.3.1
numpy==1.8.1
scipy==0.13.3

The output is as follows:

======================================================================
ERROR: test_aalen_additive_median_predictions_split_data (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 477, in test_aalen_additive_median_predictions_split_data
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_dataframe_input_with_nonstandard_index (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 539, in test_dataframe_input_with_nonstandard_index
    aaf.fit(df, duration_col='duration', event_col='done_feeding')
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_large_dimensions_for_recursion_error (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 441, in test_large_dimensions_for_recursion_error
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_tall_data_points (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 453, in test_tall_data_points
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_timeline_to_NelsonAalenFitter (__main__.StatisticalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 262, in test_timeline_to_NelsonAalenFitter
    with_list = naf.fit(T, C, timeline=timeline).cumulative_hazard_.values
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 67, in fit
    self._additive_f, self._variance_f, False)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 862, in _additive_estimate
    estimate_ = estimate_.reindex(timeline, method='pad').fillna(0)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/series.py", line 2028, in reindex
    return super(Series, self).reindex(index=index, **kwargs)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/generic.py", line 1624, in reindex
    method, fill_value, copy).__finalize__(self)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/generic.py", line 1641, in _reindex_axes
    labels, level=level, limit=limit, method=method)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/index.py", line 1375, in reindex
    limit=limit)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/index.py", line 1264, in get_indexer
    raise ValueError('Must be monotonic for forward fill')
ValueError: Must be monotonic for forward fill

======================================================================
FAIL: test_multivariate_equal_intensities (__main__.StatisticalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 227, in test_multivariate_equal_intensities
    self.assertTrue(result is None)
AssertionError: False is not true

Censored count data estimation

eg: Interested in the distribution of uses of a product (count data) -- but time will censor seeing all the users' history.

Memory blow's up for some AAF datasets

# to get the latest version of lifelines
# pip install --upgrade git+https://github.com/CamDavidsonPilon/lifelines.git


#This should break:
from lifelines.generate_datasets import *
from lifelines import AalenAdditiveFitter

#generated fake, large dataset
n = 10000
d = 20
timeline = np.linspace(0, 70, 8000)
hz, coef, X = generate_hazard_rates(n, d, timeline)
X.columns = coef.columns
cumulative_hazards = pd.DataFrame(cumulative_quadrature(coef.values.T, timeline).T, 
                                  index=timeline, columns=coef.columns)
T = generate_random_lifetimes(hz, timeline)
X['T'] = T
X['E'] = np.random.binomial(1,.99,n)
print "data created"
aaf = AalenAdditiveFitter(penalizer=1., fit_intercept=False)

aaf.fit(X) #bfill is called in here.


### 
# Here's the internal data structure that will fail on bfill

df = X.copy()
df['id'] = range(df.shape[0])
df = df.set_index(['t','id'])
wp = df.to_panel()
#calling bfill on wp should fail, or hang terribly.

AalenAdditiveFitter fit is extremy slow for large matrices

if X is large like (500k,4) , the fit method will loop over all of them ... despite the fact 99% of them the are censored.
in estimation.py the loop that I'm talking about starts on line 320
for i, time in enumerate(sorted_event_times):

The algebra is way beyond my understanding :) maybe I missing something but why sorted_event_times is so large ?

Multiple comparisons testing

Multiple comparisons corrections with something like Bonferroni would be useful. This would also require generating p-values for the logrank statistic from the Chi**2 distribution.

logrank_test requires matching time values

If I call the logrank_test function like this on uncensored data...
logrank_test(T[gen0],T[gen1])
I get the following indexing error (I removed some of the stack for brevity).

IndexError                                Traceback (most recent call last)
<ipython-input-166-59a3462b6f73> in <module>()
      1 from lifelines.statistics import logrank_test
----> 2 logrank_test(T[gen0],T[gen1])
      3 T[gen0]

~/lifelines/lifelines/statistics.py in logrank_test(event_times_A, event_times_B, censorship_A, censorship_B, t_0)
     46       pass
     47     try:
---> 48       y_2 = Y_2.loc[t]
     49     except KeyError:
     50       pass

IndexError: index out of bounds

I think what is going on is that you are requiring the two different survival curves to have the same event_time values. If the death events were observed at different times for different curves, this function throws an indexing error.

If you want to replicate my error, try the following data:
T[gen1] = array([6., 13., 13., 13., 19., 19., 19., 26., 26., 26., 26., 26., 33., 33., 47., 62., 62., 9., 9., 9., 15., 15., 22., 22., 22., 22., 29., 29., 29., 29., 29., 36., 36., 43.])

T[gen0] = array([33., 54., 54., 61., 61., 61., 61., 61., 61., 61., 61., 61., 61., 61., 69., 69., 69., 69., 69., 69., 69., 69., 69., 69., 69., 32., 53., 53., 60., 60., 60., 60., 60., 68., 68., 68., 68., 68., 68., 68., 68., 68., 68., 75., 17., 51., 51., 51., 58., 58., 58., 58., 66., 66., 7., 7., 41., 41., 41., 41., 41., 41., 41., 48., 48., 48., 48., 48., 48., 48., 48., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 63., 63., 63., 63., 63., 63., 63., 63., 63., 69., 69., 38., 38., 45., 45., 45., 45., 45., 45., 45., 45., 45., 45., 53., 53., 53., 53., 53., 60., 60., 60., 60., 60., 60., 60., 60., 60., 60., 60., 66.])

date transformer

It would be nice to have the following function:

T,C = datetime_transform( start_dates, end_dates)

Bug in Variance of Nelson-Aalen estimator

First of all I'm happy to see this library in Python. When I have time I'd love to contribute. I want to put the code into real uses and before doing so I first want to cross-validate the results from R.

It appears that the _variance_f function of Nelson-Aalen estimator returns a wrong value:

return (1.*d/N)**2

It should be:

return d/N/N

You may want to investigate this further, because I found this just form reading the code and I haven't run any code to prove that the results are wrong.

Install Error - KeyError: 'FARCHFLAGS'

Hello,

I'm trying to install lifelines and am receiving the following:

customize UnixCCompiler
customize UnixCCompiler using build_ext
customize Gnu95FCompiler
Found executable /usr/local/bin/gfortran
Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip_build_root/lifelines/setup.py", line 57, in <module>
        ext_modules=[ext_fstat]
...
  File "/Library/Python/2.7/site-packages/numpy-override/numpy/distutils/fcompiler/gnu.py", line 274, in _universal_flags
    farchflags = os.environ['FARCHFLAGS']
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'FARCHFLAGS'

I've tried building/installing from setup.py and pip - both return the same error. Looking at line 57 of setup.py eventually lead me to _statistics.F90 which I believe is a Fortran file. I removed my Fortran compiler and re-installed via Homebrew (brew install gcc), but that did not help either. I've also tried the following with no luck:

$ export FARCHFLAGS="-arch x86_64"
$ export CFLAGS="-arch i386 -arch x86_64" 
$ export LDFLAGS="-Wall -undefined dynamic_lookup -bundle -arch i386 -arch x86_64"
$ export FFLAGS="-arch i386 -arch x86_64"

Any help would be greatly appreciated! I'm using OS X Mavericks with Command Line Tools and all required packages installed.

python3 support for datasets.py

lifelines seems to work fine with python3 - but when I try the QuickStart in python3 - it complains on:

from lifelines.datasets import waltons_data

A quick 2to3 fixes the problem and I can run the Quickstart without any trouble.

thanks for all the great work!

Prediction for time-varying covariates

It's not completely clear to me how to do prediction with time-varying covariates. By prediction, I am referring to constructing a hazard curve give the covariates. Consider the static case:

hz(t) = b1(t)X_1 + b2(t)X_2 ...

This works as I can extend t to as far as a I want, and still only need to know (X_1, X_2, ...). For time-varying covariates, I can only extend as far as the observed covariates.

Access to CoxPH loglikelihood

From an email

The only thing is there's no way to get the log likelihood even if I'm willing to wait for it. In _newton_rhaphson, you're unpacking only two values out of _get_efron_values. So even if I set the argument include_likelihood to True for _get_efron_values, I'm getting a ValueError: too many values to unpack in _newton_rhaphson.

README graph doesn't reflect code

@Cmrn_DP actually I meant the way one of the data series in my last graph is truncated at 12, but both of yours are, not the test itself

โ€” Andrew Clegg (@andrew_clegg) June 13, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

KMF legend labels do not play nice with LaTeX

The labels "_upper_0.95" and "_lower_0.95" break plotting if LaTeX is enabled:

import matplotlib as mpl
mpl.rcParams['text.usetex']=True

# Create a kaplan-meier fitter then plot...
kmf.plot(ax=ax, c="#A60628", ci_force_lines=True)

The problem is that LaTeX hates underscores... The text might also look bad if matplotlib interprets the legend as mathmode (not sure if it does).
One could imagine something like this:

import matplotlib as mpl

# No more underscores
label = "{} upper 0.95".format(actual_label)
# If TeX, wrap in a text environment
if mpl.rcParams['text.usetex']:
    label = "\text{%s}" % label

Docs

Need to make the user feel empowered.

analysis on multiple survival curves simultaneously

use the groupby functionality in pandas dataframes to generate a bunch of survival functions, run the stats, and plot the curves for anything significantly different from a control population

Walton (private correspondence)

predict_expectation incorrect on AalenAdditiveFilter

predict_expectation does not agree with the following code:

from scipy.integrate import simps
def predict_expectation(aaf, covariates):
    sfs = aaf.predict_survival_function(covariates)
    xvals = sfs.index.values
    return simps(sfs.values, xvals, axis=0)

For this curve: http://imgur.com/PhJH3Tp the current predict_expectation method returns 50 or 60, the above method returns the much more reasonable value 16.

You should be able to test this by doing a Kaplan-Meier fit with no censored data points, using the quadrature method on the kmf.survival_function_ (and its index) and noticing that the value of that expectation is not equal to the mean of the duration of the data set (since all points are uncensored, the KM fit should just be 1 - CDF of the dataset).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.