camdavidsonpilon / lifelines Goto Github PK

Survival analysis in Python

License: MIT License

Makefile 0.06% Python 82.52% TeX 0.83% Jupyter Notebook 16.59%

survival-analysis python statistics data-science maximum-likelihood reliability-analysis cox-regression

lifelines's Introduction

What is survival analysis and why should I learn it? Survival analysis was originally developed and applied heavily by the actuarial and medical community. Its purpose was to answer why do events occur now versus later under uncertainty (where events might refer to deaths, disease remission, etc.). This is great for researchers who are interested in measuring lifetimes: they can answer questions like what factors might influence deaths?

But outside of medicine and actuarial science, there are many other interesting and exciting applications of survival analysis. For example:

SaaS providers are interested in measuring subscriber lifetimes, or time to some first action
inventory stock out is a censoring event for true "demand" of a good.
sociologists are interested in measuring political parties' lifetimes, or relationships, or marriages
A/B tests to determine how long it takes different groups to perform an action.

lifelines is a pure Python implementation of the best parts of survival analysis.

Documentation and intro to survival analysis

If you are new to survival analysis, wondering why it is useful, or are interested in lifelines examples, API, and syntax, please read the Documentation and Tutorials page

Contact

Start a conversation in our Discussions room.
Some users have posted common questions at stats.stackexchange.com.
Creating an issue in the Github repository.

Development

See our Contributing guidelines.

lifelines's People

Contributors

Stargazers

Watchers

Forkers

fdeheeger bobbybabra basqiat waytai kstark myusuf3 zaxtax josef-pkt vdt zju-reads rschwarz aparij wangz10 sinhrks imclab ontarionick daryavm alexsisu datatoinsight eamartin spacecowboy irkinosor george-xing lazycrazyowl nickfurlotte zulily deniszgonjanin patrick-russell tingfengainiaini jshiv al3n70rn diego-mazon tammyclee jonathanseguin agartland seanhussey linbug misaka erichhuang fengyinyang stratus-medicine snormore genevievebo gumption loringdodge pombreda pepsalehi leneve-zz benkuhn gabrielelanaro gjthompson1 dhuynh anmolgarg ejcaropr y0zhan06 davegerson nerdless wavelets jstoxrocky keshavramaswamy lrpauley statwonk cosmicbboy fmfn xanadu31 stevenmanton ajrader bigsnarfdude ahaque12 jonathanronen kevinhsu dgaffney ml-lab angrywombat zhouyonglong dandanw924 vyraun soumajyoti absolutelynowarranty colinsongf uhho sukneet opensorceror eric7lau ncdingari maryam1357 ioffl amitkbadheka gitter-badger alvarouc suryansh2020 rosejn omdgit egkachai catwang42 davegolland alvinthai wtadler jtkostman jattenberg

lifelines's Issues

multivariate_logrank_test doesn't work with Series arguments

This bug affects multivariate_logrank_test but stems from group_survival_table_from_events in utils.py. If group_survival_table_from_events is called with groups as a pandas Series and any of durations, event_observed, or min_observations is a numpy array, then the code fails because of pandas-dev/pandas#6168 (which will be fixed in numpy 1.9).

multivariate_logrank_test calls group_survival_table_from_events with a numpy array, so if the groups argument to multivariate_logrank_test is a Series then we have an error on line 71 of utils.py. If the event_observed argument to multivariate_logrank_test is a Series, group_survival_table_from_events fails on line 74 (where it calls survival_table_from_events which fails on 134). This is a different bug.

Test cases:

In [12]: data
Out[12]: 
   duration done_feeding  race
0        16         True     0
1         1         True     1
2         4        False     2
3         3         True     2
4        36         True     2

[5 rows x 3 columns]

In [13]: multivariate_logrank_test(data.duration, data.race, data.done_feeding)---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-13-ee2bb8839bd4> in <module>()
----> 1 multivariate_logrank_test(data.duration, data.race, data.done_feeding)

/usr/local/lib/python2.7/dist-packages/lifelines/statistics.pyc in multivariate_logrank_test(event_durations, groups, event_observed, alpha, t_0, **kwargs)
    152         event_observed = np.ones((event_durations.shape[0], 1))
    153 
--> 154     unique_groups, rm, obs, _ = group_survival_table_from_events(groups, event_durations, event_observed, np.zeros_like(event_durations), t_0)
    155     n_groups = unique_groups.shape[0]
    156 

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in group_survival_table_from_events(groups, durations, event_observed, min_observations, limit)
     69     T = durations[ix]
     70     C = event_observed[ix]
---> 71     B = min_observations[ix]
     72 
     73     g_name = str(g)

IndexError: unsupported iterator index

In [14]: multivariate_logrank_test(data.duration, data.race.values, data.done_feeding)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-7853f8768dfd> in <module>()
----> 1 multivariate_logrank_test(data.duration, data.race.values, data.done_feeding)

/usr/local/lib/python2.7/dist-packages/lifelines/statistics.pyc in multivariate_logrank_
test(event_durations, groups, event_observed, alpha, t_0, **kwargs)
    152         event_observed = np.ones((event_durations.shape[0], 1))
    153 
--> 154     unique_groups, rm, obs, _ = group_survival_table_from_events(groups, event_d
urations, event_observed, np.zeros_like(event_durations), t_0)
    155     n_groups = unique_groups.shape[0]
    156 

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in group_survival_table_from_
events(groups, durations, event_observed, min_observations, limit)
     81         g_name = str(g)
     82         data = data.join(survival_table_from_events(T, C, B, 
---> 83                     columns=['removed:' + g_name, "observed:" + g_name, 'censore
d:' + g_name, 'entrance' + g_name]),
     84                     how='outer')
     85     data = data.fillna(0)

/usr/local/lib/python2.7/dist-packages/lifelines/utils.pyc in survival_table_from_events
(durations, event_observed, min_observations, columns, weights)
    132     df[columns[1]] = event_observed
    133     death_table = df.groupby("event_at").sum()
--> 134     death_table[columns[2]] = (death_table[columns[0]] - death_table[columns[1]]
).astype(int)
    135 
    136     #deal with late births

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)                            [34/434]
   1636 
   1637     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, ke
y)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, 
item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _find_block(self, it
em)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _check_have(self, it
em)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named observed:1'

In [15]: multivariate_logrank_test(data.duration.values, data.race.values, data.done_feeding.values)
Out[15]: 
('Results\n   df: 2\n   alpha: 0.95\n   t 0: -1\n   test: logrank\n   null distribution: chi squared\n\n   __ p-value ___|__ test statistic __|__ test results __\n         0.12832 |              4.106 |     None   ',
 0.12832470243700733,
 None)

Install with pip

The pip installer using the github repo address doesn't seem to check for numpy and scipy installs but installs over whatever is already there.

pip install -U git+https://github.com/CamDavidsonPilon/lifelines.git

nosetests fail (solved: compiler on Windows)

Howdy. I cloned lifelines onto a Windows 7 computer and tried running the nosetests, and got this error:

E
======================================================================
ERROR: Failure: ImportError (No module named _statistics)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\nose\loader.py", line 414, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Anaconda\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Anaconda\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\Users\jmschr\Documents\GitHub\lifelines\lifelines\tests\test_suite.py", line 26, in <module>
    from ..statistics import (logrank_test, multivariate_logrank_test,
  File "C:\Users\jmschr\Documents\GitHub\lifelines\lifelines\statistics.py", line 10, in <module>
    from lifelines._statistics import concordance_index as _cindex
ImportError: No module named _statistics

----------------------------------------------------------------------
Ran 1 test in 1.293s

FAILED (errors=1)

I am currently using the CoxPHFitter in my work, but there seems to be some areas which do not work.

import error after installing from wheel (cython issue)

I installed from the lifelines-0.4.0.0-cp27-none-win32.whl file, and am now having the following issue. Lifelines 0.3 worked fine.

Traceback (most recent call last):
File ".\x.py", line 11, in
from lifelines.statistics import logrank_test
File "E:\Python27\lib\site-packages\lifelines\statistics.py", line 10,
from lifelines._statistics import concordance_index as _cindex

Cross validation should stratify for events

When doing crossvalidation on censored data, it is important to stratify the different pieces on the event variable, e.g. to keep the fraction of censored cases roughly equal between the pieces.

Example and explanation

For example, consider a (small) data set with the following event variables:

[0, 0, 1, 1, 1, 1, 1, 1]

25% of cases are censored. Now, doing repeated k-fold validation when k=4 can result the following pieces:

[1, 1]
[1, 1]
[1, 1]
[0, 0]

Validating on the final piece can by definition never score greater than random with the c-index.

Example implementation

One way to stratify is to first divide the data set by the number classes (in this case censored or event) and then essentially do the k-fold division separately for each class. Then each time around the loop, combine the pieces from both classes as the training data.

As a reference, here is my old non-panda solution. I am not suggesting this code as a replacement, just as a "brainstorming thing":

# This might be a decimal number, remember to round it off
indices = np.arange(len(data))
# I generalized to arbitrary amount of classes for some reason...
classes = np.unique(data[:, eventcol])
classindices = {}
for c in classes:
    classindices[c] = indices[data[:, eventcol] == c]

for n in range(ntimes):
    # Re-shuffle the data every time
    for c in classes:
        np.random.shuffle(classindices[c])

    for k in range(kfold):
        valindices = []
        trnindices = []

        # Join the data pieces
        for p in range(kfold):
            # validation piece
            if k == p:
                for idx in classindices.values():
                    # Calc piece length
                    plength = int(round(len(idx) / kfold))
                    valindices.extend(idx[p*plength:(p+1)*plength])
            else: # training piece
                for idx in classindices.values():
                    # Calc piece length
                    plength = int(round(len(idx) / kfold))
                    trnindices.extend(idx[p*plength:(p+1)*plength])
         # Actual model stuff follows...

Add documentation for left_truncation

Aalen Additive Filter fails with non-standard indices on dataframe

Example of non-numeric index:

In [33]: aaf = ll.AalenAdditiveFitter()

In [34]: example
Out[34]: 
   duration done_feeding white
a        16         True  True
b         1         True  True
c         4        False  True
d         3         True  True
e        36         True  True

[5 rows x 3 columns]

In [35]: aaf.fit(example, duration_col='duration', event_col='done_feeding')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-4c043f45cf2b> in <module>()
----> 1 aaf.fit(example, duration_col='duration', event_col='done_feeding')

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in fit(self, dataframe, duration_col, event_col, timeline, id_col, show_progress)
    443 
    444         if id_col is None:
--> 445             self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
    446         else:
    447             self._fit_varying(dataframe,duration_col,event_col,id_col,timeline,show_progress)

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in _fit_static(self, dataframe, duration_col, event_col, timeline, show_progress)
    526 
    527             relevant_individuals = (ids==id)
--> 528             assert relevant_individuals.sum() == 1.
    529 
    530             #perform linear regression step.

AttributeError: 'bool' object has no attribute 'sum'

Example of numeric index where index isn't just 0 through n-1

In [36]: example.index = pd.Index([0, 1, 2, 4, 5])

In [37]: example
Out[37]: 
   duration done_feeding white
0        16         True  True
1         1         True  True
2         4        False  True
4         3         True  True
5        36         True  True

[5 rows x 3 columns]

In [38]: aaf.fit(example, duration_col='duration', event_col='done_feeding')
 [-----------------100%-----------------] 4 of 4 complete in 0.0 sec---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-38-4c043f45cf2b> in <module>()
----> 1 aaf.fit(example, duration_col='duration', event_col='done_feeding')

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in fit(self, dataframe, duration_col, event_col, timeline, id_col, show_progress)
    443 
    444         if id_col is None:
--> 445             self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
    446         else:
    447             self._fit_varying(dataframe,duration_col,event_col,id_col,timeline,show_progress)

/usr/local/lib/python2.7/dist-packages/lifelines/estimation.pyc in _fit_static(self, dataframe, duration_col, event_col, timeline, show_progress)
    526 
    527             relevant_individuals = (ids==id)
--> 528             assert relevant_individuals.sum() == 1.
    529 
    530             #perform linear regression step.

AssertionError:

The same thing happens if I assign the index to be 1, 2, 3, 4, 5. When I assigned index to be 0, 1, 2, 3, 4 (the default) the code works.

From my brief look at the code, the bug stems from setting T to df[duration_col] (on line 485 of estimation.py). This gives T the index of the input dataframe, but the loop starting at line 415 assumes that id is between 0 and df.shape[0].

I'm only moderately familiar with pandas and I'm fairly busy finishing up my school year right now, but if this isn't fixed by mid-June I should have time to fix it then.

Implement cross validation

CV is important for these models. This paper looks like the best:http://uwspace.uwaterloo.ca/bitstream/10012/3974/1/thesis.pdf

pip installlifelines - failure (solved: missing fortran)

Using Anaconda in both Linux Mint abd Mac OS. pip install lifelines results in:

Downloading/unpacking lifelines
  Downloading lifelines-0.4.3.tar.gz (628kB): 628kB downloaded
  Running setup.py (path:/tmp/pip_build_root/lifelines/setup.py) egg_info for package lifelines
    build_src
    building extension "lifelines._statistics" sources
    f2py options: []
    f2py:> build/src.linux-i686-2.7/lifelines/_statisticsmodule.c
    Reading fortran codes...
        Reading file 'lifelines/_statistics.f90' (format:free)
    Post-processing...
        Block: _statistics
                Block: concordance_index
    Post-processing (stage 2)...
    Building modules...
        Building module "_statistics"...
            Creating wrapper for Fortran function "concordance_index"("concordance_index")...
            Constructing wrapper function "concordance_index"...
              cindex = concordance_index(event_times,predictions,event_observed,[rows])
        Wrote C/API module "_statistics" to file "build/src.linux-i686-2.7/lifelines/_statisticsmodule.c"
        Fortran 77 wrappers are saved to "build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f"
      adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.
      adding 'build/src.linux-i686-2.7' to include_dirs.
      adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.
    build_src: building npy-pkg config files

    warning: no files found matching '*' under directory 'styles'
    warning: no previously-included files matching '*.py[co]' found under directory '*'
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/lib/python2.7/dist-packages (from lifelines)
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages (from lifelines)
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7 (from lifelines)
Downloading/unpacking pandas>=0.14 (from lifelines)
  Downloading pandas-0.14.1.tar.gz (6.7MB): 6.7MB downloaded
  Running setup.py (path:/tmp/pip_build_root/pandas/setup.py) egg_info for package pandas

    warning: no files found matching 'README.rst'
    no previously-included directories found matching 'doc/build'
    warning: no previously-included files matching '*.so' found anywhere in distribution
    warning: no previously-included files matching '*.pyd' found anywhere in distribution
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '.git*' found anywhere in distribution
    warning: no previously-included files matching '.DS_Store' found anywhere in distribution
    warning: no previously-included files matching '*.png' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Requirement already satisfied (use --upgrade to upgrade): tornado in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6 in /usr/lib/python2.7/dist-packages (from matplotlib->lifelines)
Downloading/unpacking nose (from matplotlib->lifelines)
  Downloading nose-1.3.4.tar.gz (277kB): 277kB downloaded
  Running setup.py (path:/tmp/pip_build_root/nose/setup.py) egg_info for package nose

    no previously-included directories found matching 'doc/.build'
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/lib/python2.7/dist-packages (from pandas>=0.14->lifelines)
Installing collected packages: lifelines, pandas, nose
  Running setup.py install for lifelines
    unifing config_cc, config, build_clib, build_ext, build commands --compiler options
    unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
    build_src
    building extension "lifelines._statistics" sources
    f2py options: []
      adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.
      adding 'build/src.linux-i686-2.7' to include_dirs.
      adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.
    build_src: building npy-pkg config files
    customize UnixCCompiler
    customize UnixCCompiler using build_ext
    customize Gnu95FCompiler
    Found executable /usr/bin/gfortran
    customize Gnu95FCompiler
    customize Gnu95FCompiler using build_ext
    building 'lifelines._statistics' extension
    compiling C sources
    C compiler: i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

    compile options: '-Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
    i686-linux-gnu-gcc: build/src.linux-i686-2.7/fortranobject.c
    In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:
    build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:
    build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    error: Command "i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c build/src.linux-i686-2.7/fortranobject.c -o build/temp.linux-i686-2.7/build/src.linux-i686-2.7/fortranobject.o" failed with exit status 1
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lifelines/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-xn95nq-record/install-record.txt --single-version-externally-managed --compile:
    running install

running build

running config_cc

unifing config_cc, config, build_clib, build_ext, build commands --compiler options

running config_fc

unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options

running build_src

build_src

building extension "lifelines._statistics" sources

f2py options: []

  adding 'build/src.linux-i686-2.7/fortranobject.c' to sources.

  adding 'build/src.linux-i686-2.7' to include_dirs.

  adding 'build/src.linux-i686-2.7/lifelines/_statistics-f2pywrappers.f' to sources.

build_src: building npy-pkg config files

running build_py

creating build/lib.linux-i686-2.7

creating build/lib.linux-i686-2.7/lifelines

copying lifelines/_cox_regression.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/generate_datasets.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/__init__.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/estimation.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/plotting.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/progress_bar.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/datasets.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/statistics.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/utils.py -> build/lib.linux-i686-2.7/lifelines

copying lifelines/_statistics.f90 -> build/lib.linux-i686-2.7/lifelines

copying lifelines/../README.md -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../README.txt -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../LICENSE -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../MANIFEST.in -> build/lib.linux-i686-2.7/lifelines/..

copying lifelines/../Untitled0.ipynb -> build/lib.linux-i686-2.7/lifelines/..

creating build/lib.linux-i686-2.7/datasets

copying lifelines/../datasets/static_test.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/psychiatric_patients.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/gehan.dat -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/dd.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/divorce.dat -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/Divorces Rates.ipynb -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/The Gehan Survival Data.ipynb -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/panel_test.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/canadian_senators.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/lung.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

copying lifelines/../datasets/2002FemResp.csv -> build/lib.linux-i686-2.7/lifelines/../datasets

running build_ext

customize UnixCCompiler

customize UnixCCompiler using build_ext

customize Gnu95FCompiler

Found executable /usr/bin/gfortran

customize Gnu95FCompiler

customize Gnu95FCompiler using build_ext

building 'lifelines._statistics' extension

compiling C sources

C compiler: i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC



creating build/temp.linux-i686-2.7

creating build/temp.linux-i686-2.7/build

creating build/temp.linux-i686-2.7/build/src.linux-i686-2.7

creating build/temp.linux-i686-2.7/build/src.linux-i686-2.7/lifelines

compile options: '-Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'

i686-linux-gnu-gcc: build/src.linux-i686-2.7/fortranobject.c

In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:

build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

In file included from build/src.linux-i686-2.7/fortranobject.c:2:0:

build/src.linux-i686-2.7/fortranobject.h:7:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

error: Command "i686-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ibuild/src.linux-i686-2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c build/src.linux-i686-2.7/fortranobject.c -o build/temp.linux-i686-2.7/build/src.linux-i686-2.7/fortranobject.o" failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lifelines/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-xn95nq-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lifelines
Storing debug log for failure in /home/juan/.pip/pip.log

Statistical tests statistically fail

Because we are running example statistical tests, they have p chance of failing, purely by chance. This was okay before, but with multiple python versions running it's a problem.

Solution is to think of new ways to tests these tests.

Unit tests fail with Pandas 0.14

As the title says, the unit tests fail with Pandas 0.14. They run successfully with 0.13.1 though. Python version is 3.4 and a pip freeze of other packages used:

matplotlib==1.3.1
numpy==1.8.1
scipy==0.13.3

The output is as follows:

======================================================================
ERROR: test_aalen_additive_median_predictions_split_data (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 477, in test_aalen_additive_median_predictions_split_data
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_dataframe_input_with_nonstandard_index (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 539, in test_dataframe_input_with_nonstandard_index
    aaf.fit(df, duration_col='duration', event_col='done_feeding')
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_large_dimensions_for_recursion_error (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 441, in test_large_dimensions_for_recursion_error
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_tall_data_points (__main__.AalenAdditiveModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 453, in test_tall_data_points
    aaf.fit(X)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 468, in fit
    self._fit_static(dataframe,duration_col,event_col,timeline,show_progress)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 521, in _fit_static
    n_deaths = len(non_censorsed_times)
TypeError: object of type 'zip' has no len()

======================================================================
ERROR: test_timeline_to_NelsonAalenFitter (__main__.StatisticalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 262, in test_timeline_to_NelsonAalenFitter
    with_list = naf.fit(T, C, timeline=timeline).cumulative_hazard_.values
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 67, in fit
    self._additive_f, self._variance_f, False)
  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 862, in _additive_estimate
    estimate_ = estimate_.reindex(timeline, method='pad').fillna(0)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/series.py", line 2028, in reindex
    return super(Series, self).reindex(index=index, **kwargs)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/generic.py", line 1624, in reindex
    method, fill_value, copy).__finalize__(self)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/generic.py", line 1641, in _reindex_axes
    labels, level=level, limit=limit, method=method)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/index.py", line 1375, in reindex
    limit=limit)
  File "/home/jonas/workspacepython/lifelines/pd1/lib/python3.4/site-packages/pandas/core/index.py", line 1264, in get_indexer
    raise ValueError('Must be monotonic for forward fill')
ValueError: Must be monotonic for forward fill

======================================================================
FAIL: test_multivariate_equal_intensities (__main__.StatisticalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 227, in test_multivariate_equal_intensities
    self.assertTrue(result is None)
AssertionError: False is not true

Sparse support in AAF

It works :). Let me know if you're interested in this once you merge your AAF refactor.

https://github.com/Gild/lifelines/blob/master/lifelines/estimation.py#L319

Fix TravisCi

estimator should accepts python lists

remove unneeded properties if not used.

Example, calling

 kmf.fit(T)

will result in the kmf object having a event_occurred and a min_time property. This is confusing.

Proper confidence intervals

Currently the confidence intervals are either not correct, or untested, on all models.

Censored count data estimation

eg: Interested in the distribution of uses of a product (count data) -- but time will censor seeing all the users' history.

Easy: swtich censorship and timeline in .fit arguments

Support for ROS

per a brief twitter exchange, here's some more info on ROS:
https://www.dropbox.com/s/a9kk03ctesfgwwg/Helsel%26Cohn_1988.pdf

I've currently got it as a part of private repo that needs to be broken up and set free. Tests and all.

AalenAdditiveFitter.fit should allow dataframe

This just makes sense, and it can override the columns option internally.

Confidence intervals are broken or missing

Yea, this is a big important one. Will do.

Survival Function for p-value calculation

If possible, could you use stats.chi2.sf instead of 1.0 - stats.chi2.cdf? For really large effects, the p-value might be zero when using 1.0 - cdf.

Memory blow's up for some AAF datasets

# to get the latest version of lifelines
# pip install --upgrade git+https://github.com/CamDavidsonPilon/lifelines.git


#This should break:
from lifelines.generate_datasets import *
from lifelines import AalenAdditiveFitter

#generated fake, large dataset
n = 10000
d = 20
timeline = np.linspace(0, 70, 8000)
hz, coef, X = generate_hazard_rates(n, d, timeline)
X.columns = coef.columns
cumulative_hazards = pd.DataFrame(cumulative_quadrature(coef.values.T, timeline).T, 
                                  index=timeline, columns=coef.columns)
T = generate_random_lifetimes(hz, timeline)
X['T'] = T
X['E'] = np.random.binomial(1,.99,n)
print "data created"
aaf = AalenAdditiveFitter(penalizer=1., fit_intercept=False)

aaf.fit(X) #bfill is called in here.


### 
# Here's the internal data structure that will fail on bfill

df = X.copy()
df['id'] = range(df.shape[0])
df = df.set_index(['t','id'])
wp = df.to_panel()
#calling bfill on wp should fail, or hang terribly.

AalenAdditiveFitter fit is extremy slow for large matrices

if X is large like (500k,4) , the fit method will loop over all of them ... despite the fact 99% of them the are censored.
in estimation.py the loop that I'm talking about starts on line 320
for i, time in enumerate(sorted_event_times):

The algebra is way beyond my understanding :) maybe I missing something but why sorted_event_times is so large ?

Multiple comparisons testing

Multiple comparisons corrections with something like Bonferroni would be useful. This would also require generating p-values for the logrank statistic from the Chi**2 distribution.

easy p-value calculation from chi2 using scipy

from scipy import stats
p_value = 1 - stats.chi2.cdf(chi_square_value, degrees_freedom)

logrank_test requires matching time values

If I call the logrank_test function like this on uncensored data...
logrank_test(T[gen0],T[gen1])
I get the following indexing error (I removed some of the stack for brevity).

IndexError                                Traceback (most recent call last)
<ipython-input-166-59a3462b6f73> in <module>()
      1 from lifelines.statistics import logrank_test
----> 2 logrank_test(T[gen0],T[gen1])
      3 T[gen0]

~/lifelines/lifelines/statistics.py in logrank_test(event_times_A, event_times_B, censorship_A, censorship_B, t_0)
     46       pass
     47     try:
---> 48       y_2 = Y_2.loc[t]
     49     except KeyError:
     50       pass

IndexError: index out of bounds

I think what is going on is that you are requiring the two different survival curves to have the same event_time values. If the death events were observed at different times for different curves, this function throws an indexing error.

If you want to replicate my error, try the following data:
T[gen1] = array([6., 13., 13., 13., 19., 19., 19., 26., 26., 26., 26., 26., 33., 33., 47., 62., 62., 9., 9., 9., 15., 15., 22., 22., 22., 22., 29., 29., 29., 29., 29., 36., 36., 43.])

T[gen0] = array([33., 54., 54., 61., 61., 61., 61., 61., 61., 61., 61., 61., 61., 61., 69., 69., 69., 69., 69., 69., 69., 69., 69., 69., 69., 32., 53., 53., 60., 60., 60., 60., 60., 68., 68., 68., 68., 68., 68., 68., 68., 68., 68., 75., 17., 51., 51., 51., 58., 58., 58., 58., 66., 66., 7., 7., 41., 41., 41., 41., 41., 41., 41., 48., 48., 48., 48., 48., 48., 48., 48., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 56., 63., 63., 63., 63., 63., 63., 63., 63., 63., 69., 69., 38., 38., 45., 45., 45., 45., 45., 45., 45., 45., 45., 45., 53., 53., 53., 53., 53., 60., 60., 60., 60., 60., 60., 60., 60., 60., 60., 60., 66.])

date transformer

It would be nice to have the following function:

T,C = datetime_transform( start_dates, end_dates)

Bug in Variance of Nelson-Aalen estimator

First of all I'm happy to see this library in Python. When I have time I'd love to contribute. I want to put the code into real uses and before doing so I first want to cross-validate the results from R.

It appears that the _variance_f function of Nelson-Aalen estimator returns a wrong value:

return (1.*d/N)**2

It should be:

return d/N/N

You may want to investigate this further, because I found this just form reading the code and I haven't run any code to prove that the results are wrong.

Unclear if left-censored data is supported

I've looked through your documentation, and right now its unclear if this library can be used for left-censored survival analysis

add confidence intervals for AalenAdditive estimates

Ability to specify arbitrary alpha value for confidence interval

Support alpha values other than 0.95 and 0.99

Instead of nasty for loop...

use Pandas reindex on the dataframe to conform to a certain passed timeline.

Fix links in the Github visualization of the README

The links of the README.md here on Github are broken. For example:

http://nbviewer.ipython.org/github/CamDavidsonPilon/lifelines/blob/master/notebooks/Survival%20Analysis%20intro.ipynb

Ability to set drawstyle in regressions

https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/plotting.py#L51

Bug in pairwise_logrank using pandas dataframes

eg:

pairwise_logrank_test(df["T"], df["group"], censorship = df["C"])

fails

Install Error - KeyError: 'FARCHFLAGS'

Hello,

I'm trying to install lifelines and am receiving the following:

customize UnixCCompiler
customize UnixCCompiler using build_ext
customize Gnu95FCompiler
Found executable /usr/local/bin/gfortran
Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip_build_root/lifelines/setup.py", line 57, in <module>
        ext_modules=[ext_fstat]
...
  File "/Library/Python/2.7/site-packages/numpy-override/numpy/distutils/fcompiler/gnu.py", line 274, in _universal_flags
    farchflags = os.environ['FARCHFLAGS']
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'FARCHFLAGS'

I've tried building/installing from setup.py and pip - both return the same error. Looking at line 57 of setup.py eventually lead me to _statistics.F90 which I believe is a Fortran file. I removed my Fortran compiler and re-installed via Homebrew (brew install gcc), but that did not help either. I've also tried the following with no luck:

$ export FARCHFLAGS="-arch x86_64"
$ export CFLAGS="-arch i386 -arch x86_64" 
$ export LDFLAGS="-Wall -undefined dynamic_lookup -bundle -arch i386 -arch x86_64"
$ export FFLAGS="-arch i386 -arch x86_64"

Any help would be greatly appreciated! I'm using OS X Mavericks with Command Line Tools and all required packages installed.

python3 support for datasets.py

lifelines seems to work fine with python3 - but when I try the QuickStart in python3 - it complains on:

from lifelines.datasets import waltons_data

A quick 2to3 fixes the problem and I can run the Quickstart without any trouble.

thanks for all the great work!

rigous checking of statistical tests

see title

Allow for time dependent covariates

This is a application of Panda's panel data structure.

Redefine sub for fitters

This way it is convenient to do something like:

kmf1.survival_function_ - kmf2.survival_function_

Smoothing hazard is not working as intended

Confidence intervals are wrong in .2.1

Prediction for time-varying covariates

It's not completely clear to me how to do prediction with time-varying covariates. By prediction, I am referring to constructing a hazard curve give the covariates. Consider the static case:

hz(t) = b1(t)X_1 + b2(t)X_2 ...

This works as I can extend t to as far as a I want, and still only need to know (X_1, X_2, ...). For time-varying covariates, I can only extend as far as the observed covariates.

Access to CoxPH loglikelihood

From an email

The only thing is there's no way to get the log likelihood even if I'm willing to wait for it. In _newton_rhaphson, you're unpacking only two values out of _get_efron_values. So even if I set the argument include_likelihood to True for _get_efron_values, I'm getting a ValueError: too many values to unpack in _newton_rhaphson.

README graph doesn't reflect code

@Cmrn_DP actually I meant the way one of the data series in my last graph is truncated at 12, but both of yours are, not the test itself
— Andrew Clegg (@andrew_clegg) June 13, 2014

KMF legend labels do not play nice with LaTeX

The labels "_upper_0.95" and "_lower_0.95" break plotting if LaTeX is enabled:

import matplotlib as mpl
mpl.rcParams['text.usetex']=True

# Create a kaplan-meier fitter then plot...
kmf.plot(ax=ax, c="#A60628", ci_force_lines=True)

The problem is that LaTeX hates underscores... The text might also look bad if matplotlib interprets the legend as mathmode (not sure if it does).
One could imagine something like this:

import matplotlib as mpl

# No more underscores
label = "{} upper 0.95".format(actual_label)
# If TeX, wrap in a text environment
if mpl.rcParams['text.usetex']:
    label = "\text{%s}" % label

Docs

Need to make the user feel empowered.

analysis on multiple survival curves simultaneously

use the groupby functionality in pandas dataframes to generate a bunch of survival functions, run the stats, and plot the curves for anything significantly different from a control population

Walton (private correspondence)

predict_expectation incorrect on AalenAdditiveFilter

predict_expectation does not agree with the following code:

from scipy.integrate import simps
def predict_expectation(aaf, covariates):
    sfs = aaf.predict_survival_function(covariates)
    xvals = sfs.index.values
    return simps(sfs.values, xvals, axis=0)

For this curve: http://imgur.com/PhJH3Tp the current predict_expectation method returns 50 or 60, the above method returns the much more reasonable value 16.

You should be able to test this by doing a Kaplan-Meier fit with no censored data points, using the quadrature method on the kmf.survival_function_ (and its index) and noticing that the value of that expectation is not equal to the mean of the duration of the data set (since all points are uncensored, the KM fit should just be 1 - CDF of the dataset).