dado93 / pywearable Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 1.0 4.17 MB

Python package for extraction, visualization, and analysis of physiological data collected through wearable sensors.

Home Page: https://pywearable.readthedocs.io/en/latest/

License: MIT License

Python 91.64% Jupyter Notebook 8.36%

data-visualization feature-extraction wearable-devices wearable-sensors

pywearable's Introduction

pywearable

Do you need to deal with wearable data? Then you are in the right place! In this repository you will find a Python package that you can use to analyse all data collected with several data sources. At the moment, we support:

The aim of this Python package is to offer a series of functions for the loading and analysis of the data. Furthermore, the aim is to build a command line interface (CLI) for data extraction and a web-based dashboard for:

visualization of physiological data
automatic analysis of data with feature extraction
summary of data

Installation

At this stage of development, the package is still not uploaded to PyPi, and thus can be only be installed through a local installation:

Clone the repository with git clone [email protected]:dado93/pywearable.git
cd pywearable
pip install --editable .

Usage

The package was designed to be used in the most straightforward way:

import pywearable.loader

_BASE_FOLDER = "..." # Path to folder with data downloaded from Labfront

labfront_loader = pywearable.loader.LabfrontLoader()
sleep_summaries = labfront_loader.load_garmin_connect_sleep_summary("user-01")

Contributing

Documentation

The format for documentation is numpy-style. The documentation of the package can be built locally using sphinx.

cd docs
make html
Open in your browser the file index.html that you can find in docs/_build/html

pywearable's People

Contributors

Stargazers

Watchers

Forkers

vladresilient

pywearable's Issues

Respiration statistics

Similar to what is available in the pywearable.sleep module, it would be good to integrate a get_respiration_statistics and get_respiration_statistic in the respiration module, providing a summary of all the metrics related to respiration.

Move processing functions to separate module

The filter_bbi and _filter_out_awake_bbi should be moved to a separate module, in order to keep the cardiac module "cleaner".

Handle breaths per minute equal to 0

With the current implementation, it is possible to handle breaths per minute by setting the parameter remove_zero to True in the respiration submodule functions. It would be interesting to find out why these values are present and a better way to handle them.

Consistency across modules for parameter names

We need to have consistency across all the modules of pylabfront for the names of the parameters. Examples:

sometimes we have start_dt and others start_date
sometimes we have user_ids and others participants_ids

We need to be consistent with the names of the parameters.

Load Labfront Health Snapshot from Health API

Labfront provides Health Snapshots data from Garmin Health API, but this features is currently not available in pywearable.loader.LabfrontLoader

Handle missing labfront ID in folder name

The current implementation of the LabfrontLoader requires the presence of a Labfront ID (37 digits long) in the name of the folder for a given user. If this is not present, then both the user_id and the labfront_id are not correct. Example of a folder with labfront data that triggers this behavior:

sample_data

    sample_participant_01

How to replicate

Use the sample_data folder contained in the repository.

sample_folder = 'sample_data'
labfront_loader = pylabfront.loader.LabfrontLoader(sample_folder)
print(labfront_loader.get_ids())
(['sampl'], ['sample-participant-01'])

Potential solution

We need to change the implementation of the pylabfront.loader.LabfrontLoader.get_ids() function, checking for the presence of a Labfront Id in the folder name using a regular expression. If no Labfront ID is present, then it is set to ''. So in the provided example, we would get:

sample_folder = 'sample_data'
labfront_loader = pylabfront.loader.LabfrontLoader(sample_folder)
print(labfront_loader.get_ids())
(['sample-participant-01'], [''])

Questionnaire adherence error in number of filled questionnaires

When computing the adherence to a given questionnaire, the function adeherence.get_questionnaire_adherence() does not compute the proper number of filled questionnaires. For example, if we have the following entries in a questionnaire csv file:

timezone,unixTimestampInMs,isoDate,1_1,1_2
Europe/Rome,1683613800978,2023-05-09T08:30:00.978+02:00,3,3
Europe/Rome,1683703939354,2023-05-10T09:32:19.354+02:00,4,4
Europe/Rome,1683791147618,2023-05-11T09:45:47.618+02:00,2,2
Europe/Rome,1683874852834,2023-05-12T09:00:52.834+02:00,4,4
Europe/Rome,1683960327499,2023-05-13T08:45:27.499+02:00,2,3
Europe/Rome,1684054393961,2023-05-14T10:53:13.961+02:00,1,2
Europe/Rome,1684138111504,2023-05-15T10:08:31.504+02:00,2,3
Europe/Rome,1684220484540,2023-05-16T09:01:24.540+02:00,4,4
Europe/Rome,1684306555054,2023-05-17T08:55:55.054+02:00,2,2
Europe/Rome,1684394886647,2023-05-18T09:28:06.647+02:00,3,3

and we call the function in this way:

today = datetime.datetime(2023, 5, 18)
start_date = datetime.datetime(2023, 5, 8)
number_of_days = (today - start_date).days

questionnaire_adherence = adherence.get_questionnaire_adherence(
    labfront_loader,
    number_of_days,
    start_date,
    today,
    participant_ids="all",
    questionnaire_names="all",
    return_percentage=False,
)

I get the following result: 'questionnaire': {'total': 10, 'n_filled': 9}.

The problem is related to the main LabfrontLoader.get_data_from_datetime, that uses the specific datetime object that is passed to load data. So, today = datetime.datetime(2023, 18, 5) means today = datetime.datetime(2023, 18, 5, 0, 0, 0), and thus, no data are loaded for 2023/5/18.

Handle questionnaire ID and name

With the current implementation, it is not possible to pass as parameter to the questionnaire functions of the loader.py module the Labfront ID of the questionniare. The functions should be updated to allow for this, and raise an error in case duplicate questionnaires with the same name are present, and the name is passed as argument

Load Labfront HRV data from Garmin Health API

Garmin provides HRV data from Health API for high-end smartwatches (see this article from tryterra: https://blog.tryterra.co/garmin-hrv-data-available-from-terra-api-46e8ce2ae9b4). We need to provide APIs to load these data

Load Movesense data from Labfront

Labfront supports the use of Movesense sensors: no support is provided for this wearable in the class pywearable.loader.LabfrontLoader. It must be integrated to allow data loading from this wearable.

Handle both User ID, Labfront ID, and full user ID

At the moment, it is not possible to use either user ID, labfront ID, or full ID when calling functions.
The way in which this should be handled is to create an helper function that, based on the value of the parameter that is passed to the function, checks whether or not it contains a Labfront ID, and always return a Full ID. This should also handle duplicate User IDs, by raising an error in case a User ID is passed as an argument, and multiple User IDs exist with the same value. In this case, Labfront ID or full ID should be used.

Remove questionnaire-specific functions from visualization module

In the visualization module, there are functions specific for questionnaires that were done in a certain project for data collection.
As the module should be focused on being general-purpose, these functions should be removed, and only functions applicable to questionnaires in general should be left there.

get_sleep_timestamps does not work with None as dates

If I run the following code:

loader = pylabfront.loader.LabfrontLoader(BASE_FOLDER)

start_date = None
end_date = None
user_id = ...

pylabfront.sleep.get_sleep_timestamps(loader, start_date, end_date, user_id)

I get the following error:

TypeError: unsupported operand type(s) for -: 'NoneType' and 'datetime.timedelta'

This is due to the fact that no check is done on the input parameters, even though None is set as default value for both start_date and end_date.

New LabfrontLoader fails with movesense folder

The new implementation of the LabfrontLoader fails to initialize if there is a movesense-stream folder among the metrics.

Installation via pip

For the 0.1 release of pywearable, we need to offer users and developers the possibility to install it via pip, without manually downloading it from Github and installing.
In addition to creating the required pipeline to install the package via pip, we also need to add a requirements-dev.txt file with the requirements for developers wishing to contribute to the project.

Get_time_in_bed doesn't get the right period of interest

At the moment the function in the sleep module get_time_in_bed and most other functions related to yasa will return a dictionary with keys that go way over the period of interest.

The dates where the value isn't None are quite possibly correct.

Possibly in the function get_sleep_statistics called by the yasa funcs there's something setting off the time period? In a first look I'm not sure about the intervals variable in it.

Returning ordered dictionaries instead of dicts

In the current implementations most processing functions return metrics as dict (with keys the date and values the daily metric when kind isn't specified), pd.DataFrame, or multi-index pd.DataFrame.

However, when a dict is returned, although it is generally the case, it might not be ordered by date. This is due to the fact that dict in python is a hash table and as such is subject to re-hashing when the size of the hash table increases.

Several functions in the current implementation of pywearable iterate over (key, value) pairs through the .items() method of the dictionaries, assuming ordering by date. As detailed above, this might fail in some cases, and it would be safer to enforce the ordering by returning instead of standard dict. This could be done by initializing data_dict variables in the processing functions to collections.OrderedDict.

Change documentation style to Numpy

All the documentation should be changed to Numpy style.

Tz-aware timestamps

Labfront uses different time representations based on the collected metric. In Labfront files, we can find the following time information:

timezone - The geographic region based on the device settings. Ex: Asia/Taipei
timezoneOffsetInMs - The offset of the timezone in ms. Ex: 3600000
unixTimestampInMs - The timestamp of the corresponding data type in Unix time. It has 13 digits because it displays milliseconds. Ex: 1676951478000
isoDate - The converted local time in the standard format used to express a numeric calendar date. Ex: 2023-02-21 11:51:18.000 +08:00

Currently, we handle the time representation in the following ways:

If we have a timezoneOffsetInMs column, we add it to the unixTimestampInMs column. Then, we convert the obtained Unix time to a tz-naive datetime format using pd.to_datetime
Instead, if we have a timezone column, we use it to perform a tz_convert grouped by time zones, and then we perform a tz_localize operation with no timezone, thus obtaining a tz-naive datetime.

These solutions were implemented because:

Converting the isoDate string to datetime format using pd.to_datetime was too slow
Labfront has a bug in the isoDate column when there are different time zones in the same csv file

The problem is that the resulting datetime representation is tz-naive, and does not contain any kind of information about the time zone. The following issues should be addressed:

When there is a timezoneOffsetInMs column, convert Unix time into a tz-aware datetime representation
When there is a timezone column, use it in the tz_localize operation. The current problem is that, if there are multiple timezones in the same file, the resulting column of the DataFrame will have a dtype of object and not of datetime

Inclusiveness wrt specific datetime and avoiding timedeltas

Currently some functions behave differently in terms of inclusiveness with respect to the end_date calendar date, using datetime.timedelta objects to obtain the behavior desired. There should be some uniformity in how they behave, and avoid "under the the rug" filtering in the function definitions.

In an upcoming update, inclusiveness from pylabfront.loader.LabfrontLoader.load_from_datetime will be updated. By that time, all functions (especially in the activity module) should be checked to behave accordingly.

Union types as X | Y cause dependency errors when using python 3.9

Writing union types as X | Y is supported only from python 3.10, but setup currently requires only python 3.9.
Furthermore, using versions >= 3.10 encounter errors with pyhrv dependencies (tkinter).

The easiest solution would be to avoid using the syntax X | Y in the function description. The only case where this is used for now is in the module pywearable.loader.labfront.loader.

get_data_from_datetime with calendar day

With the current implementation, the get_data_from_datetime does not work as expected when loading sleep summary as the presence of the end_date parameter causes the loading of multiple sleep summaries. This should be fixed by creating a separate function accepting only calendar day as argument.

Questionnaire ID - data loader

Problem

With the current implementation, the name of the questionnaire/todoto be used in all the functions requires the use of both the name of the questionnaire that was set in Labfront, and the questionnaire/todo Labfront ID, which is very much inconvenient.

Proposed solution

A proposed work-around is to read the header from the questionnaire/todo csv file, which contains the questionnaireName value, that can then be used in the function loader.get_time_dictionary()

Accept multiple formats for dates

In all the package, the functions require a datetime.datetime variable for both start_date and end_date.
I propose to change this behavior to accept all the following formats:

datetime.datetime
datetime.date
str: to be parsed with dateutil.parser

get_sleep_summary_graph first and last empty days

When calling the pywearable.get_sleep_summary_graph function, if sleep data are missing from one of the days between start_date and end_date, but excluding start_date and end_date, then an empty entry is added. We need to have the same behavior also if sleep data of start_date or end_date is missing.

Load Labfront Accelerometer data from Garmin SDK

Labfront provides accelerometer data from Garmin SDK: they must be made available with pywearable.loader.LabfrontLoader

Refactor sleep module to avoid yasa dependency

With the current implementation, pylabfron relies on yasa for the computation of sleep statistics. These functions require the generation of hypnograms through pylabfront.loader.LabfrontLoader.load_hystogram. This operation takes time, thus it would be better to avoid it. For this reason, we aim at refactoring the sleep module to avoid such a dependency and implement the computation of sleep metrics directly from sleep summaries and sleep stages.

Integrate bedtime and wakeup time statistics in get_sleep_statistics

At the moment bedtime and wake-up time are not returned in the general pywearable.sleep.get_sleep_statistics.
They should be added.

In order to do so a sleep._compute_sleep_timestamp must be added, which will require a sleep_summary parameter as other compute functions.
The main difference to take into account for the calculation of statistics is that the parameter kind can't be applied directly as it is the case for other sleep statistics but must be mapped to functions in the utils module such as utils.get_earliest_bedtime etc..

This mapping could be done by creating a dictionary (kind_mapping) for the mapping which includes only the sleep metrics which have a different behavior w.r.t. the kind parameter. When kind needs to be applied for the calculation of sleep_statistics it should be simply proceeded by something like kind = kind_mapping.get(sleep_metric, kind) so that for metrics which require a different treatment, such as bedtime and wake-up time, the proper function is used.

Document package

We need to have a full documentation of the Python package before releasing version 0.1.
All the pylabfront modules have to be properly documented, following numpy-style documentation. The documentation will then be made available through ReadTheDocs with automatic pipelines.

Eliminate dependency from july

At the moment the module pywearable.visualization imports the july library to get yearly sleep and stress heatmaps.
However the functions provided by july are quite rigid; furthermore, they seem to be not so hard to replicate.
A quick inspection in the july repo shows that indeed only other (more common) dependencies of pywearable (matplotlib, numpy) are needed for such plots.

Statistics based on physiological cycles and not on midnight-to-midnight

The statistics of the cardiac and respiration module are computed using as a timeframe the time that goes from midnight to midnight. It would be interesting to offer two different functions for each of the statistic:

get_daily_XX, computing the value of statistic XX from midnight to midnight (current behavior)
get_cycle_XX, computing the value of statistics XX from bedtime to the following bedtime (physiological cycle)

Of course, in order to compute statistics on a physiological cycle level, we need to:

have sleep data available
define maximum duration for a physiological cycle (e.g., what happens if a user skips a night of sleep?)

https://support.whoop.com/s/article/WHOOP-Cycles?language=en_US

What do you think @Vaeliss ?

sleep.get_total_sleep_time returns 0 instead of nan

The function pywearable.sleep.get_total_sleep_time returns 0 when all the columns of the sleepSummary are set to NaN. The function must return NaN in this case.

Multiple sleep summary rows for the same day

In case there are multiple rows for a given calendar date in a sleep summary file, we need to be sure that only the one related to the night of interest is retrieved.

loader_kwargs as parameter in processing functions

Functions used to load data always need to retrieve data using specific loading function. Each loader, using as base class pywearable.loader.BaseLoader may require additional parameters in the loading function call, thus the functions using for processing data, such as pywearable.sleep.get_sleep_statistic should accept as an argument a loader_kwargs that contains keyword arguments to be passed to the loading function.

Wrong number of nights in get_nights_adherence

The function get_nights_adherence in the adherence module has the following line:

num_nights = (end_date - start_date).days - 1

But I think that the proper way to compute the number of nights is:
num_nights = (end_date - start_date).days

For example:

start_date = datetime.datetime(2023, 5, 18)
end_date = datetime.datetime(2023, 5, 23)

The nights are:

18/05->19/05
19/05->20/05
20/05->21/05
21/05->22/05
22/05->23/05

Thus, the number of nights are equal to 5, which is the same as doing (end_date - start_date).days

Respiration data does not work on nightly-basis

The functions in the respiration submodule of pylabfront compute mean breaths per minute on a day-by-day. For rest breaths per minute, the computation should be done on a night-by-night basis.

Problem with .DS_Store file in todos

There is a problem when additional files are present in the todo folder.

Using the sample_data folder provided in the repository, the issue can be replicated with the following lines of code on a Windows machine:

base_folder = Path('sample_data')
labfront_loader = pylabfront.loader.LabfrontLoader(base_folder)
labfront_loader.get_available_todos("all")labfront_loader.get_available_todos("all")

The output is:

['.DS_Store',
 'Complete-Garmin-Setu_8a5325fa-76ca-446f-a646-4e80aa3fb258',
 'Opened-App_fe43f539-35f7-4f87-9ba1-7ac1a91632b9']

The file .DS_Store should not be there. I already added a check in the get_available_todos() function to determine if we have a file and if the file is a csv file when we return a dictionary, but this must be checked also when we do not return a dictionary.

Hypnograms starting and/or ending with awake condition have unintended consequences on sleep metrics

In the lifesnaps branch:

Some hypnograms have more values than the duration implies (e.g. the first user in a loader.LifeSnapsLoader(), datetime.datetime(2021,6,4)). Hence they will fail in functions like visualization.get_sleep_summary_graph
Several hypnograms are filled with nans only, with valid start times and end times; as an example user with index 1 in lifesnaps, first days of recording.

Overall:

bedtime computation in sleep module assumes that the isoDate of the sleep summary is the bedtime. Given the name, this might be correct. However, in that case, there should be an asleep_time too, which returns the time the user falls asleep each day.
wakeup computation in sleep module: not sure if this is working as intended when sleep summaries don't start with sleep stages, but while still awake, or in case the hypnogram finishes with awake condition. In any case, there should be both wakeup time and getup time in sleep metrics. So far we've intended wake up time to be equal to get up time.

CPD_midpoint and CPD_duration unexpected behavior with single day computations

Both sleep.get_cpd_midpoint and sleep.get_cpd_duration assume that the period of interest is longer than a single calendar date in order to actually compute CPD metrics. If a single day is used, CPD is to be considered as simply an absolute discrepancy wrt the usual duration/midpoint (if this is not specified in chronotype_dict it will simply be 0, as there's not difference between the night considered and itself).

LabronftLoader failed to inizialize

Labfront changed the structure of the CSV files, with the keys firstUnixTimestampInMs and lastUnixTimestampInMs no longer available in the header of the CSV file, thus the main indexing function of LabfrontLoader must be rewritten.

Selection of data source

At the moment, there is no selection of the data source for different Garmin metrics, and the users need to choose whether to use the "load_garmin_connect" or "load_garmin_device" functions. General functions should be added, that check which sources are available and return data based on established criteria (e.g., higher resolution of the data, ..)

Sleep onset latency isn't computed as intended

At the moment the function sleep.get_sleep_onset_latency will always return 0 for any participant for any night of sleep.
This is due to the fact that sleep epochs and sleep summaries only begin when a first sleep stage is detected.
I currently don't see a solution for this issue.

Integrate LifeSnaps dataset

LifeSnaps is a dataset with data collected in the wild using FitBit devices. It would be a good example to show the potential of pywearable, thus it would be good to integrate it into pywearable with a dedicated laoder.

Add check with non datetime data in sleep statistics

When running sleep statistics functions from the sleep module related to bedtime and wake up times, the functions fail if:

returned sleep statistic series is empty
returned sleep statistic series does not have a compatbile datetime dtype

Sleep stages nomenclature

The current nomenclature for sleep stages is the following:

REM
deep
light
awake
unmeasurable

The problem with this nomenclature is that it does not support the N2 sleep stage, which some wearable that we are going to support in the feature may report. For this reason, it is necessary to map the values that are currently being used to the following:

REM -> REM
deep -> N3
light -> N1
awake -> awake

Questionnaire adherence error in total computation

We have a user and the following questionnaires:

questionnaire 1: user never filled it in
questionnaire 2: user filled it once

If we use the function adherence.get_questionnaire_adherence, we get wrong result for questionnaire 1:

{{'questionnaire 1': {'total': 1, 'n_filled': 0}, 'questionnaire 2': {'total': 10, 'n_filled': 1}}

This is probably due to the fact that the file for questionnaire 1 does not exist at all, and the function does not know whether the questionnaire is a repeated task or not, as the value:

questionnaire_dict[participant_id][questionnaire][_LABFRONT_TASK_SCHEDULE_KEY]

equals to False. This could be solved by checking whether the _LABFRONT_TASK_SCHEDULE_KEY is set to True for any of the other participants, which already filled in the questionnaire