neherlab / covid19_scenarios_data Goto Github PK

View Code? Open in Web Editor NEW

42.0 8.0 36.0 2.39 MB

Data preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project

Home Page: https://github.com/neherlab/covid19_scenarios

License: Other

Python 100.00%

covid-19 sars-cov-2 coronavirus ncov opensource open-source science research model modelling

covid19_scenarios_data's People

Stargazers

Watchers

covid19_scenarios_data's Issues

Add China data source

Hey guys,

I admire this project very much and I want to do some contributions by providing china data source.

I am trying to submit the scripts generating china data, however, since I am an newbie on Python and it just quit. But I still want to attach several endpoints which provided by some open source projects and all of them collecting from the official website

https://github.com/BlankerL/DXY-COVID-19-Crawler/blob/master/README.en.md
https://github.com/BlankerL/DXY-COVID-19-Data

Thanks

🇨🇦 Add data from Canada

Natalia (@nataliadgepi) in neherlab/covid19_scenarios#162 proposed the data from Canada.

If someone is available, feel free to check it out and start working.

Data Contibutor's Guide

Inconsistent country names

As pointed out by Marek Basler, there are inconsistent names for the same country that lead to the simulation not running and case counts not displaying. The example given was the Czech Republic vs Czechia. This reflects the fact that our json files are pulled from different data sources.

Initial Condition JSON needs to read out location of scenarios.

Before the app took care of the dispatch for whether a region was in the northern or southern hemisphere which helped to set the seasonal peak of the epidemic. Here, we have hard-coded in January. We need a way to input this data in automatically to correctly seed the JSON.

Simplify store_data() arguments, requirements on data passed

This issue is to discuss next steps assuming that case-counts folder structure is simplified as suggested in https://github.com/neherlab/covid19_scenarios_data/issues/63, and direct .json generation is dropped do allow for manual verification/diffing of the .tsv data.

Proposed signature of store_data():

def store_data(data, source, cols=[]):

Lets assume the main data structure passed by the parser is called data and has the format {'USA': [{'time': '2020-01-20', 'cases': 20,...},..,{'time': '2020-03-20', 'cases': 200,...}]} or {'USA': [['2020-01-20', 20,...],..,['2020-03-20', 200,...]]}
I would like to propose the following:

parsers should now explicitly be responsible to set country-level keys in data: country level is would just be name of country as found in country_codes.csv. State-level it needs to be the three-letter country code from country_codes.csv, a hyphen, and then the state name (e.g., USA-New York. We would need to update existing parsers to do so. ecdc and cds will not have to be updated. For others, the existing exceptions dict should tell you which keys are country level, all others will be state-level and need to be prepended by the three letter country code.
source is the string identifying the parser, matching sources.json. It will also be used to name the folder in case-counts for the .tsv files
cols will still be required to be able to parse data in case that is a dict of lists of lists (done in some parsers), and we don't want to rely on the parser passing data in the correct order

store_data() could then be simplified a lot. I would suggest:

If we want to drop direct .json generation, then we can just check if data is a dict of lists of dicts. if yes, we convert to dict of list of lists using dict_to_list(regions, default_cols). Either way, we then call store_tsv(). That function can also be simplified to get rid of the world.tsv and exception handling (would not be needed, as state-level keys would have appropriate names). I would still recommend sanitization of API-provided strings when using them for filenames. The files would then be saved to BASE_PATH/{source}/{country-or-state-name}.tsv
store_json() and merge_cases(), compare_day() can likely be reused for the later parsing of .tsv into json, so I would recommend not just deleting and forgetting about them.

@tryggvigy tagging you as you explicitly said you would like to do this. Hope this helps. I can also do it, let me know if I should.

Directory `aux` breaks Windows

With aux/ it is impossible to clone the repository on Windows (including the app repository, via submodule).

aux is a serial (COM) port on Windows :)
https://stackoverflow.com/a/38457713

Also, now that we are on it, can we make the name more descriptive than auxData? ;)

Related in the app:
neherlab/covid19_scenarios#3

🇩🇪 Data from Germany

@chriswien posted in the neherlab/covid19_scenarios#18 (comment) about the data from Germany

If someone is available, feel free to pick it up.

french parser has multiple lines per day in .tsv

The french parser currently adds multiple rows for the same date, e.g. in case-counts/Europe/Western Europe/France/Nouvelle-Aquitaine.tsv. From a quick first look at the source, there might be county data below the state data, and each county adds a different line? I don't think the issue is from the store_data code, it seems to be in the dict before that function is called.

Crowdsource case counts data

We want case counts data to be updated often, as the situation in the world evolves.

We need to:

have a dedicated directory (say, data/case-counts/) or even repository for this data
put data into .tsv files per location (country, city, etc.): data/case-counts/<region>.tsv e.g. data/case-counts/CH-Basel-Stadt.tsv.

these .tsv files will contain the following columns:

  date  cases  deaths  hosptitalized ICU recovered
  2020-03-14 ...

accept pull requests for updates and additions to the data
have a set of scripts to convert the incoming data in different formats into our .tsv format
assign a maintainer to retrieve and curate the data
not forget to add proper citations, acknowledge the authors

In the app:

onbuild-time, from these TSV files, generate a JSON/TS file containing all the data
on runtime, import the generated JSON/TS file in the app. Merge data into the app state.
make sure the generation occurs on TSV files change

https://github.com/neherlab/covid19_scenarios/issues/24

Inconsistent death count depending on simulation range

I tried to run a simulation for Denmark with default data, first with the default simulation range, ie. until september 1st, and then one until January 1st 2021.

Checking the death count on the 31st of august in both runs, come up with widely different numbers, ie. 346 vs. 489.

Did the same for USA-New York and got: 7281 vs. 10245.

I suspect you suffer from significant numeric instability.

Parse for Italian case counts data and integrate this data into the app

We want to have the most up-to-date case counts data.

The case counts for Italy are available at:
https://github.com/pcm-dpc/COVID-19/blob/master/dati-json/dpc-covid19-ita-regioni.json

We need to:

parse this .json into our .tsv format and load it into the application. The parser tools we typically put into the tools/ directory
be able to update this data as new .jsonfiles will be published

The data is licensed with CC-BY-4.0

https://github.com/neherlab/covid19_scenarios/issues/22

Structure for parsed data files

The current approach with a World.tsv AND individual .tsv files for subcountries/cities is confusing (at least to me). I see that in covid19_scenarios/tools/collect_case_data_to_json.py, the files are aggregated and integrated into one big JSON again, with the individual .tsv being preferred over world.tsv. The country-region info is pulled from the path (and apparently not used further).

Why not just have one tsv (or json) in this repo, and have the country-region data in that json as well? That would then also get rid of the parsing in covid19_scenarios. Having everything in JSON would make it easy to parse the file for each parser, and then add more entries. JSON is of course not as easily editable by hand, but given the scale of data we are talking about, manual editing is likely not feasible any more in any case.

[Question] parser for Iceland

I plan to create a parse for Iceland. The canonical data source for Iceland appears to be this public google sheet. Would it be ok if I get the data from the sheet straight in the parse? Or would you prefer me to create a server that periodically syncs it to a csv file in github which the parser can then read?

Track data sources

We need a mechanism to track data sources in order to properly acknowledge the authors and make the overall setup more open and reproducible.

I propose to centralize the tracking in a .json file which would contain at least:

urls: direct links to the source data file(s) that can be one of:
- string - source URL of a single file
- object - with name of the source file as a key and source URL as avalue
- string[] - source URLs of multiple file(s)
URL of the website related to the dataset
citation/acknowledgement: string
license: string
meta: any - optional metadata

This data registry can then be consumed from downloader/transformer scripts as well as by the build system: to include this information to the app and to generate a documentation page in markdown

URL strings might contain a set of pre-defined placeholders, for example to encode dates, places. These placeholders then can be automatically substituted within scripts (e.g. to current date).

🇨🇭 Swiss counts

Swiss case counts are currently not updating properly. They have updated files for individual cantons, by the aggregated file fell into disrepair. I made an issue and they'll hopefully have it fixed soon:

openZH/covid_19#99

otherwise, we have to adjust our parser to use the individual cantonal files.

Add data contributor's guide

I think we can make a little guide for data contributors and curators.

Rough example of the guide:

## Contributing and curating data:

### Adding data for the new region:
  Steps:

 - Case counts data, updated frequently, as outbreak evolves.
    * Write a script that downloads and converts raw data into .tsv format with columns: <column_list>
    * submit the script into the tools/ in main repo? (or into data?)
    * commit the produced .tsv file into directory Some/Path/To/<Region>.tsv
    * These TSV fils will be included into the app on next build

 - Another data
   * What needs to be done

 - Yet another data

### Updating data for the existing region:
  Steps:

Enhancement: Utilize case-count data inside populationData.tsv

I think it would improve the utility and usability of the model if we did something a bit more intelligent regarding the population estimates utilized in populationData.tsv as many countries have staggered epidemics and varied testing capacities. I think the manual nature of filling in these initial case counts can be massively improved. I'll propose three alternatives:

Replace suspectedCasesMarch1st with the first date SARS-CoV-2 was detected within each country. The benefit is this is a rather simple change.
Fit a few select parameters of the model to the case-count data we have. Importantly this must be kept rather simple ,e.g. fit the date of first introduction and the % cases caught within a country.
Keep the format the same but dynamically fill in suspectedMarch1st cases with empirical values.

🇧🇷 Add Brazil Scenarios

Hi,

I was recently working and trying to bring in the COVID scenarios from Brazil to trying to run the simulate scenarios, the problem in my country is because we do not have any official releases

I see that in repository we already have the CDS parser that gets data from John Hopkins and in that dashboard we have data from Brazil, can we work on this issue to add brazil scenario?

Here in south america the country is about to get a collapse in public health in about a month

neherlab / covid19_scenarios_data Goto Github PK

covid19_scenarios_data's People

Stargazers

Watchers

Forkers

covid19_scenarios_data's Issues

Recommend Projects

Recommend Topics

Recommend Org