neherlab / covid19_scenarios_data Goto Github PK
View Code? Open in Web Editor NEWData preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project
Home Page: https://github.com/neherlab/covid19_scenarios
License: Other
Data preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project
Home Page: https://github.com/neherlab/covid19_scenarios
License: Other
Hey guys,
I admire this project very much and I want to do some contributions by providing china data source.
I am trying to submit the scripts generating china data, however, since I am an newbie on Python and it just quit. But I still want to attach several endpoints which provided by some open source projects and all of them collecting from the official website
https://github.com/BlankerL/DXY-COVID-19-Crawler/blob/master/README.en.md
https://github.com/BlankerL/DXY-COVID-19-Data
Thanks
Natalia (@nataliadgepi) in neherlab/covid19_scenarios#162 proposed the data from Canada.
If someone is available, feel free to check it out and start working.
Related:
As pointed out by Marek Basler, there are inconsistent names for the same country that lead to the simulation not running and case counts not displaying. The example given was the Czech Republic vs Czechia. This reflects the fact that our json files are pulled from different data sources.
Before the app took care of the dispatch for whether a region was in the northern or southern hemisphere which helped to set the seasonal peak of the epidemic. Here, we have hard-coded in January. We need a way to input this data in automatically to correctly seed the JSON.
This issue is to discuss next steps assuming that case-counts
folder structure is simplified as suggested in https://github.com/neherlab/covid19_scenarios_data/issues/63, and direct .json generation is dropped do allow for manual verification/diffing of the .tsv data.
Proposed signature of store_data()
:
def store_data(data, source, cols=[]):
Lets assume the main data structure passed by the parser is called data
and has the format {'USA': [{'time': '2020-01-20', 'cases': 20,...},..,{'time': '2020-03-20', 'cases': 200,...}]}
or {'USA': [['2020-01-20', 20,...],..,['2020-03-20', 200,...]]}
I would like to propose the following:
data
: country level is would just be name of country as found in country_codes.csv
. State-level it needs to be the three-letter country code from country_codes.csv
, a hyphen, and then the state name (e.g., USA-New York
. We would need to update existing parsers to do so. ecdc and cds will not have to be updated. For others, the existing exceptions
dict should tell you which keys are country level, all others will be state-level and need to be prepended by the three letter country code.source
is the string identifying the parser, matching sources.json
. It will also be used to name the folder in case-counts
for the .tsv filescols
will still be required to be able to parse data
in case that is a dict of lists of lists (done in some parsers), and we don't want to rely on the parser passing data in the correct orderstore_data()
could then be simplified a lot. I would suggest:
data
is a dict of lists of dicts. if yes, we convert to dict of list of lists using dict_to_list(regions, default_cols)
. Either way, we then call store_tsv()
. That function can also be simplified to get rid of the world.tsv and exception handling (would not be needed, as state-level keys would have appropriate names). I would still recommend sanitization of API-provided strings when using them for filenames. The files would then be saved to BASE_PATH/{source}/{country-or-state-name}.tsv
store_json()
and merge_cases()
, compare_day()
can likely be reused for the later parsing of .tsv into json, so I would recommend not just deleting and forgetting about them.@tryggvigy tagging you as you explicitly said you would like to do this. Hope this helps. I can also do it, let me know if I should.
With aux/
it is impossible to clone the repository on Windows (including the app repository, via submodule).
aux
is a serial (COM) port on Windows :)
https://stackoverflow.com/a/38457713
Also, now that we are on it, can we make the name more descriptive than auxData
? ;)
Related in the app:
neherlab/covid19_scenarios#3
@chriswien posted in the neherlab/covid19_scenarios#18 (comment) about the data from Germany
If someone is available, feel free to pick it up.
The french parser currently adds multiple rows for the same date, e.g. in case-counts/Europe/Western Europe/France/Nouvelle-Aquitaine.tsv
. From a quick first look at the source, there might be county data below the state data, and each county adds a different line? I don't think the issue is from the store_data code, it seems to be in the dict before that function is called.
We want case counts data to be updated often, as the situation in the world evolves.
We need to:
have a dedicated directory (say, data/case-counts/
) or even repository for this data
put data into .tsv
files per location (country, city, etc.): data/case-counts/<region>.tsv
e.g. data/case-counts/CH-Basel-Stadt.tsv
.
these .tsv
files will contain the following columns:
date cases deaths hosptitalized ICU recovered
2020-03-14 ...
accept pull requests for updates and additions to the data
have a set of scripts to convert the incoming data in different formats into our .tsv
format
assign a maintainer to retrieve and curate the data
not forget to add proper citations, acknowledge the authors
In the app:
onbuild-time, from these TSV files, generate a JSON/TS file containing all the data
on runtime, import
the generated JSON/TS file in the app. Merge data into the app state.
make sure the generation occurs on TSV files change
Related:
I tried to run a simulation for Denmark with default data, first with the default simulation range, ie. until september 1st, and then one until January 1st 2021.
Checking the death count on the 31st of august in both runs, come up with widely different numbers, ie. 346 vs. 489.
Did the same for USA-New York and got: 7281 vs. 10245.
I suspect you suffer from significant numeric instability.
We want to have the most up-to-date case counts data.
The case counts for Italy are available at:
https://github.com/pcm-dpc/COVID-19/blob/master/dati-json/dpc-covid19-ita-regioni.json
We need to:
.json
into our .tsv
format and load it into the application. The parser tools we typically put into the tools/
directory.json
files will be publishedThe data is licensed with CC-BY-4.0
Related:
The current approach with a World.tsv AND individual .tsv files for subcountries/cities is confusing (at least to me). I see that in covid19_scenarios/tools/collect_case_data_to_json.py, the files are aggregated and integrated into one big JSON again, with the individual .tsv being preferred over world.tsv. The country-region info is pulled from the path (and apparently not used further).
Why not just have one tsv (or json) in this repo, and have the country-region data in that json as well? That would then also get rid of the parsing in covid19_scenarios. Having everything in JSON would make it easy to parse the file for each parser, and then add more entries. JSON is of course not as easily editable by hand, but given the scale of data we are talking about, manual editing is likely not feasible any more in any case.
I plan to create a parse for Iceland. The canonical data source for Iceland appears to be this public google sheet. Would it be ok if I get the data from the sheet straight in the parse? Or would you prefer me to create a server that periodically syncs it to a csv file in github which the parser can then read?
We need a mechanism to track data sources in order to properly acknowledge the authors and make the overall setup more open and reproducible.
I propose to centralize the tracking in a .json
file which would contain at least:
urls: direct links to the source data file(s) that can be one of:
string
- source URL of a single fileobject
- with name of the source file as a key and source URL as avaluestring[]
- source URLs of multiple file(s)URL of the website related to the dataset
citation/acknowledgement: string
license: string
meta: any
- optional metadata
This data registry can then be consumed from downloader/transformer scripts as well as by the build system: to include this information to the app and to generate a documentation page in markdown
URL strings might contain a set of pre-defined placeholders, for example to encode dates, places. These placeholders then can be automatically substituted within scripts (e.g. to current date).
Swiss case counts are currently not updating properly. They have updated files for individual cantons, by the aggregated file fell into disrepair. I made an issue and they'll hopefully have it fixed soon:
otherwise, we have to adjust our parser to use the individual cantonal files.
I think we can make a little guide for data contributors and curators.
Rough example of the guide:
## Contributing and curating data:
### Adding data for the new region:
Steps:
- Case counts data, updated frequently, as outbreak evolves.
* Write a script that downloads and converts raw data into .tsv format with columns: <column_list>
* submit the script into the tools/ in main repo? (or into data?)
* commit the produced .tsv file into directory Some/Path/To/<Region>.tsv
* These TSV fils will be included into the app on next build
- Another data
* What needs to be done
- Yet another data
### Updating data for the existing region:
Steps:
See also:
I think it would improve the utility and usability of the model if we did something a bit more intelligent regarding the population estimates utilized in populationData.tsv as many countries have staggered epidemics and varied testing capacities. I think the manual nature of filling in these initial case counts can be massively improved. I'll propose three alternatives:
Hi,
I was recently working and trying to bring in the COVID scenarios from Brazil to trying to run the simulate scenarios, the problem in my country is because we do not have any official releases
I see that in repository we already have the CDS parser that gets data from John Hopkins and in that dashboard we have data from Brazil, can we work on this issue to add brazil scenario?
Here in south america the country is about to get a collapse in public health in about a month
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.