Git Product home page Git Product logo

covid19_scenarios_data's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid19_scenarios_data's Issues

Add China data source

Hey guys,

I admire this project very much and I want to do some contributions by providing china data source.

I am trying to submit the scripts generating china data, however, since I am an newbie on Python and it just quit. But I still want to attach several endpoints which provided by some open source projects and all of them collecting from the official website

https://github.com/BlankerL/DXY-COVID-19-Crawler/blob/master/README.en.md
https://github.com/BlankerL/DXY-COVID-19-Data

Thanks

Inconsistent country names

As pointed out by Marek Basler, there are inconsistent names for the same country that lead to the simulation not running and case counts not displaying. The example given was the Czech Republic vs Czechia. This reflects the fact that our json files are pulled from different data sources.

Initial Condition JSON needs to read out location of scenarios.

Before the app took care of the dispatch for whether a region was in the northern or southern hemisphere which helped to set the seasonal peak of the epidemic. Here, we have hard-coded in January. We need a way to input this data in automatically to correctly seed the JSON.

Simplify store_data() arguments, requirements on data passed

This issue is to discuss next steps assuming that case-counts folder structure is simplified as suggested in https://github.com/neherlab/covid19_scenarios_data/issues/63, and direct .json generation is dropped do allow for manual verification/diffing of the .tsv data.

Proposed signature of store_data():

def store_data(data, source, cols=[]):

Lets assume the main data structure passed by the parser is called data and has the format {'USA': [{'time': '2020-01-20', 'cases': 20,...},..,{'time': '2020-03-20', 'cases': 200,...}]} or {'USA': [['2020-01-20', 20,...],..,['2020-03-20', 200,...]]}
I would like to propose the following:

  • parsers should now explicitly be responsible to set country-level keys in data: country level is would just be name of country as found in country_codes.csv. State-level it needs to be the three-letter country code from country_codes.csv, a hyphen, and then the state name (e.g., USA-New York. We would need to update existing parsers to do so. ecdc and cds will not have to be updated. For others, the existing exceptions dict should tell you which keys are country level, all others will be state-level and need to be prepended by the three letter country code.
  • source is the string identifying the parser, matching sources.json. It will also be used to name the folder in case-counts for the .tsv files
  • cols will still be required to be able to parse data in case that is a dict of lists of lists (done in some parsers), and we don't want to rely on the parser passing data in the correct order

store_data() could then be simplified a lot. I would suggest:

  • If we want to drop direct .json generation, then we can just check if data is a dict of lists of dicts. if yes, we convert to dict of list of lists using dict_to_list(regions, default_cols). Either way, we then call store_tsv(). That function can also be simplified to get rid of the world.tsv and exception handling (would not be needed, as state-level keys would have appropriate names). I would still recommend sanitization of API-provided strings when using them for filenames. The files would then be saved to BASE_PATH/{source}/{country-or-state-name}.tsv
  • store_json() and merge_cases(), compare_day() can likely be reused for the later parsing of .tsv into json, so I would recommend not just deleting and forgetting about them.

@tryggvigy tagging you as you explicitly said you would like to do this. Hope this helps. I can also do it, let me know if I should.

french parser has multiple lines per day in .tsv

The french parser currently adds multiple rows for the same date, e.g. in case-counts/Europe/Western Europe/France/Nouvelle-Aquitaine.tsv. From a quick first look at the source, there might be county data below the state data, and each county adds a different line? I don't think the issue is from the store_data code, it seems to be in the dict before that function is called.

Crowdsource case counts data

We want case counts data to be updated often, as the situation in the world evolves.

We need to:

  • have a dedicated directory (say, data/case-counts/) or even repository for this data

  • put data into .tsv files per location (country, city, etc.): data/case-counts/<region>.tsv e.g. data/case-counts/CH-Basel-Stadt.tsv.

  • these .tsv files will contain the following columns:

      date  cases  deaths  hosptitalized ICU recovered
      2020-03-14 ...
  • accept pull requests for updates and additions to the data

  • have a set of scripts to convert the incoming data in different formats into our .tsv format

  • assign a maintainer to retrieve and curate the data

  • not forget to add proper citations, acknowledge the authors

In the app:

  • onbuild-time, from these TSV files, generate a JSON/TS file containing all the data

  • on runtime, import the generated JSON/TS file in the app. Merge data into the app state.

  • make sure the generation occurs on TSV files change

Related:

Inconsistent death count depending on simulation range

I tried to run a simulation for Denmark with default data, first with the default simulation range, ie. until september 1st, and then one until January 1st 2021.

Checking the death count on the 31st of august in both runs, come up with widely different numbers, ie. 346 vs. 489.

Did the same for USA-New York and got: 7281 vs. 10245.

I suspect you suffer from significant numeric instability.

Parse for Italian case counts data and integrate this data into the app

We want to have the most up-to-date case counts data.

The case counts for Italy are available at:
https://github.com/pcm-dpc/COVID-19/blob/master/dati-json/dpc-covid19-ita-regioni.json

We need to:

  • parse this .json into our .tsv format and load it into the application. The parser tools we typically put into the tools/ directory
  • be able to update this data as new .jsonfiles will be published

The data is licensed with CC-BY-4.0

Related:

Structure for parsed data files

The current approach with a World.tsv AND individual .tsv files for subcountries/cities is confusing (at least to me). I see that in covid19_scenarios/tools/collect_case_data_to_json.py, the files are aggregated and integrated into one big JSON again, with the individual .tsv being preferred over world.tsv. The country-region info is pulled from the path (and apparently not used further).

Why not just have one tsv (or json) in this repo, and have the country-region data in that json as well? That would then also get rid of the parsing in covid19_scenarios. Having everything in JSON would make it easy to parse the file for each parser, and then add more entries. JSON is of course not as easily editable by hand, but given the scale of data we are talking about, manual editing is likely not feasible any more in any case.

[Question] parser for Iceland

I plan to create a parse for Iceland. The canonical data source for Iceland appears to be this public google sheet. Would it be ok if I get the data from the sheet straight in the parse? Or would you prefer me to create a server that periodically syncs it to a csv file in github which the parser can then read?

Track data sources

We need a mechanism to track data sources in order to properly acknowledge the authors and make the overall setup more open and reproducible.

I propose to centralize the tracking in a .json file which would contain at least:

  • urls: direct links to the source data file(s) that can be one of:

    • string - source URL of a single file
    • object - with name of the source file as a key and source URL as avalue
    • string[] - source URLs of multiple file(s)
  • URL of the website related to the dataset

  • citation/acknowledgement: string

  • license: string

  • meta: any - optional metadata

This data registry can then be consumed from downloader/transformer scripts as well as by the build system: to include this information to the app and to generate a documentation page in markdown

URL strings might contain a set of pre-defined placeholders, for example to encode dates, places. These placeholders then can be automatically substituted within scripts (e.g. to current date).

๐Ÿ‡จ๐Ÿ‡ญ Swiss counts

Swiss case counts are currently not updating properly. They have updated files for individual cantons, by the aggregated file fell into disrepair. I made an issue and they'll hopefully have it fixed soon:

openZH/covid_19#99

otherwise, we have to adjust our parser to use the individual cantonal files.

Add data contributor's guide

I think we can make a little guide for data contributors and curators.

Rough example of the guide:

## Contributing and curating data:

### Adding data for the new region:
  Steps:

 - Case counts data, updated frequently, as outbreak evolves.
    * Write a script that downloads and converts raw data into .tsv format with columns: <column_list>
    * submit the script into the tools/ in main repo? (or into data?)
    * commit the produced .tsv file into directory Some/Path/To/<Region>.tsv
    * These TSV fils will be included into the app on next build

 - Another data
   * What needs to be done

 - Yet another data

### Updating data for the existing region:
  Steps:

See also:

Enhancement: Utilize case-count data inside populationData.tsv

I think it would improve the utility and usability of the model if we did something a bit more intelligent regarding the population estimates utilized in populationData.tsv as many countries have staggered epidemics and varied testing capacities. I think the manual nature of filling in these initial case counts can be massively improved. I'll propose three alternatives:

  1. Replace suspectedCasesMarch1st with the first date SARS-CoV-2 was detected within each country. The benefit is this is a rather simple change.
  2. Fit a few select parameters of the model to the case-count data we have. Importantly this must be kept rather simple ,e.g. fit the date of first introduction and the % cases caught within a country.
  3. Keep the format the same but dynamically fill in suspectedMarch1st cases with empirical values.

๐Ÿ‡ง๐Ÿ‡ท Add Brazil Scenarios

Hi,

I was recently working and trying to bring in the COVID scenarios from Brazil to trying to run the simulate scenarios, the problem in my country is because we do not have any official releases

I see that in repository we already have the CDS parser that gets data from John Hopkins and in that dashboard we have data from Brazil, can we work on this issue to add brazil scenario?

Here in south america the country is about to get a collapse in public health in about a month

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.