timriffe / covid_age Goto Github PK

View Code? Open in Web Editor NEW

56.0 56.0 30.0 135.85 MB

COVerAGE-DB: COVID-19 cases, deaths, and tests by age and sex

License: Other

R 91.88% Python 7.97% Batchfile 0.15%

covid_age's People

Contributors

Stargazers

Watchers

covid_age's Issues

Texas data

Hello, I have been working on the dataset (I'm conding a Dash application n Python to visualize the database).
I noticed that data from Texas was false from 2020/07/29 (values are too high).

Multiple data sources create duplicate entries

Hi @timriffe

For some regions the data is collected from multiple data sources. Sometimes we observe the case that in the final output files for the same region in the same day two different values are recorded. The 2 values can be identical or significantly different.

This is more a conceptual issues, but in my opinion the processed output should contain only 1 record (most credible) for the given region-day-sex-age. Otherwise, we risk to confuse the user. I am a bit confused too, and ponder how to move on when I come across different double entries .

We can see this behavior in 69 * ages * sex instances.

Here's my code

# read data
coverage <- read_csv(
  file = "data/Output_10_20210108.zip",
  skip = 3)

# summarise
d <- coverage %>% 
  filter(
    Sex == "b") %>%
  group_by(Country, Region, Date, Code) %>% 
  summarise(cases = sum(Cases)) %>% 
  mutate(Code2 = paste(Country, Region, Date))

# Show duplicates  
d[duplicated(d$Code2), ] %>% 
  print(n = Inf)

# A tibble: 68 x 6
# Groups:   Country, Region, Date [68]
   Country     Region       Date       Code                   cases Code2                      
   <chr>       <chr>        <chr>      <chr>                  <dbl> <chr>                      
 1 Austria     All          01.11.2020 AT01.11.2020         111110. Austria All 01.11.2020     
 2 Austria     All          08.11.2020 AT08.11.2020         155903  Austria All 08.11.2020     
 3 Austria     All          15.11.2020 AT15.11.2020         206208  Austria All 15.11.2020     
 4 Austria     All          22.11.2020 AT22.11.2020         248419  Austria All 22.11.2020     
 5 Austria     All          25.10.2020 AT25.10.2020          82536. Austria All 25.10.2020     
 6 Belgium     All          01.11.2020 BE01.11.2020         448645  Belgium All 01.11.2020     
 7 Belgium     All          25.10.2020 BE25.10.2020         349459  Belgium All 25.10.2020     
 8 Bulgaria    All          01.11.2020 BG01.11.2020          52844  Bulgaria All 01.11.2020    
 9 Bulgaria    All          25.10.2020 BG25.10.2020          37562  Bulgaria All 25.10.2020    
10 Czechia     All          01.11.2020 CZ_ECDC_01.11.2020   319569  Czechia All 01.11.2020     
11 Czechia     All          08.11.2020 CZ_ECDC_08.11.2020   392561  Czechia All 08.11.2020     
12 Czechia     All          15.11.2020 CZ_ECDC_15.11.2020   438589  Czechia All 15.11.2020     
13 Czechia     All          22.11.2020 CZ_ECDC_22.11.2020   470652  Czechia All 22.11.2020     
14 Czechia     All          25.10.2020 CZ_ECDC_25.10.2020   155582. Czechia All 25.10.2020     
15 Estonia     All          01.11.2020 EE_ECDC_01.11.2020     4882. Estonia All 01.11.2020     
16 Estonia     All          08.11.2020 EE_ECDC_08.11.2020     4882. Estonia All 08.11.2020     
17 Estonia     All          15.11.2020 EE_ECDC_15.11.2020     7450  Estonia All 15.11.2020     
18 Estonia     All          22.11.2020 EE_ECDC_22.11.2020     9534  Estonia All 22.11.2020     
19 Estonia     All          25.10.2020 EE_ECDC_25.10.2020     3964  Estonia All 25.10.2020     
20 Finland     All          01.11.2020 FI01.11.2020             NA  Finland All 01.11.2020     
21 Finland     All          08.11.2020 FI08.11.2020             NA  Finland All 08.11.2020     
22 Finland     All          15.11.2020 FI15.11.2020             NA  Finland All 15.11.2020     
23 Finland     All          25.10.2020 FI25.10.2020             NA  Finland All 25.10.2020     
24 France      All          01.11.2020 FR01.11.2020             NA  France All 01.11.2020      
25 France      All          25.10.2020 FR25.10.2020             NA  France All 25.10.2020      
26 Germany     All          01.11.2020 DE_ECDC_01.11.2020   548122  Germany All 01.11.2020     
27 Germany     All          08.11.2020 DE_ECDC_08.11.2020   548122  Germany All 08.11.2020     
28 Germany     All          15.11.2020 DE_ECDC_15.11.2020   548122  Germany All 15.11.2020     
29 Germany     All          22.11.2020 DE_ECDC_22.11.2020   894433. Germany All 22.11.2020     
30 Germany     All          25.10.2020 DE_ECDC_25.10.2020   301126  Germany All 25.10.2020     
31 Italy       All          02.04.2020 ITinfo02.04.2020     105825. Italy All 02.04.2020       
32 Italy       All          03.06.2020 ITinfo03.06.2020     235046. Italy All 03.06.2020       
33 Italy       All          06.04.2020 ITinfo06.04.2020     125464. Italy All 06.04.2020       
34 Italy       All          09.04.2020 ITinfo09.04.2020     133751  Italy All 09.04.2020       
35 Italy       All          19.03.2020 ITinfo19.03.2020      33296  Italy All 19.03.2020       
36 Italy       All          20.05.2020 ITinfo20.05.2020     228447. Italy All 20.05.2020       
37 Italy       All          23.03.2020 ITinfo23.03.2020      54291. Italy All 23.03.2020       
38 Italy       All          26.03.2020 ITinfo26.03.2020      70619  Italy All 26.03.2020       
39 Italy       All          28.04.2020 ITinfo28.04.2020     195001. Italy All 28.04.2020       
40 Italy       All          30.03.2020 ITinfo30.03.2020      91446. Italy All 30.03.2020       
41 Netherlands All          01.11.2020 NL01.11.2020         393187  Netherlands All 01.11.2020 
42 Netherlands All          08.11.2020 NL08.11.2020         431655. Netherlands All 08.11.2020 
43 Netherlands All          15.11.2020 NL15.11.2020         468804  Netherlands All 15.11.2020 
44 Netherlands All          22.11.2020 NL22.11.2020         504117  Netherlands All 22.11.2020 
45 Netherlands All          25.10.2020 NL25.10.2020         341570  Netherlands All 25.10.2020 
46 Norway      All          01.11.2020 NO01.11.2020          21471  Norway All 01.11.2020      
47 Norway      All          08.11.2020 NO08.11.2020          25519  Norway All 08.11.2020      
48 Norway      All          15.11.2020 NO15.11.2020          29607  Norway All 15.11.2020      
49 Norway      All          25.10.2020 NO25.10.2020          18327  Norway All 25.10.2020      
50 Palestine   All          31.12.2020 PS31.12.2020         155365  Palestine All 31.12.2020   
51 Portugal    All          01.11.2020 PT01.11.2020         146659. Portugal All 01.11.2020    
52 Portugal    All          08.11.2020 PT08.11.2020         183159. Portugal All 08.11.2020    
53 Portugal    All          15.11.2020 PT15.11.2020         221250. Portugal All 15.11.2020    
54 Portugal    All          22.11.2020 PT22.11.2020         259788  Portugal All 22.11.2020    
55 Portugal    All          25.10.2020 PT25.10.2020         120946  Portugal All 25.10.2020    
56 South Korea All          01.01.2021 KR31.12.2020          60740  South Korea All 01.01.2021 
57 USA         California   28.11.2020 US_CDC_CA28.11.2020      NA  USA California 28.11.2020  
58 USA         Florida      28.11.2020 US_FL28.11.2020      992661. USA Florida 28.11.2020     
59 USA         Idaho        28.11.2020 US_ID28.11.2020       99660. USA Idaho 28.11.2020       
60 USA         Iowa         28.11.2020 US_IA_28.11.2020     225784  USA Iowa 28.11.2020        
61 USA         Maine        28.11.2020 US_ME28.11.2020       11508  USA Maine 28.11.2020       
62 USA         New Jersey   28.11.2020 US_NJ28.11.2020      330275  USA New Jersey 28.11.2020  
63 USA         Ohio         28.11.2020 US_OH28.11.2020      458372  USA Ohio 28.11.2020        
64 USA         Oklahoma     28.11.2020 US_OK28.11.2020      194022. USA Oklahoma 28.11.2020    
65 USA         Pennsylvania 28.11.2020 US_PA28.11.2020      326741. USA Pennsylvania 28.11.2020
66 USA         Texas        28.11.2020 US_TX28.11.2020     1151069  USA Texas 28.11.2020       
67 USA         Vermont      28.11.2020 US_VT28.11.2020        4100. USA Vermont 28.11.2020     
68 USA         Virginia     28.11.2020 US_VA28.11.2020      233617. USA Virginia 28.11.2020

Manual inspection of the 1st case

coverage %>% 
  filter(Country == "Austria",
         Region == "All",
         Date == "01.11.2020",
         Sex == "b",
         Age == 60)

# A tibble: 2 x 10
  Country Region Code               Date         Age AgeInt Sex   Cases Deaths Tests
  <chr>   <chr>  <chr>              <chr>      <dbl>  <dbl> <chr> <dbl>  <dbl> <lgl>
1 Austria All    AT_ECDC_01.11.2020 01.11.2020    60     10 b     7555    29   NA   
2 Austria All    AT01.11.2020       01.11.2020    60     10 b     9342.   86.4 NA

US Wisconsin data - trend break

Hi Tim,

Looking at the Wisconsin data I can see a trend break between end of April and beginning of June. The data in this period seems to belong to another region. Considering that we are looking at cumulative data this look wrong to me.

Here's a quick way to visualize it:

W <- read.csv(
 file = "Output_10_20200831.csv",
 header = TRUE,
 skip = 3
) %>% 
 filter(Region == "Wisconsin",
        Sex == "b")  %>% 
 mutate(
   Date = as.Date(Date, format = "%d.%m.%Y"))

ggplot(W, aes(x = Date, y = Cases, col = factor(Age))) + 
 geom_point()

Thanks for investigating!
Marius

Offset Data source

I am maintain a similar database for SCOR but with significantly less ambitions. A good source of offset data it the WPP2019 that can be found very convenient in the {wpp2019} package. See popM and popF to get 5y-age data.

I am using the last available HMD exposures and deaths when I what to determine the excess in mortality during the covid epidemic.

Data availability for Germany

Dear COVerAGE team,
I've noticed that the data inputs and outputs for Germany have stopped in August. Could we please have an update on this?

Suggestion:

I think it would be a good idea to have a database news section somewhere where important updates and developments are shared (maybe in 1 sentence). Also we could produce a "data-availability table" where for each region to have specified the start and the end of the data, perceived quality and info about missing data. I would be of a great help to all of us who are working with your amazing data frequently.

Regards,
Marius

US Cook County mind-blowing data

... and not properly anonymized i'd say, can be found here.

I don't know if you are aware of this, but you can find seriatim death data (including COVID related). This might be a sign that more high quality data is coming from the US.

However, if you know somebody close to the source it would be nice to raise a warning and send them an email explaining that we will gladly use the data as is but this can generate a reputational risk on their side.

sex aggregation with non-congruent age groups

There are cases, such as Slovenia, where age grouping (for deaths in this case) is done on the fly to make adjacent 0s into one age group. For example age 0-34 = 0 for one sex and age 0-44 for the other sex. In this case, aggregation to b doesn't work properly in step 2. The function infer_both_sex() should be modified to detect and handle this situation. Leaving this here as a TO-DO. @kikeacosta tagging so you get the memo too

Brazil

Brazil deaths data have huge oscillations, especially for females.

Sex ratio (m/f) of the deaths from age 45 onwards has a sudden drop after the first few days.

Spanish Data

I did a comparison between the Spanish data I collected and what I see in the inputDB.csv .
The sources appear to be the same however at certain points in time consistent differences across multiple ages can be observed. E.g.: March 29, 31 and April 13.

This is related to #4.

Also, it is not clear to me what is reported when Age == UNK. Maybe unknown? For Spain very large numbers can be seen there.

See the file attached here:
ESP_Data_Comparison_20200413.xlsx

Things to do for COVERAGEDB

Why can't we upload files to the OSF repo through osfr? It seems that if we create the OSF project with osf_create_project we can upload and overwrite files but with the current OSF repo, we're not allowed to upload anything. See ropensci/osfr#126
Get a notification when osfr uploads the files every day.
Implement an automatic twitter bot that announces that name of the countries (name) and the number of increase in states and provinces (the number). Include also the world map with the current countries (see https://raw.githubusercontent.com/timriffe/covid_age/master/assets/coveragemap.svg). Another type of tweet would be: we have X countries out of X in Africa/North America/Etc..
We have a bottleneck in the table that summarizes which transformation happened for which country/province/date. We read all the data from here and then apply some functions to check which chunks got transformed (here). We should get all of this to run inside a profiler to check which functions need to be optimized. Currently, this process is slow.
The big data in https://timriffe.github.io/covid_age/DataAvail.html --> 'By country, region and date' is throwing this warning:

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Why is this?

Look into including pointblank validation reporting as part of the daily build cycle. https://github.com/rich-iannone/pointblank/
write an external consistency check to compare crude totals with Hopkins data. Huge abrupt deviations should be flagged.

Texas total cases by age is off from total cases by ~10x

As of August 13, the total case count for Texas when added up by age groups in this dataset is 43,409. The "TOT" line for all ages is 513,575, which is in line with numbers I see elsewhere.

I'm guessing that Texas simply didn't report ages for more than 90% of cases, even though it is reporting them for all or nearly all of deaths. Does the discrepancy above come from the state's reporting or is it an actual error?

If the former, should there be some way that such data limitations are reported within the data set?

Lastly, thanks for making all of this data available! It's a wonderful resource to have in one place.

zip files

inputDB.csv, Output_5.csv and Output_10.csv should all be zipped prior to OSF upload, ASAP.
then it works just the same readr::read_csv("inputDB.zip")

Datasource for Israel C19 info

Hi @timriffe,
Currently for Israel little or no info is collected in the database except for number of new cases.
I would like to suggest a data source that might or might not be known to you: https://data.gov.il/dataset/covid-19

The website is in Hebrew, however I was able to navigate it in Chrome with auto translation.

One can find there details about weekly death and tests and more. One drawback is that for records higher than 0 and lower than 15, "<15" is displayed instead of a numerical value. I suspect that this is done in order to anonymized the data.

I think Israel deserves increased attention from us because is the leading country that is offering information about the vaccine effectiveness, so I guess everyone's eyes are on it.

reformat output

make Code is just the short code portion.
change Sex = "b" to "t" everywhere
output in isoweeks
exclude duplicate sources: unique source per population / measure
Tests in different file out of main output.

Extra columns in inputDB

Thanks for compiling such a great resource!
There appear to be extra columns y and 2499 in inputDB, such that it has 13 columns rather than 11:

> inputDB <-  read_csv("inputDB.zip",
+                      skip = 1,
+                      col_types = "cccccciccdc")
|===================================================| 100% 1092 MB
Warning: 15278446 parsing failures.
row col   expected     actual          file
  1  -- 11 columns 13 columns 'inputDB.zip'
  2  -- 11 columns 13 columns 'inputDB.zip'
  3  -- 11 columns 13 columns 'inputDB.zip'
  4  -- 11 columns 13 columns 'inputDB.zip'
  5  -- 11 columns 13 columns 'inputDB.zip'
... ... .......... .......... .............
See problems(...) for more details.

Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
> inputDB <-  read_csv("inputDB.zip",
+                      skip = 1)

── Column specification ──────────────────────────────────────────
cols(
  Country = col_character(),
  Region = col_character(),
  Code = col_character(),
  Date = col_character(),
  Sex = col_character(),
  Age = col_character(),
  AgeInt = col_double(),
  Metric = col_character(),
  Measure = col_character(),
  Value = col_double(),
  Short = col_character(),
  y = col_logical(),
  `2499` = col_logical()
)

|===================================================| 100% 1092 MB

This seems to cause downloading and reading in with covidAgeData to fail:

> inputDB = download_covid("inputDB",progress = F)
Reading /var/folders/xq/qzl74ksd6sxc9rfvlpfslt980000gn/T//RtmpWykb7A/Data/inputDB.csv
Error in data.table::fread(filepath, sep = ",", colClasses = rinfo[[2]],  : 
  colClasses= is an unnamed vector of types, length 11, but there are 13 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.

Sorry, I haven't been able to look at the source code to see what might be causing this.

Israel death counts collection automation

Hi everyone,

Thought I'd ask for help here. I've finally been working on automating the collection of death counts on Israel's side.

Now, the only way to get complete death counts currently is by manually downloading cumulative counts from the following dashboard (In Hebrew): https://datadashboard.health.gov.il/COVID-19/general

If you go on that website and click the following download button:
body > ngx-app > ngx-pages > nb-layout > div.scrollable-container > div > div > div > div > nb-layout-column > ngx-general > section:nth-child(3) > div > ngx-tile-wrapper > ngx-horizontal-bar-by-age-gender > nb-card > nb-card-header > ngx-select-header-infected-level > div:nth-child(1) > excel-download-button > div > button
(Open Developer Tools and search that selector)
Then you get a csv file that includes a column of current cumulative death counts.

I've been manually downloading files for quite a while, because I couldn't figure out how to automate this process. The dashboard page isn't automation-friendly unfortunately.
If you can think of ideas of how to do that, it would be greatly appreciated!

What I have currently: https://github.com/eshom/covid-dashboard-processing
What the scripts I wrote do, they download the data I semi-regularly upload to Israel's Google Drive folder. It then processes it to fit our format, And finally saves the output to /data/deaths.csv, which is pretty much ready to be joined with other measures.

A known issue with this data source, is that counts aren't strictly cumulative. At some periods, for some undocumented reasons, deaths were reduced. This was checked, and the change in counts is visible on the raw file.

Pinging you here, because I believe you wanted to get this going?
@timriffe

Data Duplication in US Subset

Hi all,

Today, when working with the US subset of the data, I noticed some duplication but with differing values. Specifically, for any given Region, Sex, and Age I was expecting there to be a single observation; however, this is not the case. For example, Arizona on 9/26/20 for Sex equal to b and Age equal to 0 has two observations with differing case, test, and death values - the lines from the CSV file are included below for clarity. Can you provide some clarification here? As of this writing, I identified 1407 such instances in the US data.

USA, Arizona, US_AZ 26.09.2020, 26.09.2020, b, 0, 5, 3531.4, 0.4, 24559.3

USA, Arizona, US_CDC_AZ26.09.2020, 26.09.2020, b, 0, 5, NA, 1, NA

Thanks!

U.K. COVID data

Great to find this. Exquisite work!

Everybody is talking about China but I am still amazed how difficult is to find detailed info for the US and UK.

Because I did not find UK on the list I will add here what I know about it:
https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales
See the April 3 update. In the Excel file you will find death counts by age but not confirmed cases.

Regards,
Marius

Not a number in Germany data

In the output for Germany I can see:

    Country Region         Code       Date Sex Age AgeInt Cases Deaths
1   Germany    All DE01.03.2020 01.03.2020   f   0     10   NaN    0.0
2   Germany    All DE01.03.2020 01.03.2020   f  10     10   NaN    0.0
3   Germany    All DE01.03.2020 01.03.2020   f  20     10   NaN    0.0
4   Germany    All DE01.03.2020 01.03.2020   f  30     10   NaN    0.0
5   Germany    All DE01.03.2020 01.03.2020   f  40     10   NaN    0.0
6   Germany    All DE01.03.2020 01.03.2020   f  50     10   NaN    0.0
7   Germany    All DE01.03.2020 01.03.2020   f  60     10   NaN    1.3
8   Germany    All DE01.03.2020 01.03.2020   f  70     10   NaN    1.7
9   Germany    All DE01.03.2020 01.03.2020   f  80     10   NaN    0.0
10  Germany    All DE01.03.2020 01.03.2020   f  90     10   NaN    0.0
11  Germany    All DE01.03.2020 01.03.2020   f 100     10   NaN    3.9

Is there a reason why I get NaN instead ofa number or NA's for Germany? The data appears to be available?

Danish females

Clearly this is a project that I want in. So I am continuing to post my observations.

I couldn't find the source for the Danish data or the original data in inputDB.csv. Can we have a database with sources? Also would be nice to indicate in the output the degree to which the data has been modeled and the perceived data quality (good, bad, mediocre...).

Now, looking at the cfr in the Danish female population as of April 9 we can see an unusual pattern.

Can we explain this?

US New Jersey Data

In Output_10.csv at the state of New Jersey i can see several weeks in a row with the same records, day after day. Does this indicate zero new cases in the period/region or the lack of new data?

I did this:

> NJ <- read.csv(
+   file = "Output_10_20200901.csv",
+   header = TRUE,
+   skip = 3
+ ) %>%
+   filter(Region == "New Jersey",
+          Sex == "b")  %>%
+   mutate(
+     Date = as.Date(Date, format = "%d.%m.%Y"))
> 
> 
> NJ %>% select(Region, Date, Age, Cases) %>%
+   filter(Date > "2020-08-01") %>% 
+   pivot_wider(names_from = Date, values_from = Cases)
# A tibble: 11 x 24
   Region   Age `2020-08-02` `2020-08-03` `2020-08-04` `2020-08-05` `2020-08-06` `2020-08-07` `2020-08-08`
   <fct>  <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1 New J~     0       1967.        1967.        1967.        1967.        1967.        1967.        1967. 
 2 New J~    10       4962.        4962.        4962.        4962.        4962.        4962.        4962. 
 3 New J~    20      23854.       23854.       23854.       23854.       23854.       23854.       23854. 
 4 New J~    30      31052.       31052.       31052.       31052.       31052.       31052.       31052. 
 5 New J~    40      26123        26123        26123        26123        26123        26123        26123  
 6 New J~    50      34429.       34429.       34429.       34429.       34429.       34429.       34429. 
 7 New J~    60      26415.       26415.       26415.       26415.       26415.       26415.       26415. 
 8 New J~    70      14328.       14328.       14328.       14328.       14328.       14328.       14328. 
 9 New J~    80       9587.        9587.        9587.        9587.        9587.        9587.        9587. 
10 New J~    90       8184.        8184.        8184.        8184.        8184.        8184.        8184. 
11 New J~   100         70.2         70.2         70.2         70.2         70.2         70.2         70.2
# ... with 15 more variables: `2020-08-09` <dbl>, `2020-08-10` <dbl>, `2020-08-11` <dbl>,
#   `2020-08-12` <dbl>, `2020-08-13` <dbl>, `2020-08-14` <dbl>, `2020-08-19` <dbl>, `2020-08-20` <dbl>,
#   `2020-08-21` <dbl>, `2020-08-22` <dbl>, `2020-08-23` <dbl>, `2020-08-24` <dbl>, `2020-08-25` <dbl>,
#   `2020-08-26` <dbl>, `2020-08-27` <dbl>

Spain all-region cases and deaths have lost a lot

Comparing to the database on Sep.1, there seems to be an issue with Spain.

library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
# Empty data.table: 
dt5_ori[Country %like% "Spain" & Region == "All" & Sex == "b" & Date == "21.07.2020",]

Many thanks

A question on presenting data by time --- is the database built cumulatively?

Thanks again for the great work. I wish to double-check a question:
If I want to show how is the situation in, for example, last October, could I just filter the latest dataset by Date?
I assume the database is built cumulatively: so whenever there is an update in a country, the new data is appended to the old one. Is this roughly the case?
Many thanks!

USA all-region cases seems to be 0 in the output dataset?

I just refreshed the data, and it seems the cases for USA become 0?

library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
                            verbose = FALSE, progress = FALSE, return = "data.table")
dt5_ori[Country=="USA" & Region == "All" & !is.na(Cases), table(Cases)]

Cases
0
3633

The numbers exist last week in the earlier versions of the database.

Negative increase of cumulative deaths and cases

Often the series of Cases or Deaths, which should be a non-decreasing function, show a decrease compared to the day before. I wonder if this can have something to do with the smoothing (perhaps the smoothing goes by one dimension at a time - either age or date - instead of both, and this somehow breaks the continuity of the increase? Just guessing...sorry if it is completely irrelevant.)

This is the data for some countries (only the rows that have a negative change of Deaths and Cases compared to the day before).
Algeria
Algeria.txt
Austria
Austria.txt
Belgium
Belgium.txt
Brazil
Brazil.txt

Denmark strange patterns

Deaths: As the deaths reported in the output are the cumulative ones over time, they cannot be decreasing. The total number of deaths increases day by day, as expected. But if I break them down by age, some age groups are decreasing in some days. Please see below.

Sex ratio of the Cases: there is a strange convergence of many ages towards a value of 1 on one single day. Please see below.

Will there be any update for India?

The data for India was updated in 2020-10, just curious if there will be an update for India? (I understand it is not easy...)
Many thanks!

UK Case datasource

Hello @timriffe,

I have noticed that your dataset does not include any case data for the UK and I could not find any topics on why. Hence, I wanted to suggest a data source. At official UK Government website for data and insights on Coronavirus there is detailed data by region and 5-year age groups. They also offer permanent download links. For the aforementioned dataset its https://api.coronavirus.data.gov.uk/v2/data?areaType=region&metric=newCasesBySpecimenDateAgeDemographics&format=csv

Please be aware that the source operates a little strangely. To obtain the dataset, I set "Area type" to "Regional" and "Metrics" to "newCasesBySpecimenDataAgeDemographics". For some combinations of "Area Type" and "Metrics" there are no datasets available, and the download-button will simply do nothing apart from changing color (falsely indicating a successful download). I am also no expert on the UK's regional structure so I am not sure if this data can be seamlessly merged with your existing datasets. I have yet to compare the regions to the ONS dataset you use.

I hope the data meets your quality standards. I am probably not the only person interested in UK case data and it would be greatly appreciated if you find a way to implement it.

Output consistency checks required!

The harmonization process should maintain the consistency between sexes and both sex.

Below I am looking at the data in Output_10.csv and I can see that the sum of the values for m and f populations are far away from the count data in b by a large margin. Somewhere along the way from InputDB to Output must be a much bigger bug hidden in the lines.

> path <- "https://raw.githubusercontent.com/timriffe/covid_age/master/Data/Output_10.csv"
> cfr10 <- read.csv(
+   file = path,
+   header = TRUE,
+   skip = 1
+   )
> 
> cfr10 %>% 
+   filter(Country == "Spain",
+          Date == "02.05.2020") %>% 
+   select(-Cases, -Tests) %>% 
+   pivot_wider(names_from = Sex, values_from = Deaths) %>% 
+   mutate(`m+f` = m + f)
# A tibble: 22 x 10
   Country Region Code         Date         Age AgeInt       b     f     m `m+f`
   <fct>   <fct>  <fct>        <fct>      <int>  <int>   <dbl> <dbl> <dbl> <dbl>
 1 Spain   All    ES02.05.2020 02.05.2020     0     10     2.9    1     1      2
 2 Spain   All    ES02.05.2020 02.05.2020    10     10     7.4    3     2      5
 3 Spain   All    ES02.05.2020 02.05.2020    20     10    32.4    8    14     22
 4 Spain   All    ES02.05.2020 02.05.2020    30     10    84     22    35     57
 5 Spain   All    ES02.05.2020 02.05.2020    40     10   268.    69   113    182
 6 Spain   All    ES02.05.2020 02.05.2020    50     10   831.   160   404    564
 7 Spain   All    ES02.05.2020 02.05.2020    60     10  2233.   438  1077   1515
 8 Spain   All    ES02.05.2020 02.05.2020    70     10  6263.  1349  2900   4249
 9 Spain   All    ES02.05.2020 02.05.2020    80     10 10423.  3084  3987   7071
10 Spain   All    ES02.05.2020 02.05.2020    90     10  4266.  1654. 1206.  2860
# ... with 12 more rows

Data availability for France

Hi Tim,
Could you please let me know the reason why the availability of French data stops in July? Can we help you somehow to bring it up to date? Thanks a lot.

observed jumps in Maine cases

Hi @timriffe,

I am looking at the confirmed cases for Maine state and I see periods with significant jumps. I think it is an isolated event.
This might need some attention.

> read_csv(
+   file = "data/Output_10_20201208.zip",
+   skip = 3)%>% 
+   mutate(
+     Date = as.Date(Date, format = "%d.%m.%Y")) %>% 
+   filter(Sex == "b",
+          Region == "Maine",
+          Age == 60) %>% 
+   arrange(Date) %>% 
+   ggplot(aes(x = Date, y = Cases)) + 
+   geom_line(size = 1) + 
+   labs(title = "Confirmed cases in the 60-70 age group")

Looking at weekly no of cases per 100k inhabitants we would see this:

UK death data

Hello @timriffe,
I've noticed that there's some abnormal data in death data of United Kingdom. As a cumulative death data, using age 80 data as example, it's 34500.8 on 2021.01.01 but 9721.2 on 2021.01.08. I'm wondering how's that happended. I've compared with the data sourse of UK you provided, and it seems fine in the data source. Could you provide any possible explaination of that. Thank you so much!

US Ohio data

I can see that in Output_10.csv we have recorded cases as early as January 2, 2020. Supposedly, the virus was present in China only at that time.

Or a 2D smoothing is used at this stage to reconstruct the early records?

High number of deaths for ages 0-5 in the United Kingdom

The data (in Output_5) shows a very high number of deaths from covid for ages 0-5 in the UK data set, higher than any other age group. This isn't present, as far as I can see, in the England, England and Wales, Scotland, and Northern Ireland data. A jump in deaths seems to occur at the start of 2021, so this might be to do with swapping to the 2021 ONS dataset?

Time series of cumulative count could have bumps, and not always increasing

I just realized this interesting issue since my colleagues were asking for time series of cases/deaths from the database. Because the database was built in an accumulative way, it is possible to get time series for some countries that (luckily) have frequently data updates.

This is not common as data for most countries are quite consistent. But happened in several countries like the deaths of Indonesia. I plotted the deaths against different update dates and they should monotonically increase as in most countries. I assume this bump is caused by a change in the data source?

The effect is less obvious if zoom out to all age groups:

Just to give some other examples:

library(covidAgeData)
library(data.table)
library(ggplot2)
dt5_ori <- covidAgeData::download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
c_list <- sort(unique(dt5_ori$Country))
plot_by_c <- function(country0, measure0 = "Deaths"){
  dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All"  & !is.na(get(measure0))]
  # dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All" & Age <=20 & !is.na(get(measure0))]
  if(nrow(dt5_c) == 0) return(NULL)
  dt5_c[, Date:= as.Date(Date, format = "%d.%m.%Y")]
  dt5_c[, Age:=factor(as.factor(Age), levels = seq(0, 100 ,by = 5))]
  g_IDN <- ggplot(data = dt5_c, aes_string(x = "Date", y = measure0, color = "Age", group = "Age")) +
    # geom_bar(stat="identity", width=0.5, show.legend = FALSE, color = "#0058AB") +
    geom_line() +
    labs(x = "", y = "") + 
    scale_x_date(date_labels = "%Y-%m") +
    scale_y_continuous(expand = c(0,0)) +
    ggtitle(paste(country0, "-", measure0)) + 
    theme_classic() 
  return(g_IDN)
}
plot_by_c("Indonesia", measure0 = "Deaths")
# too see all the countries 
plist <- invisible(lapply(c_list, plot_by_c, measure0 = "Deaths"))
plist <- plist[!sapply(plist, is.null)]
plist <- lapply(plist, ggplotGrob)
ggsave(filename = "time series of deaths_all_age.pdf",
       plot = gridExtra::marrangeGrob(grobs = plist, ncol = 2, nrow = 2), width = 20, height = 15)

Metadata Update

Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data frame http://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/JWULOO
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.

France all-region cases seem to lost (mostly)

library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
dt5_ori[Country=="France" & Region == "All" & Date=="21.02.2021" & Sex=="b" , sum(Cases, na.rm = TRUE)]

gives 5700.1.
On last Monday (3/1) the number was 2.8 million
(Date == "21.02.2021" is the latest date with data available)

This is comparing the "Output_5" datasets I downloaded on 3/8 (column COVerAGE-DB) to the 3/1 one (column COVerAGE-DB_old), but this issue emerged since the 3/3 dataset.

As we can see, Spain and Portugal also have some decrease in numbers.

I am finalizing a dashboard that updates weekly using COVerAGE-DB, so I guess I will check these general consistency things for you every week.

Argentina Data

hi! we're keeping uptodate an open data Argentina repo over here: https://github.com/SistemasMapache/Covid19arData

Italian Output

Checking the Italian output i can see:

   Country Region             Code       Date Sex Age AgeInt   Cases Deaths
1    Italy    All ITinfo15.04.2020 2020-04-15   b   0     10  1008.8    1.0
2    Italy    All ITinfo15.04.2020 2020-04-15   b  10     10     0.0    0.0
3    Italy    All ITinfo15.04.2020 2020-04-15   b  20     10  7061.9    7.0
4    Italy    All ITinfo15.04.2020 2020-04-15   b  30     10 13114.9   39.0
5    Italy    All ITinfo15.04.2020 2020-04-15   b  40     10 19392.1  173.0
6    Italy    All ITinfo15.04.2020 2020-04-15   b  50     10 30103.7  746.0
7    Italy    All ITinfo15.04.2020 2020-04-15   b  60     10 23808.5 2242.1
8    Italy    All ITinfo15.04.2020 2020-04-15   b  70     10 25532.0 6074.3
9    Italy    All ITinfo15.04.2020 2020-04-15   b  80     10 26097.4 7890.4
10   Italy    All ITinfo15.04.2020 2020-04-15   b  90     10  6907.3 1359.0
11   Italy    All ITinfo15.04.2020 2020-04-15   b 100     10  2440.5  976.1

When Age == 10 the number of cases is 0 (unlikely!). I think the cases here are estimated based on the death records, information which is taken from daily summarized info-graphics.

However, note that on a weekly basis complete information by age and sex is provided (see April 16). The modeled daily data should be consistent with the complete weekly data at the nearest calendar date ...or removed.

Also if data is already reported in the desired age classes, I don't think any altering of the data should be applied. This is the case for the number of deaths displayed above. Only the deaths in the last reported age group 90+ should be split between 90 and 100. For all the other ages should be read as reported. I know this complicates a bit the coding but would be the right way to do it.

Possibly wrong new year assigned in input - Wisconsin case

Hi @timriffe

I can see this:

coverage <- read_csv(
  file = "data/Output_10_20210107.zip",
  skip = 3) %>% 
  filter(Region == "Wisconsin",
         Sex == "b") %>% 
  mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>% 
  arrange(Date) %>% 
  print(n = 70)

# A tibble: 3,124 x 10
   Country Region    Code             Date         Age AgeInt Sex     Cases Deaths Tests
   <chr>   <chr>     <chr>            <date>     <dbl>  <dbl> <chr>   <dbl>  <dbl> <lgl>
 1 USA     Wisconsin US_WI_01.01.2020 2020-01-01     0     10 b     18816      0   NA   
 2 USA     Wisconsin US_WI_01.01.2020 2020-01-01    10     10 b     54477      2   NA   
 3 USA     Wisconsin US_WI_01.01.2020 2020-01-01    20     10 b     92503     16   NA   
 4 USA     Wisconsin US_WI_01.01.2020 2020-01-01    30     10 b     75498     37   NA   
 5 USA     Wisconsin US_WI_01.01.2020 2020-01-01    40     10 b     68817     82   NA   
 6 USA     Wisconsin US_WI_01.01.2020 2020-01-01    50     10 b     73912    268   NA   
 7 USA     Wisconsin US_WI_01.01.2020 2020-01-01    60     10 b     52605    641   NA   
 8 USA     Wisconsin US_WI_01.01.2020 2020-01-01    70     10 b     27593   1217   NA   
 9 USA     Wisconsin US_WI_01.01.2020 2020-01-01    80     10 b     13546   1539   NA   
10 USA     Wisconsin US_WI_01.01.2020 2020-01-01    90     10 b      5216.  1060.  NA   
11 USA     Wisconsin US_WI_01.01.2020 2020-01-01   100      5 b        24.2    7.5 NA   
12 USA     Wisconsin US_WI_02.01.2020 2020-01-02     0     10 b     18864      0   NA   
13 USA     Wisconsin US_WI_02.01.2020 2020-01-02    10     10 b     54588      2   NA   
14 USA     Wisconsin US_WI_02.01.2020 2020-01-02    20     10 b     92705     16   NA   
15 USA     Wisconsin US_WI_02.01.2020 2020-01-02    30     10 b     75672     37   NA   
16 USA     Wisconsin US_WI_02.01.2020 2020-01-02    40     10 b     68979     82   NA   
17 USA     Wisconsin US_WI_02.01.2020 2020-01-02    50     10 b     74056    268   NA   
18 USA     Wisconsin US_WI_02.01.2020 2020-01-02    60     10 b     52729    641   NA   
19 USA     Wisconsin US_WI_02.01.2020 2020-01-02    70     10 b     27662   1217   NA   
20 USA     Wisconsin US_WI_02.01.2020 2020-01-02    80     10 b     13581   1540   NA   
21 USA     Wisconsin US_WI_02.01.2020 2020-01-02    90     10 b      5225.  1060.  NA   
22 USA     Wisconsin US_WI_02.01.2020 2020-01-02   100      5 b        24.2    7.5 NA   
23 USA     Wisconsin US_WI_03.01.2020 2020-01-03     0     10 b     19036      0   NA   
24 USA     Wisconsin US_WI_03.01.2020 2020-01-03    10     10 b     54876      2   NA   
25 USA     Wisconsin US_WI_03.01.2020 2020-01-03    20     10 b     93085     16   NA   
26 USA     Wisconsin US_WI_03.01.2020 2020-01-03    30     10 b     76059     37   NA   
27 USA     Wisconsin US_WI_03.01.2020 2020-01-03    40     10 b     69299     82   NA   
28 USA     Wisconsin US_WI_03.01.2020 2020-01-03    50     10 b     74394    269   NA   
29 USA     Wisconsin US_WI_03.01.2020 2020-01-03    60     10 b     53018    642   NA   
30 USA     Wisconsin US_WI_03.01.2020 2020-01-03    70     10 b     27801   1217   NA   
31 USA     Wisconsin US_WI_03.01.2020 2020-01-03    80     10 b     13685   1542   NA   
32 USA     Wisconsin US_WI_03.01.2020 2020-01-03    90     10 b      5254.  1060.  NA   
33 USA     Wisconsin US_WI_03.01.2020 2020-01-03   100      5 b        24.2    7.5 NA   
34 USA     Wisconsin US_WI_04.01.2020 2020-01-04     0     10 b     19139      0   NA   
35 USA     Wisconsin US_WI_04.01.2020 2020-01-04    10     10 b     55072      2   NA   
36 USA     Wisconsin US_WI_04.01.2020 2020-01-04    20     10 b     93318     16   NA   
37 USA     Wisconsin US_WI_04.01.2020 2020-01-04    30     10 b     76268     37   NA   
38 USA     Wisconsin US_WI_04.01.2020 2020-01-04    40     10 b     69467     82   NA   
39 USA     Wisconsin US_WI_04.01.2020 2020-01-04    50     10 b     74581    269   NA   
40 USA     Wisconsin US_WI_04.01.2020 2020-01-04    60     10 b     53153    642   NA   
41 USA     Wisconsin US_WI_04.01.2020 2020-01-04    70     10 b     27877   1219   NA   
42 USA     Wisconsin US_WI_04.01.2020 2020-01-04    80     10 b     13758   1545   NA   
43 USA     Wisconsin US_WI_04.01.2020 2020-01-04    90     10 b      5281.  1064.  NA   
44 USA     Wisconsin US_WI_04.01.2020 2020-01-04   100      5 b        24.3    7.5 NA   
45 USA     Wisconsin US_WI_05.01.2020 2020-01-05     0     10 b     19279      0   NA   
46 USA     Wisconsin US_WI_05.01.2020 2020-01-05    10     10 b     55447      2   NA   
47 USA     Wisconsin US_WI_05.01.2020 2020-01-05    20     10 b     93869     16   NA   
48 USA     Wisconsin US_WI_05.01.2020 2020-01-05    30     10 b     76818     37   NA   
49 USA     Wisconsin US_WI_05.01.2020 2020-01-05    40     10 b     69972     84   NA   
50 USA     Wisconsin US_WI_05.01.2020 2020-01-05    50     10 b     75131    278   NA   
51 USA     Wisconsin US_WI_05.01.2020 2020-01-05    60     10 b     53543    655   NA   
52 USA     Wisconsin US_WI_05.01.2020 2020-01-05    70     10 b     28091   1241   NA   
53 USA     Wisconsin US_WI_05.01.2020 2020-01-05    80     10 b     13850   1572   NA   
54 USA     Wisconsin US_WI_05.01.2020 2020-01-05    90     10 b      5316.  1086.  NA   
55 USA     Wisconsin US_WI_05.01.2020 2020-01-05   100      5 b        24.5    7.7 NA   
56 USA     Wisconsin US_WI_29.03.2020 2020-03-29     0     10 b         4      0   NA   
57 USA     Wisconsin US_WI_29.03.2020 2020-03-29    10     10 b        14      0   NA   
58 USA     Wisconsin US_WI_29.03.2020 2020-03-29    20     10 b       148      0   NA   
59 USA     Wisconsin US_WI_29.03.2020 2020-03-29    30     10 b       169      0   NA   
60 USA     Wisconsin US_WI_29.03.2020 2020-03-29    40     10 b       186      0   NA   
61 USA     Wisconsin US_WI_29.03.2020 2020-03-29    50     10 b       203      4   NA   
62 USA     Wisconsin US_WI_29.03.2020 2020-03-29    60     10 b       220      3   NA   
63 USA     Wisconsin US_WI_29.03.2020 2020-03-29    70     10 b       111      3   NA   
64 USA     Wisconsin US_WI_29.03.2020 2020-03-29    80     10 b        46      2   NA   
65 USA     Wisconsin US_WI_29.03.2020 2020-03-29    90     10 b        10.9    1   NA   
66 USA     Wisconsin US_WI_29.03.2020 2020-03-29   100      5 b         0.1    0   NA   
67 USA     Wisconsin US_WI_30.03.2020 2020-03-30     0     10 b         4      0   NA   
68 USA     Wisconsin US_WI_30.03.2020 2020-03-30    10     10 b        16      0   NA   
69 USA     Wisconsin US_WI_30.03.2020 2020-03-30    20     10 b       161      0   NA   
70 USA     Wisconsin US_WI_30.03.2020 2020-03-30    30     10 b       179      0   NA   
# ... with 3,054 more rows

Is it possible that the wrong year is assigned to first 5 recorded days? 2020 instead of 2021?
Note the jump in at line 56, from January to March.

US Virginia Data

In Output_10.csv the number of confirmed cases and deaths in Virginia appear exceptionally high. We can see more than 7 mil cases.

According to the US Census data the entire state has a mid-2019 population of 4,248,373.

> path <- "Output_10_20200828.csv"
> data <- read.csv(
+   file = path,
+   header = TRUE,
+   skip = 3
+ ) %>% 
+   filter(Region == "Virginia",
+          Date == "27.08.2020") %>% 
+   group_by(Region) %>% 
+   summarise_if(is.numeric, sum )
>   
> 
> data
# A tibble: 1 x 6
  Region     Age AgeInt    Cases  Deaths Tests
  <fct>    <int>  <int>    <dbl>   <dbl> <dbl>
1 Virginia   550    105 7922068. 208726.    NA

Typo: The name of the state Louisiana

In Output_10.csv one can see that the name of the state is misspelled Lousiana instead of Louisiana.
I did not check the other output or input files.

data availability tweaks, coverage

Add another tab, just one row per country (Region == "All"), with another column indicating the most recent date.

Add a coverage graphic: Cases and Deaths as a fraction of totals from worldometers or JHU.

ungroup::pclm estimates

The {ungroup} package is great at ungrouping, however I think we should pay attention to the resulted CFR curves.

In my tests using the New York data where I harmonized the age-bands (10y groups) I noticed that the CFR at young ages is overestimated compared with the values for other regions where no ungrouppig of the data was applied (ITA, ESP, KOR, NLD, ...). This might be due to the fact that in each age band the death counts are skewed to the right (towards old ages) but not necessarily the number of confirmed cases. I assume the infections to be more uniformly distributed.

It is difficult for me to believe that we can see something that different in NY, but not impossible. For now, I am attributing this different outcome to the pclm() limitation. Maybe Silvia has a solution.

At young ages we are talking about small values however considering the age-structure of the population these can become significant.

I will try to post here some figures to support my claim.

Add selectize to shiny

Putting it here for me to remember. Add selectize to shiny for improved search.

South Africa

Age specific deaths available for SA on seemingly daily basis:

https://twitter.com/DrZweliMkhize/status/1268261539418701825

How can we help to get these into the database?

Your dataset in CoronaWhy Dataverse

Hello,

Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 dataframe https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/JWULOO.

Would you be willing to help with maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for medical science community.

Availability dash, add date range

Note to include both earliest and most recent dates in the availability dashboard.

Some countries in the database before are no longer there

Compare Output_5 to last week, it seems 8 countries were lost:

"Croatia" only misses Region == "All" while the rest 7 c("Cyprus", "Lithuania", "Luxembourg", "Latvia", "Malta", "Poland", "Slovakia") are no longer there.

library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
# Empty data.table: 
dt5_ori[Country %like% "Croatia" & Region == "All" & Sex == "b" ,]
dt5_ori[Country %in% c("Cyprus", "Lithuania", "Luxembourg", "Latvia", "Malta", "Poland", "Slovakia"),]

Null bytes in the first two lines of inputDB confuse 'read' functions.

The first two lines of in 'inputDB.csv' have some bytes which cause me problems when trying to read the file.
In R, the following error message is printed while trying to run readr::read_csv("inputDB.csv", skip = 1):

Error in make.names(x) : invalid multibyte string at '<a2><9b>y<bd>X'

I've attached the first 10 lines of the problematic 'inputDB':
inputDB-head.zip

On Linux, it looks like these bytes cause the file to be recognized as a gzip type, while it should be plain text.

$ file inputDB-head.csv
inputDB-head.csv: gzip compressed data, from HPFS filesystem (OS/2, NT), original size modulo 2^32 172376364

timriffe / covid_age Goto Github PK

covid_age's People

Contributors

Stargazers

Watchers

Forkers

covid_age's Issues

Recommend Projects

Recommend Topics

Recommend Org