timriffe / covid_age Goto Github PK
View Code? Open in Web Editor NEWCOVerAGE-DB: COVID-19 cases, deaths, and tests by age and sex
License: Other
COVerAGE-DB: COVID-19 cases, deaths, and tests by age and sex
License: Other
Hi @timriffe
For some regions the data is collected from multiple data sources. Sometimes we observe the case that in the final output files for the same region in the same day two different values are recorded. The 2 values can be identical or significantly different.
This is more a conceptual issues, but in my opinion the processed output should contain only 1 record (most credible) for the given region-day-sex-age. Otherwise, we risk to confuse the user. I am a bit confused too, and ponder how to move on when I come across different double entries .
We can see this behavior in 69 * ages * sex instances.
Here's my code
# read data
coverage <- read_csv(
file = "data/Output_10_20210108.zip",
skip = 3)
# summarise
d <- coverage %>%
filter(
Sex == "b") %>%
group_by(Country, Region, Date, Code) %>%
summarise(cases = sum(Cases)) %>%
mutate(Code2 = paste(Country, Region, Date))
# Show duplicates
d[duplicated(d$Code2), ] %>%
print(n = Inf)
# A tibble: 68 x 6
# Groups: Country, Region, Date [68]
Country Region Date Code cases Code2
<chr> <chr> <chr> <chr> <dbl> <chr>
1 Austria All 01.11.2020 AT01.11.2020 111110. Austria All 01.11.2020
2 Austria All 08.11.2020 AT08.11.2020 155903 Austria All 08.11.2020
3 Austria All 15.11.2020 AT15.11.2020 206208 Austria All 15.11.2020
4 Austria All 22.11.2020 AT22.11.2020 248419 Austria All 22.11.2020
5 Austria All 25.10.2020 AT25.10.2020 82536. Austria All 25.10.2020
6 Belgium All 01.11.2020 BE01.11.2020 448645 Belgium All 01.11.2020
7 Belgium All 25.10.2020 BE25.10.2020 349459 Belgium All 25.10.2020
8 Bulgaria All 01.11.2020 BG01.11.2020 52844 Bulgaria All 01.11.2020
9 Bulgaria All 25.10.2020 BG25.10.2020 37562 Bulgaria All 25.10.2020
10 Czechia All 01.11.2020 CZ_ECDC_01.11.2020 319569 Czechia All 01.11.2020
11 Czechia All 08.11.2020 CZ_ECDC_08.11.2020 392561 Czechia All 08.11.2020
12 Czechia All 15.11.2020 CZ_ECDC_15.11.2020 438589 Czechia All 15.11.2020
13 Czechia All 22.11.2020 CZ_ECDC_22.11.2020 470652 Czechia All 22.11.2020
14 Czechia All 25.10.2020 CZ_ECDC_25.10.2020 155582. Czechia All 25.10.2020
15 Estonia All 01.11.2020 EE_ECDC_01.11.2020 4882. Estonia All 01.11.2020
16 Estonia All 08.11.2020 EE_ECDC_08.11.2020 4882. Estonia All 08.11.2020
17 Estonia All 15.11.2020 EE_ECDC_15.11.2020 7450 Estonia All 15.11.2020
18 Estonia All 22.11.2020 EE_ECDC_22.11.2020 9534 Estonia All 22.11.2020
19 Estonia All 25.10.2020 EE_ECDC_25.10.2020 3964 Estonia All 25.10.2020
20 Finland All 01.11.2020 FI01.11.2020 NA Finland All 01.11.2020
21 Finland All 08.11.2020 FI08.11.2020 NA Finland All 08.11.2020
22 Finland All 15.11.2020 FI15.11.2020 NA Finland All 15.11.2020
23 Finland All 25.10.2020 FI25.10.2020 NA Finland All 25.10.2020
24 France All 01.11.2020 FR01.11.2020 NA France All 01.11.2020
25 France All 25.10.2020 FR25.10.2020 NA France All 25.10.2020
26 Germany All 01.11.2020 DE_ECDC_01.11.2020 548122 Germany All 01.11.2020
27 Germany All 08.11.2020 DE_ECDC_08.11.2020 548122 Germany All 08.11.2020
28 Germany All 15.11.2020 DE_ECDC_15.11.2020 548122 Germany All 15.11.2020
29 Germany All 22.11.2020 DE_ECDC_22.11.2020 894433. Germany All 22.11.2020
30 Germany All 25.10.2020 DE_ECDC_25.10.2020 301126 Germany All 25.10.2020
31 Italy All 02.04.2020 ITinfo02.04.2020 105825. Italy All 02.04.2020
32 Italy All 03.06.2020 ITinfo03.06.2020 235046. Italy All 03.06.2020
33 Italy All 06.04.2020 ITinfo06.04.2020 125464. Italy All 06.04.2020
34 Italy All 09.04.2020 ITinfo09.04.2020 133751 Italy All 09.04.2020
35 Italy All 19.03.2020 ITinfo19.03.2020 33296 Italy All 19.03.2020
36 Italy All 20.05.2020 ITinfo20.05.2020 228447. Italy All 20.05.2020
37 Italy All 23.03.2020 ITinfo23.03.2020 54291. Italy All 23.03.2020
38 Italy All 26.03.2020 ITinfo26.03.2020 70619 Italy All 26.03.2020
39 Italy All 28.04.2020 ITinfo28.04.2020 195001. Italy All 28.04.2020
40 Italy All 30.03.2020 ITinfo30.03.2020 91446. Italy All 30.03.2020
41 Netherlands All 01.11.2020 NL01.11.2020 393187 Netherlands All 01.11.2020
42 Netherlands All 08.11.2020 NL08.11.2020 431655. Netherlands All 08.11.2020
43 Netherlands All 15.11.2020 NL15.11.2020 468804 Netherlands All 15.11.2020
44 Netherlands All 22.11.2020 NL22.11.2020 504117 Netherlands All 22.11.2020
45 Netherlands All 25.10.2020 NL25.10.2020 341570 Netherlands All 25.10.2020
46 Norway All 01.11.2020 NO01.11.2020 21471 Norway All 01.11.2020
47 Norway All 08.11.2020 NO08.11.2020 25519 Norway All 08.11.2020
48 Norway All 15.11.2020 NO15.11.2020 29607 Norway All 15.11.2020
49 Norway All 25.10.2020 NO25.10.2020 18327 Norway All 25.10.2020
50 Palestine All 31.12.2020 PS31.12.2020 155365 Palestine All 31.12.2020
51 Portugal All 01.11.2020 PT01.11.2020 146659. Portugal All 01.11.2020
52 Portugal All 08.11.2020 PT08.11.2020 183159. Portugal All 08.11.2020
53 Portugal All 15.11.2020 PT15.11.2020 221250. Portugal All 15.11.2020
54 Portugal All 22.11.2020 PT22.11.2020 259788 Portugal All 22.11.2020
55 Portugal All 25.10.2020 PT25.10.2020 120946 Portugal All 25.10.2020
56 South Korea All 01.01.2021 KR31.12.2020 60740 South Korea All 01.01.2021
57 USA California 28.11.2020 US_CDC_CA28.11.2020 NA USA California 28.11.2020
58 USA Florida 28.11.2020 US_FL28.11.2020 992661. USA Florida 28.11.2020
59 USA Idaho 28.11.2020 US_ID28.11.2020 99660. USA Idaho 28.11.2020
60 USA Iowa 28.11.2020 US_IA_28.11.2020 225784 USA Iowa 28.11.2020
61 USA Maine 28.11.2020 US_ME28.11.2020 11508 USA Maine 28.11.2020
62 USA New Jersey 28.11.2020 US_NJ28.11.2020 330275 USA New Jersey 28.11.2020
63 USA Ohio 28.11.2020 US_OH28.11.2020 458372 USA Ohio 28.11.2020
64 USA Oklahoma 28.11.2020 US_OK28.11.2020 194022. USA Oklahoma 28.11.2020
65 USA Pennsylvania 28.11.2020 US_PA28.11.2020 326741. USA Pennsylvania 28.11.2020
66 USA Texas 28.11.2020 US_TX28.11.2020 1151069 USA Texas 28.11.2020
67 USA Vermont 28.11.2020 US_VT28.11.2020 4100. USA Vermont 28.11.2020
68 USA Virginia 28.11.2020 US_VA28.11.2020 233617. USA Virginia 28.11.2020
Manual inspection of the 1st case
coverage %>%
filter(Country == "Austria",
Region == "All",
Date == "01.11.2020",
Sex == "b",
Age == 60)
# A tibble: 2 x 10
Country Region Code Date Age AgeInt Sex Cases Deaths Tests
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 Austria All AT_ECDC_01.11.2020 01.11.2020 60 10 b 7555 29 NA
2 Austria All AT01.11.2020 01.11.2020 60 10 b 9342. 86.4 NA
Hi Tim,
Looking at the Wisconsin data I can see a trend break between end of April and beginning of June. The data in this period seems to belong to another region. Considering that we are looking at cumulative data this look wrong to me.
Here's a quick way to visualize it:
W <- read.csv(
file = "Output_10_20200831.csv",
header = TRUE,
skip = 3
) %>%
filter(Region == "Wisconsin",
Sex == "b") %>%
mutate(
Date = as.Date(Date, format = "%d.%m.%Y"))
ggplot(W, aes(x = Date, y = Cases, col = factor(Age))) +
geom_point()
Thanks for investigating!
Marius
I am maintain a similar database for SCOR but with significantly less ambitions. A good source of offset data it the WPP2019 that can be found very convenient in the {wpp2019}
package. See popM
and popF
to get 5y-age data.
I am using the last available HMD
exposures and deaths when I what to determine the excess in mortality during the covid epidemic.
Dear COVerAGE team,
I've noticed that the data inputs and outputs for Germany have stopped in August. Could we please have an update on this?
Suggestion:
I think it would be a good idea to have a database news section somewhere where important updates and developments are shared (maybe in 1 sentence). Also we could produce a "data-availability table" where for each region to have specified the start and the end of the data, perceived quality and info about missing data. I would be of a great help to all of us who are working with your amazing data frequently.
Regards,
Marius
... and not properly anonymized i'd say, can be found here.
I don't know if you are aware of this, but you can find seriatim death data (including COVID related). This might be a sign that more high quality data is coming from the US.
However, if you know somebody close to the source it would be nice to raise a warning and send them an email explaining that we will gladly use the data as is but this can generate a reputational risk on their side.
There are cases, such as Slovenia, where age grouping (for deaths in this case) is done on the fly to make adjacent 0s into one age group. For example age 0-34 = 0 for one sex and age 0-44 for the other sex. In this case, aggregation to b
doesn't work properly in step 2. The function infer_both_sex()
should be modified to detect and handle this situation. Leaving this here as a TO-DO. @kikeacosta tagging so you get the memo too
I did a comparison between the Spanish data I collected and what I see in the inputDB.csv
.
The sources appear to be the same however at certain points in time consistent differences across multiple ages can be observed. E.g.: March 29, 31 and April 13.
This is related to #4.
Also, it is not clear to me what is reported when Age == UNK
. Maybe unknown? For Spain very large numbers can be seen there.
See the file attached here:
ESP_Data_Comparison_20200413.xlsx
Why can't we upload files to the OSF repo through osfr
? It seems that if we create the OSF project with osf_create_project
we can upload and overwrite files but with the current OSF repo, we're not allowed to upload anything. See ropensci/osfr#126
Get a notification when osfr
uploads the files every day.
Implement an automatic twitter bot that announces that name of the countries (name) and the number of increase in states and provinces (the number). Include also the world map with the current countries (see https://raw.githubusercontent.com/timriffe/covid_age/master/assets/coveragemap.svg). Another type of tweet would be: we have X countries out of X in Africa/North America/Etc..
We have a bottleneck in the table that summarizes which transformation happened for which country/province/date. We read all the data from here and then apply some functions to check which chunks got transformed (here). We should get all of this to run inside a profiler to check which functions need to be optimized. Currently, this process is slow.
The big data in https://timriffe.github.io/covid_age/DataAvail.html --> 'By country, region and date' is throwing this warning:
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Why is this?
Look into including pointblank
validation reporting as part of the daily build cycle. https://github.com/rich-iannone/pointblank/
write an external consistency check to compare crude totals with Hopkins data. Huge abrupt deviations should be flagged.
As of August 13, the total case count for Texas when added up by age groups in this dataset is 43,409. The "TOT" line for all ages is 513,575, which is in line with numbers I see elsewhere.
I'm guessing that Texas simply didn't report ages for more than 90% of cases, even though it is reporting them for all or nearly all of deaths. Does the discrepancy above come from the state's reporting or is it an actual error?
If the former, should there be some way that such data limitations are reported within the data set?
Lastly, thanks for making all of this data available! It's a wonderful resource to have in one place.
inputDB.csv
, Output_5.csv
and Output_10.csv
should all be zipped prior to OSF upload, ASAP.
then it works just the same readr::read_csv("inputDB.zip")
Hi @timriffe,
Currently for Israel little or no info is collected in the database except for number of new cases.
I would like to suggest a data source that might or might not be known to you: https://data.gov.il/dataset/covid-19
The website is in Hebrew, however I was able to navigate it in Chrome with auto translation.
One can find there details about weekly death and tests and more. One drawback is that for records higher than 0 and lower than 15, "<15" is displayed instead of a numerical value. I suspect that this is done in order to anonymized the data.
I think Israel deserves increased attention from us because is the leading country that is offering information about the vaccine effectiveness, so I guess everyone's eyes are on it.
Code
is just the short code portion.Sex = "b"
to "t"
everywhereThanks for compiling such a great resource!
There appear to be extra columns y
and 2499
in inputDB
, such that it has 13 columns rather than 11:
> inputDB <- read_csv("inputDB.zip",
+ skip = 1,
+ col_types = "cccccciccdc")
|===================================================| 100% 1092 MB
Warning: 15278446 parsing failures.
row col expected actual file
1 -- 11 columns 13 columns 'inputDB.zip'
2 -- 11 columns 13 columns 'inputDB.zip'
3 -- 11 columns 13 columns 'inputDB.zip'
4 -- 11 columns 13 columns 'inputDB.zip'
5 -- 11 columns 13 columns 'inputDB.zip'
... ... .......... .......... .............
See problems(...) for more details.
Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.
> inputDB <- read_csv("inputDB.zip",
+ skip = 1)
── Column specification ──────────────────────────────────────────
cols(
Country = col_character(),
Region = col_character(),
Code = col_character(),
Date = col_character(),
Sex = col_character(),
Age = col_character(),
AgeInt = col_double(),
Metric = col_character(),
Measure = col_character(),
Value = col_double(),
Short = col_character(),
y = col_logical(),
`2499` = col_logical()
)
|===================================================| 100% 1092 MB
This seems to cause downloading and reading in with covidAgeData
to fail:
> inputDB = download_covid("inputDB",progress = F)
Reading /var/folders/xq/qzl74ksd6sxc9rfvlpfslt980000gn/T//RtmpWykb7A/Data/inputDB.csv
Error in data.table::fread(filepath, sep = ",", colClasses = rinfo[[2]], :
colClasses= is an unnamed vector of types, length 11, but there are 13 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.
Sorry, I haven't been able to look at the source code to see what might be causing this.
Hi everyone,
Thought I'd ask for help here. I've finally been working on automating the collection of death counts on Israel's side.
Now, the only way to get complete death counts currently is by manually downloading cumulative counts from the following dashboard (In Hebrew): https://datadashboard.health.gov.il/COVID-19/general
If you go on that website and click the following download button:
body > ngx-app > ngx-pages > nb-layout > div.scrollable-container > div > div > div > div > nb-layout-column > ngx-general > section:nth-child(3) > div > ngx-tile-wrapper > ngx-horizontal-bar-by-age-gender > nb-card > nb-card-header > ngx-select-header-infected-level > div:nth-child(1) > excel-download-button > div > button
(Open Developer Tools and search that selector)
Then you get a csv file that includes a column of current cumulative death counts.
I've been manually downloading files for quite a while, because I couldn't figure out how to automate this process. The dashboard page isn't automation-friendly unfortunately.
If you can think of ideas of how to do that, it would be greatly appreciated!
What I have currently: https://github.com/eshom/covid-dashboard-processing
What the scripts I wrote do, they download the data I semi-regularly upload to Israel's Google Drive folder. It then processes it to fit our format, And finally saves the output to /data/deaths.csv
, which is pretty much ready to be joined with other measures.
A known issue with this data source, is that counts aren't strictly cumulative. At some periods, for some undocumented reasons, deaths were reduced. This was checked, and the change in counts is visible on the raw file.
Pinging you here, because I believe you wanted to get this going?
@timriffe
Hi all,
Today, when working with the US subset of the data, I noticed some duplication but with differing values. Specifically, for any given Region, Sex, and Age I was expecting there to be a single observation; however, this is not the case. For example, Arizona on 9/26/20 for Sex equal to b and Age equal to 0 has two observations with differing case, test, and death values - the lines from the CSV file are included below for clarity. Can you provide some clarification here? As of this writing, I identified 1407 such instances in the US data.
USA, Arizona, US_AZ 26.09.2020, 26.09.2020, b, 0, 5, 3531.4, 0.4, 24559.3
USA, Arizona, US_CDC_AZ26.09.2020, 26.09.2020, b, 0, 5, NA, 1, NA
Thanks!
Great to find this. Exquisite work!
Everybody is talking about China but I am still amazed how difficult is to find detailed info for the US and UK.
Because I did not find UK on the list I will add here what I know about it:
https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales
See the April 3 update. In the Excel file you will find death counts by age but not confirmed cases.
Regards,
Marius
In the output for Germany I can see:
Country Region Code Date Sex Age AgeInt Cases Deaths
1 Germany All DE01.03.2020 01.03.2020 f 0 10 NaN 0.0
2 Germany All DE01.03.2020 01.03.2020 f 10 10 NaN 0.0
3 Germany All DE01.03.2020 01.03.2020 f 20 10 NaN 0.0
4 Germany All DE01.03.2020 01.03.2020 f 30 10 NaN 0.0
5 Germany All DE01.03.2020 01.03.2020 f 40 10 NaN 0.0
6 Germany All DE01.03.2020 01.03.2020 f 50 10 NaN 0.0
7 Germany All DE01.03.2020 01.03.2020 f 60 10 NaN 1.3
8 Germany All DE01.03.2020 01.03.2020 f 70 10 NaN 1.7
9 Germany All DE01.03.2020 01.03.2020 f 80 10 NaN 0.0
10 Germany All DE01.03.2020 01.03.2020 f 90 10 NaN 0.0
11 Germany All DE01.03.2020 01.03.2020 f 100 10 NaN 3.9
Is there a reason why I get NaN
instead ofa number or NA
's for Germany? The data appears to be available?
Clearly this is a project that I want in. So I am continuing to post my observations.
I couldn't find the source for the Danish data or the original data in inputDB.csv
. Can we have a database with sources? Also would be nice to indicate in the output the degree to which the data has been modeled and the perceived data quality (good, bad, mediocre...).
Now, looking at the cfr in the Danish female population as of April 9 we can see an unusual pattern.
Can we explain this?
In Output_10.csv
at the state of New Jersey i can see several weeks in a row with the same records, day after day. Does this indicate zero new cases in the period/region or the lack of new data?
I did this:
> NJ <- read.csv(
+ file = "Output_10_20200901.csv",
+ header = TRUE,
+ skip = 3
+ ) %>%
+ filter(Region == "New Jersey",
+ Sex == "b") %>%
+ mutate(
+ Date = as.Date(Date, format = "%d.%m.%Y"))
>
>
> NJ %>% select(Region, Date, Age, Cases) %>%
+ filter(Date > "2020-08-01") %>%
+ pivot_wider(names_from = Date, values_from = Cases)
# A tibble: 11 x 24
Region Age `2020-08-02` `2020-08-03` `2020-08-04` `2020-08-05` `2020-08-06` `2020-08-07` `2020-08-08`
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New J~ 0 1967. 1967. 1967. 1967. 1967. 1967. 1967.
2 New J~ 10 4962. 4962. 4962. 4962. 4962. 4962. 4962.
3 New J~ 20 23854. 23854. 23854. 23854. 23854. 23854. 23854.
4 New J~ 30 31052. 31052. 31052. 31052. 31052. 31052. 31052.
5 New J~ 40 26123 26123 26123 26123 26123 26123 26123
6 New J~ 50 34429. 34429. 34429. 34429. 34429. 34429. 34429.
7 New J~ 60 26415. 26415. 26415. 26415. 26415. 26415. 26415.
8 New J~ 70 14328. 14328. 14328. 14328. 14328. 14328. 14328.
9 New J~ 80 9587. 9587. 9587. 9587. 9587. 9587. 9587.
10 New J~ 90 8184. 8184. 8184. 8184. 8184. 8184. 8184.
11 New J~ 100 70.2 70.2 70.2 70.2 70.2 70.2 70.2
# ... with 15 more variables: `2020-08-09` <dbl>, `2020-08-10` <dbl>, `2020-08-11` <dbl>,
# `2020-08-12` <dbl>, `2020-08-13` <dbl>, `2020-08-14` <dbl>, `2020-08-19` <dbl>, `2020-08-20` <dbl>,
# `2020-08-21` <dbl>, `2020-08-22` <dbl>, `2020-08-23` <dbl>, `2020-08-24` <dbl>, `2020-08-25` <dbl>,
# `2020-08-26` <dbl>, `2020-08-27` <dbl>
Comparing to the database on Sep.1, there seems to be an issue with Spain.
library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
verbose = FALSE, progress = FALSE, return = "data.table")
# Empty data.table:
dt5_ori[Country %like% "Spain" & Region == "All" & Sex == "b" & Date == "21.07.2020",]
Many thanks
Thanks again for the great work. I wish to double-check a question:
If I want to show how is the situation in, for example, last October, could I just filter the latest dataset by Date
?
I assume the database is built cumulatively: so whenever there is an update in a country, the new data is appended to the old one. Is this roughly the case?
Many thanks!
I just refreshed the data, and it seems the cases for USA become 0?
library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
verbose = FALSE, progress = FALSE, return = "data.table")
dt5_ori[Country=="USA" & Region == "All" & !is.na(Cases), table(Cases)]
Cases
0
3633
The numbers exist last week in the earlier versions of the database.
Often the series of Cases or Deaths, which should be a non-decreasing function, show a decrease compared to the day before. I wonder if this can have something to do with the smoothing (perhaps the smoothing goes by one dimension at a time - either age or date - instead of both, and this somehow breaks the continuity of the increase? Just guessing...sorry if it is completely irrelevant.)
This is the data for some countries (only the rows that have a negative change of Deaths and Cases compared to the day before).
Algeria
Algeria.txt
Austria
Austria.txt
Belgium
Belgium.txt
Brazil
Brazil.txt
Deaths: As the deaths reported in the output are the cumulative ones over time, they cannot be decreasing. The total number of deaths increases day by day, as expected. But if I break them down by age, some age groups are decreasing in some days. Please see below.
Sex ratio of the Cases: there is a strange convergence of many ages towards a value of 1 on one single day. Please see below.
The data for India was updated in 2020-10, just curious if there will be an update for India? (I understand it is not easy...)
Many thanks!
Hello @timriffe,
I have noticed that your dataset does not include any case data for the UK and I could not find any topics on why. Hence, I wanted to suggest a data source. At official UK Government website for data and insights on Coronavirus there is detailed data by region and 5-year age groups. They also offer permanent download links. For the aforementioned dataset its https://api.coronavirus.data.gov.uk/v2/data?areaType=region&metric=newCasesBySpecimenDateAgeDemographics&format=csv
Please be aware that the source operates a little strangely. To obtain the dataset, I set "Area type" to "Regional" and "Metrics" to "newCasesBySpecimenDataAgeDemographics". For some combinations of "Area Type" and "Metrics" there are no datasets available, and the download-button will simply do nothing apart from changing color (falsely indicating a successful download). I am also no expert on the UK's regional structure so I am not sure if this data can be seamlessly merged with your existing datasets. I have yet to compare the regions to the ONS dataset you use.
I hope the data meets your quality standards. I am probably not the only person interested in UK case data and it would be greatly appreciated if you find a way to implement it.
The harmonization process should maintain the consistency between sexes and both sex.
Below I am looking at the data in Output_10.csv
and I can see that the sum of the values for m
and f
populations are far away from the count data in b
by a large margin. Somewhere along the way from InputDB
to Output
must be a much bigger bug hidden in the lines.
> path <- "https://raw.githubusercontent.com/timriffe/covid_age/master/Data/Output_10.csv"
> cfr10 <- read.csv(
+ file = path,
+ header = TRUE,
+ skip = 1
+ )
>
> cfr10 %>%
+ filter(Country == "Spain",
+ Date == "02.05.2020") %>%
+ select(-Cases, -Tests) %>%
+ pivot_wider(names_from = Sex, values_from = Deaths) %>%
+ mutate(`m+f` = m + f)
# A tibble: 22 x 10
Country Region Code Date Age AgeInt b f m `m+f`
<fct> <fct> <fct> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Spain All ES02.05.2020 02.05.2020 0 10 2.9 1 1 2
2 Spain All ES02.05.2020 02.05.2020 10 10 7.4 3 2 5
3 Spain All ES02.05.2020 02.05.2020 20 10 32.4 8 14 22
4 Spain All ES02.05.2020 02.05.2020 30 10 84 22 35 57
5 Spain All ES02.05.2020 02.05.2020 40 10 268. 69 113 182
6 Spain All ES02.05.2020 02.05.2020 50 10 831. 160 404 564
7 Spain All ES02.05.2020 02.05.2020 60 10 2233. 438 1077 1515
8 Spain All ES02.05.2020 02.05.2020 70 10 6263. 1349 2900 4249
9 Spain All ES02.05.2020 02.05.2020 80 10 10423. 3084 3987 7071
10 Spain All ES02.05.2020 02.05.2020 90 10 4266. 1654. 1206. 2860
# ... with 12 more rows
Hi Tim,
Could you please let me know the reason why the availability of French data stops in July? Can we help you somehow to bring it up to date? Thanks a lot.
Hi @timriffe,
I am looking at the confirmed cases for Maine state and I see periods with significant jumps. I think it is an isolated event.
This might need some attention.
> read_csv(
+ file = "data/Output_10_20201208.zip",
+ skip = 3)%>%
+ mutate(
+ Date = as.Date(Date, format = "%d.%m.%Y")) %>%
+ filter(Sex == "b",
+ Region == "Maine",
+ Age == 60) %>%
+ arrange(Date) %>%
+ ggplot(aes(x = Date, y = Cases)) +
+ geom_line(size = 1) +
+ labs(title = "Confirmed cases in the 60-70 age group")
Looking at weekly no of cases per 100k inhabitants we would see this:
Hello @timriffe,
I've noticed that there's some abnormal data in death data of United Kingdom. As a cumulative death data, using age 80 data as example, it's 34500.8 on 2021.01.01 but 9721.2 on 2021.01.08. I'm wondering how's that happended. I've compared with the data sourse of UK you provided, and it seems fine in the data source. Could you provide any possible explaination of that. Thank you so much!
I can see that in Output_10.csv
we have recorded cases as early as January 2, 2020. Supposedly, the virus was present in China only at that time.
Or a 2D smoothing is used at this stage to reconstruct the early records?
The data (in Output_5) shows a very high number of deaths from covid for ages 0-5 in the UK data set, higher than any other age group. This isn't present, as far as I can see, in the England, England and Wales, Scotland, and Northern Ireland data. A jump in deaths seems to occur at the start of 2021, so this might be to do with swapping to the 2021 ONS dataset?
I just realized this interesting issue since my colleagues were asking for time series of cases/deaths from the database. Because the database was built in an accumulative way, it is possible to get time series for some countries that (luckily) have frequently data updates.
This is not common as data for most countries are quite consistent. But happened in several countries like the deaths of Indonesia. I plotted the deaths against different update dates and they should monotonically increase as in most countries. I assume this bump is caused by a change in the data source?
The effect is less obvious if zoom out to all age groups:
Just to give some other examples:
library(covidAgeData)
library(data.table)
library(ggplot2)
dt5_ori <- covidAgeData::download_covid(data = "Output_5", temp = TRUE,
verbose = FALSE, progress = FALSE, return = "data.table")
c_list <- sort(unique(dt5_ori$Country))
plot_by_c <- function(country0, measure0 = "Deaths"){
dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All" & !is.na(get(measure0))]
# dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All" & Age <=20 & !is.na(get(measure0))]
if(nrow(dt5_c) == 0) return(NULL)
dt5_c[, Date:= as.Date(Date, format = "%d.%m.%Y")]
dt5_c[, Age:=factor(as.factor(Age), levels = seq(0, 100 ,by = 5))]
g_IDN <- ggplot(data = dt5_c, aes_string(x = "Date", y = measure0, color = "Age", group = "Age")) +
# geom_bar(stat="identity", width=0.5, show.legend = FALSE, color = "#0058AB") +
geom_line() +
labs(x = "", y = "") +
scale_x_date(date_labels = "%Y-%m") +
scale_y_continuous(expand = c(0,0)) +
ggtitle(paste(country0, "-", measure0)) +
theme_classic()
return(g_IDN)
}
plot_by_c("Indonesia", measure0 = "Deaths")
# too see all the countries
plist <- invisible(lapply(c_list, plot_by_c, measure0 = "Deaths"))
plist <- plist[!sapply(plist, is.null)]
plist <- lapply(plist, ggplotGrob)
ggsave(filename = "time series of deaths_all_age.pdf",
plot = gridExtra::marrangeGrob(grobs = plist, ncol = 2, nrow = 2), width = 20, height = 15)
Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data frame http://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/JWULOO
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.
library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
verbose = FALSE, progress = FALSE, return = "data.table")
dt5_ori[Country=="France" & Region == "All" & Date=="21.02.2021" & Sex=="b" , sum(Cases, na.rm = TRUE)]
gives 5700.1.
On last Monday (3/1) the number was 2.8 million
(Date == "21.02.2021"
is the latest date with data available)
This is comparing the "Output_5" datasets I downloaded on 3/8 (column COVerAGE-DB
) to the 3/1 one (column COVerAGE-DB_old
), but this issue emerged since the 3/3 dataset.
As we can see, Spain and Portugal also have some decrease in numbers.
I am finalizing a dashboard that updates weekly using COVerAGE-DB, so I guess I will check these general consistency things for you every week.
hi! we're keeping uptodate an open data Argentina repo over here: https://github.com/SistemasMapache/Covid19arData
Checking the Italian output i can see:
Country Region Code Date Sex Age AgeInt Cases Deaths
1 Italy All ITinfo15.04.2020 2020-04-15 b 0 10 1008.8 1.0
2 Italy All ITinfo15.04.2020 2020-04-15 b 10 10 0.0 0.0
3 Italy All ITinfo15.04.2020 2020-04-15 b 20 10 7061.9 7.0
4 Italy All ITinfo15.04.2020 2020-04-15 b 30 10 13114.9 39.0
5 Italy All ITinfo15.04.2020 2020-04-15 b 40 10 19392.1 173.0
6 Italy All ITinfo15.04.2020 2020-04-15 b 50 10 30103.7 746.0
7 Italy All ITinfo15.04.2020 2020-04-15 b 60 10 23808.5 2242.1
8 Italy All ITinfo15.04.2020 2020-04-15 b 70 10 25532.0 6074.3
9 Italy All ITinfo15.04.2020 2020-04-15 b 80 10 26097.4 7890.4
10 Italy All ITinfo15.04.2020 2020-04-15 b 90 10 6907.3 1359.0
11 Italy All ITinfo15.04.2020 2020-04-15 b 100 10 2440.5 976.1
When Age == 10
the number of cases is 0 (unlikely!). I think the cases
here are estimated based on the death records, information which is taken from daily summarized info-graphics.
However, note that on a weekly basis complete information by age and sex is provided (see April 16). The modeled daily data should be consistent with the complete weekly data at the nearest calendar date ...or removed.
Also if data is already reported in the desired age classes, I don't think any altering of the data should be applied. This is the case for the number of deaths displayed above. Only the deaths in the last reported age group 90+
should be split between 90
and 100
. For all the other ages should be read as reported. I know this complicates a bit the coding but would be the right way to do it.
Hi @timriffe
I can see this:
coverage <- read_csv(
file = "data/Output_10_20210107.zip",
skip = 3) %>%
filter(Region == "Wisconsin",
Sex == "b") %>%
mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>%
arrange(Date) %>%
print(n = 70)
# A tibble: 3,124 x 10
Country Region Code Date Age AgeInt Sex Cases Deaths Tests
<chr> <chr> <chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 USA Wisconsin US_WI_01.01.2020 2020-01-01 0 10 b 18816 0 NA
2 USA Wisconsin US_WI_01.01.2020 2020-01-01 10 10 b 54477 2 NA
3 USA Wisconsin US_WI_01.01.2020 2020-01-01 20 10 b 92503 16 NA
4 USA Wisconsin US_WI_01.01.2020 2020-01-01 30 10 b 75498 37 NA
5 USA Wisconsin US_WI_01.01.2020 2020-01-01 40 10 b 68817 82 NA
6 USA Wisconsin US_WI_01.01.2020 2020-01-01 50 10 b 73912 268 NA
7 USA Wisconsin US_WI_01.01.2020 2020-01-01 60 10 b 52605 641 NA
8 USA Wisconsin US_WI_01.01.2020 2020-01-01 70 10 b 27593 1217 NA
9 USA Wisconsin US_WI_01.01.2020 2020-01-01 80 10 b 13546 1539 NA
10 USA Wisconsin US_WI_01.01.2020 2020-01-01 90 10 b 5216. 1060. NA
11 USA Wisconsin US_WI_01.01.2020 2020-01-01 100 5 b 24.2 7.5 NA
12 USA Wisconsin US_WI_02.01.2020 2020-01-02 0 10 b 18864 0 NA
13 USA Wisconsin US_WI_02.01.2020 2020-01-02 10 10 b 54588 2 NA
14 USA Wisconsin US_WI_02.01.2020 2020-01-02 20 10 b 92705 16 NA
15 USA Wisconsin US_WI_02.01.2020 2020-01-02 30 10 b 75672 37 NA
16 USA Wisconsin US_WI_02.01.2020 2020-01-02 40 10 b 68979 82 NA
17 USA Wisconsin US_WI_02.01.2020 2020-01-02 50 10 b 74056 268 NA
18 USA Wisconsin US_WI_02.01.2020 2020-01-02 60 10 b 52729 641 NA
19 USA Wisconsin US_WI_02.01.2020 2020-01-02 70 10 b 27662 1217 NA
20 USA Wisconsin US_WI_02.01.2020 2020-01-02 80 10 b 13581 1540 NA
21 USA Wisconsin US_WI_02.01.2020 2020-01-02 90 10 b 5225. 1060. NA
22 USA Wisconsin US_WI_02.01.2020 2020-01-02 100 5 b 24.2 7.5 NA
23 USA Wisconsin US_WI_03.01.2020 2020-01-03 0 10 b 19036 0 NA
24 USA Wisconsin US_WI_03.01.2020 2020-01-03 10 10 b 54876 2 NA
25 USA Wisconsin US_WI_03.01.2020 2020-01-03 20 10 b 93085 16 NA
26 USA Wisconsin US_WI_03.01.2020 2020-01-03 30 10 b 76059 37 NA
27 USA Wisconsin US_WI_03.01.2020 2020-01-03 40 10 b 69299 82 NA
28 USA Wisconsin US_WI_03.01.2020 2020-01-03 50 10 b 74394 269 NA
29 USA Wisconsin US_WI_03.01.2020 2020-01-03 60 10 b 53018 642 NA
30 USA Wisconsin US_WI_03.01.2020 2020-01-03 70 10 b 27801 1217 NA
31 USA Wisconsin US_WI_03.01.2020 2020-01-03 80 10 b 13685 1542 NA
32 USA Wisconsin US_WI_03.01.2020 2020-01-03 90 10 b 5254. 1060. NA
33 USA Wisconsin US_WI_03.01.2020 2020-01-03 100 5 b 24.2 7.5 NA
34 USA Wisconsin US_WI_04.01.2020 2020-01-04 0 10 b 19139 0 NA
35 USA Wisconsin US_WI_04.01.2020 2020-01-04 10 10 b 55072 2 NA
36 USA Wisconsin US_WI_04.01.2020 2020-01-04 20 10 b 93318 16 NA
37 USA Wisconsin US_WI_04.01.2020 2020-01-04 30 10 b 76268 37 NA
38 USA Wisconsin US_WI_04.01.2020 2020-01-04 40 10 b 69467 82 NA
39 USA Wisconsin US_WI_04.01.2020 2020-01-04 50 10 b 74581 269 NA
40 USA Wisconsin US_WI_04.01.2020 2020-01-04 60 10 b 53153 642 NA
41 USA Wisconsin US_WI_04.01.2020 2020-01-04 70 10 b 27877 1219 NA
42 USA Wisconsin US_WI_04.01.2020 2020-01-04 80 10 b 13758 1545 NA
43 USA Wisconsin US_WI_04.01.2020 2020-01-04 90 10 b 5281. 1064. NA
44 USA Wisconsin US_WI_04.01.2020 2020-01-04 100 5 b 24.3 7.5 NA
45 USA Wisconsin US_WI_05.01.2020 2020-01-05 0 10 b 19279 0 NA
46 USA Wisconsin US_WI_05.01.2020 2020-01-05 10 10 b 55447 2 NA
47 USA Wisconsin US_WI_05.01.2020 2020-01-05 20 10 b 93869 16 NA
48 USA Wisconsin US_WI_05.01.2020 2020-01-05 30 10 b 76818 37 NA
49 USA Wisconsin US_WI_05.01.2020 2020-01-05 40 10 b 69972 84 NA
50 USA Wisconsin US_WI_05.01.2020 2020-01-05 50 10 b 75131 278 NA
51 USA Wisconsin US_WI_05.01.2020 2020-01-05 60 10 b 53543 655 NA
52 USA Wisconsin US_WI_05.01.2020 2020-01-05 70 10 b 28091 1241 NA
53 USA Wisconsin US_WI_05.01.2020 2020-01-05 80 10 b 13850 1572 NA
54 USA Wisconsin US_WI_05.01.2020 2020-01-05 90 10 b 5316. 1086. NA
55 USA Wisconsin US_WI_05.01.2020 2020-01-05 100 5 b 24.5 7.7 NA
56 USA Wisconsin US_WI_29.03.2020 2020-03-29 0 10 b 4 0 NA
57 USA Wisconsin US_WI_29.03.2020 2020-03-29 10 10 b 14 0 NA
58 USA Wisconsin US_WI_29.03.2020 2020-03-29 20 10 b 148 0 NA
59 USA Wisconsin US_WI_29.03.2020 2020-03-29 30 10 b 169 0 NA
60 USA Wisconsin US_WI_29.03.2020 2020-03-29 40 10 b 186 0 NA
61 USA Wisconsin US_WI_29.03.2020 2020-03-29 50 10 b 203 4 NA
62 USA Wisconsin US_WI_29.03.2020 2020-03-29 60 10 b 220 3 NA
63 USA Wisconsin US_WI_29.03.2020 2020-03-29 70 10 b 111 3 NA
64 USA Wisconsin US_WI_29.03.2020 2020-03-29 80 10 b 46 2 NA
65 USA Wisconsin US_WI_29.03.2020 2020-03-29 90 10 b 10.9 1 NA
66 USA Wisconsin US_WI_29.03.2020 2020-03-29 100 5 b 0.1 0 NA
67 USA Wisconsin US_WI_30.03.2020 2020-03-30 0 10 b 4 0 NA
68 USA Wisconsin US_WI_30.03.2020 2020-03-30 10 10 b 16 0 NA
69 USA Wisconsin US_WI_30.03.2020 2020-03-30 20 10 b 161 0 NA
70 USA Wisconsin US_WI_30.03.2020 2020-03-30 30 10 b 179 0 NA
# ... with 3,054 more rows
Is it possible that the wrong year is assigned to first 5 recorded days? 2020 instead of 2021?
Note the jump in at line 56, from January to March.
In Output_10.csv
the number of confirmed cases and deaths in Virginia appear exceptionally high. We can see more than 7 mil cases.
According to the US Census data the entire state has a mid-2019 population of 4,248,373.
> path <- "Output_10_20200828.csv"
> data <- read.csv(
+ file = path,
+ header = TRUE,
+ skip = 3
+ ) %>%
+ filter(Region == "Virginia",
+ Date == "27.08.2020") %>%
+ group_by(Region) %>%
+ summarise_if(is.numeric, sum )
>
>
> data
# A tibble: 1 x 6
Region Age AgeInt Cases Deaths Tests
<fct> <int> <int> <dbl> <dbl> <dbl>
1 Virginia 550 105 7922068. 208726. NA
In Output_10.csv
one can see that the name of the state is misspelled Lousiana
instead of Louisiana
.
I did not check the other output or input files.
Add another tab, just one row per country (Region == "All")
, with another column indicating the most recent date.
Add a coverage graphic: Cases and Deaths as a fraction of totals from worldometers or JHU.
The {ungroup}
package is great at ungrouping, however I think we should pay attention to the resulted CFR curves.
In my tests using the New York data where I harmonized the age-bands (10y groups) I noticed that the CFR at young ages is overestimated compared with the values for other regions where no ungrouppig of the data was applied (ITA, ESP, KOR, NLD, ...). This might be due to the fact that in each age band the death counts are skewed to the right (towards old ages) but not necessarily the number of confirmed cases. I assume the infections to be more uniformly distributed.
It is difficult for me to believe that we can see something that different in NY, but not impossible. For now, I am attributing this different outcome to the pclm()
limitation. Maybe Silvia has a solution.
At young ages we are talking about small values however considering the age-structure of the population these can become significant.
I will try to post here some figures to support my claim.
Putting it here for me to remember. Add selectize to shiny for improved search.
Age specific deaths available for SA on seemingly daily basis:
https://twitter.com/DrZweliMkhize/status/1268261539418701825
How can we help to get these into the database?
Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 dataframe https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/JWULOO.
Would you be willing to help with maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for medical science community.
Note to include both earliest and most recent dates in the availability dashboard.
Compare Output_5
to last week, it seems 8 countries were lost:
"Croatia"
only misses Region == "All" while the rest 7 c("Cyprus", "Lithuania", "Luxembourg", "Latvia", "Malta", "Poland", "Slovakia")
are no longer there.
library(covidAgeData)
library(data.table)
dt5_ori <- download_covid(data = "Output_5", temp = TRUE,
verbose = FALSE, progress = FALSE, return = "data.table")
# Empty data.table:
dt5_ori[Country %like% "Croatia" & Region == "All" & Sex == "b" ,]
dt5_ori[Country %in% c("Cyprus", "Lithuania", "Luxembourg", "Latvia", "Malta", "Poland", "Slovakia"),]
The first two lines of in 'inputDB.csv' have some bytes which cause me problems when trying to read the file.
In R, the following error message is printed while trying to run readr::read_csv("inputDB.csv", skip = 1)
:
Error in make.names(x) : invalid multibyte string at '<a2><9b>y<bd>X'
I've attached the first 10 lines of the problematic 'inputDB':
inputDB-head.zip
On Linux, it looks like these bytes cause the file to be recognized as a gzip
type, while it should be plain text.
$ file inputDB-head.csv
inputDB-head.csv: gzip compressed data, from HPFS filesystem (OS/2, NT), original size modulo 2^32 172376364
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.