edwindj / cbsodatar Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 14.0 1.88 MB

Statistics Netherlands (CBS) OpenData API Client for R

Home Page: https://edwindj.github.io/cbsodataR

R 100.00%

cbs census-data officialstatistics opendata r statistics-netherlands

cbsodatar's People

Contributors

Stargazers

Watchers

Forkers

markivk89 elwali6 wietsedol abdelrahman1012018 jbdatascience jacobkap sarahouweling wligtenberg han-tun wnverm ekrombouts ytok3a

cbsodatar's Issues

cbs_get_meta() gives error: fixed but not yet on CRAN

Hi!

I ran into the cbs_get_meta() error (Error in strsplit(params$`$select`, ", ") : non-character argument ) that was reported and fixed by @sarahouweling a few days ago.
The CRAN version is not yet updated, so the solution for now is to install from github.

devtools::install_github("edwindj/cbsodataR")

Thanks for creating this awesome package!

setwd() in package limits cronjobs

Hi Edwindj,

Als ik mijn R script via de terminal wil starten (ipv via Rstudio), dan krijg ik de melding dat setwd() afgetrapt wordt in het package. Dit mag niet via terminal scripts. Kun je de paden dynamisch maken, zodat je niet afhankelijk bent van setwd()?

Het zit 'm in cbs_download_meta.R

Column labels

Why is there a number behind the colnames? Is this how you receive the data from CBS?

# Get the data (doodsoorzaken)
doodsoorzaken <- get_data('81452NED')
colnames(doodsoorzaken)
 [1] "ID"                                     "Geslacht"                              
 [3] "Leeftijd"                               "Perioden"                              
 [5] "TotaalDodelijkeOngevallen_1"            "TotaalDodelijkeVervoersongevallen_2"   
 [7] "Voetganger_3"                           "Fiets_4"                               
 [9] "BromEnSnorfietsEnBrommobiel_5"          "GemotInvalidenvoertuigScootmobiel_6"   
[11] "Motorfiets_7"                           "Personenauto_8"                        
[13] "BestelautoVrachtauto_9"                 "OverigOnbekend_10"                     
[15] "AccidenteleVal_11"                      "AccidenteleVerdrinking_12"             
[17] "TotaalAccidenteleVergiftiging_13"       "Medicijnen_14"                         
[19] "Drugs_15"                               "Alcohol_16"                            
[21] "OverigOnbekend_17"                      "TotaalOverigeDodelijkeOngevallen_18"   
[23] "MechanischEffect_19"                    "RookVuurEnVlammen_20"                  
[25] "Verstikking_21"                         "OverigInclLaatGevolg_22"               
[27] "TotaalDodelijkeOngevallen_23"           "DodelijkeVervoersongevallen_24"        
[29] "AccidenteleVal_25"                      "AccidenteleVerdrinking_26"             
[31] "AccidenteleVergiftigingInclOpzetOnb_27" "OverigeDodelijkeOngevallen_28"

I am not so sure what the recode argument does, but is doesn't change anything.

extract default selection from tableinfo

cbsodataR 0.3 errors on some windows configurations

version 0.3 cbsodata may generate an error on some windows configurations.
This is due to proxy confgurations and will be solved in a coming version.

`cbs_get_data` with `base_url="http://dataderden.cbs.nl` fails

cbs_get_data fails, trying to connect to http://opendata.cbs.nl. Fixed in version 0.5.2.

Remedies for now:

use the catalog parameter, that works... (and was intended to make the base_url obsolete): https://edwindj.github.io/cbsodataR/reference/cbs_get_data.html
or, set verbose=TRUE, for now.
or, set progress=TRUE for now.

Thanks to Mirjam Zengers for reporting

method to keep track of table updates?

This is a feature request, not a bug I think.

Use case: I am interested in table 85067NED, "Gebieden in Nederland"

library(cbsodataR)
library(dplyr)

toc = cbs_get_toc()
filter(toc, Identifier=='85067NED') |> select(Identifier, Title, Period)

  Identifier Title                      Period
  <chr>      <chr>                      <chr> 
1 85067NED   Gebieden in Nederland 2022 2022

However, newer versions of the table may become available and I would like to always get the most recent version. I don't see any pattern to the Identifier of this table and previous versions. I thought cbs_search might do the trick but it returns only one version of the table and that is an old one.

s=cbs_search('gebieden in nederland',language='nl')
select(s, Title, Identifier, Period)

   Title                                                                         Identifier Period                            
   <chr>                                                                         <chr>      <chr>                             
 1 "Regionale kerncijfers Nederland"                                             70072ned   Jaarcijfers 1995 - 2023           
 2 "Conjunctuurenquête Nederland; kwartaal, bedrijfstakken"                      82435NED   1e kwartaal 2012 - 1e kwartaal 20…
 3 "Banen van werknemers in december; economische activiteit (SBI2008), regio"   83582NED   2010-2021                         
 4 "Bevolkingsontwikkeling; regio per maand"                                     37230ned   Januari 2002 - maart 2023         
 5 "Bedrijven; bedrijfstak"                                                      81589NED   2007 kwI - 2023 kw II             
 6 "Winning, invoer en uitvoer van materialen naar soort; nationale rekeningen " 83180NED   1996-2020                         
 7 "Bodemgebruik; uitgebreide gebruiksvorm, per gemeente"                        70262ned   1996, 2000, 2003, 2006, 2008, 201…
 8 "Winning, invoer en uitvoer materialen per continent; nationale rekeningen"   83177NED   2004-2020                         
 9 "Onderwijsinstellingen; grootte, soort, levensbeschouwelijke grondslag"       03753      1990/'91 - 2021/'22               
10 "Gebieden in Nederland 2020"                                                  84721NED   2020

So my request: is there a way, or could there be a way, in which I can get the most recent version of a specific table? And a second note: it is not clear to me from the help where cbs_search actually searches. Apparently not in the title or it would have found the most recent version as well.

Many thanks for the package by the way, it greatly simplifies things.

error cbs_download_data

I run the following code (2 lines):

library(cbsodataR)
dt <- cbs_download_data(id='84910NED')

It errors out with:

library(cbsodataR)
dt <- cbs_download_data(id='84910NED')
Error in isTRUE(catalog != "CBS") :
promise already under evaluation: recursive default argument reference or earlier problems?

I re-installed R, RStudio and installed the cbsodataR package again but no difference.

Can you help me?

cbs_get_toc with `select` argument fails

In version 0.3 cbs_get_toc fails when either:

one column is selected with select
or the date columns are omitted in the select statement.

(Thanks to Rob van Harrevelt for reporting!)

add option to convert data columns to numeric

Currently all data columns are characters. This is part of the API of Statistics Netherlands, which uses multiple special values, e.g. "not possible", "unknown", "strictly zero".

An option for get_data and download_table could be to automatically change data columns into numeric columns, thereby changing these special values into NA.

Filter not working in cbs_get_data, on column SoortRegio_2

I am only interested in data from the gemeentes from the 'Kerncijfers wijken en buurten'.

So my code looks like this:

cbs_get_data("84583NED", SoortRegio_2 = "Gemeente ", verbose = T)

(with two spaces after 'gemeente'). This does not seem to work however, all types of regions are loaded. Any idea why?

Could it possibly have something to do with the '_' in the column name?

add function to convert CBS periods into date/datetime

SN data contains temporal indicators: (years)YYYYJJ00, (months) YYYYMMxx, and quarters (YYYYKWxx). A utiltiy function that converts them to date/datetimes would be helpful.

please document output values

The help documentation is pretty thin. It could be greatly improved by added a few lines in the @description section and adding at a @return everywhere.

Strip whitespace

It seems to be a good idea to strip whitespace. See the output below

> t <- get_data('37556')
Writing TableInfos.csv...
Writing DataProperties.csv...
Writing CategoryGroups.csv...
Writing Perioden.csv...
Retrieving data from table '37556'
Done!
> t$Mannen_2
  [1] "       ." "    2521" "    2550" "    2584" "    2622" "    2660" "    2699" "    2737"
  [9] "    2777" "    2817" "    2855" "    2899" "    2945" "    2987" "    3037" "    3088"
 [17] "    3141" "    3188" "    3236" "    3282" "    3311" "    3352" "    3410" "    3465"
 [25] "    3516" "    3574" "    3629" "    3683" "    3735" "    3785" "    3838" "    3886"
 [33] "    3943" "    4006" "    4068" "    4124" "    4177" "    4221" "    4264" "    4307"
 [41] "    4353" "    4408" "    4454" "    4497" "    4530" "    4558" "    4603" "    4634"
 [49] "    4748" "    4838" "    4926" "    4998" "    5084" "    5146" "    5198" "    5256"
 [57] "    5321" "    5391" "    5460" "    5529" "    5619" "    5686" "    5754" "    5838"
 [65] "    5924" "    6001" "    6091" "    6178" "    6262" "    6317" "    6383" "    6465"
 [73] "    6550" "    6624" "    6676" "    6722" "    6772" "    6837" "    6872" "    6907"
 [81] "    6945" "    6994" "    7048" "    7082" "    7103" "    7124" "    7150" "    7185"
 [89] "    7224" "    7274" "    7317" "    7358" "    7420" "    7480" "    7535" "    7586"
 [97] "    7627" "    7662" "    7697" "    7740" "    7793" "    7846" "    7910" "    7972"
[105] "    8015" "    8046" "    8066" "    8077" "    8089" "    8112" "    8156" "    8203"
[113] "    8243" "    8283" "    8307" "    8334" "    8373" "    8417"

Maybe add strip.white=TRUE to line https://github.com/edwindj/cbsodataR/blob/master/R/get-data.R#L31? Not tested.

error bij `cbs_add_columns()`

Hi Edwin,

Bij het onderstaande gebruiken van de functie cbs_add_columns(): krijg ik een foutmelding alsvolgt

##download cbs package
#devtools::install_github("edwindj/cbsodataR")
##libraries
lib <- c("cbsodataR", "dplyr")
##Attach
lapply(lib, require, character.only = TRUE)
##Specificeer de 21 Haarlemse wijken
Wijken <- c("WK039201  ", "WK039202  ", "WK039203  ", "WK039204  ", "WK039205  ", "WK039206  ", "WK039207  ", "WK039208  ", "WK039209  ", "WK039210  ",
            "WK039211  ", "WK039212  ", "WK039213  ", "WK039214  ", "WK039215  ", "WK039216  ", "WK039217  ", "WK039218  ", "WK039219  ",
            "WK039220  ", "WK039221  ")
##Haal de data op
kern <- cbs_get_data("84286NED",
                     select = c("WijkenEnBuurten",
                                "Gemeentenaam_1",
                                "SoortRegio_2", "Codering_3",
                                "PersonenautoSTotaal_89",
                                "PersonenautoSBrandstofBenzine_90",
                                "PersonenautoSOverigeBrandstof_91",
                                "PersonenautoSPerHuishouden_92",
                                "PersonenautoSNaarOppervlakte_93",
                                "GemiddeldeHuishoudensgrootte_32"),
                     WijkenEnBuurten = Wijken) %>% cbs_add_label_columns()

krijg ik een foutmelding als volgt:

Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  : 
  factor level [177] is duplicated

Ik gebruik zowel een macbook als een windows 10 computer. Op de windows computer krijg ik de labels gewoon terug. Terwijl op de mac de bovenstaande foutmelding wordt gegenereerd.

Ik hoor graag of jij een idee hebt waar deze error vandaan komt.

> ##Toon sessie
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.8.0.1   cbsodataR_0.3.2 forcats_0.3.0   stringr_1.4.0  
 [5] purrr_0.3.1     readr_1.1.1     tidyr_0.8.3     tibble_2.0.1   
 [9] tidyverse_1.2.1 spData_0.2.9.4  leaflet_2.0.2   plotly_4.8.0   
[13] ggplot2_3.1.0   shiny_1.3.2    

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5    haven_1.1.2         lattice_0.20-35    
 [4] colorspace_1.3-2    htmltools_0.3.6     viridisLite_0.3.0  
 [7] yaml_2.1.19         rlang_0.3.1         later_0.8.0        
[10] pillar_1.3.1        glue_1.3.0          withr_2.1.2        
[13] readxl_1.1.0        modelr_0.1.2        plyr_1.8.4         
[16] cellranger_1.1.0    munsell_0.5.0       gtable_0.2.0       
[19] rvest_0.3.2         devtools_1.13.5     htmlwidgets_1.3    
[22] memoise_1.1.0       knitr_1.20          httpuv_1.5.0       
[25] crosstalk_1.0.0     curl_3.2            broom_0.5.0        
[28] Rcpp_1.0.0          xtable_1.8-2        scales_1.0.0       
[31] promises_1.0.1      backports_1.1.2     jsonlite_1.6       
[34] mime_0.6            hms_0.4.2           digest_0.6.18      
[37] stringi_1.3.1       grid_3.4.4          cli_1.0.1          
[40] tools_3.4.4         magrittr_1.5        lazyeval_0.2.1     
[43] crayon_1.3.4        whisker_0.3-2       pkgconfig_2.0.2    
[46] rsconnect_0.8.12    xml2_1.2.0          data.table_1.10.4-3
[49] lubridate_1.7.4     assertthat_0.2.0    httr_1.4.0         
[52] rstudioapi_0.7      R6_2.4.0            nlme_3.1-131.1     
[55] git2r_0.21.0        compiler_3.4.4

Met vriendelijke groet,

Tobias Brils

api change (0.3)

The naming of the functions will change: prefix with cbs_ from version 0.3 and on.

HTTP versus HTTPS

cbsodataR haalt data op via http://opendata.cbs.nl/ODataApi/odata/xxx
op veel plekken mogen http api's niet meer binnengehaald worden. Is het mogelijk er een https verbinding van te maken?

regional variables

Many tables of SN contain regional data: it would be nice if these tables can be converted into sf objects.

However:

many maps of SN are time-dependent so the user should be expliciet in what year should be used to convert to sf
would be nice to be not dependent on sf (only when installed).

Change GPL license

Hello,

Is it an idea to change the license into a license with less restrictions like the BSD license? This package is hard to include in some projects due to the GPL license.

What are your thoughts about this?

week periods are not converted with `cbs_add_date_column`

Converting the cbs date to the regular date format does not seem to work:

-----------code snippet----------------
library(cbsodataR)

dt <- cbs_get_data("70895ned")
dt <- cbs_add_date_column (dt)

head (dt)
-----------output----------------
Geslacht LeeftijdOp31December Perioden Perioden_Date Perioden_freq Overledenen_1
1 1100 10000 1995X000 394
2 1100 10000 1995W101 2719
3 1100 10000 1995W102 2823
4 1100 10000 1995W103 2609
5 1100 10000 1995W104 2664
6 1100 10000 1995W105 2577
-----------end-----------------

The code did add the colums but it did not fill in the correct date and the correct frequency.
n.b.: Not shown here is that in case of a year-date, it is filled in correctly for both.

filter data with geq, neq and not in

Hi Edwin,
very nice package.
In your examples you show how to filter on a specific variable-value:
cbs_get_data(id="03759ned", Perioden=c("2013JJ00","2014JJ00"), Geslacht="T001038")

however, in this large file of 45 million records I want to filter to e.g.
Perioden > "1990JJ00",
Geslacht != "T001038",
! Leeftijd %in% c(10000, 60100,60200,60300,60400,60500,60600,60700,60800,60900,21900)
Is that possible?
What would be the correct syntax for "not equal", "greater then" or "not in"?

Or filter substr(RegioS,1,2)="GM" filtering just municipalities :-)

error Connection refused

Hi,
I am trying to pull out data but it seems I have a problem with the proxy and I get this error.

tables_en <- cbs_get_toc(Language="en")
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to opendata.cbs.nl port 443: Connection refused

Could you help me to get a solution to it?
Best,

Nmta

cbs_get_data() no response error

cbs_get_data(id = "84378NED")

error:
Error in open.connection(con, "rb") :
cannot open the connection to 'http://opendata.cbs.nl/ODataApi/odata/84378NED'
In addition: Warning messages:
1: In open.connection(con, "rb") :
URL 'http://opendata.cbs.nl/ODataApi/odata/84378NED': status was 'Server returned nothing (no headers, no data)'
2: Failing: http://opendata.cbs.nl/ODataApi/odata/84378NED
Retrying...
3: In open.connection(con, "rb") :
URL 'http://opendata.cbs.nl/ODataApi/odata/84378NED': status was 'Server returned nothing (no headers, no data)'

Any idea? The site does show a page when I load it in the browser..

add progressbar to cbs_get_data etc

It would be helpful if a progressbar was shown when downloading a table

cbs_add_label_columns(data) - niet alle kolommen krijgen een codeboek

id <- '80302ned'

library(cbsodataR)
meta <- cbs_get_meta(id,verbose=TRUE,cache=TRUE)
View(meta)
View(meta$Voertuigtypes)
View(meta$Perioden)
data <- cbs_get_data(id)
View(data)

# Voeg de metadata toe aan de data zelf
data2 <- cbs_add_label_columns(data)
View(data2)
# werkt wel bij jaar maar niet bij voertuigtype
# in de data is voertuigtype een getal en in de metadata een string b.v. 0 en '00', mogelijk is dat de oorzaak?
# expliciet benoemen helpt ook niet
data2 <- cbs_add_label_columns(data,columns = 'Voertuigtypes')

cbs_get_toc for a particular city

stringsAsFactors option (with FALSE default)

The title says it all. I may do a PR if get round to it.

cbsodataR in Remoteacces environment

Dear Edwin,

thank you for developing this great package! I tried to run it in the microdata-remoteaccess environment of CBS but i could not get it running. The same code worked perfectly fine outside the environment, but it simply did not download the data (even though the URL is accessible in the environment).
Are you familiar with this problem?

Best,

Benedikt

More informative error message when request too long

When there are too many filter statements, the request URL to the odata portal becomes too long and the request fails. Currently, the resulting error message is very uninformative. Most people probably don't know how to address this error.

Example:

library(cbsodataR)

tbl <- "70072ned"
meta <- cbs_get_meta(tbl)

gemeentes <- meta$RegioS$Key
gemeentes <- gemeentes[grep("^GM", gemeentes)]
jaren <- meta$Perioden$Key

onderwerpen <- c("TotaleBevolking_1",
  "JongerDan5Jaar_4",
  "k_5Tot10Jaar_5",
  "k_10Tot15Jaar_6",
  "k_15Tot20Jaar_7",
  "k_20Tot25Jaar_8",
  "k_25Tot45Jaar_9",
  "k_45Tot65Jaar_10",
  "k_65Tot80Jaar_11",
  "k_80JaarOfOuder_12",
  "Bevolkingsdichtheid_57",
  "VestigingUitAndereGemeente_69",
  "VertrekNaarAndereGemeente_70",
  "BinnenlandsMigratiesaldo_71",
  "BinnenlandsMigratiesaldoRelatief_72",
  "VerhuismobiliteitRelatief_73",
  "Bevolkingsgroei_79",
  "TotaalAantalParticuliereHuishoudens_82",
  "VoorraadOp1Januari_90",
  "Nieuwbouwwoningen_91",
  "Woningen_97",
  "GemiddeldeWoningwaarde_99",
  "TotaleOppervlakte_187")

dta <- cbs_get_data(id = tbl, Perioden = jaren, 
  RegioS = gemeentes, 
  select = c("Perioden", "RegioS", onderwerpen))

This result in either the following error message:

Error in curl::curl_fetch_memory(url, handle = handle) : 
  OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

Or sometimes the following:

Error in get_json(url, verbose = verbose) : 
  Request-URI Too Long (HTTP 414). Failed to Client error: (414) Request-URI Too Long.

Perhaps add the following lines to cbs_download_data after url <- URLencode(url):

if (nchar(url) > 2000L) 
  warning(paste0(c("The request URL is longer than 2000 characters. ", 
    "This could cause the request to fail on some platforms. ", 
    "If so, try to reduce the number of filter statements and filter the data afterwards.")))

Or, catch the error: something like:

res <- get_json(url, verbose = verbose)
tryCatch({
  res <- get_json(url, verbose = verbose),
}, error = function(e) {
    warning <- if (nchar(url) < 2000L) "" else 
      paste0(c("\n\nThe request URL is longer than 2000 characters. ", 
        "This could cause the request to fail on some platforms. ", 
        "Try to reduce the number of filter statements and ",
        "filter the data afterwards."))
    stop("Request failed with the following message:\n", e$message, warning)
})

Solution for no connection with Windows 7/8 and IE11

The security certificates on https://opendata.cbs.nl have been updated, causing that IE11 and Windows 7 and Windows 8 have trouble connecting to the open data server.

solution

Add the following lines to your R script (before making any calls with cbsodataR

Sys.setenv(CURL_SSL_BACKEND = "openssl")
options("url.method" = "libcurl")

Thanks to Jasper Dupont for reporting the issue and problem.

Clarify README

Hello Edwin,

Maybe it is good to write somewhere in the README that you have to load dplyr to run this example. For the more inexperienced R users... I was trying to call View, but it fails without dyplr.

> get_data('71509ENG') %>% select(2:5) %>% head

Source: local data frame [6 x 4]

  FruitFarmingRegions Periods TotalAppleVarieties_1 CoxSOrangePippin_2
1   Total Netherlands    1997                   420                 43
2   Total Netherlands    1998                   518                 40
3   Total Netherlands    1999                   568                 39
4   Total Netherlands    2000                   461                 27
5   Total Netherlands    2001                   408                 30
6   Total Netherlands    2002                   354                 17

Very nice package btw. Is CBS going to support it?

Kind regards Jonathan

problems loading 70747ned

Hi,

I'am having problems with loading the 70747ned table, other tables seem to be alright. When I try to get the table I get the following:

Attempt 1
code:
fname = "70747ned"
bucket1 <- cbs_get_data(id = fname)

Output:

Error: parse error: premature EOF { "odata.metadata":"https:// (right here) ------^ In addition: Warning message: In fun(libname, pkgname) : couldn't connect to display ":0"

Attempt 2
code:
fname = "70747ned"
cbs_download_table(id=fname, dir=fname)

output:

[1] "https://opendata.cbs.nl/ODataFeed/odata/70747ned/UntypedDataSet?$format=json&$skip=7060000"
Writing...
Reading...
[1] "https://opendata.cbs.nl/ODataFeed/odata/70747ned/UntypedDataSet?$format=json&$skip=7070000"
Error: parse error: premature EOF
{ "odata.metadata":"https://
(right here) ------^

Regards,

Maarten.

edwindj / cbsodatar Goto Github PK

cbsodatar's People

Contributors

Stargazers

Watchers

Forkers

cbsodatar's Issues

library(cbsodataR) dt <- cbs_download_data(id='84910NED')

solution

Recommend Projects

Recommend Topics

Recommend Org

library(cbsodataR)
dt <- cbs_download_data(id='84910NED')