ropensci / bikedata Goto Github PK

View Code? Open in Web Editor NEW

81.0 9.0 16.0 5.49 MB

:bike: Extract data from public hire bicycle systems

Home Page: https://docs.ropensci.org/bikedata

R 66.59% Makefile 0.30% C++ 31.15% Shell 0.05% C 0.97% TeX 0.94%

bicycle-hire-systems r rstats bike-hire bicycle-hire database bike-data r-package peer-reviewed

bikedata's Introduction

rOpenSci

This repository has been archived. The former README is now in README-NOT.md.

bikedata's People

Contributors

Stargazers

Watchers

Forkers

richardellison jimhester carpetri ginberg graceli8 zhaoxiaohe robinlovelace milkcoffeeyr tbuckl ssh352 based-god-fucked-my-bitch-fuckzig szymanskir yingxurui ondrocks uclwilson tanyadasari

bikedata's Issues

finish write_test_data function

And then delete the data from the ./tests/ directory.

Minneapolis/St Paul

Niceride MN has full data, all good to incorporate

ensure SQLite3 database has correct tables

Write a short boolean function to confirm that all bikedb arguments reference a database with the stations, trips, and datafiles tables, and use in all functions which have this argument

I think it may be worth adding a vignette with some examples of possible analysis that can be done with the data. Either making use of other packages or more standard examples (possibly using some spatial queries).

In that theme, as promised, below is the code used to generate this image. The sum_network_links function is in ropensci/stplanr#185. The create_index argument of the store_bikedata function is in #3.

library(rgdal)
library(stplanr)
library(rgeos)
library(dplyr)
library(RSQLite)
library(bikedata)


# Download and read in the New York State street layer (could use the NYC Tiger 
# database as well potentially). This will be used to create the network (later)
download.file("http://gis.ny.gov/gisdata/fileserver/?DSID=932&file=streets_shp.zip",
              destfile = "~/Downloads/streets_shp.zip")
unzip("~/Downloads/streets_shp.zip")
nystreets <- readOGR("~/Downloads/Streets_shp","StreetSegment")

# Create a directory to store the data files and then download Citibike data for
# October 2016 to December 2016
dir.create("~/Downloads/citibikedata")
dl_bikedata(data_dir = "~/Downloads/citibikedata/",
            dates = c("201610","201611","201612"))

# Store downloaded data into a database
store_bikedata("~/Downloads/citibikedata/", "citifq2016", create_index = TRUE)

# Connect to the database
dbcon <- dbConnect(SQLite(), "citifq2016")

# Retrieve stations from the database and create a SpatialPointsDataFrame 
# from the result.
nycstations <- dbGetQuery(dbcon, "SELECT * FROM stations")
nycstations$geom <- NULL
nycstations <- SpatialPointsDataFrame(coords = nycstations[,c('longitude','latitude')], 
                                      proj4string = CRS("+init=epsg:4326"), 
                                      data = nycstations)

# Reproject to the same projection as the streets layer
nycstations <- spTransform(nycstations, nycnet@sl@proj4string)

# Clip the streets layer to the area around the stations then 
# remove the full New York State dataset.
nycstreets <- gclip(nystreets, bbox(gBuffer(gEnvelope(nycstations, byid = FALSE),
                                            byid = FALSE,width=1000)))
rm(nystreets)

# Create a new network with the length parameter as the default weight
nycnet <- SpatialLinesNetwork(sl = nycstreets)

# Find the closest node to each station
nycstations@data$nodeid <- stplanr::find_network_nodes(
  nycnet, 
  nycstations@coords[,1], 
  nycstations@coords[,2]
)

# Query the database to count the number of trips between each pair of stations.
routetrips <- dbGetQuery(dbcon, "SELECT start_station_id, end_station_id, 
                         COUNT(*) as numtrips
                         FROM trips 
                         WHERE start_station_id <> end_station_id
                         GROUP BY start_station_id, end_station_id")

# Join the routetrips table to the nycstations layer to match the Node IDs
routetrips <- routetrips %>% 
  inner_join(
    nycstations@data %>%
      select(start_station_id = id, startnodeid = nodeid)
  ) %>%
  inner_join(
    nycstations@data %>%
      select(end_station_id = id, endnodeid = nodeid)
  ) %>%
  select(
    startnodeid,
    endnodeid,
    numtrips
  )

# Run the sum_network_links function to aggregate the number of trips
# on each part of the network.
# Note that since the default weight (length) has not been changed,
# this is the simple shortest path.
nycbicycleusage <- sum_network_links(nycnet, routetrips)

# Download and read in some layers to set the geographic context
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_36061_areawater.zip",
              destfile = "~/Downloads/citibikedata/nycountyareawater.zip")
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_34017_areawater.zip",
              destfile = "~/Downloads/citibikedata/njcountyareawater.zip")
unzip("~/Downloads/citibikedata/nycountyareawater.zip", exdir = "~/Downloads/citibikedata/")
unzip("~/Downloads/citibikedata/njcountyareawater.zip", exdir = "~/Downloads/citibikedata/")
nywater <- readOGR("~/Downloads/citibikedata","tl_2016_36061_areawater")
njwater <- readOGR("~/Downloads/citibikedata","tl_2016_34017_areawater")
nywater <- spTransform(nywater, nycbicycleusage@proj4string)
njwater <- spTransform(njwater, nycbicycleusage@proj4string)

# Plot the water and routes layers
tm_shape(nywater, is.master = FALSE) + 
  tm_fill(col="#000011") + 
tm_shape(njwater, is.master = FALSE) + 
  tm_fill(col="#000011") + 
tm_shape(nycbicycleusage, is.master=TRUE) + 
  tm_lines(col="numtrips", 
           lwd="numtrips", 
           title.col = "Number of trips",
           breaks = c(0,20000,40000,60000,80000,100000,Inf),
           legend.lwd.show = FALSE,
           scale = 2
          ) + 
  tm_layout(
    bg.color="black",
    legend.position = c("right","bottom"), 
    legend.bg.color = "white", 
    legend.bg.alpha = 0.5
  )

# Save resulting map.
save_tmap(filename = "citibikeexample.png")

add san fran bay area

copy of main data issue - data are here. Unfortunately not a NABSA system, so this'll take a bit more work than Philadelphia.

enable specific months to be stored for london

The London data files are structured in a distinctly different way to all others, so simple grepping of data files to download will not work. Implement a way to store data for particular months for London.

add option to standardise bike_daily_trips

same as tripmat, so they can be extracted independent of potential changes in numbers of stations in operation

consider adding seattle/pronto data

You've probably already looked into Seattle's Pronto program. It ended up shutting down. Would trip data have been useful in understanding why? Or how that could have been avoided?

Found a site with what seems to be some data here:
https://www.kaggle.com/pronto/cycle-share-dataset

If i get some time I might open a PR with a pointer to it.

get dl_bikedata to update database not just files

Current implementation of dl_bikedata() will check whether files exist and only download files that don't already exist. This should be extended to only download those files the contents of which aren't already in the nominated SQLite3 database. Data should be downloaded and added only if the data do not already exist either as downloaded files or in the database.

Initially try matching files to database entries by start dates alone, but first ensure that all files for any nomintedl time contain only rides that start within that time.

add examples to all functions

This can now be done using the bike_write_test_data() function.

Trouble downloading London files for March to May 2017

I had no problem getting the data for London in the rest of 2017 (except after December 5, I'm assuming those are not available yet?) - but these files seem to not download correctly.

dl_bikedata("lo", paste0(getwd(),"/bikedata"), dates = 201703:201705)
#> Downloading 50 Journey Data Extract 22Mar2017-28Mar2017.csv
#> Downloading 51 Journey Data Extract 29Mar2017-04Apr2017.csv
#> Downloading 52 Journey Data Extract 05Apr2017-11Apr2017.csv
#> Downloading 55JourneyData Extract26Apr2017-02May2017.csv
#> Downloading 56JourneyDataExtract 03May2017-09May2017.csv

I'm getting the message that they downloaded, but those specific files do not appear in the directory. They stand out from the other files in that they have spaces - which is potentially what's causing the issue. Is there another way to acquire them?

link in github project header should be to ropensci.github.. rather than mpadge.github...

should be: https://ropensci.github.io/bikedata/

extract daily trips

Actually a two-part task:

Extract date of first trip for each station to enable station counts to be standardised independent of how long stations have been in operation; and
Extract a simply daily time series of total rides, with the possibility of specifying a particular station

strtokm function

@richardellison dumping issues on you here. The strtokm function is great, but actually not really necessary anymore, because the NYC data files were only double quoted up until 201412, and quotes disappeared from 201501 onwards. At present, I just pass a delim option to read_one_line which is either "," or just ,. Obviously we could - and should, hence this issue - just boost::replace_all (line, "\"", "") at the outset. This would then enable reversion to direct strtok rather than the multi-char strtokm, but strok behaves slightly differently, and prevents the last token from being extracted in the neat way it currently is.

I implemented an ugly work-around by sticking an extra delim on the end of the line, but didn't commit that because there must be a better way. I'd really appreciate it if you could give it a try with std::strok instead of strokm and see if you can find a neat solution.

revert DC live station locations

Station location data for DC had to be hard-coded because the data given at opendata.dc.gov/datasets/capital-bike-share-locations became too unreliable, and simply returned an unknown error. Blame the underlying opendata.arcgis.com server! In the meantime, the data have been parked in R/sysdata.rda, but this can't be a long term solution as the system expands and changes. This issue is a flag to address this when (hopefully) the live data becomes more stable

make sure hours all have two digits!

new york, for example, currently does not, and without the leading 0 these dates are not properly translated by SQLite3. Likely related to #16

add nabsa station names

The NABSA systems (Philly, LA) have all station coordinates in the trip data files, with station files containing only station numbers, names, dates, and operating status. These station files could at least be used to insert names into the station tables, which these systems currently lack. This would be a bit tricky only because the station data are inserted into the SQLite station table during reading of the raw data files, so it would require subsequent modification of the existing table.

mv sqlite to src/vendor folder

now i understand why RSQLite does this: The github language statistics automatically ignore all files in any (^|/)[Vv]+endor/ directories. Other alternative is adding a .gitattributes with explicit linguist-documentation, but i've not got a .gitattributes anywhere else, so ./src/vendor/ seems like a cleaner solution.

extend tests

data to add

Great article of state of American bike share systems here, with new systems including LA and Portland. Full list of systems (with direct links to data, and excluding London UK):

Systems not yet part of this package which are hoped to be added

Ciudad de Mexico (issue here) - awaiting open data on station locations
[Vancouver Mobi]https://www.mobibikes.ca/en/system-data) (issue here) - awaiting open data on station locations

Additional systems that do not (yet?) provide data:

Systems which have died an ungraceful death yet which still provide (historical) data:

Seattle

filter tripmat by demographic characteristics

As described in vignette.

rm station names from trips table

This field probably occupies at least half the entire database size, yet is not really necessary. Remove?

more flexible date args for dl_bikedata

At the moment just handled with the single line

indx <- which (!file.exists (files) & grepl (paste (dates, collapse = "|"), files))

meaning the dates passed to either dl_bikedata() or store_bikedata() must perfectly match the formats given in the data files. More flexible entry of dates should be possible

update bike_demographic_data

Because some cities distinguish between registered/casual users (DC), or between monthly-pass/walk-up (LA, PH), These are a kind of demographic statistic, so should be noted

Ciudad Mexico

@Robinlovelace Can you please provide details of the Mexico City data you mentioned? It'd be great to incorporate that if possible

add station size to stations table

This is not necessarily straightforward, because some station tables are constructed straight from trip data, which doesn't have this info.

Fix DC Data

Their AWS format just changed from monthly to annual dumps for all prior years, and quarterly dumps of current year. The bike_convert_dates() function current maps dates to quarters, and so no longer matches any files, causing tests to fail. FIX!

Distances calculated with bike_distmat() seem incorrect

The distances between the stations using bike_distmat() seem off for a large part of the station pairs. For instance, with the code below I calculate the distance for the London stations and add a column with euclidean distance (using raster package). The resulting plot shows that for a lot of cases the euclidean distance exceeds the distance calculated using bike_distmat, which cannot be correct. Also doing some manual tests on google maps shows that some distances are simply too short.

Thank you in advance.

` library(bikedata)
library(data.table)
library(raster)

#set up the bikedb
data_dir <- paste0(getwd(), "/data/Rbikedata")
bikedb <- file.path (data_dir,'testdb')
dl_bikedata (city ='London', data_dir=data_dir, dates = 201701:201702)
store_bikedata (data_dir = data_dir, bikedb = bikedb)

#load stations data
dtSs <- data.table(bike_stations(bikedb = bikedb))

#load distance matrix
dtDs <- data.table(bike_distmat(bikedb=bikedb, city="London", expand = 0, long = T, quiet = F))

#join distance matrix with start_station data
setkey(dtSs, "stn_id")
setkey(dtDs, "start_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("start_station_name","lon_start","lat_start") )

#join distance matrix with end_station data
setkey(dtDs, "end_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("end_station_name","lon_end","lat_end") )

#calculate euclidean distance using raster package
dtDs[, Dist_eucl:=pointDistance(dtDs[,.(lon=lon_start,lat=lat_start)], dtDs[,.(lon=lon_end,lat=lat_end)], lonlat = T)]

#select a subset for faster plotting
vSubset <- 1:(.05*nrow(dtDs))

#plot the euclidean against the over the network distance
plot(dtDs[vSubset, Dist_eucl], dtDs[vSubset, distance])`

metadata

Add metadata to trip matrices, particularly:

Version number of bikedata
Dates of first and last trips used to calculate the trip matrix

add philly indego

copy of main data issue - data are here

add explicit function to add indexes

Indexes for both times and dates as well as demographic characteristics could then be explicitly constructed

Switch to DBI

Thanks @krlmlr for the great useR talk. Compelling reasons to make the switch.

sqlite3_exec() returns 1 not 0

@richardellison a plea for help here: travis currently fails because the sqlite3_exec() statement at the end of rcpp_import_stn_df returns 1 on travis, yet i get 0. I can reproduce the travis failure in a trusty container, but not in xenial, so this must have something to do with sqlite3 versions? Any insight from your side very much appreciated!

These lines suffice to reproduce (in this case, for Chicago test data, but same result arises for London):

devtools::load_all (".", export_all=TRUE)
bikedb <- "junkdb"
data_dir <- "./tests"
rcpp_create_sqlite3_db (bikedb)
flists <- bike_unzip_files_chicago (data_dir, bikedb)
ch_stns <- bike_get_chicago_stations (flists)
head (ch_stns)
nstations <- rcpp_import_stn_df (bikedb, ch_stns, 'ch')

The query structure is absolutely okay and works fine on >= 16.04, so this really does just seem to be an internal SQLite3 thing, but one for which we really need to find a solution. Set up of stations table is here - could it be an issue with the UNIQUE statement? All other otherwise entirely equivalent sqlite3_exec() statements return 0 as expected, so that's the only real different that jumps out to me.

Oh, and going the full _prepare_v2() -> _step() -> _reset() path also returns identical results (trusty = fail, xenial = pass).

ensure both dl_bikedata and store_bikedata work properly for London

There are some recent raw .csv files that are junk and these may cause the functions to fail? Check!

Error in curl::curl_fetch_disk() when using store_bikedata()

I'm looking to learn more about the bikedata package, and I was going through this vignette: https://ropensci.github.io/bikedata/

However, when I attempt the following: store_bikedata (city = 'nyc', bikedb = 'bikedb', dates = 201601:201603), I get the following error:

Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failed to open file C:\Users\PSTRAFC:\Users\PStraforelli\Documents1\AppData\Local\Temp\RtmpwZUPl6\201601-citibike-tripdata.zip.

I understand that this error may be unrelated to the bikedata package, but I was hoping I could at least get some pointers on how I could debug this. I haven't been able to find anything via google.

Can't use store_bikedata without specifying a city

Hello,

I am trying out your package and I wanted to download and store the data for every cities in a database outside the temporary directory to be able to play the data without having the redownloading it each time.

I use the following code

bike_dt <- file.path('data/database', 'bikedata.sqlite')
store_bikedata(bikedb = bike_dt)
# enter 'yes' in the console

But this fails with the error Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default. If I take this line out of the function source code and retry I have this error: Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default.

This package ooks nice and I am looking forward looking into the data.

Thanks for your help,

Mathieu

return tibbles from all functions

Because both the stations table and the long = TRUE tripmat are tibbles by default, it's probably better to do this for all functions

Fix London files with "Station Logical Terminal"

These are introduced after "20...24Aug2016-30Aug2016" in place of "Start/End Station Id". All good and well except the "Logical Terminal" numbers in no way map on to the actual station numbers! Find out what the hell these are in order to be able to read these latest files.

Test failures with development version of dplyr

Some tests that use expect_silent() fail, probably since tidyverse/dplyr#2878.

Related: tidyverse/dbplyr#18.

more flexible arg names for bike_tripmat

lots of args like start/end_date/time could and should be accepted in more flexible formats using a rename_args() function like rename_aes() in ggplot. Also should be able to American-spell anything and everything for those of that inclination.

add database in /inst/db for tests

It's probably better to replace current approach to tests with the RSQLite approach of storing an SQLite database in /inst/db/. Tests could then still download and store data, but all actual tests of data extraction could then just use this pre-stored database and would give more reliable values. (Things such as numbers of stations in London would will vary, but numbers of trips should then vary only slightly and only at times at which corresponding stations are potentially closed.)

call-a-bike data?

Thanks @kruse-alex for the heads up. These data are different to most, and seem restricted to Hamburg only - right Alex? But it reads as if they intend to maintain ongoing releases on an annual basis, making them acceptable for incorporation in bikedata. Alex: I've put this issue here in my package, becuase all of the infrastructure for reading and storing is there, so it'll be the easiest way for us both to get the data into R. Please jump in and help! Note also that the repo should soon move to ropensci (pending review).

put raw data in data-raw directory?

At the moment the raw data are processed with the script in this gist. It might be better to keep all raw data in the package itself, along with that script?

appveyor

@richardellison Hopefully a final question for you, and this one just in case you've got any insight: Do you have any idea how to get appveyor working with sqlite3? I'm currently loading it through pre-installing RSQLite, but this fails because the sqlite3.h file can't be found. Note that it currently also fails on devtools, presumably for the same reason, so it seems like a solution ain't easy.

Would you perchance happen to have any idea how to help here?

(The good news is that the package is otherwise done and dusted and will be submitted this week - first a CRAN upload, then ropensci.)

Add Deutsche Bahn Call A Bike (Germany)

"Call a Bike is a bike hire system run by Deutsche Bahn (DB) in several German cities." (Wikipedia).

Call a Bike's data is available on Deutsche Bahn's Open Data portal.

Just a few translations to find your way around:

German	English
Buchungen	bookings
Fahrzeugdaten	vehicle (data)
Stationen	stations / rental zones
Tarifklassen	tariff class (by vehicle category)
Herunterladen	download

The data itself has got English names according to the documentation.

The data is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.