This repository has been archived. The former README is now in README-NOT.md.
ropensci / bikedata Goto Github PK
View Code? Open in Web Editor NEW:bike: Extract data from public hire bicycle systems
Home Page: https://docs.ropensci.org/bikedata
:bike: Extract data from public hire bicycle systems
Home Page: https://docs.ropensci.org/bikedata
This repository has been archived. The former README is now in README-NOT.md.
Actually a two-part task:
These are introduced after "20...24Aug2016-30Aug2016" in place of "Start/End Station Id". All good and well except the "Logical Terminal" numbers in no way map on to the actual station numbers! Find out what the hell these are in order to be able to read these latest files.
Indexes for both times and dates as well as demographic characteristics could then be explicitly constructed
At the moment just handled with the single line
indx <- which (!file.exists (files) & grepl (paste (dates, collapse = "|"), files))
meaning the dates passed to either dl_bikedata()
or store_bikedata()
must perfectly match the formats given in the data files. More flexible entry of dates should be possible
Thanks @krlmlr for the great useR talk. Compelling reasons to make the switch.
@richardellison Hopefully a final question for you, and this one just in case you've got any insight: Do you have any idea how to get appveyor working with sqlite3
? I'm currently loading it through pre-installing RSQLite
, but this fails because the sqlite3.h
file can't be found. Note that it currently also fails on devtools
, presumably for the same reason, so it seems like a solution ain't easy.
Would you perchance happen to have any idea how to help here?
(The good news is that the package is otherwise done and dusted and will be submitted this week - first a CRAN upload, then ropensci.)
Add metadata to trip matrices, particularly:
bikedata
new york, for example, currently does not, and without the leading 0 these dates are not properly translated by SQLite3. Likely related to #16
Because some cities distinguish between registered/casual users (DC), or between monthly-pass/walk-up (LA, PH), These are a kind of demographic statistic, so should be noted
same as tripmat, so they can be extracted independent of potential changes in numbers of stations in operation
I think it may be worth adding a vignette with some examples of possible analysis that can be done with the data. Either making use of other packages or more standard examples (possibly using some spatial queries).
In that theme, as promised, below is the code used to generate this image. The sum_network_links function is in ropensci/stplanr#185. The create_index argument of the store_bikedata function is in #3.
library(rgdal)
library(stplanr)
library(rgeos)
library(dplyr)
library(RSQLite)
library(bikedata)
# Download and read in the New York State street layer (could use the NYC Tiger
# database as well potentially). This will be used to create the network (later)
download.file("http://gis.ny.gov/gisdata/fileserver/?DSID=932&file=streets_shp.zip",
destfile = "~/Downloads/streets_shp.zip")
unzip("~/Downloads/streets_shp.zip")
nystreets <- readOGR("~/Downloads/Streets_shp","StreetSegment")
# Create a directory to store the data files and then download Citibike data for
# October 2016 to December 2016
dir.create("~/Downloads/citibikedata")
dl_bikedata(data_dir = "~/Downloads/citibikedata/",
dates = c("201610","201611","201612"))
# Store downloaded data into a database
store_bikedata("~/Downloads/citibikedata/", "citifq2016", create_index = TRUE)
# Connect to the database
dbcon <- dbConnect(SQLite(), "citifq2016")
# Retrieve stations from the database and create a SpatialPointsDataFrame
# from the result.
nycstations <- dbGetQuery(dbcon, "SELECT * FROM stations")
nycstations$geom <- NULL
nycstations <- SpatialPointsDataFrame(coords = nycstations[,c('longitude','latitude')],
proj4string = CRS("+init=epsg:4326"),
data = nycstations)
# Reproject to the same projection as the streets layer
nycstations <- spTransform(nycstations, nycnet@sl@proj4string)
# Clip the streets layer to the area around the stations then
# remove the full New York State dataset.
nycstreets <- gclip(nystreets, bbox(gBuffer(gEnvelope(nycstations, byid = FALSE),
byid = FALSE,width=1000)))
rm(nystreets)
# Create a new network with the length parameter as the default weight
nycnet <- SpatialLinesNetwork(sl = nycstreets)
# Find the closest node to each station
nycstations@data$nodeid <- stplanr::find_network_nodes(
nycnet,
nycstations@coords[,1],
nycstations@coords[,2]
)
# Query the database to count the number of trips between each pair of stations.
routetrips <- dbGetQuery(dbcon, "SELECT start_station_id, end_station_id,
COUNT(*) as numtrips
FROM trips
WHERE start_station_id <> end_station_id
GROUP BY start_station_id, end_station_id")
# Join the routetrips table to the nycstations layer to match the Node IDs
routetrips <- routetrips %>%
inner_join(
nycstations@data %>%
select(start_station_id = id, startnodeid = nodeid)
) %>%
inner_join(
nycstations@data %>%
select(end_station_id = id, endnodeid = nodeid)
) %>%
select(
startnodeid,
endnodeid,
numtrips
)
# Run the sum_network_links function to aggregate the number of trips
# on each part of the network.
# Note that since the default weight (length) has not been changed,
# this is the simple shortest path.
nycbicycleusage <- sum_network_links(nycnet, routetrips)
# Download and read in some layers to set the geographic context
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_36061_areawater.zip",
destfile = "~/Downloads/citibikedata/nycountyareawater.zip")
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_34017_areawater.zip",
destfile = "~/Downloads/citibikedata/njcountyareawater.zip")
unzip("~/Downloads/citibikedata/nycountyareawater.zip", exdir = "~/Downloads/citibikedata/")
unzip("~/Downloads/citibikedata/njcountyareawater.zip", exdir = "~/Downloads/citibikedata/")
nywater <- readOGR("~/Downloads/citibikedata","tl_2016_36061_areawater")
njwater <- readOGR("~/Downloads/citibikedata","tl_2016_34017_areawater")
nywater <- spTransform(nywater, nycbicycleusage@proj4string)
njwater <- spTransform(njwater, nycbicycleusage@proj4string)
# Plot the water and routes layers
tm_shape(nywater, is.master = FALSE) +
tm_fill(col="#000011") +
tm_shape(njwater, is.master = FALSE) +
tm_fill(col="#000011") +
tm_shape(nycbicycleusage, is.master=TRUE) +
tm_lines(col="numtrips",
lwd="numtrips",
title.col = "Number of trips",
breaks = c(0,20000,40000,60000,80000,100000,Inf),
legend.lwd.show = FALSE,
scale = 2
) +
tm_layout(
bg.color="black",
legend.position = c("right","bottom"),
legend.bg.color = "white",
legend.bg.alpha = 0.5
)
# Save resulting map.
save_tmap(filename = "citibikeexample.png")
The distances between the stations using bike_distmat() seem off for a large part of the station pairs. For instance, with the code below I calculate the distance for the London stations and add a column with euclidean distance (using raster package). The resulting plot shows that for a lot of cases the euclidean distance exceeds the distance calculated using bike_distmat, which cannot be correct. Also doing some manual tests on google maps shows that some distances are simply too short.
Thank you in advance.
` library(bikedata)
library(data.table)
library(raster)
#set up the bikedb
data_dir <- paste0(getwd(), "/data/Rbikedata")
bikedb <- file.path (data_dir,'testdb')
dl_bikedata (city ='London', data_dir=data_dir, dates = 201701:201702)
store_bikedata (data_dir = data_dir, bikedb = bikedb)
#load stations data
dtSs <- data.table(bike_stations(bikedb = bikedb))
#load distance matrix
dtDs <- data.table(bike_distmat(bikedb=bikedb, city="London", expand = 0, long = T, quiet = F))
#join distance matrix with start_station data
setkey(dtSs, "stn_id")
setkey(dtDs, "start_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("start_station_name","lon_start","lat_start") )
#join distance matrix with end_station data
setkey(dtDs, "end_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("end_station_name","lon_end","lat_end") )
#calculate euclidean distance using raster package
dtDs[, Dist_eucl:=pointDistance(dtDs[,.(lon=lon_start,lat=lat_start)], dtDs[,.(lon=lon_end,lat=lat_end)], lonlat = T)]
#select a subset for faster plotting
vSubset <- 1:(.05*nrow(dtDs))
#plot the euclidean against the over the network distance
plot(dtDs[vSubset, Dist_eucl], dtDs[vSubset, distance])`
it doesn't actually do any checks for whether files are already in the database, and unzips any and all files regardless. This has to be fixed!
At the moment the raw data are processed with the script in this gist. It might be better to keep all raw data in the package itself, along with that script?
@richardellison dumping issues on you here. The strtokm
function is great, but actually not really necessary anymore, because the NYC data files were only double quoted up until 201412, and quotes disappeared from 201501 onwards. At present, I just pass a delim
option to read_one_line
which is either ","
or just ,
. Obviously we could - and should, hence this issue - just boost::replace_all (line, "\"", "")
at the outset. This would then enable reversion to direct strtok
rather than the multi-char strtokm
, but strok
behaves slightly differently, and prevents the last token from being extracted in the neat way it currently is.
I implemented an ugly work-around by sticking an extra delim on the end of the line, but didn't commit that because there must be a better way. I'd really appreciate it if you could give it a try with std::strok
instead of strokm
and see if you can find a neat solution.
"Call a Bike is a bike hire system run by Deutsche Bahn (DB) in several German cities." (Wikipedia).
Call a Bike's data is available on Deutsche Bahn's Open Data portal.
Just a few translations to find your way around:
German | English |
---|---|
Buchungen | bookings |
Fahrzeugdaten | vehicle (data) |
Stationen | stations / rental zones |
Tarifklassen | tariff class (by vehicle category) |
Herunterladen | download |
The data itself has got English names according to the documentation.
The data is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.
Great article of state of American bike share systems here, with new systems including LA and Portland. Full list of systems (with direct links to data, and excluding London UK):
Systems not yet part of this package which are hoped to be added
Additional systems that do not (yet?) provide data:
Systems which have died an ungraceful death yet which still provide (historical) data:
Niceride MN has full data, all good to incorporate
You've probably already looked into Seattle's Pronto program. It ended up shutting down. Would trip data have been useful in understanding why? Or how that could have been avoided?
Found a site with what seems to be some data here:
https://www.kaggle.com/pronto/cycle-share-dataset
If i get some time I might open a PR with a pointer to it.
Current implementation of dl_bikedata()
will check whether files exist and only download files that don't already exist. This should be extended to only download those files the contents of which aren't already in the nominated SQLite3
database. Data should be downloaded and added only if the data do not already exist either as downloaded files or in the database.
Initially try matching files to database entries by start dates alone, but first ensure that all files for any nomintedl time contain only rides that start within that time.
The London data files are structured in a distinctly different way to all others, so simple grep
ping of data files to download will not work. Implement a way to store data for particular months for London.
And then delete the data from the ./tests/
directory.
copy of main data issue - data are here
This keeps causing "unpredictable" Travis failures, because tests for total trio's for the 6 cities vary by 1 or 2. Work out why and resolve this problem! Likely related to #13
Thanks @kruse-alex for the heads up. These data are different to most, and seem restricted to Hamburg only - right Alex? But it reads as if they intend to maintain ongoing releases on an annual basis, making them acceptable for incorporation in bikedata
. Alex: I've put this issue here in my package, becuase all of the infrastructure for reading and storing is there, so it'll be the easiest way for us both to get the data into R
. Please jump in and help! Note also that the repo should soon move to ropensci
(pending review).
This can now be done using the bike_write_test_data()
function.
Hello,
I am trying out your package and I wanted to download and store the data for every cities in a database outside the temporary directory to be able to play the data without having the redownloading it each time.
I use the following code
bike_dt <- file.path('data/database', 'bikedata.sqlite')
store_bikedata(bikedb = bike_dt)
# enter 'yes' in the console
But this fails with the error Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default
. If I take this line out of the function source code and retry I have this error: Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default
.
This package ooks nice and I am looking forward looking into the data.
Thanks for your help,
Mathieu
Station location data for DC had to be hard-coded because the data given at opendata.dc.gov/datasets/capital-bike-share-locations became too unreliable, and simply returned an unknown error. Blame the underlying opendata.arcgis.com server! In the meantime, the data have been parked in R/sysdata.rda
, but this can't be a long term solution as the system expands and changes. This issue is a flag to address this when (hopefully) the live data becomes more stable
The NABSA systems (Philly, LA) have all station coordinates in the trip data files, with station files containing only station numbers, names, dates, and operating status. These station files could at least be used to insert names into the station tables, which these systems currently lack. This would be a bit tricky only because the station data are inserted into the SQLite station
table during reading of the raw data files, so it would require subsequent modification of the existing table.
Is there a reason that the filter_tripmat_by_datetime function returns all the trips that match the query rather than just using a combination of a COUNT(*) and GROUP BY?
I also wonder if there is value to switching to parameterised queries where possible (it isn't in all cases in this package)? SQL injection probably isn't an issue here, but it is standard practice and can make the code somewhat easier to read.
This is already described in the vignette, but not yet implemented
now i understand why RSQLite
does this: The github language statistics automatically ignore all files in any (^|/)[Vv]+endor/
directories. Other alternative is adding a .gitattributes
with explicit linguist-documentation
, but i've not got a .gitattributes
anywhere else, so ./src/vendor/
seems like a cleaner solution.
@richardellison a plea for help here: travis currently fails because the sqlite3_exec()
statement at the end of rcpp_import_stn_df
returns 1 on travis, yet i get 0. I can reproduce the travis failure in a trusty container, but not in xenial, so this must have something to do with sqlite3
versions? Any insight from your side very much appreciated!
These lines suffice to reproduce (in this case, for Chicago test data, but same result arises for London):
devtools::load_all (".", export_all=TRUE)
bikedb <- "junkdb"
data_dir <- "./tests"
rcpp_create_sqlite3_db (bikedb)
flists <- bike_unzip_files_chicago (data_dir, bikedb)
ch_stns <- bike_get_chicago_stations (flists)
head (ch_stns)
nstations <- rcpp_import_stn_df (bikedb, ch_stns, 'ch')
The query structure is absolutely okay and works fine on >= 16.04, so this really does just seem to be an internal SQLite3 thing, but one for which we really need to find a solution. Set up of stations
table is here - could it be an issue with the UNIQUE
statement? All other otherwise entirely equivalent sqlite3_exec()
statements return 0 as expected, so that's the only real different that jumps out to me.
Oh, and going the full _prepare_v2() -> _step() -> _reset()
path also returns identical results (trusty = fail, xenial = pass).
Write a short boolean function to confirm that all bikedb
arguments reference a database with the stations
, trips
, and datafiles
tables, and use in all functions which have this argument
lots of args like start/end_date/time
could and should be accepted in more flexible formats using a rename_args()
function like rename_aes()
in ggplot
. Also should be able to American-spell anything and everything for those of that inclination.
I'm looking to learn more about the bikedata package, and I was going through this vignette: https://ropensci.github.io/bikedata/
However, when I attempt the following: store_bikedata (city = 'nyc', bikedb = 'bikedb', dates = 201601:201603)
, I get the following error:
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failed to open file C:\Users\PSTRAFC:\Users\PStraforelli\Documents1\AppData\Local\Temp\RtmpwZUPl6\201601-citibike-tripdata.zip.
I understand that this error may be unrelated to the bikedata package, but I was hoping I could at least get some pointers on how I could debug this. I haven't been able to find anything via google.
Their AWS format just changed from monthly to annual dumps for all prior years, and quarterly dumps of current year. The bike_convert_dates()
function current maps dates to quarters, and so no longer matches any files, causing tests to fail. FIX!
This field probably occupies at least half the entire database size, yet is not really necessary. Remove?
This is not necessarily straightforward, because some station tables are constructed straight from trip data, which doesn't have this info.
I had no problem getting the data for London in the rest of 2017 (except after December 5, I'm assuming those are not available yet?) - but these files seem to not download correctly.
dl_bikedata("lo", paste0(getwd(),"/bikedata"), dates = 201703:201705)
#> Downloading 50 Journey Data Extract 22Mar2017-28Mar2017.csv
#> Downloading 51 Journey Data Extract 29Mar2017-04Apr2017.csv
#> Downloading 52 Journey Data Extract 05Apr2017-11Apr2017.csv
#> Downloading 55JourneyData Extract26Apr2017-02May2017.csv
#> Downloading 56JourneyDataExtract 03May2017-09May2017.csv
I'm getting the message that they downloaded, but those specific files do not appear in the directory. They stand out from the other files in that they have spaces - which is potentially what's causing the issue. Is there another way to acquire them?
As described in vignette.
There are some recent raw .csv
files that are junk and these may cause the functions to fail? Check!
Some tests that use expect_silent()
fail, probably since tidyverse/dplyr#2878.
Related: tidyverse/dbplyr#18.
It's probably better to replace current approach to tests with the RSQLite
approach of storing an SQLite database in /inst/db/
. Tests could then still download and store data, but all actual tests of data extraction could then just use this pre-stored database and would give more reliable values. (Things such as numbers of stations in London would will vary, but numbers of trips should then vary only slightly and only at times at which corresponding stations are potentially closed.)
@Robinlovelace Can you please provide details of the Mexico City data you mentioned? It'd be great to incorporate that if possible
Because both the stations table and the long = TRUE
tripmat are tibbles by default, it's probably better to do this for all functions
These both use the same web pages (LA here and Philly here). Figure out how to scrape these pages so data can be automatically updated.
should be: https://ropensci.github.io/bikedata/
copy of main data issue - data are here. Unfortunately not a NABSA system, so this'll take a bit more work than Philadelphia.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.