Git Product home page Git Product logo

bikedata's Introduction

rOpenSci

Project Status: Abandoned

This repository has been archived. The former README is now in README-NOT.md.

bikedata's People

Contributors

arfon avatar graceli8 avatar jimhester avatar maelle avatar mpadge avatar richardellison avatar sckott avatar szymanskir avatar tbuckl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bikedata's Issues

extract daily trips

Actually a two-part task:

  1. Extract date of first trip for each station to enable station counts to be standardised independent of how long stations have been in operation; and
  2. Extract a simply daily time series of total rides, with the possibility of specifying a particular station

Fix London files with "Station Logical Terminal"

These are introduced after "20...24Aug2016-30Aug2016" in place of "Start/End Station Id". All good and well except the "Logical Terminal" numbers in no way map on to the actual station numbers! Find out what the hell these are in order to be able to read these latest files.

more flexible date args for dl_bikedata

At the moment just handled with the single line

indx <- which (!file.exists (files) & grepl (paste (dates, collapse = "|"), files))

meaning the dates passed to either dl_bikedata() or store_bikedata() must perfectly match the formats given in the data files. More flexible entry of dates should be possible

appveyor

@richardellison Hopefully a final question for you, and this one just in case you've got any insight: Do you have any idea how to get appveyor working with sqlite3? I'm currently loading it through pre-installing RSQLite, but this fails because the sqlite3.h file can't be found. Note that it currently also fails on devtools, presumably for the same reason, so it seems like a solution ain't easy.

Would you perchance happen to have any idea how to help here?

(The good news is that the package is otherwise done and dusted and will be submitted this week - first a CRAN upload, then ropensci.)

metadata

Add metadata to trip matrices, particularly:

  1. Version number of bikedata
  2. Dates of first and last trips used to calculate the trip matrix

update bike_demographic_data

Because some cities distinguish between registered/casual users (DC), or between monthly-pass/walk-up (LA, PH), These are a kind of demographic statistic, so should be noted

Vignette with example of uses

I think it may be worth adding a vignette with some examples of possible analysis that can be done with the data. Either making use of other packages or more standard examples (possibly using some spatial queries).

In that theme, as promised, below is the code used to generate this image. The sum_network_links function is in ropensci/stplanr#185. The create_index argument of the store_bikedata function is in #3.

nycitibikeexample

library(rgdal)
library(stplanr)
library(rgeos)
library(dplyr)
library(RSQLite)
library(bikedata)


# Download and read in the New York State street layer (could use the NYC Tiger 
# database as well potentially). This will be used to create the network (later)
download.file("http://gis.ny.gov/gisdata/fileserver/?DSID=932&file=streets_shp.zip",
              destfile = "~/Downloads/streets_shp.zip")
unzip("~/Downloads/streets_shp.zip")
nystreets <- readOGR("~/Downloads/Streets_shp","StreetSegment")

# Create a directory to store the data files and then download Citibike data for
# October 2016 to December 2016
dir.create("~/Downloads/citibikedata")
dl_bikedata(data_dir = "~/Downloads/citibikedata/",
            dates = c("201610","201611","201612"))

# Store downloaded data into a database
store_bikedata("~/Downloads/citibikedata/", "citifq2016", create_index = TRUE)

# Connect to the database
dbcon <- dbConnect(SQLite(), "citifq2016")

# Retrieve stations from the database and create a SpatialPointsDataFrame 
# from the result.
nycstations <- dbGetQuery(dbcon, "SELECT * FROM stations")
nycstations$geom <- NULL
nycstations <- SpatialPointsDataFrame(coords = nycstations[,c('longitude','latitude')], 
                                      proj4string = CRS("+init=epsg:4326"), 
                                      data = nycstations)

# Reproject to the same projection as the streets layer
nycstations <- spTransform(nycstations, nycnet@sl@proj4string)

# Clip the streets layer to the area around the stations then 
# remove the full New York State dataset.
nycstreets <- gclip(nystreets, bbox(gBuffer(gEnvelope(nycstations, byid = FALSE),
                                            byid = FALSE,width=1000)))
rm(nystreets)

# Create a new network with the length parameter as the default weight
nycnet <- SpatialLinesNetwork(sl = nycstreets)

# Find the closest node to each station
nycstations@data$nodeid <- stplanr::find_network_nodes(
  nycnet, 
  nycstations@coords[,1], 
  nycstations@coords[,2]
)

# Query the database to count the number of trips between each pair of stations.
routetrips <- dbGetQuery(dbcon, "SELECT start_station_id, end_station_id, 
                         COUNT(*) as numtrips
                         FROM trips 
                         WHERE start_station_id <> end_station_id
                         GROUP BY start_station_id, end_station_id")

# Join the routetrips table to the nycstations layer to match the Node IDs
routetrips <- routetrips %>% 
  inner_join(
    nycstations@data %>%
      select(start_station_id = id, startnodeid = nodeid)
  ) %>%
  inner_join(
    nycstations@data %>%
      select(end_station_id = id, endnodeid = nodeid)
  ) %>%
  select(
    startnodeid,
    endnodeid,
    numtrips
  )

# Run the sum_network_links function to aggregate the number of trips
# on each part of the network.
# Note that since the default weight (length) has not been changed,
# this is the simple shortest path.
nycbicycleusage <- sum_network_links(nycnet, routetrips)

# Download and read in some layers to set the geographic context
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_36061_areawater.zip",
              destfile = "~/Downloads/citibikedata/nycountyareawater.zip")
download.file("https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_34017_areawater.zip",
              destfile = "~/Downloads/citibikedata/njcountyareawater.zip")
unzip("~/Downloads/citibikedata/nycountyareawater.zip", exdir = "~/Downloads/citibikedata/")
unzip("~/Downloads/citibikedata/njcountyareawater.zip", exdir = "~/Downloads/citibikedata/")
nywater <- readOGR("~/Downloads/citibikedata","tl_2016_36061_areawater")
njwater <- readOGR("~/Downloads/citibikedata","tl_2016_34017_areawater")
nywater <- spTransform(nywater, nycbicycleusage@proj4string)
njwater <- spTransform(njwater, nycbicycleusage@proj4string)

# Plot the water and routes layers
tm_shape(nywater, is.master = FALSE) + 
  tm_fill(col="#000011") + 
tm_shape(njwater, is.master = FALSE) + 
  tm_fill(col="#000011") + 
tm_shape(nycbicycleusage, is.master=TRUE) + 
  tm_lines(col="numtrips", 
           lwd="numtrips", 
           title.col = "Number of trips",
           breaks = c(0,20000,40000,60000,80000,100000,Inf),
           legend.lwd.show = FALSE,
           scale = 2
          ) + 
  tm_layout(
    bg.color="black",
    legend.position = c("right","bottom"), 
    legend.bg.color = "white", 
    legend.bg.alpha = 0.5
  )

# Save resulting map.
save_tmap(filename = "citibikeexample.png")

Distances calculated with bike_distmat() seem incorrect

The distances between the stations using bike_distmat() seem off for a large part of the station pairs. For instance, with the code below I calculate the distance for the London stations and add a column with euclidean distance (using raster package). The resulting plot shows that for a lot of cases the euclidean distance exceeds the distance calculated using bike_distmat, which cannot be correct. Also doing some manual tests on google maps shows that some distances are simply too short.

Thank you in advance.

` library(bikedata)
library(data.table)
library(raster)

#set up the bikedb
data_dir <- paste0(getwd(), "/data/Rbikedata")
bikedb <- file.path (data_dir,'testdb')
dl_bikedata (city ='London', data_dir=data_dir, dates = 201701:201702)
store_bikedata (data_dir = data_dir, bikedb = bikedb)

#load stations data
dtSs <- data.table(bike_stations(bikedb = bikedb))

#load distance matrix
dtDs <- data.table(bike_distmat(bikedb=bikedb, city="London", expand = 0, long = T, quiet = F))

#join distance matrix with start_station data
setkey(dtSs, "stn_id")
setkey(dtDs, "start_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("start_station_name","lon_start","lat_start") )

#join distance matrix with end_station data
setkey(dtDs, "end_station_id")
dtDs <- dtDs[dtSs[, .(stn_id, name, latitude, longitude)]]
setnames(dtDs, c("name","longitude","latitude") , c("end_station_name","lon_end","lat_end") )

#calculate euclidean distance using raster package
dtDs[, Dist_eucl:=pointDistance(dtDs[,.(lon=lon_start,lat=lat_start)], dtDs[,.(lon=lon_end,lat=lat_end)], lonlat = T)]

#select a subset for faster plotting
vSubset <- 1:(.05*nrow(dtDs))

#plot the euclidean against the over the network distance
plot(dtDs[vSubset, Dist_eucl], dtDs[vSubset, distance])`

bikedata_plot

fix store_bikedata()

it doesn't actually do any checks for whether files are already in the database, and unzips any and all files regardless. This has to be fixed!

strtokm function

@richardellison dumping issues on you here. The strtokm function is great, but actually not really necessary anymore, because the NYC data files were only double quoted up until 201412, and quotes disappeared from 201501 onwards. At present, I just pass a delim option to read_one_line which is either "," or just ,. Obviously we could - and should, hence this issue - just boost::replace_all (line, "\"", "") at the outset. This would then enable reversion to direct strtok rather than the multi-char strtokm, but strok behaves slightly differently, and prevents the last token from being extracted in the neat way it currently is.

I implemented an ugly work-around by sticking an extra delim on the end of the line, but didn't commit that because there must be a better way. I'd really appreciate it if you could give it a try with std::strok instead of strokm and see if you can find a neat solution.

Add Deutsche Bahn Call A Bike (Germany)

"Call a Bike is a bike hire system run by Deutsche Bahn (DB) in several German cities." (Wikipedia).

Call a Bike's data is available on Deutsche Bahn's Open Data portal.

Just a few translations to find your way around:

German English
Buchungen bookings
Fahrzeugdaten vehicle (data)
Stationen stations / rental zones
Tarifklassen tariff class (by vehicle category)
Herunterladen download

The data itself has got English names according to the documentation.

The data is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.

data to add

Great article of state of American bike share systems here, with new systems including LA and Portland. Full list of systems (with direct links to data, and excluding London UK):

  1. NYC
  2. Washington DC
  3. Chicago
  4. Boston
  5. LA
  6. Philadelphia
  7. Minneapolis/St Paul
  8. San Francisco Bay Area (issue here)
  9. bixi montreal (issue here)
  10. mibici Guadalahara (issue here)

Systems not yet part of this package which are hoped to be added

  1. Ciudad de Mexico (issue here) - awaiting open data on station locations
  2. [Vancouver Mobi]https://www.mobibikes.ca/en/system-data) (issue here) - awaiting open data on station locations

Additional systems that do not (yet?) provide data:

  1. Miami
  2. Portland
  3. Baltimore

Systems which have died an ungraceful death yet which still provide (historical) data:

  1. Seattle

get dl_bikedata to update database not just files

Current implementation of dl_bikedata() will check whether files exist and only download files that don't already exist. This should be extended to only download those files the contents of which aren't already in the nominated SQLite3 database. Data should be downloaded and added only if the data do not already exist either as downloaded files or in the database.

Initially try matching files to database entries by start dates alone, but first ensure that all files for any nomintedl time contain only rides that start within that time.

enable specific months to be stored for london

The London data files are structured in a distinctly different way to all others, so simple grepping of data files to download will not work. Implement a way to store data for particular months for London.

tests give variable numbers of trips

This keeps causing "unpredictable" Travis failures, because tests for total trio's for the 6 cities vary by 1 or 2. Work out why and resolve this problem! Likely related to #13

call-a-bike data?

Thanks @kruse-alex for the heads up. These data are different to most, and seem restricted to Hamburg only - right Alex? But it reads as if they intend to maintain ongoing releases on an annual basis, making them acceptable for incorporation in bikedata. Alex: I've put this issue here in my package, becuase all of the infrastructure for reading and storing is there, so it'll be the easiest way for us both to get the data into R. Please jump in and help! Note also that the repo should soon move to ropensci (pending review).

Can't use store_bikedata without specifying a city

Hello,

I am trying out your package and I wanted to download and store the data for every cities in a database outside the temporary directory to be able to play the data without having the redownloading it each time.

I use the following code

bike_dt <- file.path('data/database', 'bikedata.sqlite')
store_bikedata(bikedb = bike_dt)
# enter 'yes' in the console 

But this fails with the error Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default. If I take this line out of the function source code and retry I have this error: Error in store_bikedata(bikedb = bike_dt) : argument "city" is missing, with no default.

This package ooks nice and I am looking forward looking into the data.

Thanks for your help,

Mathieu

add nabsa station names

The NABSA systems (Philly, LA) have all station coordinates in the trip data files, with station files containing only station numbers, names, dates, and operating status. These station files could at least be used to insert names into the station tables, which these systems currently lack. This would be a bit tricky only because the station data are inserted into the SQLite station table during reading of the raw data files, so it would require subsequent modification of the existing table.

Query construction in tripmat function

Is there a reason that the filter_tripmat_by_datetime function returns all the trips that match the query rather than just using a combination of a COUNT(*) and GROUP BY?

I also wonder if there is value to switching to parameterised queries where possible (it isn't in all cases in this package)? SQL injection probably isn't an issue here, but it is standard practice and can make the code somewhat easier to read.

sqlite3_exec() returns 1 not 0

@richardellison a plea for help here: travis currently fails because the sqlite3_exec() statement at the end of rcpp_import_stn_df returns 1 on travis, yet i get 0. I can reproduce the travis failure in a trusty container, but not in xenial, so this must have something to do with sqlite3 versions? Any insight from your side very much appreciated!

These lines suffice to reproduce (in this case, for Chicago test data, but same result arises for London):

devtools::load_all (".", export_all=TRUE)
bikedb <- "junkdb"
data_dir <- "./tests"
rcpp_create_sqlite3_db (bikedb)
flists <- bike_unzip_files_chicago (data_dir, bikedb)
ch_stns <- bike_get_chicago_stations (flists)
head (ch_stns)
nstations <- rcpp_import_stn_df (bikedb, ch_stns, 'ch')

The query structure is absolutely okay and works fine on >= 16.04, so this really does just seem to be an internal SQLite3 thing, but one for which we really need to find a solution. Set up of stations table is here - could it be an issue with the UNIQUE statement? All other otherwise entirely equivalent sqlite3_exec() statements return 0 as expected, so that's the only real different that jumps out to me.

Oh, and going the full _prepare_v2() -> _step() -> _reset() path also returns identical results (trusty = fail, xenial = pass).

ensure SQLite3 database has correct tables

Write a short boolean function to confirm that all bikedb arguments reference a database with the stations, trips, and datafiles tables, and use in all functions which have this argument

Error in curl::curl_fetch_disk() when using store_bikedata()

I'm looking to learn more about the bikedata package, and I was going through this vignette: https://ropensci.github.io/bikedata/

However, when I attempt the following: store_bikedata (city = 'nyc', bikedb = 'bikedb', dates = 201601:201603), I get the following error:

Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failed to open file C:\Users\PSTRAFC:\Users\PStraforelli\Documents1\AppData\Local\Temp\RtmpwZUPl6\201601-citibike-tripdata.zip.

I understand that this error may be unrelated to the bikedata package, but I was hoping I could at least get some pointers on how I could debug this. I haven't been able to find anything via google.

add station size to stations table

This is not necessarily straightforward, because some station tables are constructed straight from trip data, which doesn't have this info.

Trouble downloading London files for March to May 2017

I had no problem getting the data for London in the rest of 2017 (except after December 5, I'm assuming those are not available yet?) - but these files seem to not download correctly.

dl_bikedata("lo", paste0(getwd(),"/bikedata"), dates = 201703:201705)
#> Downloading 50 Journey Data Extract 22Mar2017-28Mar2017.csv
#> Downloading 51 Journey Data Extract 29Mar2017-04Apr2017.csv
#> Downloading 52 Journey Data Extract 05Apr2017-11Apr2017.csv
#> Downloading 55JourneyData Extract26Apr2017-02May2017.csv
#> Downloading 56JourneyDataExtract 03May2017-09May2017.csv

I'm getting the message that they downloaded, but those specific files do not appear in the directory. They stand out from the other files in that they have spaces - which is potentially what's causing the issue. Is there another way to acquire them?

add database in /inst/db for tests

It's probably better to replace current approach to tests with the RSQLite approach of storing an SQLite database in /inst/db/. Tests could then still download and store data, but all actual tests of data extraction could then just use this pre-stored database and would give more reliable values. (Things such as numbers of stations in London would will vary, but numbers of trips should then vary only slightly and only at times at which corresponding stations are potentially closed.)

Ciudad Mexico

@Robinlovelace Can you please provide details of the Mexico City data you mentioned? It'd be great to incorporate that if possible

return tibbles from all functions

Because both the stations table and the long = TRUE tripmat are tibbles by default, it's probably better to do this for all functions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.