Git Product home page Git Product logo

Minjie XU's Projects

bluebikes-project icon bluebikes-project

1> Background information Bluebikes is Metro Boston’s public bike share program, with more than 1800 bikes at over 200 stations across Boston and nearby areas. The bikes sharing program launched in 2011. The program aimed for individuals to use it for short-term basis for a price. It allows individuals to borrow a bike from a dock station after using it, which makes it ideal for one-way trips. The City of Boston is committed to providing bike share as a part of the public transportation system. However, to build a transport system that encourages bicycling, it is important to build knowledge about the current bicycle flows, and what factors are involved in the decision-making of potential bicyclists when choosing whether to use the bicycle. It is logical to make hypotheses that age and gender, bicycle infrastructure, safety perception are possible determinants of bicycling. On the short-term perspective, it has been shown that weather plays an important role whether to choose the bicycle. 2> Data collection The Bluebikes collects and provides system data to the public. The datasets used in the project can be download through this link (https://www.bluebikes.com/system-data). Based on this time series dataset (start from 2017-01-01 00:00:00 to 2019-03-31 23:00:00), we could have the information includes: Trip duration, start time and data, stop time and data, start station name and id, end station name and id, bike id, user type (casual or subscribed), birth year, gender. Besides, any trips that were below 60 seconds in length is considered as potentially false starts, which is already removed in the datasets. The number of bicycles used during a particular time period, varies over time based on several factors, including the current weather conditions, time of the day, time of the year and the current interest of the biker to use the bicycle as a transport mode. The current interest is different between subscribed users and casual users, so we should analyze them separately. Factors such as season, day of a week, month, hour, and if a holiday can be extracted from the date and time column in the datasets. Since we would analyze the hourly bicycle rental flow, we need hourly weather conditions data from 2017-01-01 00:00:00 to 2019-03-31 23:00:00 to complete our regression model of prediction. The weather data used in the project is scrapped using python selenium from Logan airport station (42.38 °N, 71.04 °W) webpage (https://www.wunderground.com/history/daily/us/ma/boston/KBOS/date/2019-7-15) maintained by weather underground website. The hourly weather observations include time, temperature, dew point, humidity, wind, wind speed, wind gust, pressure, precipitation, precipitation accumulated, condition. 3> The problem The aims of the project are to gain insight of the factors that could give short-term perspective of bicycle flows in Boston. It also aimed to investigate the how busy each station is, the division of bicycle trip direction and duration of the usage of a busy station and the mean flows variation within a day or during that period. The addition to the factors included in the regression model, there also exist other factors than influence how the bicycle flows vary over longer periods time. For example, general tendency to use the bicycle. Therefore, there is potential to improve the regression model accuracy by incorporating a long-term trend estimate taken over the time series of bicycle usage. Then the result from the machine learning algorithm-based regression model should be compared with the time series forecasting-based models. 4> Possible solutions Data preprocessing/Exploration and variable selection: date approximation manipulation, correlation analysis among variables, merging data, scrubbing for duplicate data, verifying errors, interpolation for missing values, handling outliers and skewness, binning low frequent levels, encoding categorical variables. Data visualization: split number of bike usage by subscribed/casual to build time series; build heatmap to present how busy is each station and locate the busiest station in the busiest period of a busy day; using boxplot and histogram to check outliers and determine appropriate data transformation, using weather condition text to build word cloud. Time series trend curve estimates: two possible way we considered are fitting polynomials of various degrees to the data points in the time series or by using time series decomposition functions and forecast functions to extract and forecast. We would emphasize on the importance to generate trend curve estimates that do not follow the seasonal variations: the seasonal variations should be captured explicitly by the input weather related variables in the regression model. Prediction/regression/time series forecasting: It is possible to build up multilayer perceptron neural network regressor to build up models and give prediction based on all variables of data, time and weather. However, considering the interpretability of model, we prefer to build regression models based on machine learning algorithms (like random forest or SVM) respectively for subscribed/casual users. Then the regressor would be combined with trend curve extracted and forecasted by ARIMA, and then comparing with the result of time series forecasting by STL (Seasonal and Trend decomposition using Loess) with multiple seasonal periods and the result of TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality).

boston_crime icon boston_crime

Open security is imperative to general wellbeing and bliss, and a city's safety can be an urgent factor in choosing where to study, work and live. It is reported that the city violent crime rate for Boston in 2016 was higher than the national violent crime rate average by 12.27% [1]. As International students pursuing master's degree in Boston, to study well while living here, we would like to know how safe the city is, which kind of and how frequently crime is going on in our neighborhood. Along these lines, the objective of this project is to study the distribution base on the crime incidents of the recent five years to help address the crime and find a pattern or prediction if possible. By providing a model to determine the most criminal hotspots and discover the sort, area and time of committed crimes, we need to raise individuals' mindfulness regarding the perilous areas in certain timeframes. On the other hand, police forces can utilize this result to increment the accuracy of crime prediction in order to have police resources allocations with higher efficiency. The data set about crime incident reports are provided by the Boston Police Department, which containing records focused on capturing the type of incident as well as when and where it occurred. Records time range covers from 2015-06-15 to 2019-11-20. The data set has 440606 rows, consisting of 7 numerical variables (offense_code, reporting_area, year, month, hour, lat, long) and 10 categorical variables (incident_number, offense_code_group, offense_description, district, shooting, occurred_on_date, day_of_week, ucr_part, street, location). Among them, offense_code_group, reporting_area, ucr_part, incident_number and location are not what we are going to focus on, so we removed them before the analysis. In this project, large dataset has been reviewed and information such as if with shooting, time, location and the type of crimes have been extracted and plot to help people discover the crime pattern in Boston city.

h1b_analysis icon h1b_analysis

To study the geographic distribution, the distribution of popular job roles, popular companies and trend of application’s time attributes for H1B visa petitions. To gain insight into the employment situation of international students and the popular occupations’ characteristics. To analyze population of the city, wage of the position and their effect on H1B visa petitions.

imbalanced-dataset-analysis icon imbalanced-dataset-analysis

In this project we address the problem of classification on imbalanced classes dataset to predict if the client will subscribe a bank term deposit. We use a number of machine learning and deep learning algorithms: random forest tree, logistic regression, linear supporting vector machine and multi-layer perceptron neural network, to perform the classification. Combining those algorithms with four ways of rebalancing methods, we are able to compare precision, recall, 10-fold cross-validation scores and auc-roc curve of the mentioned method-combinations. Furthermore, we discuss the reasons of differences among rebalancing methods on different algorithms. LinearSVC has best performance considering processing time and performance indices. Using randomly undersampling, it can achieve the classification sensitivity (recall) of 0.67 and precision of 0.27 with an auc score of 0.72 on class yes. In the end, we took logistic regression algorithm as example to build up our cost-based model, using cross validation to find out the best lowcost rate for class “no” (Known that cost for class “yes” is assigned as 1). Meanwhile we compared it with the f1 score targeted model with best class_weight parameter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.