Concepts and tools needed throughout the entire data science pipeline, from asking the right kinds of questions to making inferences and publishing results.
Includes:
- Practical application of statistical computing through reading data into R, accessing R packages, writing R functions, debugging, profiling R code, and organizing and commenting R code.
- Basic data cleaning of an 'activity recognition dataset' of 30 subjects who wore waist-mounted smartphone sensors. Includes R code to load the raw dataset and processing instructions formalized in a markdown based codebook.
- Exploratory analysis techniques in R for summarizing data, including how to implement multivariate statistical techniques and use plotting systems in order to summarize high-dimensional data.
- Use of R tools to generate data analysis in a markdown document with a focus on providing results which can be easily reproduced. R markdown code integrates live R code, knitr and related tools.
- Collection of R scripts which employ fundamentals of statistical inference, including broad theories such as frequentists, Bayesian, and likelihood.
- Regression analysis performed on a collection of cars in order to explore the relationship between car features and fuel consumption. Includes special cases of the regression model, ANOVA and ANCOVA with analysis of dummy variable, multivariable adjustment, residuals and variability.
- Application of machine learning algorithms (decision tree, random forest and generalized boosted regression) using R, in order to explore personal activity data and predict the manner in which individuals completed particular exercises.
- A simple, yet scalable, web application built using Shiny, R packages, and interactive graphics, with a focus on automating statistical inference of a dataset related to passengers onboard the Titanic.
|
- Air pollution monitoring data at 332 locations in the US. [link]
- Patient quality of care statistics for over 4,000 US hospitals from the Medicare.gov Hospital Compare service. [link]
- Activity recognition data set built from the recordings of 30 subjects performing basic activities and postural transitions while carrying a waist-mounted smartphone with embedded inertial sensors, from the UCI Machine Learning Repository. [link]
- Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years, from the UCI Machine Learning Repository. [link]
- Fine particulate matter (PM2.5) air pollutant data for the US for the period of 1999-2008, from the EPA National Emissions Inventory. [link]
- Data from a monitoring device (number of steps taken) worn by an anonymous individual worn between Oct-Nov 2014. [link]
- Storm Data' publication data from the National Oceanic and Atmospheric Administration (NOAA) for the period of 1950-2011. [link]
- The response in the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs who received one of three dose levels of vitamin C by one of two delivery methods. [link]
- Data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). [link]
- Weight lifting exercise data from accelerometers on the belt, forearm, arm, and dumbbell of six participants. [link]
- Passenger data (age, gender, fare, cabin etc.) who were onboard the Titanic. [link]
|