Git Product home page Git Product logo

datascience_03_project's Introduction

DataScience_03_Project

Coursera Data Science Specialization Track class 03 Course Project


title: "Coursera Project Tidy Data" author: "David Parker" date: "Sunday, July 27, 2014" output: html_document

Objective:

To create a "Tidy" dataset from data published in an experiment using wearable smart devices. This is a "Data Science, Wearable Computing" experiment. The data from this experiment, titled "Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine" 1, is publicly avaible, see citations.

Process:

A single R script handes all the processing. It is located here. Script processing occurs in the following order.

Housekeeping:

In preperation to aquire "Human Activity Recognition Using Smartphones Data Set", the script first checks to see if a copy is present in the working directory. If not it downloads it from:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
The script then unzips the data into the working directory.

Intial Data Loading:

Next, the script reads in the datasets in the following order:
First gathers data common to both test and train:

  • Read actvity_labels.txt into, dataframe: xActivityLabels
  • Read features.txt labels into, dataframe: xColNames (these are column names common to Train & Test)
    Then collects data for test subjects:
  • Read x_test.txt variables into dataframe: xTestData (using column names above)
  • Read y_test.txt activities into dataframe: yTestData
  • Read subject_test.txt subjects into datafrae: TestSubject
    Next collects data for train subjects:
  • Read x_train.txt variables into dataframe: xTrainData (using column names above)
  • Read y_train.txt activities into dataframe: yTrainData
  • Read subject_train.txt subjects into datafrae: TrainSubject

Now the steps for processing the data begins:

##Step 1
Merges the training and the test sets to create one data set.
Both the Test and Train datasets should contain the same number of rows.
Verify test data row counts.

nrow(xTestData)  # ecpect 2947  
identical(nrow(xTestData), nrow(yTestData))  # expect TRUE  
identical(nrow(xTestData), nrow(TestSubject))  # expect TRUE  

The 3 tables in test have the same number of rows, 2947.
Create one test data frame combining columns from xTestData, TestSubject, yTestData.

xTestData <- cbind(xTestData, TestSubject)   
xTestData <- cbind(xTestData, yTestData)   

Verify train data row counts

nrow(xTrainData)  # expect 7352  
identical(nrow(xTrainData), nrow(yTrainData))    # expect TRUE  
identical(nrow(xTrainData), nrow(TrainSubject))  # expect TRUE  

The 3 tables in train have the same number of rows, 7352.
Create one train data frame combining xTrainData, TrainSubject, yTrainData,

xTrainData <- cbind(xTrainData, TrainSubject)  
xTrainData <- cbind(xTrainData, yTrainData)  

Verify colnames prior to merging.

identical(colnames(xTestData), colnames(xTrainData))  # expect TRUE  

Merge all the data into one data frame.

xMergeData <- rbind(xTrainData, xTestData)  

Verify the rowcount of Merged Data equal the sum of Test + Train.

identical(nrow(xMergeData), (nrow(xTestData) + nrow(xTrainData) ) )  # expect TRUE  

##Step 2
Extracts only the measurements on the mean and standard deviation for each measurement.

Gets all column indicies for standard deviation, mean, generic V1 V1.1 added by Subject & yData.
Note: columns with meanFreq in their names get selected here, pulled in with the other mean columns, but these will be dropped at the end of this step. Easier that way.

xMeanStdCol <- grepl(".std|.mean|^V", colnames(xMergeData) )  
sum(xMeanStdCol)  # sum of filtered column indicies | expect 81  

Extracting data using filtered column indicies into a new dataset: xMeanStdData

xMeanStdData <- xMergeData[, xMeanStdCol]  

Now to drop columns with meanFreq in their names.

cnames <- colnames(xMeanStdData)  # grab column names  
xMeanFreqCol <- grepl(".Freq", cnames)  # return a logical vector of indicies to drop  
sum(xMeanFreqCol)  # expect 13 | columns to drop  
colDrop <- cnames[xMeanFreqCol]  # returns a vector with names to be dropped  

Recreate data frame by subsetting on column names NOT to be dropped.

xMeanStdData <- xMeanStdData[, !(colnames(xMeanStdData) %in% colDrop)]  

Verifying tidier dataset.

ncol(xMeanStdData)  # expect 68 columns  
nrow(xMeanStdData)  # expect 10299  
colnames(xMeanStdData)  

Examine sample data head & tail | first & last columns.

head(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68)])  
tail(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68)])  

##Step 3
Use descriptive activity names to name the activities in the data set.

colnames(xActivityLabels)  # review Activity Labels column names for merging  

Update the merged yTestData activity number column with the xActivityLabels.

xMeanStdData <- merge(xMeanStdData, xActivityLabels, by.x = "V1.1", by.y = "V1", all = TRUE)  
ncol(xMeanStdData)  # expect 69 (1 new column)  
colnames(xMeanStdData)  

Examine sample data head & tail | first & last columns.

head(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68, 69)])  
tail(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68, 69)])  

We no longer need V1.1 as the textual data was merged with its corresponding activity.

xMeanStdData$V1.1 <- NULL  
ncol(xMeanStdData)  # expect 68 - we basically replaced Activity number with its name  

##Step 4
Appropriately labels the data set with descriptive variable names.

All but the last 2 column labels have descriptive names read in during read.table on xDataFiles.
Refer to xActivityLabels

colnames(xMeanStdData)  # review column names  

Rename the generic V & V2 to appropriate Subject and Activity labels.

colnames(xMeanStdData)[67] <- "Subject.ID"  
colnames(xMeanStdData)[68] <- "Activity"  

Convert Subject.ID to numeric for proper sorting in Step 5.

is.numeric(as.numeric(xMeanStdData$Subject.ID)) # test validity | expect TRUE  
xMeanStdData$Subject.ID <- as.numeric(as.character(xMeanStdData$Subject.ID))  

Reorder columns placing Subject.ID and Activity 1st.
These are not measured variables. They are the Activities the measured variables will be summarized to.

xMeanStdData <- xMeanStdData[,c(67,68,1:66)]  
colnames(xMeanStdData)  # observe new column arrangement  

##Step 5
Creates a second, independent tidy data set with the average of each variable for each Activity and each Subject.
Utilize function ddply from the plyr library. This function accomplishes 2 tasks.

  1. This breaks down the tidy dataset xMeanStdData created above into a tidier dataset by Subject.ID and Activity.
  2. Summarizes the 66 measured variables for standard deviation and means by way of an anonymous function taking the means for those variables.
    The resultant new dataframe TidyDataSet is sorted by Subject.ID and Activity.
TidyDataSet <-ddply(xMeanStdData, c("Subject.ID","Activity"),  
                    function(x) colMeans(x[c(colnames(xMeanStdData)[3:68])]))  

Examine specifics for resulting Tidy dataset.

dim(TidyDataSet)  
# [1] 180  68  
head(TidyDataSet,12)  

Write TidyDataSet to TidyData.txt into the working directory.

write.table(TidyDataSet, file = "TidyData.txt", sep = ",", eol = "\r", row.names = FALSE, col.names = TRUE)  

Retrieve the file info.

file.info("TidyData.txt")  

TidyData.txt size is 224153

write xMeanStdData to MeanStdData.txt into the working directory

write.table(xMeanStdData, file = "MeanStdData.txt", sep = ",", eol = "\r", row.names = FALSE, col.names = TRUE)  

Tidy Data processing is complete.

The Tidy Dataset TidyData.txt is located in the working directory of the script.

The file can be retrieved here: TidyData.txt

The Code Book for the data is located in the working directory of the script.

The file can be accessed here: codebook.Rmd


Citations:
1 Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012

datascience_03_project's People

Contributors

davparker avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.