DataScience_03_Project

Coursera Data Science Specialization Track class 03 Course Project

title: "Coursera Project Tidy Data" author: "David Parker" date: "Sunday, July 27, 2014" output: html_document

Objective:

To create a "Tidy" dataset from data published in an experiment using wearable smart devices. This is a "Data Science, Wearable Computing" experiment. The data from this experiment, titled "Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine" 1, is publicly avaible, see citations.

Process:

A single R script handes all the processing. It is located here. Script processing occurs in the following order.

Housekeeping:

In preperation to aquire "Human Activity Recognition Using Smartphones Data Set", the script first checks to see if a copy is present in the working directory. If not it downloads it from:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
The script then unzips the data into the working directory.

Intial Data Loading:

Next, the script reads in the datasets in the following order:
First gathers data common to both test and train:

Read actvity_labels.txt into, dataframe: xActivityLabels
Read features.txt labels into, dataframe: xColNames (these are column names common to Train & Test)
Then collects data for test subjects:
Read x_test.txt variables into dataframe: xTestData (using column names above)
Read y_test.txt activities into dataframe: yTestData
Read subject_test.txt subjects into datafrae: TestSubject
Next collects data for train subjects:
Read x_train.txt variables into dataframe: xTrainData (using column names above)
Read y_train.txt activities into dataframe: yTrainData
Read subject_train.txt subjects into datafrae: TrainSubject

Now the steps for processing the data begins:

##Step 1
Merges the training and the test sets to create one data set.
Both the Test and Train datasets should contain the same number of rows.
Verify test data row counts.

nrow(xTestData)  # ecpect 2947  
identical(nrow(xTestData), nrow(yTestData))  # expect TRUE  
identical(nrow(xTestData), nrow(TestSubject))  # expect TRUE

The 3 tables in test have the same number of rows, 2947.
Create one test data frame combining columns from xTestData, TestSubject, yTestData.

xTestData <- cbind(xTestData, TestSubject)   
xTestData <- cbind(xTestData, yTestData)

Verify train data row counts

nrow(xTrainData)  # expect 7352  
identical(nrow(xTrainData), nrow(yTrainData))    # expect TRUE  
identical(nrow(xTrainData), nrow(TrainSubject))  # expect TRUE

The 3 tables in train have the same number of rows, 7352.
Create one train data frame combining xTrainData, TrainSubject, yTrainData,

xTrainData <- cbind(xTrainData, TrainSubject)  
xTrainData <- cbind(xTrainData, yTrainData)

Verify colnames prior to merging.

identical(colnames(xTestData), colnames(xTrainData))  # expect TRUE

Merge all the data into one data frame.

xMergeData <- rbind(xTrainData, xTestData)

Verify the rowcount of Merged Data equal the sum of Test + Train.

identical(nrow(xMergeData), (nrow(xTestData) + nrow(xTrainData) ) )  # expect TRUE

##Step 2
Extracts only the measurements on the mean and standard deviation for each measurement.

Gets all column indicies for standard deviation, mean, generic V1 V1.1 added by Subject & yData.
Note: columns with meanFreq in their names get selected here, pulled in with the other mean columns, but these will be dropped at the end of this step. Easier that way.

xMeanStdCol <- grepl(".std|.mean|^V", colnames(xMergeData) )  
sum(xMeanStdCol)  # sum of filtered column indicies | expect 81

Extracting data using filtered column indicies into a new dataset: xMeanStdData

xMeanStdData <- xMergeData[, xMeanStdCol]

Now to drop columns with meanFreq in their names.

cnames <- colnames(xMeanStdData)  # grab column names  
xMeanFreqCol <- grepl(".Freq", cnames)  # return a logical vector of indicies to drop

sum(xMeanFreqCol)  # expect 13 | columns to drop  
colDrop <- cnames[xMeanFreqCol]  # returns a vector with names to be dropped

Recreate data frame by subsetting on column names NOT to be dropped.

xMeanStdData <- xMeanStdData[, !(colnames(xMeanStdData) %in% colDrop)]

Verifying tidier dataset.

ncol(xMeanStdData)  # expect 68 columns  
nrow(xMeanStdData)  # expect 10299  
colnames(xMeanStdData)

Examine sample data head & tail | first & last columns.

head(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68)])  
tail(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68)])

##Step 3
Use descriptive activity names to name the activities in the data set.

colnames(xActivityLabels)  # review Activity Labels column names for merging

Update the merged yTestData activity number column with the xActivityLabels.

xMeanStdData <- merge(xMeanStdData, xActivityLabels, by.x = "V1.1", by.y = "V1", all = TRUE)  
ncol(xMeanStdData)  # expect 69 (1 new column)  
colnames(xMeanStdData)

Examine sample data head & tail | first & last columns.

head(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68, 69)])  
tail(xMeanStdData[, c(1, 2, 3, 4, 5, 6, 66, 67, 68, 69)])

We no longer need V1.1 as the textual data was merged with its corresponding activity.

xMeanStdData$V1.1 <- NULL  
ncol(xMeanStdData)  # expect 68 - we basically replaced Activity number with its name

##Step 4
Appropriately labels the data set with descriptive variable names.

All but the last 2 column labels have descriptive names read in during read.table on xDataFiles.
Refer to xActivityLabels

colnames(xMeanStdData)  # review column names

Rename the generic V & V2 to appropriate Subject and Activity labels.

colnames(xMeanStdData)[67] <- "Subject.ID"  
colnames(xMeanStdData)[68] <- "Activity"

Convert Subject.ID to numeric for proper sorting in Step 5.

is.numeric(as.numeric(xMeanStdData$Subject.ID)) # test validity | expect TRUE  
xMeanStdData$Subject.ID <- as.numeric(as.character(xMeanStdData$Subject.ID))

Reorder columns placing Subject.ID and Activity 1st.
These are not measured variables. They are the Activities the measured variables will be summarized to.

xMeanStdData <- xMeanStdData[,c(67,68,1:66)]  
colnames(xMeanStdData)  # observe new column arrangement

##Step 5
Creates a second, independent tidy data set with the average of each variable for each Activity and each Subject.
Utilize function ddply from the plyr library. This function accomplishes 2 tasks.

This breaks down the tidy dataset xMeanStdData created above into a tidier dataset by Subject.ID and Activity.
Summarizes the 66 measured variables for standard deviation and means by way of an anonymous function taking the means for those variables.
The resultant new dataframe TidyDataSet is sorted by Subject.ID and Activity.

TidyDataSet <-ddply(xMeanStdData, c("Subject.ID","Activity"),  
                    function(x) colMeans(x[c(colnames(xMeanStdData)[3:68])]))

Examine specifics for resulting Tidy dataset.

dim(TidyDataSet)  
# [1] 180  68

head(TidyDataSet,12)

Write TidyDataSet to TidyData.txt into the working directory.

write.table(TidyDataSet, file = "TidyData.txt", sep = ",", eol = "\r", row.names = FALSE, col.names = TRUE)

Retrieve the file info.

file.info("TidyData.txt")

TidyData.txt size is 224153

write xMeanStdData to MeanStdData.txt into the working directory

write.table(xMeanStdData, file = "MeanStdData.txt", sep = ",", eol = "\r", row.names = FALSE, col.names = TRUE)

Tidy Data processing is complete.

The Tidy Dataset TidyData.txt is located in the working directory of the script.

The file can be retrieved here: TidyData.txt

The Code Book for the data is located in the working directory of the script.

The file can be accessed here: codebook.Rmd

Citations:
1 Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012

davparker / datascience_03_project Goto Github PK