Git Product home page Git Product logo

cleaningdata's Introduction

Course Project: Getting and Cleaning Data

Introduction

The run_analysis.R script contains the script that performs steps 1-5 of the course project.

The script itself is structured so that it includes a section for each step, with comments within the code explaining its purpose and actions.

Directory structure

In the run_analysis.R script I assume the script is located in the parent directory of UCI HAR Dataset directory (the directory that contains the test and train sub-directories as well as related files.

The expected directory structure is as follows:

\run_analysis.R
\UCI HAR Dataset\*.*
\UCI HAR Dataset\test\*.*
\UCI HAR Dataset\train\*.*

Library dependencies

  • dplyr

Step-by-step Description of Script

Step 1: Merges the training and the test sets to create one data set

  1. Load the train and test data into data tables
  • 3 files are loaded for each:
    • The data itself (train / test)
    • The Labels (i.e. the Activity codes, numeric range 1:6)
    • The Subjects (i.e. test subject identifiers, as numeric with range 1:30)
  • Bind the Labels and Subject columns as new columns of the train and test sets (accordingly)
  • Bind the rows of test and train data sets into a single data set, appropriately named allData
  • Loaded the descriptive header names and assign them to the allData data set
    • (this was required in step 4, but I decided to do this in this step since it feels more aesthetic to have proper column names on a data set)

Step 2: Extracts only the measurements on the mean and standard deviation for each measurement

I used the apply function to calculate the mean and sd of all measurements columns:

> apply(allData[1:561], 2, mean)  
> apply(allData[1:561], 2, sd)

I restricted the column range to 1:561 since I added (in step 1) two columns (562:563 for Activity and Subject info).

The result of this step is that the mean and standard deviation of every measurement column is printed to the console.

Step 3: Uses descriptive activity names to name the activities in the data set

In this step I add the proper values for activities (i.e. mapping the activity codes 1:6 to an activity name like "WALKING", "SITTING" etc.)

This is done using the dplyr library's inner_join function.

The inner join is performed based on identical column names ("Activity") and a identical row counts in the two joined sets.

Step 4: Appropriately labels the data set with descriptive variable names

This was done in step 1, as described above.

Step 5: From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject

Using the dplyr library's aggregate function, I calculate the average (mean) of every measurement, grouped by Activity and Subject.

This means that for 6 activities, and 30 subjects, there will be 180 rows (6x30) in the resulting data set, each containing the average measurements for a specific activity performed by a specific subject.

The resulting data set was saved using write.table function as a tab separated text file named "meanByActivityAndSubject.txt", and uploaded to using the Coursera web site's user interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.