Summary

Using a boosting/tree model (“gbm”) in R’s caret packet exercise correctness is predicted with a > 98% accuracy on a 4900 observation test set.

Background

The ability to predict how well, or how correct, a particular exercise is completed by an exerciser is a new frontier at the integration of exercise, technology, and predictive analytics.
While fitness trackers, embedded sensors, and IoT devices have become more accessible for exercising tracking currently the integration of these systems focuses more on measuring the amount of exercise, rather than the correctness. The later is a more difficult problem, but solving it has significant commercial impacts. Validating predictive models to determine exercise correctness:

  1. offers new access to “personal trainer” type experiences for a wider range of exercisers (2)prevents potential injuries
  2. better utilization of exercise time. [1]

I utilized the WLE dataset[1] to develop a predictive model on the correct or type of errors a exerciser was making while lifting dumbbells. In the dataset, these errors are classified in a variable called “classe”. Relevant predictors were recorded from sensor datas in the weight and lifter motion.

Data and Cross-Validation

The relevant data set was provided as a csv file “pml-training” and had 19622 observations with 160 variables: 159 possible predictors and 1 response; where the response is “classe” of lifting errors as discussed previously.

Below, I show the basic data cleanup and partitioning completed to prepare for training a predictive model. Variables, predictors were removed from the data for 3 possible reasons: (1) may bias response (removed name of participant) (2) not relevant (raw time stamp, and windowing information) (3) incomplete, blank or NA

For validation, I break the available data into a testing and training dataset.

  library(ElemStatLearn)
  library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
  training<-read.csv("pml-training.csv", stringsAsFactors = FALSE)
    #final<-read.csv("pml-testing.csv", stringsAsFactors = FALSE)
  training$classe<-as.factor(training$classe)
  
  #Removing biased predictors and predictors that were empty --manually removing
  colsToRemove<-c(seq(0,5, by=1),seq(12,20, by=1), seq(43,48, by =1), seq(52,60, by=1), seq(74,82, by=1))
  
  #Programmatically removing other columns
  if(TRUE){  
    na_count <-sapply(training, function(y) sum(is.na(y) ))
    na_count <- data.frame(na_count)
    na_count$colnames<-row.names(na_count)
    colsToTake<-na_count[na_count<1000,]$colnames
    training<-training[,colsToTake]
    colsToTake<-replace(colsToTake,length(colsToTake),colnames(training)[length(training)])
    #testing<-testing[,colsToTake]
  }
  
  training<-training[,-colsToRemove]
  #final<<-testing[,-colsToRemove]
  
  trainIndex<-createDataPartition(y=training$classe, p=0.75, list=FALSE)
  testing<-training[-trainIndex,]
  training<-training[trainIndex,]
  
  #For Debug
  #testing<<-testing
  #training<<-training

Building Model

Using the “caret” package in R a boosting with trees model “gbm” is generated to test prediction on the testing dataset

  library(AppliedPredictiveModeling)
  library(caret)
  
  modelfit<-train(classe~., method="gbm", data=training ,verbose=FALSE)
  #modelfit

Validation

We use a confusion matrix to show response accuracy with a 4904 observation test set. This validation shows that the model predicts well at >98%.

confusionMatrix(testing$classe,predict(modelfit,testing))
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ElemStatLearn':
## 
##     ozone
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    7  932   10    0    0
##          C    0   10  844    1    0
##          D    1    0   13  789    1
##          E    0    2    3   10  886
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9882         
##                  95% CI : (0.9847, 0.991)
##     No Information Rate : 0.2861         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.985          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   0.9873   0.9701   0.9862   0.9989
## Specificity            1.0000   0.9957   0.9973   0.9963   0.9963
## Pos Pred Value         1.0000   0.9821   0.9871   0.9813   0.9834
## Neg Pred Value         0.9977   0.9970   0.9936   0.9973   0.9998
## Prevalence             0.2861   0.1925   0.1774   0.1631   0.1809
## Detection Rate         0.2845   0.1900   0.1721   0.1609   0.1807
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9971   0.9915   0.9837   0.9913   0.9976

Works Cited

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

http://groupware.les.inf.puc-rio.br/har#ixzz4LyHt342I