Using a boosting/tree model (“gbm”) in R’s caret packet exercise correctness is predicted with a > 98% accuracy on a 4900 observation test set.
The ability to predict how well, or how correct, a particular exercise is completed by an exerciser is a new frontier at the integration of exercise, technology, and predictive analytics.
While fitness trackers, embedded sensors, and IoT devices have become more accessible for exercising tracking currently the integration of these systems focuses more on measuring the amount of exercise, rather than the correctness. The later is a more difficult problem, but solving it has significant commercial impacts. Validating predictive models to determine exercise correctness:
I utilized the WLE dataset[1] to develop a predictive model on the correct or type of errors a exerciser was making while lifting dumbbells. In the dataset, these errors are classified in a variable called “classe”. Relevant predictors were recorded from sensor datas in the weight and lifter motion.
The relevant data set was provided as a csv file “pml-training” and had 19622 observations with 160 variables: 159 possible predictors and 1 response; where the response is “classe” of lifting errors as discussed previously.
Below, I show the basic data cleanup and partitioning completed to prepare for training a predictive model. Variables, predictors were removed from the data for 3 possible reasons: (1) may bias response (removed name of participant) (2) not relevant (raw time stamp, and windowing information) (3) incomplete, blank or NA
For validation, I break the available data into a testing and training dataset.
library(ElemStatLearn)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
training<-read.csv("pml-training.csv", stringsAsFactors = FALSE)
#final<-read.csv("pml-testing.csv", stringsAsFactors = FALSE)
training$classe<-as.factor(training$classe)
#Removing biased predictors and predictors that were empty --manually removing
colsToRemove<-c(seq(0,5, by=1),seq(12,20, by=1), seq(43,48, by =1), seq(52,60, by=1), seq(74,82, by=1))
#Programmatically removing other columns
if(TRUE){
na_count <-sapply(training, function(y) sum(is.na(y) ))
na_count <- data.frame(na_count)
na_count$colnames<-row.names(na_count)
colsToTake<-na_count[na_count<1000,]$colnames
training<-training[,colsToTake]
colsToTake<-replace(colsToTake,length(colsToTake),colnames(training)[length(training)])
#testing<-testing[,colsToTake]
}
training<-training[,-colsToRemove]
#final<<-testing[,-colsToRemove]
trainIndex<-createDataPartition(y=training$classe, p=0.75, list=FALSE)
testing<-training[-trainIndex,]
training<-training[trainIndex,]
#For Debug
#testing<<-testing
#training<<-training
Using the “caret” package in R a boosting with trees model “gbm” is generated to test prediction on the testing dataset
library(AppliedPredictiveModeling)
library(caret)
modelfit<-train(classe~., method="gbm", data=training ,verbose=FALSE)
#modelfit
We use a confusion matrix to show response accuracy with a 4904 observation test set. This validation shows that the model predicts well at >98%.
confusionMatrix(testing$classe,predict(modelfit,testing))
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ElemStatLearn':
##
## ozone
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 7 932 10 0 0
## C 0 10 844 1 0
## D 1 0 13 789 1
## E 0 2 3 10 886
##
## Overall Statistics
##
## Accuracy : 0.9882
## 95% CI : (0.9847, 0.991)
## No Information Rate : 0.2861
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.985
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9943 0.9873 0.9701 0.9862 0.9989
## Specificity 1.0000 0.9957 0.9973 0.9963 0.9963
## Pos Pred Value 1.0000 0.9821 0.9871 0.9813 0.9834
## Neg Pred Value 0.9977 0.9970 0.9936 0.9973 0.9998
## Prevalence 0.2861 0.1925 0.1774 0.1631 0.1809
## Detection Rate 0.2845 0.1900 0.1721 0.1609 0.1807
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9971 0.9915 0.9837 0.9913 0.9976
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.