I have updated my work on the data set provided by Kaggle on the NCAA competition. To reiterate, the data is provided at: https://www.kaggle.com/c/march-machine-learning-mania-2015
I also utilized the blog_utility.r script from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/
One of the changes I've made since the original post on July 6th is that I used BPI as an additional predictor, I got additional log-loss values for Logistic Regression, and I determined log-loss values using Random Forest and K-Nearest Neighbors. Details are below
I used the log-loss values of each model to determine which one predicts results that more accurately match actual results. I first used the following variables as predictors: SEED,WST_6, TWPCT (details about these variables is explained at https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/). I also included BPI as a predictor, which is a rough estimate for how good teams really are. For the 2011-12 season and afterwards, I used the BPI rankings from http://espn.go.com/mens-college-basketball/bpi/_/season/2012 . These BPI rankings were computed by ESPN and I believe these rankings do not consider tournament results into the rankings. That is because that link has “NCAA tournament information” which predicts the seeds and which teams will make tournament or not. For seasons before 2011-12, I used the Pomeroy rankings instead http://kenpom.com/index.php?y=2014, which also tries to determine how good teams really are
I concluded that Logistic regression was clearly a better model than rpart because it produced a lower log-loss value of .692 if using 2010-2012 as the training set and 2013 as the test set (1.06 for rpart). The log-loss score for Logistic Regression was further reduced to .684 when I only used BPI, SEED and WST6 as the predictors. It was further reduced to .682 if I just used BPI and WST6 (not SEED since it has much higher p-value than A_WST6 and A_BPI). I got a more noticeable improvement to .64 if only using BPI.
If I used 2008-2012 as the training set instead, the log-loss was .688 if using BPI, SEED and A_wST6 as predictors. The log-loss remained the same if I just used BPI and A_WST6 (no SEED) as predictors. The log-loss noticeably decreased to .656 if only using BPI
I then tried to determine the log-loss for Support Vector Machine, but I could not get the code for it to work.
I then tried Random Forest. It computed the log-loss value as .651 if I used the 2008-2012 seasons as the training set and 2013 as the test set with BPI, SEED and WST6 as the predictors, mtry=2 (using only 2 of the 3 predictors in each tree split), and ntree=5000 (using 5000 different trees). The log-loss values did not change much if I varied the number of trees as it was .650 if using ntree=1000, .657 for ntree=500, and .648 if ntree=10000. These log-loss values are very similar to the values from logistic regression
Finally, I used K-nearest neighbors. The best log-loss value was .6947, for k=320 if using 2008-2012 as the training set and 2013 as test set. This is slightly larger than the log-loss values from random forest and logistic regression. So the predictions from random forest and logistic regression are more slightly more accurate than K-nearest neighbors (using 2008-12 as the training set and 2013 as the test set)
The changes I made are labeled as "second commit" at https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM, which provides all the data and code I used
Hello world, my name is Jerry Kim. I have a background in physics and programming and I am interested in a career as a software engineer or data scientist. I created this website to showcase a small sample of the things that I have worked on