Blog Archives

Linear Regression and Bayesian Regression with MCMC on data from the NBA

7/6/2015

In this post I will talk about how I used Linear Regression and Bayesian Regression using MCMC simulations on NBA data for a recent course project in the Bayesian statistical methods course I enrolled in. Before performing a Bayesian Regression, I tried to determine which variables (such as STL, AST, REB, etc) best predict wins (Winpct) by implementing the Best Subset method and Forward/Backward Step methods on a Linear Regression model. Then, I used Jags to run the MCMC and then made sure it was correct by checking the trace plots and autocorrelation plots

INTRODUCTION
During a National Basketball Association (NBA) season, large amounts of data are recorded during the game. There are "traditional" team and player statistics that are recorded in "box scores", such as the number of assists (AST), steals (STL), rebounds (REB), and field goal percentage (FG)%. However, with the recent rise of applying statistics and data science to the game of basketball, there are now new "advanced" statistics such as Effective Field Goal Percentage (EFG)%, Turnover Percentage (TOV)%, Rebounding Percentage (REB)%, and Free Throws Per Field Goal Attempts (FTpFTA), etc

Previous studies have generated models to determine what basketball variables best predict the winning percentage of teams. For example, the statistician Dean Oliver came up with the "Four Factors" in one of his studies. He pointed out that EFG%, TOV%, REB%, and FTpFGA werethe four important factors that determined whether teams would win or lose games. However, these factors can be refined into more specific variables, which could give more details as to what best predicts whether a team will win or lose a game. For example, REB% could be further divided into Offensive Rebounding Percentage (ORB%) and Defensive Rebounding Percentage (DRB%). We wish to investigate our own linear regression model and compare our results with the “Four Factors”. In addition, we are interested in applying Markov Chain Monte Carlo (MCMC) simulations to a Bayesian regression model. Therefore, the first purpose of this project is to establish a more advanced model using general linear regression to investigate whether we obtain similar results as the Four Factors. The second purpose is to evaluate the similarity between frequentist and Bayesian estimates

DATA AND VARIABLES
The data for our project was collected from NBA.com, basketball-reference.com, and nbaminer.com. We use the winning percentages and statistics for all 30 NBA teams. The statistics we will use as predictors are: 3-Point Make Percentage (X3Ppct), 2-Point Make Percentage (X2Ppct), Assist Per Turnover Ratio (ASTpTO), Assist Ratio (ASTRatio), Steal Percentage (STLpct), BlockPercentage (BLKpct), Turnover Ratio (TORatio), Personal Fouls Drawn Rate (PFDRate), Free Throw Attempt Rate (FTARate), Free Throws per Field Goal Attempt (FTpFGA), Turnover Percentage (TOVpct), OREBpct, DREBpct, and the respective variables for each teams' opponents. We use the winning percentage as the dependent variable

MODELS
We use the winning percentage and statistics for each team from the 2013-14 NBA season data to determine the multiple linear regression model. It is possible that all the predictors that we consider are strongly associated with teams' winning percentages, but it is more likely that the response is only related to a subset of the predictors. When trying to determine the relevant predictors, there are a total of 2^p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subset of the predictors is infeasible. For example, if we try 26 different predictors, then with p=26, we have to consider 2^26=67,108,864 models. It is clearly not practical to try all of these possible models. Therefore, we use the Forward, Backward, and Stepwise Selection methods to determine which predictors are most relevant to predicting a team's winning percentage. By combining the results generated from all three methods, we obtained the variables that are most relevant to predicting a team's winning percentage. Two of the most common numerical measures, RSE and R^2, were used.

VARIABLE SELECTION
Combining the results from Forward, Backward, and Mixed Selection, we found that the most relevant predictors are X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct. All of these variables are contained in the "Four Factors" mentioned above, but our results have predictor variables that are more refined. For example, the shooting factor is measured using Effective Field Goal Percentage (EFG%), which is a combination of 2Ppct, 3Ppct, Opp2Ppct, and Opp3Ppct. From our results, only 2Ppct, Opp2Ppct, and Opp3Ppct are relevant. The rebounding factor is measured using Offensive and Defensive Rebound Percentage. Our results shows that only Offensive Rebound Percentage (ORBpct) has a significant impact on winning games.

COEFFICIENTS FROM GENERAL LINEAR REGRESSION
After determining the relevant variables, we determine that the values for the regression coefficients are 3.84, -.0528, .00722, .00763, .0219, -2.54, and -5.53 for X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct, respectively.

BAYESIAN REGRESSION USING MCMC
After performing linear regression, we performed Bayesian regression. We use non-informative prior distributions for the parameters. That is, we use “flat” normal distributions. We then perform Bayesian regression using MCMC simulations using OpenBugs and rjags from within R. Our results show that the mean coefficient values for the predictors are 3.86, -.0523, .00755, .00377, .0227, -2.36, and -5.60 for X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct, respectively.We verify that our Bayesian regression model is valid bychecking that the trace plots for each predictor variable converges. The plots are below

We also obtain values close to 1.00 for the Gelman diagnostics, with the highest values being 1.08 for Opp2Ppct and 1.06 for Opp3Ppct. Also, the largest in magnitude cross-correlation value we obtain is -.467 for the cross-correlation between Opp2Ppct and Opp3Ppct. However, our plots for the auto-correlation do not initially converge to 0 for Opp2Ppct, Opp3Ppct, and X2Ppct. However, after increasing the thinning, we see in the plots below the convergence in the auto-correlation plots that we expect

DISCUSSION
We see that the general linear regression and Bayesian regression produced similar values for the coefficients of the predictors. However, the one exception is with the FtpFGA variable, which had a coefficient of .00763 for the linear regression model but value of .00377 for the Bayesian regression model. These similar values matches our expectations because we used Bayesian methods with non-informative priors, which should produce similar values as the frequentist methods. For future research and to generate more accurate methods, there are several possibilities we could investigate. First, we could use informative priors for the Bayesian regression model. Second, we could utilize machine learning algorithms, such as the Random Forest algorithm to determine which predictors are most relevant. In addition, we could use Support Vector Machine with our predictor variables to try to predict the winners of future NBA games.

ADDITIONAL STUDIES
I also completed analyses using Principal Component Analysis, Ridge regression, and Lasso regression. In addition to Forward Selection, Backward Selection, and Best Subset Selection, I used Principal Component Analysis to perform dimension reduction and variable selection. I wanted to know how much of the information and variance in the data is lost if I just assume everything is contained in a few principal components. This variance is the proportion of variance explained (PVE) for each principal component. In order to compute the cumulative PVE of the first n principal components, we can simply sum over each of the first n PVEs.

Looking at the figure at the below, one can see that the first 8 principal components explain almost 90% of the total variance. Unfortunately, there is no single answer to the question of how many principal components we need to use. There is a reasonable increase in the cumulative total variance until around 6-8 principal components. Afterwards, the effects of adding more principal components marginally increases the total variance

In addition to the subset selection models, I can fit a model using a constraint which can shrink the coefficient estimates towards zero. Ridge regression uses a shrinkage penalty and tuning parameter "lambda". In the ridge-lambda plot below on the left, each curve corresponds to the ridge-regression coefficient estimate for one of the predictor variables, plotted as a function of lambda. The coefficients shrink towards 0 as lambda increases. A plot of the MSE vs log(lambda) for cross-validation is below on the right (at the plot, the "24" at the top indicates that all 24 predictors are used for each lamdba).

One disadvantage of Ridge regression is that it includes all the predictors in the final model, unlike subset selection. Lasso uses a different penalty and also performs variable selection. It also must select a good value of lambda using cross-validation. Plots of coefficients vs lambda (numbers at top indicate number of non-zero coefficients used in model) and MSE vs lambda (minimum CV MSE occurs when using around 21 predictors, but rapidly increases when using 8-9 predictors, so we use 8 predictors) are located below, to the left and right, respectively

For more details, see my github page at https://github.com/jk34/Bayes1_MarkovChainMonteCarlo_SDS384/tree/master/NBAproject

The completed report is the file "Report.pdf". I also completed analyses using Principal Component Analysis, Ridge regression, and Lasso regression. They are in the files "PCA.r" and "RidgeLasso.r"

1 Comment

Random forest, regression and other statistical analysis on NBA data sets

7/6/2015

0 Comments

.In NBA_WinsPredict1415.r, I obtained data from http://www.basketball-reference.com/leagues/NBA_2015_advanced.html I also modified the file so that WinPct is the winning percentage of the player's team as of 3/19/15

I wanted to predict which player statistics best predicts the team's winning percentage

I set WinPct as the dependent variable. I used the following as predictors: PER, Ts_pct, ORBpct, DRBpct, TRBpct, ASTpct, STLpct, BLKpct, TOVpct, OWS , DWS, WS, WSp48, OBPM, DBPM , BPM , and VORP (see basketball-reference.com for meaning of these variables). I performed a Random Forest on nearly every predictor statistic variable. The results of the relative importance for each predictor are displayed in ggplot_RF.jpg (at the bottom of this post). Based on this, WSp48(player's win shares per 48 minutes) best predicts a team's win, followed by DWS (defensive win shares). It seems strange that WSp48 and DWS are greater predictors of a team's victories than OWS and WS since WS is just OWS+DWS (see NBA_WinsPredict1415.r for details)

I then performed a multiple linear regression on that same dataset. PER,ORBpct,DRBpct,TRBpct,ASTpct,TOVpct,WSp48,OBPM,DBPM,BPM all have p-values less than 0.05 However, OWS,DWS, and WS all have p-values greater than .74 Using the relimp, the variables with highest relative importance were WSp48, ORBpct, DRBpct, and TRBpct

I also wanted to see if the regression and random forest would be any different if I focused only on guards (PG) and (SG). The statistical analysis for only Guards is in "NBA_WinsPredict1314_weighted_PG_SG.r"

In PTPM.r, I downloaded data from https://docs.google.com/spreadsheets/d/1GtCDQw94kpcOw_kPhyH8F5cIjPT3QTsOGqvrX_hMCo8/edit?pli=1#gid=0 I used a multiple linear regression to determine which variables had the greatest impact on the team's winning percentage TeamDefEffect had the highest relative importance value, but it's p-value is .585

I also wanted to see which of PTPM, WSp48 and VORP best predicts the team win percentage for a given player. The R-script is in "NBA_WinsPredict1415_WS_VORP_vsPTPM.r"

I also created a heat map for PTPM, PER, DWS, WS, WSp48, and VORP. but only the players with the top 15 values of WinPctMP. See "Heatmap1415_WS_VORP_PTPM.r" (on my github page. See details at bottom of this post) and the heat map " WSvsVORPvsPTPM_top15_Heatmap.jpg" at the bottom of this post

I also created line charts of the VORP vs Season and WSp48 vs Season for several NBA players. See LineChartOfPlayersVORP_WSp.r and LineChart_VORP_WSpvsSeason.jpg (plot is at the bottom of this post). I also created Loess regression plots as seen in NBA_WinsPredict1314_Loess.r and Loess_WinPctMPvsWS_VORP_PER1.jpg (See bottom)

The equivalent work I have done using Python can be seen at https://github.com/jk34/Python/

0 Comments

Kaggle NCAA competition

7/6/2015

0 Comments

My work in using machine learning algorithms on NCAA basketball tournament data. I obtained the data from https://www.kaggle.com/c/march-machine-learning-mania-2015

I also utilize the blog_utility.r from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/

The analysis I have performed is in ncaa.r. I have tried to use trees, logistic regression, and support vector machines to generate predictive models on previous seasons and tournament results and use the 2014 NCAA tournament results as the test data

So far, the log-loss value of .5935 for logistic regression is better than the 1.008 log-loss for rpart

See https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM for more details

0 Comments

Author

Hello world, my name is Jerry Kim. I have a Master's Degree in Physics and years of work experience in Image Processing, Machine Learning, and Deep Learning. I mostly have used C++, Matlab, and Python. I created this website to showcase a small sample of the things that I have worked on

Linear Regression and Bayesian Regression with MCMC on data from the NBA

Random forest, regression and other statistical analysis on NBA data sets

Kaggle NCAA competition

Author

Archives

Categories