Blog Posts

My work using Python and SQLite for University of Michigan/Coursera course on databases

12/20/2015

I just completed all the assignments for the "Using Databases with Python" online course offered by the University of Michigan and Coursera (at https://www.coursera.org/learn/python-databases/home/welcome). You can view my work at : https://github.com/jk34/DatabasesPython_SQL_Coursera

In the "HW1" folder is the work for the first assignment. I created a SQLite database and then created a new table and inserted some values into it. I then modified the given "email.py" Python script to read in the "mbox.txt" file containing email records to then count the number email messages per domain name of an email address

In the "hw2" folder, I modified the given Python script "tracks.py" to read in an iTunes export file in XML ("Library.xml") and then utilize sqlite3 within Python and executescript to generate a database "trackdb.sqlite" to store the artist, genre, track, and album of songs from that iTunes file.

In the "hw3" folder, I used the "roster.py" given file to read in a JSON file (roster_data.json) that contains the names of students, the courses they enrolled in, and a "role" value of 0 or 1. It read the values from the JSON file and stored it in the database "rosterdb.sqlite". I modified the "roster.py" file so that it could store the "role" value from the JSON file into the "role" column of the "Member" table

Finally, my work in the "peer" folder is to complete the assignment where I use the "where.data", which stores the names of places (such as Northeastern University, University of London, etc) and then load it into"geoload.py" to use google to look up the addresses of those places and obtain data about them, and then store those in the database "geodata.sqlite". Then " geodump.py" reads that database and produces "where.js" which allows the user to open "where.html" to see a visualization of those places in a google map. I added the place "Westminster Mall" to "where.data" and then ran geoload.py and geodump.py and then opened where.html to see the Westminster Mall's location on the map. You can see this in "where.png"

2 Comments

Kaggle competition on credit scoring to predict defaults

10/23/2015

0 Comments

One problem banks are interested in is determining the credit score of their customers in order to predict the likelihood that their customers would default on a potential loan. In this blog post I will talk about the project I worked on that dealt with this problem. I obtained data from: https://www.kaggle.com/c/GiveMeSomeCredit

I used Python to work on this data set. I started with Logistic Regression to carry out my analyses. Because the training set had lots of NA values, I first got rid of the entries that contained any NA values

The "30-59DaysPastDueNotWorse" variable contained values 96 and 98, which are typos, so I replaced them with the median of 30-59DaysPastDueNotWorse

I then generated histograms of the 'age' and 'NumberOfTime30-59DaysPastDueNotWorse' variables

I then generated a KDE plot to further visualize the data. The KDE plot is similar to a histogram in that it treats each data point as a Gaussian distribution and then takes the cumulative probability function

I also wanted to generate plots that show how many of the entries contained defaults and non-defaults, along with factor plots (using the Seaborn package) that shows how the defaults varies depending on the age and number of dependents of each person

I then generated linear plots using Seaborn to see how the number of defaults correlates with the 'NumberOfTime30-59DaysPastDueNotWorse' variable

I then proceeded with Logistic Regression. I first set the dependent variable as the defaults and converted it into a 1-d array as required by Scikit-learn

I then computed the score by using Logistic Regression on the entire training set. I obtained a score of 93.06%. However, this was only a marginal improvement from the actual percentage of non-defaults in the dataset, which is 93.05%

To improve on this score, I then tried Regularization with the Lasso l1 penalty. I then split the training set into a training and validation/test set. Python automatically converts 75% of the original set into a new training set and the remaining 25% becomes the validation set. The plot below shows the coefficients as a function of the log of C (where C=1/lambda, where lambda is the penalty term. The greater the lambda, the more the coefficients of the predictors tends towards 0, thus eliminating the irrelevant predictors)

Many of the coefficients go towards 0 when C=0 (or lambda = inf). The accuracy scores I obtained were 1.0 for C values = 1, 316.2, 100000, 3.16e7, etc. However, the score was .9301 when C=1e-5 and .99963 when C=.003 (logC = -2.5). From the plot, it is hard to determine which predictors become 0 due to increasing C. I concluded that the most relevant predictors were DebtRatio, age, NumberRealEstateLoansOrLines, and NumberOfOpenCreditLinesAndLoans.

I then used only these relevant predictors in another Logistic Regression analysis. However, with just these predictors, the accuracy dropped to .9277

For future studies, I plan to utilize Random Forest and Support Vector Machine to compare the accuracy score with Logistic Regression. I also want to see if using an ensemble of these methods can further improve the accuracy score

Full details of the code at: https://github.com/jk34/Kaggle_Credit_Default_Loan

0 Comments

Using Python for the Coursera course "Computational Investing"

10/11/2015

0 Comments

I used Python to complete the assignments for the coursera course "Computational Investing". The description of the course and the assignments is located at: http://wiki.quantsoftware.org/index.php?title=Computational_Investing_I

For the assignments, I used the QSTK package, which supports portfolio construction and management. It is described further at: http://wiki.quantsoftware.org/index.php?title=QuantSoftware_ToolKit

For the 1st assignment, I wrote the program hw1.py. which simulates how the stocks in a given portfolio perform over time and computes the statistics of the final values of the stocks to see how much profit/loss you got with this portfolio. The program also contains a portfolio optimizer to test every "legal" set of allocations to the 4 given stocks to see which allocation of stocks produces the best portfolio. The plot below (and which is also located in my Github repository at https://github.com/jk34/Computational_Investing_Python_Coursera) shows the value of the portfolio compared to a benchmark (S&P 500 index) over time

For the 2nd assignment, the program hw2.py (code details at my github page) conducts "event studies" to see how stock price "events" affect future prices. An event is defined as when the actual close of a stock price drops below $5.00 when its actual close was at least $5 the previous day. It uses the Event Profiler provided in QSTK. The event profiler output, which allows us to see how stocks perform after a market event, is displayed in the plot below

For the 3rd assignment, I first wrote "hw3_marketsim.py", which creates a market simulator that accepts trading orders (buy and/or sell stocks) and keeps track of the value of the portfolio containing all the equities by using the values of the stocks in historical data. The market simulator is used if you have a trading strategy containing trades you want to execute. The simulator then simulates those trades by executing them "hw3_analyze.py" then analyzes the performance of that portfolio by computing the Sharpe Ratio, Standard Deviation, Average Daily Return of Fund, and Total/Cumulative Return of your strategy in order to measure the performance of that strategy. The "marketsim-guidelines.pdf" file explains how to build the simulator, and is located at http://wiki.quantsoftware.org/index.php?title=CompInvesti_Homework_3 The plot below shows the value of the portfolio compared to a benchmark (S&P 500 index) over time

For the 4th assignment, my program "hw4.py" combines the Event Study in "hw2.py" with the market simulator in "hw3_marketsim.py" by taking the output of the Event Study in hw2.py as a trading strategy and then inputting it into the market simulator I created in "hw3.py". This program creates a trading strategy by specifying that when an event occurs, we will buy 100 shares of the equity on that day and then sell it 5 trading days later. The plot below shows the value of the portfolio compared to a benchmark (S&P 500 index) over time

For the 5th assignment, "hw5.py" first computes the rolling mean, the stock price, and upper and lower bands. Then, it computes the Bollinger bands. The results are plotted below

0 Comments

Kaggle Crime Classification competition

10/10/2015

0 Comments

In this blog post I will discuss my work on the data provided for the San Francisco Crime Classification competition by Kaggle. The data and description of the competition is located at: https://www.kaggle.com/c/sf-crime

I ran Linear Discriminant Analysis and Random Forest on the training data in order to predict the type of crime that occurred in the test set. I could not try Principal Component Analysis to perform dimension reduction because the data only contains categorical variables. As explained in the book "Introduction to Statistical Learning" by Tibshirani et al., because the outcome variable in the dataset has more than 2 outcomes, it is better to use LDA than logistic regression because the parameter estimates are unstable for logistic regression. However, that's not true for LDA

I got a better value for the log-loss score when using LDA than with Random Forest. For LDA, I used the first 100000 rows of the validation set and the remaining rows as the training set for Cross Validation. The log-loss I obtained was 2.547. I could not do this with Random Forest because I kept getting errors with memory size because Random Forest uses up alot of the computer's RAM. Therefore, I had to use smaller data for the training and validation set. The log-loss was -3.18 when I used only the rows 850001:878049 of the original training set file as the training set and the 1st 100 rows of that as the validation set and using ntree=100.

I then tried to get a better log-loss, so I then got 6 samples that contained each outcome for the dependent variable (crime Category) using dplyr as the training set. I then used the first 50000 of the training set file as the validation set for Cross Validation. I then ran Random Forest with 5000 trees and computed the log-loss as 3.856. It worsened to 4.856 when using 200 samples that contained each possible outcome for the crime category.

So the log-loss for LDA was better than any of the log-loss values computed from Random Forest

I then used k-fold cross validation on LDA before creating a submission file containing the predicted probabilities on the test data provided by Kaggle. With 10 folds, the average log-loss was 2.668.

In the future, I plan to modify this by further tuning the parameters for the Random Forest method to get the best possible log-loss

You can find the code I used for this analysis at: https://github.com/jk34/Kaggle_SF_Crime_Classification/blob/master/run.r

0 Comments

Using Python to implement MapReduce for Data Science Coursera class

9/30/2015

0 Comments

In this post I will discuss the work I completed for the assignments that are part of the "Data Science at Scale specialization" course offered by Coursera and the University of Washington. This is the same as the coursera course "Introduction to Data Science"

In "assignment1", I performed the following: Used Python to access the twitter API, determined the sentiment value (measure of popularity) of tweets. Full details at: https://www.coursera.org/learn/data-manipulation/programming/AxbQn/twitter-sentiment-analysis

In "assignment3", I used the code given in MapReduce.py provided by coursera that implements MapReduce in Python. I then implemented my own MapReduce algorithms to complete the assignments, which included counting the number of friends and determining the asymmetric friendships in an example social network with (person,friend) as key-value pairs. Full details at:https://www.coursera.org/learn/data-manipulation/programming/Dp7qI/thinking-in-mapreduce

The code I used to complete these assignments is in the "assignment1" and "assignment2" folders at the repository located at: https://github.com/jk34/Coursera_DataManipulation_MapReduce

0 Comments

Update to Kaggle NCAA competition

9/30/2015

0 Comments

I have updated my work on the data set provided by Kaggle on the NCAA competition. To reiterate, the data is provided at: https://www.kaggle.com/c/march-machine-learning-mania-2015

I also utilized the blog_utility.r script from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/

One of the changes I've made since the original post on July 6th is that I used BPI as an additional predictor, I got additional log-loss values for Logistic Regression, and I determined log-loss values using Random Forest and K-Nearest Neighbors. Details are below

I used the log-loss values of each model to determine which one predicts results that more accurately match actual results. I first used the following variables as predictors: SEED,WST_6, TWPCT (details about these variables is explained at https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/). I also included BPI as a predictor, which is a rough estimate for how good teams really are. For the 2011-12 season and afterwards, I used the BPI rankings from http://espn.go.com/mens-college-basketball/bpi/_/season/2012 . These BPI rankings were computed by ESPN and I believe these rankings do not consider tournament results into the rankings. That is because that link has “NCAA tournament information” which predicts the seeds and which teams will make tournament or not. For seasons before 2011-12, I used the Pomeroy rankings instead http://kenpom.com/index.php?y=2014, which also tries to determine how good teams really are

I concluded that Logistic regression was clearly a better model than rpart because it produced a lower log-loss value of .692 if using 2010-2012 as the training set and 2013 as the test set (1.06 for rpart). The log-loss score for Logistic Regression was further reduced to .684 when I only used BPI, SEED and WST6 as the predictors. It was further reduced to .682 if I just used BPI and WST6 (not SEED since it has much higher p-value than A_WST6 and A_BPI). I got a more noticeable improvement to .64 if only using BPI.

If I used 2008-2012 as the training set instead, the log-loss was .688 if using BPI, SEED and A_wST6 as predictors. The log-loss remained the same if I just used BPI and A_WST6 (no SEED) as predictors. The log-loss noticeably decreased to .656 if only using BPI

I then tried to determine the log-loss for Support Vector Machine, but I could not get the code for it to work.

I then tried Random Forest. It computed the log-loss value as .651 if I used the 2008-2012 seasons as the training set and 2013 as the test set with BPI, SEED and WST6 as the predictors, mtry=2 (using only 2 of the 3 predictors in each tree split), and ntree=5000 (using 5000 different trees). The log-loss values did not change much if I varied the number of trees as it was .650 if using ntree=1000, .657 for ntree=500, and .648 if ntree=10000. These log-loss values are very similar to the values from logistic regression

Finally, I used K-nearest neighbors. The best log-loss value was .6947, for k=320 if using 2008-2012 as the training set and 2013 as test set. This is slightly larger than the log-loss values from random forest and logistic regression. So the predictions from random forest and logistic regression are more slightly more accurate than K-nearest neighbors (using 2008-12 as the training set and 2013 as the test set)

The changes I made are labeled as "second commit" at https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM, which provides all the data and code I used

0 Comments

Simulation of Yu-Gi-Oh card game written in C++ and C#

9/19/2015

1 Comment

In this post, I will talk about how I worked on simulating the Yu-Gi-Oh card game (http://www.yugioh-card.com/en/) written in C++ and C# as a personal project

In the game of Yu-Gi-Oh (http://www.yugioh-card.com/en/), there are 3 types of cards: Monsters, Traps, and Magic cards. An example of a card is the Blue-Eyes White Dragon (http://yugioh.wikia.com/wiki/Blue-Eyes_White_Dragon). As seen in the link, it is a Monster card. Monster cards also have an Attack power, Defense power, Star level, . Magic and Trap cards do not. All types of cards can be played either in "face-down" mode or "Face-up" mode. Whereas all Trap and Magic cards have an "effect", only some monsters have an "effect". For example, the Blue-Eyes White Dragon has no effect. However, the monster "Blade Knight" (http://yugioh.wikia.com/wiki/Blade_Knight) has an effect, which is "While you have 1 or less cards in your hand, this card gains 400 ATK. If you control no other monsters, negate the effects of Flip monsters destroyed by battle with this card."

This explains why in "Card.h" and "Card.cpp", I have the functions:

Also, in the game of Yu-Gi-Oh, there are 2 players playing against each other at a time. For this program, I specify the two users as "Kaiba" (http://www.yugioh.com/characters/seto-kaiba) and "Yugi" (http://www.yugioh.com/characters/yugi-muto). There are various phases to the game (http://www.wikihow.com/Play-Yu-Gi-Oh!).

In "Yugioh.cpp", I program the game. There is the deck, hand, and graveyard as explained in http://yugioh.wikia.com/wiki/Field.

The code for the C++ version of the program is at: https://github.com/jk34/Card_game_cpp

The code for the C# version with GUI is at: https://github.com/jk34/Card_game_Csharp

An image for one of the cards is:

1 Comment

Linear Regression and Bayesian Regression with MCMC on data from the NBA

7/6/2015

1 Comment

In this post I will talk about how I used Linear Regression and Bayesian Regression using MCMC simulations on NBA data for a recent course project in the Bayesian statistical methods course I enrolled in. Before performing a Bayesian Regression, I tried to determine which variables (such as STL, AST, REB, etc) best predict wins (Winpct) by implementing the Best Subset method and Forward/Backward Step methods on a Linear Regression model. Then, I used Jags to run the MCMC and then made sure it was correct by checking the trace plots and autocorrelation plots

INTRODUCTION
During a National Basketball Association (NBA) season, large amounts of data are recorded during the game. There are "traditional" team and player statistics that are recorded in "box scores", such as the number of assists (AST), steals (STL), rebounds (REB), and field goal percentage (FG)%. However, with the recent rise of applying statistics and data science to the game of basketball, there are now new "advanced" statistics such as Effective Field Goal Percentage (EFG)%, Turnover Percentage (TOV)%, Rebounding Percentage (REB)%, and Free Throws Per Field Goal Attempts (FTpFTA), etc

Previous studies have generated models to determine what basketball variables best predict the winning percentage of teams. For example, the statistician Dean Oliver came up with the "Four Factors" in one of his studies. He pointed out that EFG%, TOV%, REB%, and FTpFGA werethe four important factors that determined whether teams would win or lose games. However, these factors can be refined into more specific variables, which could give more details as to what best predicts whether a team will win or lose a game. For example, REB% could be further divided into Offensive Rebounding Percentage (ORB%) and Defensive Rebounding Percentage (DRB%). We wish to investigate our own linear regression model and compare our results with the “Four Factors”. In addition, we are interested in applying Markov Chain Monte Carlo (MCMC) simulations to a Bayesian regression model. Therefore, the first purpose of this project is to establish a more advanced model using general linear regression to investigate whether we obtain similar results as the Four Factors. The second purpose is to evaluate the similarity between frequentist and Bayesian estimates

DATA AND VARIABLES
The data for our project was collected from NBA.com, basketball-reference.com, and nbaminer.com. We use the winning percentages and statistics for all 30 NBA teams. The statistics we will use as predictors are: 3-Point Make Percentage (X3Ppct), 2-Point Make Percentage (X2Ppct), Assist Per Turnover Ratio (ASTpTO), Assist Ratio (ASTRatio), Steal Percentage (STLpct), BlockPercentage (BLKpct), Turnover Ratio (TORatio), Personal Fouls Drawn Rate (PFDRate), Free Throw Attempt Rate (FTARate), Free Throws per Field Goal Attempt (FTpFGA), Turnover Percentage (TOVpct), OREBpct, DREBpct, and the respective variables for each teams' opponents. We use the winning percentage as the dependent variable

MODELS
We use the winning percentage and statistics for each team from the 2013-14 NBA season data to determine the multiple linear regression model. It is possible that all the predictors that we consider are strongly associated with teams' winning percentages, but it is more likely that the response is only related to a subset of the predictors. When trying to determine the relevant predictors, there are a total of 2^p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subset of the predictors is infeasible. For example, if we try 26 different predictors, then with p=26, we have to consider 2^26=67,108,864 models. It is clearly not practical to try all of these possible models. Therefore, we use the Forward, Backward, and Stepwise Selection methods to determine which predictors are most relevant to predicting a team's winning percentage. By combining the results generated from all three methods, we obtained the variables that are most relevant to predicting a team's winning percentage. Two of the most common numerical measures, RSE and R^2, were used.

VARIABLE SELECTION
Combining the results from Forward, Backward, and Mixed Selection, we found that the most relevant predictors are X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct. All of these variables are contained in the "Four Factors" mentioned above, but our results have predictor variables that are more refined. For example, the shooting factor is measured using Effective Field Goal Percentage (EFG%), which is a combination of 2Ppct, 3Ppct, Opp2Ppct, and Opp3Ppct. From our results, only 2Ppct, Opp2Ppct, and Opp3Ppct are relevant. The rebounding factor is measured using Offensive and Defensive Rebound Percentage. Our results shows that only Offensive Rebound Percentage (ORBpct) has a significant impact on winning games.

COEFFICIENTS FROM GENERAL LINEAR REGRESSION
After determining the relevant variables, we determine that the values for the regression coefficients are 3.84, -.0528, .00722, .00763, .0219, -2.54, and -5.53 for X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct, respectively.

BAYESIAN REGRESSION USING MCMC
After performing linear regression, we performed Bayesian regression. We use non-informative prior distributions for the parameters. That is, we use “flat” normal distributions. We then perform Bayesian regression using MCMC simulations using OpenBugs and rjags from within R. Our results show that the mean coefficient values for the predictors are 3.86, -.0523, .00755, .00377, .0227, -2.36, and -5.60 for X2Ppct, TORatio, ORBpct, FTpFGA, OppTOVpct, Opp3Ppct, and Opp2Ppct, respectively.We verify that our Bayesian regression model is valid bychecking that the trace plots for each predictor variable converges. The plots are below

We also obtain values close to 1.00 for the Gelman diagnostics, with the highest values being 1.08 for Opp2Ppct and 1.06 for Opp3Ppct. Also, the largest in magnitude cross-correlation value we obtain is -.467 for the cross-correlation between Opp2Ppct and Opp3Ppct. However, our plots for the auto-correlation do not initially converge to 0 for Opp2Ppct, Opp3Ppct, and X2Ppct. However, after increasing the thinning, we see in the plots below the convergence in the auto-correlation plots that we expect

DISCUSSION
We see that the general linear regression and Bayesian regression produced similar values for the coefficients of the predictors. However, the one exception is with the FtpFGA variable, which had a coefficient of .00763 for the linear regression model but value of .00377 for the Bayesian regression model. These similar values matches our expectations because we used Bayesian methods with non-informative priors, which should produce similar values as the frequentist methods. For future research and to generate more accurate methods, there are several possibilities we could investigate. First, we could use informative priors for the Bayesian regression model. Second, we could utilize machine learning algorithms, such as the Random Forest algorithm to determine which predictors are most relevant. In addition, we could use Support Vector Machine with our predictor variables to try to predict the winners of future NBA games.

ADDITIONAL STUDIES
I also completed analyses using Principal Component Analysis, Ridge regression, and Lasso regression. In addition to Forward Selection, Backward Selection, and Best Subset Selection, I used Principal Component Analysis to perform dimension reduction and variable selection. I wanted to know how much of the information and variance in the data is lost if I just assume everything is contained in a few principal components. This variance is the proportion of variance explained (PVE) for each principal component. In order to compute the cumulative PVE of the first n principal components, we can simply sum over each of the first n PVEs.

Looking at the figure at the below, one can see that the first 8 principal components explain almost 90% of the total variance. Unfortunately, there is no single answer to the question of how many principal components we need to use. There is a reasonable increase in the cumulative total variance until around 6-8 principal components. Afterwards, the effects of adding more principal components marginally increases the total variance

In addition to the subset selection models, I can fit a model using a constraint which can shrink the coefficient estimates towards zero. Ridge regression uses a shrinkage penalty and tuning parameter "lambda". In the ridge-lambda plot below on the left, each curve corresponds to the ridge-regression coefficient estimate for one of the predictor variables, plotted as a function of lambda. The coefficients shrink towards 0 as lambda increases. A plot of the MSE vs log(lambda) for cross-validation is below on the right (at the plot, the "24" at the top indicates that all 24 predictors are used for each lamdba).

One disadvantage of Ridge regression is that it includes all the predictors in the final model, unlike subset selection. Lasso uses a different penalty and also performs variable selection. It also must select a good value of lambda using cross-validation. Plots of coefficients vs lambda (numbers at top indicate number of non-zero coefficients used in model) and MSE vs lambda (minimum CV MSE occurs when using around 21 predictors, but rapidly increases when using 8-9 predictors, so we use 8 predictors) are located below, to the left and right, respectively

For more details, see my github page at https://github.com/jk34/Bayes1_MarkovChainMonteCarlo_SDS384/tree/master/NBAproject

The completed report is the file "Report.pdf". I also completed analyses using Principal Component Analysis, Ridge regression, and Lasso regression. They are in the files "PCA.r" and "RidgeLasso.r"

1 Comment

Random forest, regression and other statistical analysis on NBA data sets

7/6/2015

0 Comments

.In NBA_WinsPredict1415.r, I obtained data from http://www.basketball-reference.com/leagues/NBA_2015_advanced.html I also modified the file so that WinPct is the winning percentage of the player's team as of 3/19/15

I wanted to predict which player statistics best predicts the team's winning percentage

I set WinPct as the dependent variable. I used the following as predictors: PER, Ts_pct, ORBpct, DRBpct, TRBpct, ASTpct, STLpct, BLKpct, TOVpct, OWS , DWS, WS, WSp48, OBPM, DBPM , BPM , and VORP (see basketball-reference.com for meaning of these variables). I performed a Random Forest on nearly every predictor statistic variable. The results of the relative importance for each predictor are displayed in ggplot_RF.jpg (at the bottom of this post). Based on this, WSp48(player's win shares per 48 minutes) best predicts a team's win, followed by DWS (defensive win shares). It seems strange that WSp48 and DWS are greater predictors of a team's victories than OWS and WS since WS is just OWS+DWS (see NBA_WinsPredict1415.r for details)

I then performed a multiple linear regression on that same dataset. PER,ORBpct,DRBpct,TRBpct,ASTpct,TOVpct,WSp48,OBPM,DBPM,BPM all have p-values less than 0.05 However, OWS,DWS, and WS all have p-values greater than .74 Using the relimp, the variables with highest relative importance were WSp48, ORBpct, DRBpct, and TRBpct

I also wanted to see if the regression and random forest would be any different if I focused only on guards (PG) and (SG). The statistical analysis for only Guards is in "NBA_WinsPredict1314_weighted_PG_SG.r"

In PTPM.r, I downloaded data from https://docs.google.com/spreadsheets/d/1GtCDQw94kpcOw_kPhyH8F5cIjPT3QTsOGqvrX_hMCo8/edit?pli=1#gid=0 I used a multiple linear regression to determine which variables had the greatest impact on the team's winning percentage TeamDefEffect had the highest relative importance value, but it's p-value is .585

I also wanted to see which of PTPM, WSp48 and VORP best predicts the team win percentage for a given player. The R-script is in "NBA_WinsPredict1415_WS_VORP_vsPTPM.r"

I also created a heat map for PTPM, PER, DWS, WS, WSp48, and VORP. but only the players with the top 15 values of WinPctMP. See "Heatmap1415_WS_VORP_PTPM.r" (on my github page. See details at bottom of this post) and the heat map " WSvsVORPvsPTPM_top15_Heatmap.jpg" at the bottom of this post

I also created line charts of the VORP vs Season and WSp48 vs Season for several NBA players. See LineChartOfPlayersVORP_WSp.r and LineChart_VORP_WSpvsSeason.jpg (plot is at the bottom of this post). I also created Loess regression plots as seen in NBA_WinsPredict1314_Loess.r and Loess_WinPctMPvsWS_VORP_PER1.jpg (See bottom)

The equivalent work I have done using Python can be seen at https://github.com/jk34/Python/

0 Comments

Kaggle NCAA competition

7/6/2015

0 Comments

My work in using machine learning algorithms on NCAA basketball tournament data. I obtained the data from https://www.kaggle.com/c/march-machine-learning-mania-2015

I also utilize the blog_utility.r from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/

The analysis I have performed is in ncaa.r. I have tried to use trees, logistic regression, and support vector machines to generate predictive models on previous seasons and tournament results and use the 2014 NCAA tournament results as the test data

So far, the log-loss value of .5935 for logistic regression is better than the 1.008 log-loss for rpart

See https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM for more details

0 Comments

Forward>>

Author

Hello world, my name is Jerry Kim. I have a Master's Degree in Physics and years of work experience in Image Processing, Machine Learning, and Deep Learning. I mostly have used C++, Matlab, and Python. I created this website to showcase a small sample of the things that I have worked on

My work using Python and SQLite for University of Michigan/Coursera course on databases

Kaggle competition on credit scoring to predict defaults

Using Python for the Coursera course "Computational Investing"

Kaggle Crime Classification competition

Using Python to implement MapReduce for Data Science Coursera class

Update to Kaggle NCAA competition

Simulation of Yu-Gi-Oh card game written in C++ and C#

Linear Regression and Bayesian Regression with MCMC on data from the NBA

Random forest, regression and other statistical analysis on NBA data sets

Kaggle NCAA competition

Author

Archives

Categories