I have updated my work on the data set provided by Kaggle on the NCAA competition. To reiterate, the data is provided at: https://www.kaggle.com/c/march-machine-learning-mania-2015
I also utilized the blog_utility.r script from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/ In this post, I converted the R code provided in the link above into Python code with Pandas. I will briefly discuss how the code works: 1st step: Look at all games from the tournament for a given season (training set should contain multiple seasons) by reading in tourneyRes=pd.read_csv("tourney_compact_results.csv"). Then we loop through each row of that dataframe and we concatenate the season “seasonletter” with the teamID for the winning team (wteam in tourneyRes) and the teamID of the losing team (lteam in tourneyRes). We place these newly formed strings into a new data frame model_data_frame with the results model_data_frame now contains “matchup” (ex. 1234_5678), which is a Pandas Series concatenated with “result/Win” Pandas series (contains 0’s and 1’s, depending on the below) ixs = season_matches['wteam'] < season_matches['lteam'] Then, we create a new dataframe df2 that splits “matchup” (for example, 2008_1234_5678) and stores the teamIDs into HomeID and AwayID. Then we JOIN these columns onto model_data_frame 2nd step: Use teamMetrics =team_metrics_by_season(seasonletter) function to read in tourney_seeds.csv, which contains teamID, BPI, and SEED values for each team that made the tournament in a given season. Then we look through the entire results for a given regular season (the season according to “seasonletter”), which we obtain from regular_season_compact_results.csv. We then keep track of the number of wins and losses, and then compute the number of wins divided by total number of games (TWPCT) for each tournament team in a given regular season. “teamMetrics” now has the columns set as team_metrics.columns = ['TEAMID', 'A_TWPCT', 'A_SEED','A_BPI'] 3rd step: Then we MERGE model_data_frame (containing HomeID, AwayID, and Results of 0’s and 1’s for the tournament) with teamMetrics (containing TEAMID, TWPCT, SEED, BPI based on regular season data) ON HomeID=TEAMID. Then do the same for AwayID Resulting training set looks like: head(trainData) Matchup Win HomeID AwayID A_TWPCT A_SEED A_BPI B_TWPCT B_SEED B_BPI 12 2008_1164_1291 0 1164 1291 0.412 16 288 0.562 16 163 4th Step: The test Set is similar to above, except we first start off by reading in all the teams and their seeds for a particular season tourneySeeds= pd.read_csv("tourney_seeds.csv", sep=',') playoffTeams = season_seeds['team'] playoffTeams = playoffTeams.sort_values(ascending=[1]) Then, assign create a Pandas Series containing matchups of every possible matchup of teams in a particular tournament idcol = pd.Series(str_seasonletter+ "_" + "_".join([str(a),str(b)]) for a,b in combinations(playoffTeams,2)) form = idcol.to_frame() form.columns=['Matchup'] form['result'] = np.NaN Assign NaN to every matchup because we don’t know yet the results for a test set. The resulting test set looks identical to the training set, except the “Win” column is all NA values In addition to Logistic Regression, I also tried to utilize Neural Networks to generate predictions of which team would win in a potential matchup. However, I could not get the code to work. I will attempt to fix this in the near future The Python script is the "ncaa.py" file at https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM, which provides all the data and code I used LINEAR DISCRIMINANT ANALYSIS (LDA) I also used LDA in the R code ( the updated "ncaa.r" file in my Github page). I obtained a logloss score of .631 (when using just BPI, SEED, and TWPCT as features). This is better than the .65 values I got from Logistic Regression and Random Forest
1 Comment
2/27/2021 11:50:30 am
It's a really insightful project using the Pandas library, involving neural networks and logistic regression. It essentially utilizes multiple datasets and algorithms for the execution of the project.
Reply
Leave a Reply. |
AuthorHello world, my name is Jerry Kim. I have a Master's Degree in Physics and years of work experience in Image Processing, Machine Learning, and Deep Learning. I mostly have used C++, Matlab, and Python. I created this website to showcase a small sample of the things that I have worked on Archives
March 2017
Categories |