In this blog post I will discuss my updated work on the San Francisco Crime Classification competition by Kaggle. The data and description of the competition is located at: https://www.kaggle.com/c/sf-crime
I had used Linear Discriminant Analysis and Random Forest. I have now been able to run Boosting and obtained much better log-loss scores First of all, I was able to generate the new features "Intersection" and "Night", utilize the data.table package to read in large csv faster, and make use of sparse matrices to save memory from this link: https://brittlab.uwaterloo.ca/2015/11/01/KaggleSFcrime/ I then implemented Gradient Boosting by using the "caret" and "xgboost" packages. I first tried eta=.3 (the larger eta is, the smaller the regularization penalty term is). With Cross-Validation using 3 folds, I found that the 16th iteration produced the smallest logloss.mean value of 2.56. However, my previous submission to Kaggle using LDA produced a log-loss of around 2.58. Because the validation error is smaller than what the test error would be, I knew that this 2.56 value was unacceptable. I then guessed that perhaps the previous LDA model overfitted the training set, so I tried increasing the regularization penalty term and decreased eta to 0.1 According to the xgboost documentation page, if you decrease eta, you must increase the number of boosting iterations. I thus tried 50 iterations. I then submitted this to Kaggle, and my logloss score was 2.43! That was much better than the 2.58 I got from LDA I should also note that I tried to use the parameter tuning with "caret". However, it was running too slow on my machine. Just trying a 2-fold CV, with 40 max iterations on 3 different eta values ran for over 8 hours! In addition to Boosting, I also tried to use Random Forests, use the bigRF package, and Neural Networks. In my previous analysis using Random Forests, I kept running into errors due to my computer not having enough memory. I recently purchased a new laptop with more RAM, but I have gotten those same errors with not sufficient memory when running Random Forests. I also could not get the bigRF package to work. I believe it was because it doesn't work on my version of R. As for Neural Networks, I am working on that as I type this post You can find the code I used for this analysis at: https://github.com/jk34/Kaggle_SF_Crime_Classification/blob/master/run_improved.r
12 Comments
I have updated my work on the data set provided by Kaggle on the NCAA competition. To reiterate, the data is provided at: https://www.kaggle.com/c/march-machine-learning-mania-2015
I also utilized the blog_utility.r script from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/ In this post, I converted the R code provided in the link above into Python code with Pandas. I will briefly discuss how the code works: 1st step: Look at all games from the tournament for a given season (training set should contain multiple seasons) by reading in tourneyRes=pd.read_csv("tourney_compact_results.csv"). Then we loop through each row of that dataframe and we concatenate the season “seasonletter” with the teamID for the winning team (wteam in tourneyRes) and the teamID of the losing team (lteam in tourneyRes). We place these newly formed strings into a new data frame model_data_frame with the results model_data_frame now contains “matchup” (ex. 1234_5678), which is a Pandas Series concatenated with “result/Win” Pandas series (contains 0’s and 1’s, depending on the below) ixs = season_matches['wteam'] < season_matches['lteam'] Then, we create a new dataframe df2 that splits “matchup” (for example, 2008_1234_5678) and stores the teamIDs into HomeID and AwayID. Then we JOIN these columns onto model_data_frame 2nd step: Use teamMetrics =team_metrics_by_season(seasonletter) function to read in tourney_seeds.csv, which contains teamID, BPI, and SEED values for each team that made the tournament in a given season. Then we look through the entire results for a given regular season (the season according to “seasonletter”), which we obtain from regular_season_compact_results.csv. We then keep track of the number of wins and losses, and then compute the number of wins divided by total number of games (TWPCT) for each tournament team in a given regular season. “teamMetrics” now has the columns set as team_metrics.columns = ['TEAMID', 'A_TWPCT', 'A_SEED','A_BPI'] 3rd step: Then we MERGE model_data_frame (containing HomeID, AwayID, and Results of 0’s and 1’s for the tournament) with teamMetrics (containing TEAMID, TWPCT, SEED, BPI based on regular season data) ON HomeID=TEAMID. Then do the same for AwayID Resulting training set looks like: head(trainData) Matchup Win HomeID AwayID A_TWPCT A_SEED A_BPI B_TWPCT B_SEED B_BPI 12 2008_1164_1291 0 1164 1291 0.412 16 288 0.562 16 163 4th Step: The test Set is similar to above, except we first start off by reading in all the teams and their seeds for a particular season tourneySeeds= pd.read_csv("tourney_seeds.csv", sep=',') playoffTeams = season_seeds['team'] playoffTeams = playoffTeams.sort_values(ascending=[1]) Then, assign create a Pandas Series containing matchups of every possible matchup of teams in a particular tournament idcol = pd.Series(str_seasonletter+ "_" + "_".join([str(a),str(b)]) for a,b in combinations(playoffTeams,2)) form = idcol.to_frame() form.columns=['Matchup'] form['result'] = np.NaN Assign NaN to every matchup because we don’t know yet the results for a test set. The resulting test set looks identical to the training set, except the “Win” column is all NA values In addition to Logistic Regression, I also tried to utilize Neural Networks to generate predictions of which team would win in a potential matchup. However, I could not get the code to work. I will attempt to fix this in the near future The Python script is the "ncaa.py" file at https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM, which provides all the data and code I used LINEAR DISCRIMINANT ANALYSIS (LDA) I also used LDA in the R code ( the updated "ncaa.r" file in my Github page). I obtained a logloss score of .631 (when using just BPI, SEED, and TWPCT as features). This is better than the .65 values I got from Logistic Regression and Random Forest |
AuthorHello world, my name is Jerry Kim. I have a Master's Degree in Physics and years of work experience in Image Processing, Machine Learning, and Deep Learning. I mostly have used C++, Matlab, and Python. I created this website to showcase a small sample of the things that I have worked on Archives
March 2017
Categories |