In this blog post I will discuss my updated work on the San Francisco Crime Classification competition by Kaggle. The data and description of the competition is located at: https://www.kaggle.com/c/sf-crime
I had used Linear Discriminant Analysis and Random Forest. I have now been able to run Boosting and obtained much better log-loss scores
First of all, I was able to generate the new features "Intersection" and "Night", utilize the data.table package to read in large csv faster, and make use of sparse matrices to save memory from this link: https://brittlab.uwaterloo.ca/2015/11/01/KaggleSFcrime/
I then implemented Gradient Boosting by using the "caret" and "xgboost" packages. I first tried eta=.3 (the larger eta is, the smaller the regularization penalty term is). With Cross-Validation using 3 folds, I found that the 16th iteration produced the smallest logloss.mean value of 2.56. However, my previous submission to Kaggle using LDA produced a log-loss of around 2.58. Because the validation error is smaller than what the test error would be, I knew that this 2.56 value was unacceptable.
I then guessed that perhaps the previous LDA model overfitted the training set, so I tried increasing the regularization penalty term and decreased eta to 0.1 According to the xgboost documentation page, if you decrease eta, you must increase the number of boosting iterations. I thus tried 50 iterations. I then submitted this to Kaggle, and my logloss score was 2.43! That was much better than the 2.58 I got from LDA
I should also note that I tried to use the parameter tuning with "caret". However, it was running too slow on my machine. Just trying a 2-fold CV, with 40 max iterations on 3 different eta values ran for over 8 hours!
In addition to Boosting, I also tried to use Random Forests, use the bigRF package, and Neural Networks. In my previous analysis using Random Forests, I kept running into errors due to my computer not having enough memory. I recently purchased a new laptop with more RAM, but I have gotten those same errors with not sufficient memory when running Random Forests. I also could not get the bigRF package to work. I believe it was because it doesn't work on my version of R. As for Neural Networks, I am working on that as I type this post
You can find the code I used for this analysis at: https://github.com/jk34/Kaggle_SF_Crime_Classification/blob/master/run_improved.r
Hello world, my name is Jerry Kim. I have a background in physics and programming and I am interested in a career as a software engineer or data scientist. I created this website to showcase a small sample of the things that I have worked on