In this blog post I will discuss my work on the data provided for the San Francisco Crime Classification competition by Kaggle. The data and description of the competition is located at: https://www.kaggle.com/c/sf-crime
I ran Linear Discriminant Analysis and Random Forest on the training data in order to predict the type of crime that occurred in the test set. I could not try Principal Component Analysis to perform dimension reduction because the data only contains categorical variables. As explained in the book "Introduction to Statistical Learning" by Tibshirani et al., because the outcome variable in the dataset has more than 2 outcomes, it is better to use LDA than logistic regression because the parameter estimates are unstable for logistic regression. However, that's not true for LDA
I got a better value for the log-loss score when using LDA than with Random Forest. For LDA, I used the first 100000 rows of the validation set and the remaining rows as the training set for Cross Validation. The log-loss I obtained was 2.547. I could not do this with Random Forest because I kept getting errors with memory size because Random Forest uses up alot of the computer's RAM. Therefore, I had to use smaller data for the training and validation set. The log-loss was -3.18 when I used only the rows 850001:878049 of the original training set file as the training set and the 1st 100 rows of that as the validation set and using ntree=100.
I then tried to get a better log-loss, so I then got 6 samples that contained each outcome for the dependent variable (crime Category) using dplyr as the training set. I then used the first 50000 of the training set file as the validation set for Cross Validation. I then ran Random Forest with 5000 trees and computed the log-loss as 3.856. It worsened to 4.856 when using 200 samples that contained each possible outcome for the crime category.
So the log-loss for LDA was better than any of the log-loss values computed from Random Forest
I then used k-fold cross validation on LDA before creating a submission file containing the predicted probabilities on the test data provided by Kaggle. With 10 folds, the average log-loss was 2.668.
In the future, I plan to modify this by further tuning the parameters for the Random Forest method to get the best possible log-loss
You can find the code I used for this analysis at: https://github.com/jk34/Kaggle_SF_Crime_Classification/blob/master/run.r
Hello world, my name is Jerry Kim. I have a background in physics and programming and I am interested in a career as a software engineer or data scientist. I created this website to showcase a small sample of the things that I have worked on