In this post I will discuss the work I completed for the assignments that are part of the "Data Science at Scale specialization" course offered by Coursera and the University of Washington. This is the same as the coursera course "Introduction to Data Science"
In "assignment1", I performed the following: Used Python to access the twitter API, determined the sentiment value (measure of popularity) of tweets. Full details at: https://www.coursera.org/learn/data-manipulation/programming/AxbQn/twitter-sentiment-analysis
In "assignment3", I used the code given in MapReduce.py provided by coursera that implements MapReduce in Python. I then implemented my own MapReduce algorithms to complete the assignments, which included counting the number of friends and determining the asymmetric friendships in an example social network with (person,friend) as key-value pairs. Full details at:https://www.coursera.org/learn/data-manipulation/programming/Dp7qI/thinking-in-mapreduce
The code I used to complete these assignments is in the "assignment1" and "assignment2" folders at the repository located at: https://github.com/jk34/Coursera_DataManipulation_MapReduce
I have updated my work on the data set provided by Kaggle on the NCAA competition. To reiterate, the data is provided at: https://www.kaggle.com/c/march-machine-learning-mania-2015
I also utilized the blog_utility.r script from https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/
One of the changes I've made since the original post on July 6th is that I used BPI as an additional predictor, I got additional log-loss values for Logistic Regression, and I determined log-loss values using Random Forest and K-Nearest Neighbors. Details are below
I used the log-loss values of each model to determine which one predicts results that more accurately match actual results. I first used the following variables as predictors: SEED,WST_6, TWPCT (details about these variables is explained at https://statsguys.wordpress.com/2014/03/15/data-analytics-for-beginners-march-machine-learning-mania-part-ii/). I also included BPI as a predictor, which is a rough estimate for how good teams really are. For the 2011-12 season and afterwards, I used the BPI rankings from http://espn.go.com/mens-college-basketball/bpi/_/season/2012 . These BPI rankings were computed by ESPN and I believe these rankings do not consider tournament results into the rankings. That is because that link has “NCAA tournament information” which predicts the seeds and which teams will make tournament or not. For seasons before 2011-12, I used the Pomeroy rankings instead http://kenpom.com/index.php?y=2014, which also tries to determine how good teams really are
I concluded that Logistic regression was clearly a better model than rpart because it produced a lower log-loss value of .692 if using 2010-2012 as the training set and 2013 as the test set (1.06 for rpart). The log-loss score for Logistic Regression was further reduced to .684 when I only used BPI, SEED and WST6 as the predictors. It was further reduced to .682 if I just used BPI and WST6 (not SEED since it has much higher p-value than A_WST6 and A_BPI). I got a more noticeable improvement to .64 if only using BPI.
If I used 2008-2012 as the training set instead, the log-loss was .688 if using BPI, SEED and A_wST6 as predictors. The log-loss remained the same if I just used BPI and A_WST6 (no SEED) as predictors. The log-loss noticeably decreased to .656 if only using BPI
I then tried to determine the log-loss for Support Vector Machine, but I could not get the code for it to work.
I then tried Random Forest. It computed the log-loss value as .651 if I used the 2008-2012 seasons as the training set and 2013 as the test set with BPI, SEED and WST6 as the predictors, mtry=2 (using only 2 of the 3 predictors in each tree split), and ntree=5000 (using 5000 different trees). The log-loss values did not change much if I varied the number of trees as it was .650 if using ntree=1000, .657 for ntree=500, and .648 if ntree=10000. These log-loss values are very similar to the values from logistic regression
Finally, I used K-nearest neighbors. The best log-loss value was .6947, for k=320 if using 2008-2012 as the training set and 2013 as test set. This is slightly larger than the log-loss values from random forest and logistic regression. So the predictions from random forest and logistic regression are more slightly more accurate than K-nearest neighbors (using 2008-12 as the training set and 2013 as the test set)
The changes I made are labeled as "second commit" at https://github.com/jk34/Kaggle_NCAA_logistic_trees_SVM, which provides all the data and code I used
In this post, I will talk about how I worked on simulating the Yu-Gi-Oh card game (http://www.yugioh-card.com/en/) written in C++ and C# as a personal project
In the game of Yu-Gi-Oh (http://www.yugioh-card.com/en/), there are 3 types of cards: Monsters, Traps, and Magic cards. An example of a card is the Blue-Eyes White Dragon (http://yugioh.wikia.com/wiki/Blue-Eyes_White_Dragon). As seen in the link, it is a Monster card. Monster cards also have an Attack power, Defense power, Star level, . Magic and Trap cards do not. All types of cards can be played either in "face-down" mode or "Face-up" mode. Whereas all Trap and Magic cards have an "effect", only some monsters have an "effect". For example, the Blue-Eyes White Dragon has no effect. However, the monster "Blade Knight" (http://yugioh.wikia.com/wiki/Blade_Knight) has an effect, which is "While you have 1 or less cards in your hand, this card gains 400 ATK. If you control no other monsters, negate the effects of Flip monsters destroyed by battle with this card."
This explains why in "Card.h" and "Card.cpp", I have the functions:
Also, in the game of Yu-Gi-Oh, there are 2 players playing against each other at a time. For this program, I specify the two users as "Kaiba" (http://www.yugioh.com/characters/seto-kaiba) and "Yugi" (http://www.yugioh.com/characters/yugi-muto). There are various phases to the game (http://www.wikihow.com/Play-Yu-Gi-Oh!).
In "Yugioh.cpp", I program the game. There is the deck, hand, and graveyard as explained in http://yugioh.wikia.com/wiki/Field.
The code for the C++ version of the program is at: https://github.com/jk34/Card_game_cpp
The code for the C# version with GUI is at: https://github.com/jk34/Card_game_Csharp
An image for one of the cards is:
Hello world, my name is Jerry Kim. I have a background in physics and programming and I am interested in a career as a software engineer or data scientist. I created this website to showcase a small sample of the things that I have worked on