KaggleSFcrime

During graduate school, at least in Cognitive Psychology, it is often the case that the only two career trajectories that you hear discussed and advertised are academia and teaching. Nearing the end of my PhD, I realized I wasn't interested in either of those two paths and had to do a lot of research to determine what kinds of things a Cognitive Psychologist can do in industry. It was during this search that I discovered the massively expanding field of Data Science.

Data Scientists are typically computer scientists and statisticians that design information systems for collecting, processing, analyzing, and making inferences about massive amounts of data. Harvard Business Review has called Data Science the sexiest job of the 21st century.

Data Scientists are highly educated. Of approximately 615 respondents who were included in the analysis, 44% had a master's degree and 23% had a doctorate degree. More than half of the respondents (67%) worked in the United States. Data Scientists are well compensated making a median salary of $104,000 USD in the US. The largest proportion of Data Scientists (23%) work in software.

The connection to Cognitive Psychology and Cognitive Neuroscience may not seem obvious; however, many data science problems involve making inferences about data produced by humans, an area in which we have the advantage. Further, 52% of the respondents used R and 51% used Python: programming languages that are used extensively in the Anderson lab. At the end of my five and a half years of graduate school the most important advice I have is not to underestimate the utility of mastering at least one computational programming language. Here is a comparison between R and Python in Data Science.

It is admittedly difficult to break into a field dominated by computer scientists, but I did it! I recently accepted a job offer for a position as a Data Scientist at Game Circus, a mobile gaming company. I shared my enthusiasm with the lab and convinced Alex and Syaheed to enter a Kaggle competition together as a team. Kaggle competitions typically involve the application of machine learning to real-world data science problems and submissions are scored based on how accurately you have predicted the outcome of the test data.

We choose to join the San Francisco Crime Classification competition. Here I will outline our first solution that resulted in a 139th place submission (out of 683 total submissions).

Solution 1: Logistic Regression

The purpose of the competition is to predict what category a crime was committed in. The data files were large (train: 119 MB; test: 86 MB) so they were read in using the fread function in the data.table package. This is orders of magnitude faster than using read.csv.

This is what the data looked like:

> head(train)

                 Dates       Category                       Descript DayOfWeek
1: 2015-05-13 23:53:00       WARRANTS                 WARRANT ARREST Wednesday   
2: 2015-05-13 23:53:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST Wednesday   
3: 2015-05-13 23:33:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST Wednesday   
4: 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO Wednesday   
5: 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO Wednesday       
6: 2015-05-13 23:30:00  LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO Wednesday  

   PdDistrict     Resolution                   Address         X        Y
1:   NORTHERN ARREST, BOOKED        OAK ST / LAGUNA ST -122.4259 37.77460
2:   NORTHERN ARREST, BOOKED        OAK ST / LAGUNA ST -122.4259 37.77460
3:   NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4244 37.80041
4:   NORTHERN           NONE  1500 Block of LOMBARD ST -122.4270 37.80087
5:       PARK           NONE 100 Block of BRODERICK ST -122.4387 37.77154
6:  INGLESIDE           NONE       0 Block of TEDDY AV -122.4033 37.71343

All of the following applies to the train and test datasets, but I will only display the result for the training dataset. The first step was to clean up the data. First, we separated the Dates variable into Year, Month, Day, Hour, Minute and Second using the lubridate package. The fast_strptime function is strptime optimized for use on large datasets.

train$Dates<-fast_strptime(train$Dates, format="%Y-%m-%d %H:%M:%S", tz="UTC")
train$Day<-day(train$Dates)
train$Month<-month(train$Dates)
train$Year<-year(train$Dates)
train$Hour<-hour(train$Dates)
train$Minute<-minute(train$Dates)
train$Second<-second(train$Dates)

Next we created a new logical variable that was 1 if the crime was committed at an intersection and 0 if it was not.

train$Intersection<-grepl("/", train$Address)
train$Intersection<-plyr::mapvalues(train$Intersection,from=c("TRUE","FALSE"),to=c(1,0))

Another logical variable was created that was 1 if the crime was committed at night (defined as occurring after 11pm and before 6am)

train$Night<-ifelse(train$Hour > 22 | train$Hour < 6,1,0)

Finally, a logical variable for whether the crime was committed on a weekday (1) or not (0) was made

train$Week<-ifelse(train$DayOfWeek=="Saturday" | train$DayOfWeek=="Sunday",0,1)

In order to perform a logistic regression, the category outcome was converted into a dummy binary matrix:

categoryMatrix<-data.frame(with(train,model.matrix(~Category+0)))
names(categoryMatrix)<-sort(unique(train$Category))
train<-cbind(categoryMatrix,train)

The evaluation metric for the competition was multi-class logarithmic loss so we created a function to evaluate our model

MMLL <- function(act, pred, eps=1e-15) {
  pred[pred < eps] <- eps
  pred[pred > 1 - eps] <- 1 - eps
  -1/nrow(act)*(sum(act*log(pred)))
}

As a first pass, the features for the model were selected to minimize the AIC, which was done with a brute-force backwards step procedure. The final model included he following predictors: PdDistrict, X, Y, Hour, Minute, Intersection and Night. In order to evaluate the model, the training dataset was divided into a training and test set (30% of the data):

train.tr.index<-sample(1:nrow(train),0.7*nrow(train))
train.tr<-train[train.tr.index,]
train.test<-train[-train.tr.index,]

The first model using glm produced perfect fitted probabilities; therefore, glmnet from the glmnet package was used instead to perform a penalized regression to correct for this. The glmnet function requires that the input be in the form of a matrix so the train and test sets were converted into sparse model matrices.

matMod.tr<-sparse.model.matrix(~as.factor(PdDistrict)+X+Y+Hour+Minute+Intersection+Night,data=train.tr)
matMod.test<-sparse.model.matrix(~as.factor(PdDistrict)+X+Y+Hour+Minute+Intersection+Night,data=train.test)

The model was then run for each Category and the predicted probabilites computed

m<-glmnet(matMod.tr,train.tr[,1],family="binomial")
pred<-as.data.frame(predict(m,matMod.test,s=1e-15,type="response"))
numCat<-length(unique(train.tr$Category))
pb <- txtProgressBar(min = 1, max = numCat, style = 3)
for (i in 2:numCat) {
  m<-glmnet(matMod.tr,train.tr[,i],family="binomial")
  pred<-cbind(pred,predict(m,matMod.test,s=1e-15,type="response"))
  setTxtProgressBar(pb, i)
}

Before re-running the model on the full training set, and then predicting the probabilities for the test set to make our submission, we evaluated the performance of this model.

> act<-as.data.frame(train.test[1:numCat])
> MMLL(act,pred)
[1] 2.524884

As of today first place is 2.11638, so this isn't a bad first pass!