Kaggle – Data Science London + Scikit-learn (Using k-nn in R)

1 minute read

Published:

This post won't make much sense unless you have seen this post as well: http://siddhantj.wordpress.com/2013/12/31/kaggle-data-science-london-scikit-learn-using-svm-in-r/

 

Alright, so moving on.

 

I implemented the k-nearest neighbour model next and there was a significant improvement in my leader-board ranking (16 positions!) and a 2.5% increment in my accuracy. I tried two different values for k: 5 and 15. For k=15, my accuracy dropped slightly(0.3%) and hence k=5 seems to be a giving a decent value. I am left with just one more allowed submission in the next 6 hours and I am tempted to try another value of k (12, no sound logical reasoning behind the number. Just a hunch.)

I ended up spending a lot of time in order to figure out how to use KNN in R. Have little or no idea about the data types in R which is where I got stuck. So far, I have been floating by looking at examples and learning. Hoping, that I will eventually learn about R as well. Most of my time went in solving a trivial error that was there because my variable that stored the labels of the training set was not of the type - Factor. The problem and the solution  that I ended up using are almost exactly explained here

Again, appending the commands that worked:

#assuming train,test and trainLabels are as defined in the previous post

cl=trainLabels[,1]

answer<- knn(train,test,cl, k=5)

write.csv <-(answer,"answer2.csv") 

 

 

PS: I did try that k=12 hunch and failed. Must learn how to do cross-validation next.