Approximate Nearest Neighbours in R and Spark

Background K-Nearest Neighbour is a commonly used algorithm, but is difficult to compute for big data. Spark implements a couple of methods for getting approximate nearest neighbours using Local Sensitivity Hashing; Bucketed Random Projection for Euclidean Distance and MinHash for Jaccard Distance. The work to add these methods was done in collaboration with Uber, which you can read about here. Whereas traditional KNN algorithms find the exact nearest neighbours, these approximate methods will only find the nearest neighbours with high probability. [Read More]

Feature selection by cross-validation with sparklyr

Overview In this post we’ll run through how to do feature selection by cross-validation in sparklyr. You can see previous posts for some background on cross-validation and sparklyr. Our aim will be to loop over a set of features refitting a model with each feature excluded. We can then compare the performance of these reduced models to a model containing all the features. This way, we’ll quantify the effect of removing a particular feature on performance. [Read More]

Cross-validation with sparklyr 2: Electric Boogaloo

Overview I’ve previously written about doing cross-validation with sparklyr. This post will serve as an update, given the changes that have been made to sparklyr. The first half of my previous post may be worth reading, but the section on cross-validation is wrong, in that the function provided no longer works. If you want an overview of sparklyr, and how it compared to SparkR, see this post. Bear in mind, however, that post was written in December 2017, and both packages have added functionality since then. [Read More]

Machine learning and k-fold cross validation with sparklyr

Update, 2019. I have now written an updated post on cross-validation with sparklyr, as well as a follow-up on using cross-validation for feature selection. These posts would be better to read as the code here no longer works following changes to sparklyr. In this post I’m going to run through a brief example of using sparklyr in R. This package provides a way to connect to Spark from within R, while using the dplyr functions we all know and love. [Read More]