Approximate Nearest Neighbours in R and Spark

Background K-Nearest Neighbour is a commonly used algorithm, but is difficult to compute for big data. Spark implements a couple of methods for getting approximate nearest neighbours using Local Sensitivity Hashing; Bucketed Random Projection for Euclidean Distance and MinHash for Jaccard Distance. The work to add these methods was done in collaboration with Uber, which you can read about here. Whereas traditional KNN algorithms find the exact nearest neighbours, these approximate methods will only find the nearest neighbours with high probability. [Read More]

Feature selection by cross-validation with sparklyr

Overview In this post we’ll run through how to do feature selection by cross-validation in sparklyr. You can see previous posts for some background on cross-validation and sparklyr. Our aim will be to loop over a set of features refitting a model with each feature excluded. We can then compare the performance of these reduced models to a model containing all the features. This way, we’ll quantify the effect of removing a particular feature on performance. [Read More]

Cross-validation with sparklyr 2: Electric Boogaloo

Overview I’ve previously written about doing cross-validation with sparklyr. This post will serve as an update, given the changes that have been made to sparklyr. The first half of my previous post may be worth reading, but the section on cross-validation is wrong, in that the function provided no longer works. If you want an overview of sparklyr, and how it compared to SparkR, see this post. Bear in mind, however, that post was written in December 2017, and both packages have added functionality since then. [Read More]

SparkR vs sparklyr for interacting with Spark from R

This post grew out of some notes I was making on the differences between SparkR and sparklyr, two packages that provide an R interface to Spark. I’m currently working on a project where I’ll be interacting with data in Spark, so wanted to get a sense of options using R. Those unfamiliar with sparklyr might benefit from reading the first half of this previous post, where I cover the idea of having R objects for connections to Spark DataFrames. [Read More]
Code  Spark  R  sparklyr