Approximate Nearest Neighbours in R and Spark

Background K-Nearest Neighbour is a commonly used algorithm, but is difficult to compute for big data. Spark implements a couple of methods for getting approximate nearest neighbours using Local Sensitivity Hashing; Bucketed Random Projection for Euclidean Distance and MinHash for Jaccard Distance. The work to add these methods was done in collaboration with Uber, which you can read about here. Whereas traditional KNN algorithms find the exact nearest neighbours, these approximate methods will only find the nearest neighbours with high probability. [Read More]

Feature selection by cross-validation with sparklyr

Overview In this post we’ll run through how to do feature selection by cross-validation in sparklyr. You can see previous posts for some background on cross-validation and sparklyr. Our aim will be to loop over a set of features refitting a model with each feature excluded. We can then compare the performance of these reduced models to a model containing all the features. This way, we’ll quantify the effect of removing a particular feature on performance. [Read More]

Cross-validation with sparklyr 2: Electric Boogaloo

Overview I’ve previously written about doing cross-validation with sparklyr. This post will serve as an update, given the changes that have been made to sparklyr. The first half of my previous post may be worth reading, but the section on cross-validation is wrong, in that the function provided no longer works. If you want an overview of sparklyr, and how it compared to SparkR, see this post. Bear in mind, however, that post was written in December 2017, and both packages have added functionality since then. [Read More]

SparkR vs sparklyr for interacting with Spark from R

This post grew out of some notes I was making on the differences between SparkR and sparklyr, two packages that provide an R interface to Spark. I’m currently working on a project where I’ll be interacting with data in Spark, so wanted to get a sense of options using R. Those unfamiliar with sparklyr might benefit from reading the first half of this previous post, where I cover the idea of having R objects for connections to Spark DataFrames. [Read More]
Code  Spark  R  sparklyr 

Machine learning and k-fold cross validation with sparklyr

Update, 2019. I have now written an updated post on cross-validation with sparklyr, as well as a follow-up on using cross-validation for feature selection. These posts would be better to read as the code here no longer works following changes to sparklyr. In this post I’m going to run through a brief example of using sparklyr in R. This package provides a way to connect to Spark from within R, while using the dplyr functions we all know and love. [Read More]

Writing your thesis with bookdown

This post details some tips and tricks for writing a thesis/dissertation using the bookdown R package by Yihui Xie. The idea of this post is to supplement the fantastic book that Xie has written about bookdown, which can be found here. I will assume that readers know a bit about R Markdown; a decent knowledge of R Markdown is going to be essential to using bookdown. The first thing to highlight is that I’m not a pandoc or LaTeX expert. [Read More]

Intro to R slides

For the Perception Action and Cognition Lab Open Science Week, 2017 (University of Leeds) I gave two talks introducing R. You can see the slides below. The code for the slides can be found over at GitHub. An introduction to R In this introduction to R I focused on tools from the tidyverse, as well as trying to provide some motivation for learning R. The audience was academics and postgraduates in a psychology department. [Read More]
R  Talks