Playing Kaggle with Scala

Davi de Castro Reis
11 min readJul 31, 2017

I had some fun and challenges with the unusual choice of picking Spark ML as my primary building block in one of Kaggle competitions. Read on to learn more.

The competition problem statement was to detect duplicate Quora questions, which involves a lot of text analysis, an area we at WorldSense like a lot. The solution I attempted leverages only algorithms readily available in Spark ML, and is a simple pipeline with data cleaning, TF-IDF, LDA, and a linear classifier.

The tools of the trade in these competitions are invariably from the python ecosystem. The completeness of the machine learning libraries, the interactive programming model with Jupyter, and the robust visualization support are unmatched in other language. On the other hand, python can be quite slow at runtime when you need to do something that is not readily available as a computation supported by a library implemented in C, and the lack of a type system can be unnerving.

So, while we developed a competitive model in the top 11% using Keras, Tensorflow, Scikit-Learn and Gensim, I also thought it would be interesting to see how hard it would be to write a complete solution for the problem using the Spark/Scala eco-system. I did manage to reach a 0.45 log loss score, which is somewhat bleh, but still meaningful. There were many possible improvements, but the amount of work I had to get there was just too much, and I did not feel like further improving it.

Keep reading if you want to learn the details of the code I wrote for each step in this machine learning task from Quora.

Loading the data

The very first step when doing machine learning is to load your data in the data structures you need. In my case, since I will be using Spark's machine learning libraries, I want to have the data in a spark dataframe (analogous to a pandas dataframe in python, which in turn are analogous to the R dataframes). The original data is stored in a csv which can be directly loaded in Python, the lingua franca of Kaggle competitors. In Spark, I needed a few custom settings to get it right.

During data loading, it is also usual to do small transformations to match the environment expectations. For example, here I renamed a field with a underscore in the name, which is not polite Scala. After that, I could leverage Spark 2.0 strongly-typed dataframes sibilings, called datasets. Those are a joy to work with, so as soon as I loaded the data I converted it to the Features case class, surfacing those types and making coding easier.

Data Cleaning

I have loaded the data doing the minimal amount of transformations, but before feeding it to our machine learning algorithms, I needed to clean it. This is inevitable, every real world data comes with some noise that either breaks or harms the learning process. Even Kaggle competition data has its share of problems, ranging from the data itself, like missing values, to completely borked rows that need to be dropped.

For the integer values, one usually needs to look out for values outside the domain valid range, or dataset level constraints violations, like non-unique ids. In this specific dataset, I only had to handle missing values, which could be easily replaced. For free text values, like what is given in question1 and question2, there are a bunch of transformations which are usually required, regarding encoding, diacritical representation, how to deal with math symbols, and so on. Some of those decisions are not strictly in the domain of data cleaning, and are heuristics in nature, while others are more straightforward. Here is what I did:

The implementation of UTF8 normalization and diacritical handling relies on IBM ICU, a state of the art library for dealing with unicode. I highly recommend feeding all your text through it due to its robustness at dealing with crazy input. This will often save you from all sorts of failures when you feed your input to more fragile code further down in your pipeline.


Machine learning works in the domain of vectors and matrices, a very natural world for the image processing community, which developed most of the field over the last decades. Words and characters are somewhat strangers in this universe, and machine learning with text will invariably involve converting your words to a vector representation.

I started by first converting our phrases to sequences of words, a process named tokenization. I also did small, lossy transformations (where we lose potentially useful information), to normalize the resulting representation. After this step, punctuation will be gone, as will be the so called call stopwords, which are the words, like "the", that are defined as carrying very little information and believed to introduce noise in our learning. The exact set of stopwords is problem specific, and we started with this list.

After that, I transformed each word, which is a variable length series of characters, into a fixed size vector. In the world of deep learning, this is often done through a table of word vectors, which was trained with an unsupervised learning algorithm over a large corpora. I used a more traditional technique, where we weight terms by their frequency times their inverse document frequency*, the so called TF-IDF vectorization. Its biggest downside is that it loses positional information of the words, but despite this significant shortcoming, and can still yield interesting results. On the upside, it is readily available in Spark ML and easily understood (*Learn more: inverse document frequency, or IDF, was conceived by Karen Sparck Jones, one of the many great women in computer science).

You will notice the class MultiColumnPipeline in both this snippets. This is an adapter I had to write to deal with the fact that the questions are given in two columns. The adapter takes a traditional spark ml estimator, which trains data in a column and applies its transformation to the same column, and wraps it in a modified pipeline which trains in the concatenated data of multiple columns and applies its transformation in each individual column.

Latent Dirichlet Allocation

We now have, representing each question in our training set, a variable length sequence of numeric vectors. But machine learning algorithms do not usually directly handle variable length sequences, since they don't know how to assemble those vectors in a matrix representation. This needs to be done somehow, either in a straightforward way, through padding and pruning, through an attention technique, or through and algorithm that is capable of doing this transformation. In our case, we use the Latent Dirichlet Allocation algorithm, which is readily available in Spark ML, and has the nice property of being somewhat interpretable, an important aspect often overlooked by machine learning practitioners. A more modern architecture, using word vectors and attention is described in this excellent writeup from the authors of the python library Spacy .

Logistic Regression

Now that we have fixed size vectors representing each of our sentences, in principle we all we need is to calculate the distance between these two vectors and feed the result to Kaggle. For example, we could use straightforward algebraic formulations, like the cosine between the two vectors, or take advantage of LDA probabilistic nature and compute the Hellinger distance between the vectors. That choice would define the typical numerical values we get in the output, and we would need to adjust the client code consuming these numbers to properly interpret the distribution of values we generate.

We do have a specific goal for the distance though, which is maximizing log-loss in the Kaggle competition. In a real world scenario, other goals could be present, like how fast the distance computation is, or how easy it is to interpret. But since in our problem statement, all we need is to maximize our metric, an interesting option is to just use another component in our machine learning pipeline to do that. Here, we simply decided to feed each pair of vectors to a Logistic Regression learner, which will yield the probability of the two questions being repeated, as defined by our training data.

Evaluation Metrics

We have been given a specific metric to optimize, but when working in a problem in machine learning, it is often the case that you want to look at different metrics while you develop until you develop the intuition of what each one is capturing, and grow confident that what you are optimizing really matters. The Spark ML library is not very rich in terms of implemented metrics or visualization tools, but regardless I tried to look at what was convenient enough. First, the code that puts the stages and trains the model we want to inspect:

And now the helper functions that compute the metrics and print them in the screen in our development iterations. The lda topics, albeit not a metric per se, serve the same noble goal here: understanding what is happening as we play with the code.

It is important to not try to optimize the numbers you see here, otherwise you will be falling in the overfit trap, since you do not know how well your model will generalize to unseen data. In the following section we discuss how to alleviate that problem, but nonetheless looking at numbers at every step in your machine learning pipeline is key to develop intuition.


As the last piece of our puzzle, we will now wrap fully functional machine learning pipeline we have in a loop that will define the best parameters for the machine learning algorithms themselves. This is called hyper-parameter selection, and even though it could be seen as a fine tuning step, often the impact of hyper-parameters in the learning process can be huge, and searching that space is a must to achieve good results in a task.

This is the time we harvest the benefits of having expressed all of our computation as a Pipeline , the abstraction Spark ML borrowed from Scikit-Learn. This means that the Spark knows how run our learning algorithm multiple times and compare the results of each run.

There are many interesting hyper-parameters to tune, like the list of stopwords, or how one normalizes diacriticals and capitalization, which arguably is a change of the learning algorithm itself, but ultimately that is what all hyper-parameters are. However, each hyper-parameter you try to change increases exponentially the time to train. The reason for that is that our pipeline assumes that every parameter in the model is dependent on each other. In practice, it is often ok to assume independence among most parameters, and that can be expressed by tuning each parameter set at a time.

Below we see the code we used to decide on vocabulary size, minimum frequency, lists of stopwords, number of topics and max iterations for the LDA and logistic regression steps, with a total of over a hundred models being trained. Luckily, the cross validator has a hook to define the metric we want to optimize, and even though I had to implement a custom evaluator for log-loss, it was straightforward to plug it in.

Another important aspect here is that I cross validated the model over 5 folds for each hyper-parameter set, which helps avoiding overfitting the hyper-parameters to the training data, as was the case before introducing the cross validation.

Usually this is not the step where one stops in a Kaggle competition. In fact, here is not a single best set of hyper-parameters, and in practice each set captures different angles of the problem. In particular, when one starts to think of models as hyper-parameters themselves, you will see that you want to build many different models and combine them somehow. And the secret to score high in those competitions is to mix and match a lot of techniques and discoveries and create an ensemble, using something like XGboost. We will not go into this here, even though XGBoost do have nice bindings for Spark/Scala.


Now that I have a model using the best set of hyper-parameters that had been found, I need to apply it to the test data provided by Kaggle and submit the results. First, I loaded the test data into the same format that the model expects as input.

Then I applied the model and finally wrote the output file. I just needed some tricks to rename the output columns and again and I had to set the right csv parameters.

To wrap it up, see how simple and clear the main function can be when one adheres to the pipeline model, be it either in the Scala or the in Scikit-Learn world.


Even with the recent advancements in the area, machine learning is still a complex field and achieving above average results in practical problems depends on a lot of tools which unfortunately are still sub-par in the Scala/Spark ecosystem. Even though I have presented an end to end solution with all the typical bells and whistles, the lack of an interactive environment with visualization support like Jupyter prevents exploring efficiently the problem space, and I got with the initial solution I thought about. In the same fashion, the lack of a large, integrated, easy to use, collection of data cleaning algorithms, learners, transformers and evaluation metrics, place a big cost of each idea one wants to try, and since much in this field is still experimentation, development cost dominates. For example, not having a readily available LSTM implementation already puts you way too much behind the state of the art.

On the bright side, having the type-system on your side is very helpful. An integrated environment like Intellij with proper auto-completion, good testing frameworks, and the superior speed of data crunching with Scala brings its own benefits. If I had Keras and some subset of Scikit-Learn in Scala, I would probably never look back.


  1. There are a handful of projects which provide Scala or Spark support in Jupyter, or provide a similar experience in another environment. Databricks proprietary notebooks are a good option if it fits in your budget. I use jupyter-scala and plotly-scala, but the productivity is just not the same yet.
  2. Unfortunately the data provided by Quora for this competition was not properly sanitized and inane things like supposedly random identifiers were actually leaking information about the results, making the overall competition less interesting than it could have been. See for details.
  3. Our competitive solution, which benefitted from the leaky data and was written with keras/scikit-learn using LSTM and other more modern techniques scored 0.15664, which was position 335 overall by the end of the competition.
  4. Python can often outperform java/scala, but that is only the case when the majority of your code is actually running in C, and is properly using multiples cores and/or the GPU (cython being one of the possible workarounds nowadays). In the JVM, your performance will usually be more uniform, and depending what you are doing, that may be a better trade-off. In many ways, it is sad that things just don't simply work.
  5. There many other machine learning libraries in Java other than Spark ML. But I don't see any of them matching scikit-learn breadth of scope and maturity, and I chose Spark ML due to the deep integration with Spark itself.
  6. It would be only natural that Spark ML cross validator would train the models in parallel in the cluster. Unfortunately, that is not the case:
  7. Writing MultiColumnPipeline was the main software engineering task to make this work. That class is an adapter that takes two columns, concatenates them, train the given learner on the full data, and then applies the learned model in each of the columns individually. Had this code not been written as a an adapter pipeline, we would have deviated from the Spark ML Pipeline model and would not have been able to use its cross validation functionality.