A little bit of Data Science in Scala

It is stating the obvious to say that nowadays, Data Science (DS) is dominated by python (with R being a decent challenger). It is however arguably not the best for Data Engineering (1), where more robust languages such as Scala tend to shine more. As we shall see, though, Scala is also a prime choice for tinkering with data. It has all the tools you as a data scientist needs, and it has several advantages over python, that I list at the end.

In order to see that Scala is a good fit for DS, we have to accommodate all of a data scientist’s requests. Actually, they don’t need much:

  • a way to “munge” the data they have
  • a way to visualise these data
  • a way to run many machine learning algorithms on them (ranging from random forests to neural networks, with a decent share of clustering methods)
  • a way to do all that in a “notebook”,
    and they want to do that with the fewest key strokes possible (that is immediately checked, as Scala can be extremely concise).

Today, we only consider “medium-sized” data. That is, data that easily fit into one (decent) computer memory (the kind of data that you can handle with pandas. In order to do “big data”, you should turn to Spark, which is actually written in Scala.

In this post, we will manipulate the data of all the cards of the game Magic the Gathering. This dataset comes from kaggle, and contains JSON representations of the cards.

In the following, we will describe how to use notebooks. Then, we will see how to manipulate data in Scala. We will then visualise these data and finally apply machine learning on them. The last section will be devoted to list some of the advantages that you have in using Scala.

(1) For the purpose of this post, let’s consider as “Data Science” the part consisting of searching for the best algorithms to leverage the data you have, and “Data Engineering” for putting these algorithms into production.

The notebook

If you don’t know what a notebook is, we’ll simply say that it is a non-linear console, where input code lies in “cells” and the corresponding output (which can be plots and charts) is directly displayed below them. A notebook is a tool that data scientists like because it is highly interactive, offers quick iterations on “trials and errors”, and allows to nicely display the whole thing.

The most popular one is probably jupyter notebook. We said that a notebook is like a non-linear console. And it is actually a genuine console running in the background, called a “kernel”. Obviously, the most known kernel is the kernel for python, but you can have kernels for other languages, such as Scala.

An excellent choice for a Scala jupyter notebook kernel is almond (2). Importing (and automatically downloading if needed) libraries within almond is extremely easy, and it thus gets you working in no time. It also has a nice Spark support, be it your server has enough cpus that it’s worth it (in my company, we have one!).

We will use almond in the sequel. You can find the final version of the notebook here (graphs are not visible from GitHub). All the code snippets below can be put into separate cells.

(2) You’ll find all the instructions to install it on their website.

Munging data

If you use python in your everyday life, there are two things that you are probably not used to: a fast language, and a comprehensive standard collection library. Scala enjoys both of these things. That means that you don’t need third party libraries such as pandas or numpy to manipulate data.

The Magic cards `AllCards.json` contains a big JSON object of the form `cardName: cardInfoObject`. Let’s read it from the file, using os-lib for reading the file, and ujson for parsing the object. We import these libraries with

import $ivy.`com.lihaoyi::upickle:0.7.5`
import $ivy.`com.lihaoyi::os-lib:0.3.0`

and we use them by doing

val cards = ujson.read(os.read(os.pwd / “AllCards.json”)).obj

(`cards` is now a “Map” (a dictionary, if you prefer) from the name of the cards to their JSON info). There are `cards.size` (= 19848) cards in total. We can for example get the information of the “Island” card: `cards(“Island”)`.

Here we are in one of those fairy tales where the data are clean. But we can still start to manipulate them to display stuff in the next section.

Converted mana cost

The zero CCM cards

Power/Toughness of creatures

As you can see, manipulating “complex” (i.e., more involved than a data frame structured) data is easy and doesn’t require you learning complicated apis.

Let’s now visualise what we obtained above.

Visualising data

Let’s first import the library and all the needed packages

Let’s plot de repartition of CCMs:

which gives

The different types of zero CCM cards is similar:

We then plot the scatter plot of the couples `(CCM, sum of stats)`. Since many couples are the same, we account for that in the size of the marker.

As you can see, outputting charts in the notebook is as simple as it should be. Other examples can be found in this notebook. Another excellent library for plotting in the notebook is Vegas, which will be much closer to what python users know with matplotlib.

(3) As I shall discuss below, advantages of being typesafe in a notebook are real.

Applying machine learning algorithms

A `DataFrame` in Smile is simply a matrix of features `x`, together with an array of targets (when relevant) `y`. Internally, these arrays are represented as `Double`s, with the types recorded in an array of `Attribute`s.

We’re going to build an artificial neural network (ANN, called multilayer perceptron in Smile) that, given the cost (accounting for the colours) and the power of a creature, will try to determine its toughness. Spoiler alert: we will fail. But that’s not the point. Since abilities of a creature can highly affect its cost (in both directions, depending on whether the ability is a bonus), we’ll only consider creatures without ability (there are called “vanilla”).

In the JSON, the costs of the cards are represented as a `String`, where each mana symbol is enclosed in curly brackets. For example, a card that costs 2 white mana and 2 colourless mana will be written `{2}{W}{W}` (4). Let’s make a class `Colour` that is going to help us parse these colour info:

Now we can get a list of vanilla creatures that contains the colour and the stats:

We are ready to enter Smile. We first import it:

The first thing to do is to create a `DataFrame`. We need to provide the matrix `x` of features, the array `y` of targets and the corresponding attributes:

We normalize the data in the (0, 1) range:

We can then simply try an artificial neural network on that data, and look at the accuracy of the model:

You should get something that’s not that terrible (like 70%), but there’s a good chance we overfit. To do that, we need to make a train/test split on the data. This is rather straightforward:

and we can then train it

And you should see something close to 0.66 for test accuracy, which is actually not that far away from the training accuray.

History of accuracy

The implementation of the training is:

We simply need to change the `foreach` iteration, and keep track of the accuracy. We need a training and a test subset for that.

We can use it, and then plot the history

This would give you a plot that looks like the following, where we can clearly see the point where the model started to overfit.

(4) There are black and blue in Magic, so the letter symbol for blue is `U`.

Conclusion

I’ll try to list below the advantages of using Scala instead of python (or R) for DS, while also trying not to oversell the whole thing:

  • manipulating data doesn’t require third party libraries, even when the data are complex. Compared to pandas, which is “mandatory” in python and only handles data frames, Scala’s standard collection library can handle any kind of data, without sacrificing efficiency.
  • Scala is typesafe. When working in a notebook, in a script-like environment, it can seem useless. However, it has the big advantage that when using functions and manipulating data, the errors are more clearly spotted by the compiler. When you give the wrong argument to a function in python, you get an error often so deep into the stack trace that it is difficult to understand what was wrong.
  • it is easier to learn new libraries. This is very important and linked to the previous point. The fact that compilation errors explain better what is wrong is a big asset in learning new libraries. Source code is also often easier to read because way cleaner.
  • you will be more familiar with Spark. The Spark API is very much alike the standard collection library, based on methods like `map`, `flatMap`, `filter`… Being familiar with the Scala API makes learning Spark a breeze.
  • you can use monkey patching. Monkey patching is the fact of adding methods to objects. For example, it’s possible to add the method `mean` to lists of Doubles, if you deem it necessary.

There are also drawbacks, among which the fact that the community is much smaller. It’s not an issue to discuss about what type of model to use (that is of course not language dependent), but implementation tricks can’t be shared as much.

Overall, I think that Scala is way better than python for manipulating the data, especially in a preprocessing phase. For visualising data and applying machine learning, I think they are both more or less on the same level. The big advantage of python, though, is its immense community.

I hope this will give you a taste of the possibilities and the tools at your disposal for data science in Scala.

Mathematician, Scala enthusiast, father of two.