A little bit of Data Science in Scala

Antoine Doeraene
9 min readOct 14, 2019

It is stating the obvious to say that nowadays, Data Science (DS) is dominated by python (with R being a decent challenger). It is however arguably not the best for Data Engineering (1), where more robust languages such as Scala tend to shine more. As we shall see, though, Scala is also a prime choice for tinkering with data. It has all the tools you as a data scientist needs, and it has several advantages over python, that I list at the end.

In order to see that Scala is a good fit for DS, we have to accommodate all of a data scientist’s requests. Actually, they don’t need much:

  • a way to “munge” the data they have
  • a way to visualise these data
  • a way to run many machine learning algorithms on them (ranging from random forests to neural networks, with a decent share of clustering methods)
  • a way to do all that in a “notebook”,
    and they want to do that with the fewest key strokes possible (that is immediately checked, as Scala can be extremely concise).

Today, we only consider “medium-sized” data. That is, data that easily fit into one (decent) computer memory (the kind of data that you can handle with pandas. In order to do “big data”, you should turn to Spark, which is actually written in Scala.

In this post, we will manipulate the data of all the cards of the game Magic the Gathering. This dataset comes from kaggle, and contains JSON representations of the cards.

In the following, we will describe how to use notebooks. Then, we will see how to manipulate data in Scala. We will then visualise these data and finally apply machine learning on them. The last section will be devoted to list some of the advantages that you have in using Scala.

(1) For the purpose of this post, let’s consider as “Data Science” the part consisting of searching for the best algorithms to leverage the data you have, and “Data Engineering” for putting these algorithms into production.

The notebook

As we said, one important thing that data scientists like is the ability to work in a “notebook”.

If you don’t know what a notebook is, we’ll simply say that it is a non-linear console, where input code lies in “cells” and the corresponding output (which can be plots and charts) is directly displayed below them. A notebook is a tool that data scientists like because it is highly interactive, offers quick iterations on “trials and errors”, and allows to nicely display the whole thing.

The most popular one is probably jupyter notebook. We said that a notebook is like a non-linear console. And it is actually a genuine console running in the background, called a “kernel”. Obviously, the most known kernel is the kernel for python, but you can have kernels for other languages, such as Scala.

An excellent choice for a Scala jupyter notebook kernel is almond (2). Importing (and automatically downloading if needed) libraries within almond is extremely easy, and it thus gets you working in no time. It also has a nice Spark support, be it your server has enough cpus that it’s worth it (in my company, we have one!).

We will use almond in the sequel. You can find the final version of the notebook here (graphs are not visible from GitHub). All the code snippets below can be put into separate cells.

(2) You’ll find all the instructions to install it on their website.

Munging data

When a data scientist receives data “from the business”, they are never clean and beautiful (that only happens in fairy tales and machine learning tutorials). Instead, they are messy, not well organised, sometimes missing, and so on. The first job of a data scientist is then always to massage the data in order to be able to easily work with it. The final form that data take is (nearly) always a “data frame”, which is simply a glorified table where each piece of data is a row, and each data feature has its column. Munging data also comes in pair with the next section, which is about visualising data.

If you use python in your everyday life, there are two things that you are probably not used to: a fast language, and a comprehensive standard collection library. Scala enjoys both of these things. That means that you don’t need third party libraries such as pandas or numpy to manipulate data.

The Magic cards `AllCards.json` contains a big JSON object of the form `cardName: cardInfoObject`. Let’s read it from the file, using os-lib for reading the file, and ujson for parsing the object. We import these libraries with

import $ivy.`com.lihaoyi::upickle:0.7.5`
import $ivy.`com.lihaoyi::os-lib:0.3.0`

and we use them by doing

val cards = ujson.read(os.read(os.pwd / “AllCards.json”)).obj

(`cards` is now a “Map” (a dictionary, if you prefer) from the name of the cards to their JSON info). There are `cards.size` (= 19848) cards in total. We can for example get the information of the “Island” card: `cards(“Island”)`.

Here we are in one of those fairy tales where the data are clean. But we can still start to manipulate them to display stuff in the next section.

Converted mana cost

We can for example want to see the distribution of “converted mana costs” (CCM). If you are not familiar with the game, it is simply the “total cost” of a card, regardless of colours. This is simply done with

The zero CCM cards

Among the cards that cost nothing, we can distinguish three interesting cases: being only a “Land”, being a “Land” and something else, or not being a “Land” (say, “Composite”). We can easily get the distribution of those as well:

Power/Toughness of creatures

“Creatures” are special kind of cards, that have a “Power” and a “Toughness”. Power and toughness of a creature are integers (although I believe nothing in the rules says that it couldn’t be a non integer) or a “\*” (in that case, it is computed by the special ability of the creatures). Obviously, the CCM of a creature has an impact on the stats (power and toughness). The ability of the creature as well, but as this is a text, it’s difficult to measure. We will look at the correlation between CCM and the sum of the stats. We will remove all creatures that have an ability, and all creatures for which the power or toughness is not an integer (to remove weird cases):

As you can see, manipulating “complex” (i.e., more involved than a data frame structured) data is easy and doesn’t require you learning complicated apis.

Let’s now visualise what we obtained above.

Visualising data

The Almond Scala kernel comes with bindings for plotly. Plotly is a JSON-based library for outputting interactive SVG graphs. The binding API is based on case classes to model plotly’s JSONs, providing a typesafe (3) and intuitive syntax.

Let’s first import the library and all the needed packages

Let’s plot de repartition of CCMs:

which gives

The different types of zero CCM cards is similar:

We then plot the scatter plot of the couples `(CCM, sum of stats)`. Since many couples are the same, we account for that in the size of the marker.

As you can see, outputting charts in the notebook is as simple as it should be. Other examples can be found in this notebook. Another excellent library for plotting in the notebook is Vegas, which will be much closer to what python users know with matplotlib.

(3) As I shall discuss below, advantages of being typesafe in a notebook are real.

Applying machine learning algorithms

It is now time to apply some machine learning on our data. A nice library that provides many high quality ML implementations is Smile. As Smile describes itself, it is a “fast and comprehensive machine learning engine”. Once you have you data inserted in their `DataFrame` abstraction, you can run plethora of ML algorithm with a single line of code.

A `DataFrame` in Smile is simply a matrix of features `x`, together with an array of targets (when relevant) `y`. Internally, these arrays are represented as `Double`s, with the types recorded in an array of `Attribute`s.

We’re going to build an artificial neural network (ANN, called multilayer perceptron in Smile) that, given the cost (accounting for the colours) and the power of a creature, will try to determine its toughness. Spoiler alert: we will fail. But that’s not the point. Since abilities of a creature can highly affect its cost (in both directions, depending on whether the ability is a bonus), we’ll only consider creatures without ability (there are called “vanilla”).

In the JSON, the costs of the cards are represented as a `String`, where each mana symbol is enclosed in curly brackets. For example, a card that costs 2 white mana and 2 colourless mana will be written `{2}{W}{W}` (4). Let’s make a class `Colour` that is going to help us parse these colour info:

Now we can get a list of vanilla creatures that contains the colour and the stats:

We are ready to enter Smile. We first import it:

The first thing to do is to create a `DataFrame`. We need to provide the matrix `x` of features, the array `y` of targets and the corresponding attributes:

We normalize the data in the (0, 1) range:

We can then simply try an artificial neural network on that data, and look at the accuracy of the model:

You should get something that’s not that terrible (like 70%), but there’s a good chance we overfit. To do that, we need to make a train/test split on the data. This is rather straightforward:

and we can then train it

And you should see something close to 0.66 for test accuracy, which is actually not that far away from the training accuray.

History of accuracy

The final accuracy of a neural is obviously important, but the history of that accuracy along the training can also give a lot of information. The Smile implementation of the NN training is actually so simple that we can adapt it to have that history.

The implementation of the training is:

We simply need to change the `foreach` iteration, and keep track of the accuracy. We need a training and a test subset for that.

We can use it, and then plot the history

This would give you a plot that looks like the following, where we can clearly see the point where the model started to overfit.

(4) There are black and blue in Magic, so the letter symbol for blue is `U`.

Conclusion

We’ve seen all the steps that fall to a data scientist, in Scala. Obviously, it was a toy example and we didn’t even achieve anything amazing, but again, it was not the point. It mostly gives you an idea how one can do Data Science with Scala.

I’ll try to list below the advantages of using Scala instead of python (or R) for DS, while also trying not to oversell the whole thing:

  • manipulating data doesn’t require third party libraries, even when the data are complex. Compared to pandas, which is “mandatory” in python and only handles data frames, Scala’s standard collection library can handle any kind of data, without sacrificing efficiency.
  • Scala is typesafe. When working in a notebook, in a script-like environment, it can seem useless. However, it has the big advantage that when using functions and manipulating data, the errors are more clearly spotted by the compiler. When you give the wrong argument to a function in python, you get an error often so deep into the stack trace that it is difficult to understand what was wrong.
  • it is easier to learn new libraries. This is very important and linked to the previous point. The fact that compilation errors explain better what is wrong is a big asset in learning new libraries. Source code is also often easier to read because way cleaner.
  • you will be more familiar with Spark. The Spark API is very much alike the standard collection library, based on methods like `map`, `flatMap`, `filter`… Being familiar with the Scala API makes learning Spark a breeze.
  • you can use monkey patching. Monkey patching is the fact of adding methods to objects. For example, it’s possible to add the method `mean` to lists of Doubles, if you deem it necessary.

There are also drawbacks, among which the fact that the community is much smaller. It’s not an issue to discuss about what type of model to use (that is of course not language dependent), but implementation tricks can’t be shared as much.

Overall, I think that Scala is way better than python for manipulating the data, especially in a preprocessing phase. For visualising data and applying machine learning, I think they are both more or less on the same level. The big advantage of python, though, is its immense community.

I hope this will give you a taste of the possibilities and the tools at your disposal for data science in Scala.

--

--