A little bit of Data Science in Scala

  • a way to “munge” the data they have
  • a way to visualise these data
  • a way to run many machine learning algorithms on them (ranging from random forests to neural networks, with a decent share of clustering methods)
  • a way to do all that in a “notebook”,
    and they want to do that with the fewest key strokes possible (that is immediately checked, as Scala can be extremely concise).

The notebook

As we said, one important thing that data scientists like is the ability to work in a “notebook”.

Munging data

When a data scientist receives data “from the business”, they are never clean and beautiful (that only happens in fairy tales and machine learning tutorials). Instead, they are messy, not well organised, sometimes missing, and so on. The first job of a data scientist is then always to massage the data in order to be able to easily work with it. The final form that data take is (nearly) always a “data frame”, which is simply a glorified table where each piece of data is a row, and each data feature has its column. Munging data also comes in pair with the next section, which is about visualising data.

import $ivy.`com.lihaoyi::upickle:0.7.5`
import $ivy.`com.lihaoyi::os-lib:0.3.0`
val cards = ujson.read(os.read(os.pwd / “AllCards.json”)).obj

Converted mana cost

We can for example want to see the distribution of “converted mana costs” (CCM). If you are not familiar with the game, it is simply the “total cost” of a card, regardless of colours. This is simply done with

The zero CCM cards

Among the cards that cost nothing, we can distinguish three interesting cases: being only a “Land”, being a “Land” and something else, or not being a “Land” (say, “Composite”). We can easily get the distribution of those as well:

Power/Toughness of creatures

“Creatures” are special kind of cards, that have a “Power” and a “Toughness”. Power and toughness of a creature are integers (although I believe nothing in the rules says that it couldn’t be a non integer) or a “\*” (in that case, it is computed by the special ability of the creatures). Obviously, the CCM of a creature has an impact on the stats (power and toughness). The ability of the creature as well, but as this is a text, it’s difficult to measure. We will look at the correlation between CCM and the sum of the stats. We will remove all creatures that have an ability, and all creatures for which the power or toughness is not an integer (to remove weird cases):

Visualising data

The Almond Scala kernel comes with bindings for plotly. Plotly is a JSON-based library for outputting interactive SVG graphs. The binding API is based on case classes to model plotly’s JSONs, providing a typesafe (3) and intuitive syntax.

Applying machine learning algorithms

It is now time to apply some machine learning on our data. A nice library that provides many high quality ML implementations is Smile. As Smile describes itself, it is a “fast and comprehensive machine learning engine”. Once you have you data inserted in their `DataFrame` abstraction, you can run plethora of ML algorithm with a single line of code.

History of accuracy

The final accuracy of a neural is obviously important, but the history of that accuracy along the training can also give a lot of information. The Smile implementation of the NN training is actually so simple that we can adapt it to have that history.

Conclusion

We’ve seen all the steps that fall to a data scientist, in Scala. Obviously, it was a toy example and we didn’t even achieve anything amazing, but again, it was not the point. It mostly gives you an idea how one can do Data Science with Scala.

  • manipulating data doesn’t require third party libraries, even when the data are complex. Compared to pandas, which is “mandatory” in python and only handles data frames, Scala’s standard collection library can handle any kind of data, without sacrificing efficiency.
  • Scala is typesafe. When working in a notebook, in a script-like environment, it can seem useless. However, it has the big advantage that when using functions and manipulating data, the errors are more clearly spotted by the compiler. When you give the wrong argument to a function in python, you get an error often so deep into the stack trace that it is difficult to understand what was wrong.
  • it is easier to learn new libraries. This is very important and linked to the previous point. The fact that compilation errors explain better what is wrong is a big asset in learning new libraries. Source code is also often easier to read because way cleaner.
  • you will be more familiar with Spark. The Spark API is very much alike the standard collection library, based on methods like `map`, `flatMap`, `filter`… Being familiar with the Scala API makes learning Spark a breeze.
  • you can use monkey patching. Monkey patching is the fact of adding methods to objects. For example, it’s possible to add the method `mean` to lists of Doubles, if you deem it necessary.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store