A little bit of Data Science in Scala

  • a way to “munge” the data they have
  • a way to visualise these data
  • a way to run many machine learning algorithms on them (ranging from random forests to neural networks, with a decent share of clustering methods)
  • a way to do all that in a “notebook”,
    and they want to do that with the fewest key strokes possible (that is immediately checked, as Scala can be extremely concise).

The notebook

Munging data

import $ivy.`com.lihaoyi::upickle:0.7.5`
import $ivy.`com.lihaoyi::os-lib:0.3.0`
val cards = ujson.read(os.read(os.pwd / “AllCards.json”)).obj

Converted mana cost

The zero CCM cards

Power/Toughness of creatures

Visualising data

Applying machine learning algorithms

History of accuracy

Conclusion

  • manipulating data doesn’t require third party libraries, even when the data are complex. Compared to pandas, which is “mandatory” in python and only handles data frames, Scala’s standard collection library can handle any kind of data, without sacrificing efficiency.
  • Scala is typesafe. When working in a notebook, in a script-like environment, it can seem useless. However, it has the big advantage that when using functions and manipulating data, the errors are more clearly spotted by the compiler. When you give the wrong argument to a function in python, you get an error often so deep into the stack trace that it is difficult to understand what was wrong.
  • it is easier to learn new libraries. This is very important and linked to the previous point. The fact that compilation errors explain better what is wrong is a big asset in learning new libraries. Source code is also often easier to read because way cleaner.
  • you will be more familiar with Spark. The Spark API is very much alike the standard collection library, based on methods like `map`, `flatMap`, `filter`… Being familiar with the Scala API makes learning Spark a breeze.
  • you can use monkey patching. Monkey patching is the fact of adding methods to objects. For example, it’s possible to add the method `mean` to lists of Doubles, if you deem it necessary.

--

--

--

Mathematician, Scala enthusiast, father of two.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Berry Data — $BRY+$USDT LP mining pool

Machine Learning: Where It All Comes Together

Onward from the first lockdown, is Chicago Travelling more now?

Building a Data-Driven Culture

New York MTA Analysis

Object Detection — Most Important “B” Detectors and Use-cases for Smart Cities

What is the Significance of Recommendation Engine?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Antoine Doeraene

Antoine Doeraene

Mathematician, Scala enthusiast, father of two.

More from Medium

Spark Dataset/DataFrame IN/NOT_IN expression

Apache Spark SQL User Defined Functions..!

4 different ways to work with Nebula Graph in Apache Spark

Spark Scala Pivoting Explained: A Crucial Topic for Big Data Scientists