A little bit of Data Science in Scala

  • a way to “munge” the data they have
  • a way to visualise these data
  • a way to run many machine learning algorithms on them (ranging from random forests to neural networks, with a decent share of clustering methods)
  • a way to do all that in a “notebook”,
    and they want to do that with the fewest key strokes possible (that is immediately checked, as Scala can be extremely concise).

The notebook

Munging data

import $ivy.`com.lihaoyi::upickle:0.7.5`
import $ivy.`com.lihaoyi::os-lib:0.3.0`
val cards = ujson.read(os.read(os.pwd / “AllCards.json”)).obj

Converted mana cost

The zero CCM cards

Power/Toughness of creatures

Visualising data

Applying machine learning algorithms

History of accuracy

Conclusion

  • manipulating data doesn’t require third party libraries, even when the data are complex. Compared to pandas, which is “mandatory” in python and only handles data frames, Scala’s standard collection library can handle any kind of data, without sacrificing efficiency.
  • Scala is typesafe. When working in a notebook, in a script-like environment, it can seem useless. However, it has the big advantage that when using functions and manipulating data, the errors are more clearly spotted by the compiler. When you give the wrong argument to a function in python, you get an error often so deep into the stack trace that it is difficult to understand what was wrong.
  • it is easier to learn new libraries. This is very important and linked to the previous point. The fact that compilation errors explain better what is wrong is a big asset in learning new libraries. Source code is also often easier to read because way cleaner.
  • you will be more familiar with Spark. The Spark API is very much alike the standard collection library, based on methods like `map`, `flatMap`, `filter`… Being familiar with the Scala API makes learning Spark a breeze.
  • you can use monkey patching. Monkey patching is the fact of adding methods to objects. For example, it’s possible to add the method `mean` to lists of Doubles, if you deem it necessary.

--

--

--

Mathematician, Scala enthusiast, father of two.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AutoML Is Not Enough — The Best Analyses Still Need Humans

A step-by-step guide for customer journey analysis with Python clustering

The Citizen Data Scientist

Quantum Computing Notes, for a Python Programmer: Geometry of Complex Numbers

Building a Dashboard App using Plotly’s Dash: A Complete Guide from Beginner to Pro

The Importance of Having a Feature Store

Data scientist : All you need to know and more!

AI in Industry: How to use sensors’ accuracy to create augmented data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Antoine Doeraene

Antoine Doeraene

Mathematician, Scala enthusiast, father of two.

More from Medium

Data Warehouses

Why You Should Think Critically About Your Machine Learning Model Outputs

What does it mean to be a data scientist ?

How data scientists can leverage object oriented programming (OOP) design pattern to write better…