Julia, Python, R & Scala

- a comparison from experience -

Overview

Stephan Sahm

founder of Jolin.io
organizer of Julia User Group Munich
full stack data consultant
10 years experience in data science
5 years in consultancy
professional programmer
hobby: philosophy of mind

Outline

my personal time travel
- 2013 Python
- 2015 R
- 2018 Scala
- 2021 Julia
comparison
- Data Science
- Distributed Batch Computation
- Distributed Real-time Computation

2013 Osnabrück: Python

I’ve been in academia in Cognitive Science where Matlab was widespread.
What brought me to Python is that it was said to be almost like Matlab, but more generic and open source.

Magic

completely dynamic execution

object oriented with MetaClasses and multiple inheritance

Main feeling

easy interactive procedural
standard object oriented for moving to production
if Python is fast, then it is not Python (but some underlying C)

Use for

glue language | plug & play
deep learning
data automation
performance optimization
large software developement
(with strong open–closed principle requirements)

2015 Nijmegen: R

I came to R in my master of applied stochastics.
It is still the standard language at numerous universities for analysis of all kind.

Magic

copy-on-modify

lazy execution

meta programming everywhere
(inspect not only values but also syntax)

Main feeling

easy safe scripting
awesome plots, reports (RMarkdown) and dashboards (RShiny)

Use for

data analytics
reports
dashboards
ad-hoc analysis
performance optimization (R is especially memory intensive)
software developement

2018 München: Scala

During my second year in industry Apache Spark crossed my road for doing Big Data in R.
While you can use Spark as an R library, many features are only available in Scala. Hence, I came across Scala.

Magic

by-type arguments

functional & object oriented (with multiple inheritance)

clean interface definition (traits) & extension methods

Main feeling

empowerment – super powerful language with perfect IDE support
few mandatory type annotations
generic code only via traits
can be quite overwhelming
(especially with changes between Scala2 & Scala3)

Use for

professional software development
Big Data (Spark)
Real-time (Kafka, Flink)
Good native performance
Data Science
Machine Learning

2021 München: Julia

Already in university a good friend told me about Julia. Since then it was always the new hot stuff, the next generation.
In 2018 Julia v1.0 stable appeared and I took it as an occasion to learn the language. In 2021 I founded my own Julia consultancy.

Magic

generic functions

functions as generic interfaces (Multimethod)

works for custom objects too (fast like C structs)

just-in-time compilation
with extremely powerful meta programming capabilities

Main feeling

empowerment – easy as Python, faster than Java/Scala, more flexible than both
feeling at home – everything is Julia from performance-critical to high-level
generic programming with builtin data science support

Use for

applied mathematics & data science
scientific computing (e.g. simulation, optimization, learning)
software development
performance optimization
traditional object orientation
non-data ecosystem is still small

Comparison

Python

Scala

generic programming
• polymorphism

Inheritance (defined as part of object definition)

not recommended (many different approaches exist in R)

by-type arguments (defined independent of object definition)

built into Multimethod (no need for extra definition)

• packaging

multiple common solutions: virtualenv, conda, scripts

commonly no packages, just scripts with function definitions

standard package system `sbt` with virtualenv support

builtin package system with virtualenv support

multi dimensional array
• convenient syntax

numpy

builtin (extra: nameable axes)

There is no multidimensional array

builtin

• broadcasting

mainly for numpy values/functions

builtin elementwise operations (but no broadcasting)

works for any function by prepending `.`

• missing

missing values represented as Float.nan

special treatment of missing

represented as `null`, impacting performance because of boxing

special treatment of missing with extra care for performance

• performance

bad performance on custom Python objects & functions

performance better than native Python

native performance is much faster than Python/R, but slower than C or Julia

optimal performance, also for custom Julia types

dataframe

ecosystem
• plotting

excellent GPU-acceleration

• statistics & analytics

• scientific computing

• machine learning & deep learning

excellent in terms of flexibility

Distributed Batch Computation

Distributed Table/SQL:

Spark is written in Scala and often used from Python. Two styles:

Python/R: Plug & play Big Data Pipelines. Without much performance focus.
Scala: Performance oriented, custom Big Data Pipelines.

Difficulties with Spark:

5 SQL-like APIs (functional & sql-columns, typed & untyped, RDD)
machine learning support, but not suited to build custom ml algorithms
(underlying map-reduce computing model is too inflexible)

	Python	R
change elements (User Defined Functions)
aggregate rows (User Defined Aggregate Function)	🔲	🔲
aggregate rows in group
create new rows
no serialization	🔲	🔲
optimal performance	🔲	🔲
automated typing	🔲	🔲

Distributed Machine/Deep Learning: Python Ray

relatively young (v1.0 Sept 2020)
actor model (distributed Classes) is more intuitive for building new algorithms
GPU support
written in Python, C++ & Java
performance optimization not really possible due to many layers

Distributed Array/Table: Python Dask /

Dask Array / Dask DataFrame, can be run on top of Python Ray (v2022.04.2)
Julia DistributedArray.jl (v0.6)
Julia Dagger.jl DTable (v0.14)

Flexible GPU Acceleration: Cuda.jl

most array algorithms work out of the box on Cuda.jl arrays
easily write custom Cuda Kernels
widely supported (v3.9)

High Performance Computing:

builtin support for distributed computation
100% Julia, hence highly optimized performance
very good MPI support MPI.jl (v0.19)
strong alternative to Fortran HPC

Distributed Real-time Computation

Goto plattform: Apache Kafka (written in Java/Scala). Distributed. Resilient.

Java/Scala support is best

Apache Flink
Kafka Streams
Apache Spark
Apache Storm
Apache Samza

"Structured Streaming" for R/Python

table-functions supported like for standard Spark
implicit micro-batching
no stateful functions

Additional Python support is good

Apache Flink PyFlink

stateful functions are supported (since July 2021)
full support for table functions UDF, UDTF, UDAGG, UDTAGG

Python Faust ("Kafka Streams for Python")

stateful functions supported
original company dropped support silently
now community supported (but not everyone knows yet)

/ Python are in general suboptimal for real-time

only with large micro-batches reasonable fast
in step-by-step pipelines (e.g. with stateful functions) native R/Python counts (very slow)
iterators are slow in R/Python

real-time performance

Julia is very well suited for real-time streaming.

extremely fast iterators
iterators are easy and cheap to fork/combine
optimal performance also for step-by-step code
high performant streaming libraries: Transducers.jl (v0.4) OnlineStats.jl (v1.5)

However, as of now

still lacks support for distributed streaming (work in progress – please reach out if you are interested)

Current best choice for distributed streaming: Scala.

Summary

my personal time travel
- 2013 Python
- 2015 R
- 2018 Scala
- 2021 Julia
comparison
- Data Science
- Distributed Batch Computation
- Distributed Real-time Computation

Thank you for your attention.

I am happy to answer all your questions.

Contact jolin.io

stephan.sahm@jolin.io

+49 152 2406 7803

linkedin.com/company/jolin-io