Julia, Python, R & Scala
- a comparison from experience -
Overview
Stephan Sahm

- founder of
Jolin.io
- organizer of Julia User Group Munich
- full stack data consultant
- 10 years experience in data science
- 5 years in consultancy
- professional programmer
- hobby: philosophy of mind
Outline
-
my personal time travel
- 2013 Python
- 2015 R
- 2018 Scala
- 2021 Julia
-
comparison
- Data Science
- Distributed Batch Computation
- Distributed Real-time Computation

2013 Osnabrück: Python
I’ve been in academia in Cognitive Science where Matlab was
widespread.
What brought me to Python is that it was said to be almost like Matlab, but more generic and open source.
What brought me to Python is that it was said to be almost like Matlab, but more generic and open source.
Magic
-
completely dynamic execution
-
object oriented with MetaClasses and multiple inheritance
Main feeling
- easy interactive procedural
- standard object oriented for moving to production
- if Python is fast, then it is not Python (but some underlying C)
Use for
- glue language | plug & play
- deep learning
- data automation
- performance optimization
- large software developement
(with strong open–closed principle requirements)

2015 Nijmegen: R
I came to R in my master of applied stochastics.
It is still the standard language at numerous universities for analysis of all kind.
It is still the standard language at numerous universities for analysis of all kind.
Magic
-
copy-on-modify
-
lazy execution
-
meta programming everywhere
(inspect not only values but also syntax)
Main feeling
- easy safe scripting
- awesome plots, reports (RMarkdown) and dashboards (RShiny)
Use for
- data analytics
- reports
- dashboards
- ad-hoc analysis
- performance optimization (R is especially memory intensive)
- software developement

2018 München: Scala
During my second year in industry Apache Spark crossed my road for doing Big Data in R.
While you can use Spark as an R library, many features are only available in Scala. Hence, I came across Scala.
While you can use Spark as an R library, many features are only available in Scala. Hence, I came across Scala.
Magic
-
by-type arguments
-
functional & object oriented (with multiple inheritance)
-
clean interface definition (traits) & extension methods
Main feeling
- empowerment – super powerful language with perfect IDE support
- few mandatory type annotations
- generic code only via traits
- can be quite overwhelming
(especially with changes between Scala2 & Scala3)
Use for
- professional software development
- Big Data (Spark)
- Real-time (Kafka, Flink)
- Good native performance
- Data Science
- Machine Learning

2021 München: Julia
Already in university a good friend told me about Julia. Since then it was always the new hot stuff, the next generation.
In 2018 Julia v1.0 stable appeared and I took it as an occasion to learn the language. In 2021 I founded my own Julia consultancy.
In 2018 Julia v1.0 stable appeared and I took it as an occasion to learn the language. In 2021 I founded my own Julia consultancy.
Magic
-
generic functions
-
functions as generic interfaces (Multimethod)
-
works for custom objects too (fast like C structs)
-
just-in-time compilation
with extremely powerful meta programming capabilities
Main feeling
- empowerment – easy as Python, faster than Java/Scala, more flexible than both
- feeling at home – everything is Julia from performance-critical to high-level
- generic programming with builtin data science support
Use for
- applied mathematics & data science
- scientific computing (e.g. simulation, optimization, learning)
- software development
- performance optimization
- traditional object orientation
- non-data ecosystem is still small
Comparison

generic programming
• polymorphism
• polymorphism
Inheritance (defined as part of object definition)
not recommended (many different approaches exist in R)
by-type arguments (defined independent of object definition)
built into Multimethod (no need for extra definition)
• packaging
multiple common solutions: virtualenv, conda, scripts
commonly no packages, just scripts with function definitions
standard package system `sbt` with virtualenv support
builtin package system with virtualenv support
multi dimensional array
• convenient syntax
• convenient syntax
numpy
builtin (extra: nameable axes)
There is no multidimensional array
builtin
• broadcasting
mainly for numpy values/functions
builtin elementwise operations (but no broadcasting)
works for any function by prepending `.`
• missing
missing values represented as Float.nan
special treatment of missing
represented as `null`, impacting performance because of boxing
special treatment of missing with extra care for performance
• performance
bad performance on custom Python objects & functions
performance better than native Python
native performance is much faster than Python/R, but slower than C or Julia
optimal performance, also for custom Julia types
dataframe
ecosystem
• plotting
• plotting
excellent GPU-acceleration
• statistics & analytics
• scientific computing
• machine learning & deep learning
excellent in terms of flexibility
Distributed Batch Computation
Distributed Table/SQL: 
Spark is written in Scala and often used from Python. Two styles:
- Python/R: Plug & play Big Data Pipelines. Without much performance focus.
- Scala: Performance oriented, custom Big Data Pipelines.
Difficulties with Spark:
- 5 SQL-like APIs (functional & sql-columns, typed & untyped, RDD)
- machine learning support, but not suited to build custom ml algorithms
(underlying map-reduce computing model is too inflexible)
Scala | Python | R | |
---|---|---|---|
change elements (User Defined Functions) | |||
aggregate rows (User Defined Aggregate Function) | 🔲 | 🔲 | |
aggregate rows in group | |||
create new rows | |||
no serialization | 🔲 | 🔲 | |
optimal performance | 🔲 | 🔲 | |
automated typing | 🔲 | 🔲 |
Distributed Machine/Deep Learning:
Python Ray
- relatively young (v1.0 Sept 2020)
- actor model (distributed Classes) is more intuitive for building new algorithms
- GPU support
- written in Python, C++ & Java
- performance optimization not really possible due to many layers
Distributed Array/Table:
Python Dask /
- Dask Array / Dask DataFrame, can be run on top of Python Ray (v2022.04.2)
- Julia DistributedArray.jl (v0.6)
- Julia Dagger.jl DTable (v0.14)
Flexible GPU Acceleration:
Cuda.jl
- most array algorithms work out of the box on Cuda.jl arrays
- easily write custom Cuda Kernels
- widely supported (v3.9)
High Performance Computing:
- builtin support for distributed computation
- 100% Julia, hence highly optimized performance
- very good MPI support MPI.jl (v0.19)
- strong alternative to Fortran HPC
Distributed Real-time Computation
Goto plattform: Apache Kafka (written in Java/Scala). Distributed. Resilient.
Java/Scala support is best
- Apache Flink
- Kafka Streams
- Apache Spark
- Apache Storm
- Apache Samza
"Structured Streaming" for R/Python
- table-functions supported like for standard Spark
- implicit micro-batching
- no stateful functions
Additional
Python support is good
Apache Flink PyFlink
- stateful functions are supported (since July 2021)
- full support for table functions UDF, UDTF, UDAGG, UDTAGG
Python Faust ("Kafka Streams for Python")
- stateful functions supported
- original company dropped support silently
- now community supported (but not everyone knows yet)
/
Python are in general suboptimal for real-time
- only with large micro-batches reasonable fast
- in step-by-step pipelines (e.g. with stateful functions) native R/Python counts (very slow)
- iterators are slow in R/Python
real-time performance
Julia is very well suited for real-time streaming.
- extremely fast iterators
- iterators are easy and cheap to fork/combine
- optimal performance also for step-by-step code
- high performant streaming libraries: Transducers.jl (v0.4) OnlineStats.jl (v1.5)
However, as of now
- still lacks support for distributed streaming (work in progress – please reach out if you are interested)
Current best choice for distributed streaming:
Scala.
Summary
-
my personal time travel
- 2013 Python
- 2015 R
- 2018 Scala
- 2021 Julia
-
comparison
- Data Science
- Distributed Batch Computation
- Distributed Real-time Computation
Thank you for your attention.
I am happy to answer all your questions.