Julia, Python, R & Scala
- a comparison from experience -

Overview

Stephan Sahm

  • founder of  Jolin.io
  • organizer of Julia User Group Munich
  • full stack data consultant
  • 10 years experience in data science
  • 5 years in consultancy
  • professional programmer
  • hobby: philosophy of mind

Outline

  • my personal time travel
    • 2013 Python
    • 2015 R
    • 2018 Scala
    • 2021 Julia
  • comparison
    • Data Science
    • Distributed Batch Computation
    • Distributed Real-time Computation

2013 Osnabrück: Python

I’ve been in academia in Cognitive Science where Matlab was widespread.
What brought me to Python is that it was said to be almost like Matlab, but more generic and open source.

Magic

  • completely dynamic execution
            
                
            
        
  • object oriented with MetaClasses and multiple inheritance
            
                
            
        

Main feeling

  • easy interactive procedural
  • standard object oriented for moving to production
  • if Python is fast, then it is not Python (but some underlying C)

Use for

  • glue language | plug & play
  • deep learning
  • data automation
  • performance optimization
  • large software developement
    (with strong open–closed principle requirements)

2015 Nijmegen: R

I came to R in my master of applied stochastics.
It is still the standard language at numerous universities for analysis of all kind.

Magic

  • copy-on-modify
            
                
            
        
  • lazy execution
            
                
            
        
  • meta programming everywhere
    (inspect not only values but also syntax)
            
                
            
        

Main feeling

  • easy safe scripting
  • awesome plots, reports (RMarkdown) and dashboards (RShiny)

Use for

  • data analytics
  • reports
  • dashboards
  • ad-hoc analysis
  • performance optimization (R is especially memory intensive)
  • software developement

2018 München: Scala

During my second year in industry Apache Spark crossed my road for doing Big Data in R.
While you can use Spark as an R library, many features are only available in Scala. Hence, I came across Scala.

Magic

  • by-type arguments
            
                
            
        
            
                
            
        
  • functional & object oriented (with multiple inheritance)
            
                
            
        
  • clean interface definition (traits) & extension methods
            
                
            
        

Main feeling

  • empowerment – super powerful language with perfect IDE support
  • few mandatory type annotations
  • generic code only via traits
  • can be quite overwhelming
    (especially with changes between Scala2 & Scala3)

Use for

  • professional software development
  • Big Data (Spark)
  • Real-time (Kafka, Flink)
  • Good native performance
  • Data Science
  • Machine Learning

2021 München: Julia

Already in university a good friend told me about Julia. Since then it was always the new hot stuff, the next generation.
In 2018 Julia v1.0 stable appeared and I took it as an occasion to learn the language. In 2021 I founded my own Julia consultancy.

Magic

  • generic functions
            
                
            
        
  • functions as generic interfaces (Multimethod)
            
                
            
        
  • works for custom objects too (fast like C structs)
            
                
            
        

Main feeling

  • empowerment – easy as Python, faster than Java/Scala, more flexible than both
  • feeling at home – everything is Julia from performance-critical to high-level
  • generic programming with builtin data science support

Use for

  • applied mathematics & data science
  • scientific computing (e.g. simulation, optimization, learning)
  • software development
  • performance optimization
  • traditional object orientation
  • non-data ecosystem is still small
Comparison
Python
R
Scala
Julia
generic programming
• polymorphism
Inheritance (defined as part of object definition)
not recommended (many different approaches exist in R)
by-type arguments (defined independent of object definition)
built into Multimethod (no need for extra definition)
• packaging
multiple common solutions: virtualenv, conda, scripts
commonly no packages, just scripts with function definitions
standard package system `sbt` with virtualenv support
builtin package system with virtualenv support
multi dimensional array
• convenient syntax
numpy
        
            
        
    
builtin (extra: nameable axes)
        
            
        
    
There is no multidimensional array
builtin
        
            
        
    
• broadcasting
mainly for numpy values/functions
builtin elementwise operations (but no broadcasting)
works for any function by prepending `.`
• missing
missing values represented as Float.nan
special treatment of missing
represented as `null`, impacting performance because of boxing
special treatment of missing with extra care for performance
• performance
bad performance on custom Python objects & functions
performance better than native Python
native performance is much faster than Python/R, but slower than C or Julia
optimal performance, also for custom Julia types
dataframe
ecosystem
• plotting
excellent GPU-acceleration
• statistics & analytics
• scientific computing
• machine learning & deep learning
excellent in terms of flexibility

Distributed Batch Computation

Distributed Table/SQL:

Spark is written in Scala and often used from Python. Two styles:

  • Python/R: Plug & play Big Data Pipelines. Without much performance focus.
  • Scala: Performance oriented, custom Big Data Pipelines.

Difficulties with Spark:

  • 5 SQL-like APIs (functional & sql-columns, typed & untyped, RDD)
  • machine learning support, but not suited to build custom ml algorithms
    (underlying map-reduce computing model is too inflexible)
Scala Python R
change elements (User Defined Functions)
aggregate rows (User Defined Aggregate Function) 🔲 🔲
aggregate rows in group
create new rows
no serialization 🔲 🔲
optimal performance 🔲 🔲
automated typing 🔲 🔲

Distributed Machine/Deep Learning: Python Ray

  • relatively young (v1.0 Sept 2020)
  • actor model (distributed Classes) is more intuitive for building new algorithms
  • GPU support
  • written in Python, C++ & Java
  • performance optimization not really possible due to many layers

Distributed Array/Table: Python Dask / Julia

  • Dask Array / Dask DataFrame, can be run on top of Python Ray (v2022.04.2)
  • Julia DistributedArray.jl (v0.6)
  • Julia Dagger.jl DTable (v0.14)

Flexible GPU Acceleration: Julia Cuda.jl

  • most array algorithms work out of the box on Cuda.jl arrays
  • easily write custom Cuda Kernels
  • widely supported (v3.9)

High Performance Computing: Julia

  • builtin support for distributed computation
  • 100% Julia, hence highly optimized performance
  • very good MPI support MPI.jl (v0.19)
  • strong alternative to Fortran HPC

Distributed Real-time Computation

Goto plattform: Apache Kafka (written in Java/Scala). Distributed. Resilient.

Java/Scala support is best

  • Apache Flink
  • Kafka Streams
  • Apache Spark
  • Apache Storm
  • Apache Samza

"Structured Streaming" for R/Python

  • table-functions supported like for standard Spark
  • implicit micro-batching
  • no stateful functions

Additional Python support is good

Apache Flink PyFlink
  • stateful functions are supported (since July 2021)
  • full support for table functions UDF, UDTF, UDAGG, UDTAGG
Python Faust ("Kafka Streams for Python")
  • stateful functions supported
  • original company dropped support silently
  • now community supported (but not everyone knows yet)

/ Python are in general suboptimal for real-time

  • only with large micro-batches reasonable fast
  • in step-by-step pipelines (e.g. with stateful functions) native R/Python counts (very slow)
  • iterators are slow in R/Python

Julia real-time performance

Julia is very well suited for real-time streaming.
  • extremely fast iterators
  • iterators are easy and cheap to fork/combine
  • optimal performance also for step-by-step code
  • high performant streaming libraries: Transducers.jl (v0.4) OnlineStats.jl (v1.5)
However, as of now
  • still lacks support for distributed streaming (work in progress – please reach out if you are interested)

Current best choice for distributed streaming: Scala.

Summary

  • my personal time travel
    • 2013 Python
    • 2015 R
    • 2018 Scala
    • 2021 Julia
  • comparison
    • Data Science
    • Distributed Batch Computation
    • Distributed Real-time Computation

Thank you for your attention.

I am happy to answer all your questions.

Contact  jolin.io