Gregor Godbersen : The effectiveness of Polars DataFrames
Published

The effectiveness of Polars DataFrames

In this article, we illustrate the effectiveness and flexibility of the Polars DataFrame library. This article is inspired by the discussion raised in the blog post Why pandas feels clunky when coming from R by Rasmus Bååth. We extend their case study and compare possible implementations in Data Frame libraries for Python, R, and Julia in terms of flexibility and performance. We show that Polars features an expressive interface and best-in-class performance.

Case study

A list of purchases with a country of purchase, a purchase amount, and a discount is given.

  country amount discount
1 USA       2000       10
2 USA       3500       15
3 USA       3000       20
4 Canada     120       12
5 Canada     180       18
6 Canada    3100       21

The case study then describes several sequential incoming data analysis tasks and compares how the analyst needs to change their code to answer them:

  • “How much do we sell..? Let’s take the total sum!”
  • “Ah, they wanted it by country…”
  • “And I guess I should deduct the discount.”
  • “Oh, and Maria asked me to remove any outliers. Let’s remove everything 10x larger than the median.”
  • “I probably should use the median within each country. Prices are quite different across the globe…”

In the original blog post Rasmus Bååth shows that the implementation in R allows not only for a very expressive formulation of the queries but is also very easy to adapt to changing requirements. The presented equivalent Pandas formulation is much more verbose and difficult to adapt.

In the following, we show how the same case study can be solved with the Polars library in Python.

”How much do we sell..? Let’s take the total sum!”

We read the CSV file and take the sum of a single column.

df = pl.read_csv("/tmp/data.csv")
result = ( 
    df["amount"].sum()
)
print(result)

”Ah, they wanted it by country…”

We group by country, compute a total column, and sort the groups by country to match the output of the R/tidyverse reference implementation, which does so by default. While the chained functions are a little more verbose than R’s pipe syntax, the control flow is very obvious and readable.

df = pl.read_csv("/tmp/data.csv")
result = (
            df
              .group_by("country") # added
              .agg(total = pl.col("amount").sum()) #added
              .sort("country") # added
         )

print(result)

”And I guess I should deduct the discount.”

We change the aggregation function to subtract the discount.

df = pl.read_csv("/tmp/data.csv")
result = (
            df
              .group_by("country")
              .agg(total = (
                          pl.col("amount") - pl.col("discount")) #changed
                            .sum()
              )
              .sort("country")
         )

print(result)

”Oh, and Maria asked me to remove any outliers. Let’s remove everything 10x larger than the median.”

We add a simple filter function at the start of the chain

df = pl.read_csv("/tmp/data.csv")
result = (
            df
              .filter(pl.col("amount") <= pl.col("amount").median() * 10) # added
              .group_by("country")
              .agg(total = (
                          pl.col("amount") - pl.col("discount"))
                            .sum()
               ) 
              .sort("country")
         )

print(result)

”I probably should use the median within each country. Prices are quite different across the globe…”

We move the filter function into the aggregation function

df = pl.read_csv("/tmp/data.csv")
result = (
          df
            .group_by("country")
            .agg( total = (pl.col("amount") - pl.col("discount"))
                      .filter(pl.col("amount") <= pl.col("amount").median() * 10) # moved
                      .sum()
            )
            .sort("country")
          )

print(result)

Benchmark

For a quick benchmark study of the final query, we create a large purchase.csv file using random sampling. We create 50 million rows, where each row gets assigned a random country, an amount value picked from a normal distribution around the country’s median from the original file, and a discount normally distributed around N(10,2). With a probability of 0.5%, a row is designated an outlier. In this case, we multiply its amount by 10. The Rust code to generate this file is provided in the appendix.

We run the benchmarks using hyperfine which applies benchmarking best practices such as warmup loops and averaging over multiple runs.

Implementations

We use the Pandas and R/tidyverse implementations from the original blog post and the Polars implementation shown above. For comparison, we also add the following Julia code based on the DataFrames.jl library.

# julia 
df = CSV.read("/tmp/data.csv", DataFrame);
result = sort(
 combine(
  groupby(df, :country),
   [:amount, :discount] => ((a,d) -> sum( (a .- d)[ a .<= median(a) * 10])) => :total
 )
, :country);

println(result);

Note that there are likely more expressive and or faster implementations possible in Julia (e.g. the Query.jl library). We applied a basic set of performance improvements for Julia, such as wrapping the code in a function, as Julia does not optimize untyped global variables, and setting a NUM_THREADS variable to nprocs. As Julia applies dynamic compilation of code, we further add a “Julia-Repl/DataFrames” variant to the benchmark that runs a benchmark using BenchmarkTools.jl within an active Julia REPL to see the difference to a cold-start “Julia/DataFrames”.

For the Polars library, we further add a variant that uses lazy computation and explicit column type information as “Python/polars-opt”

df = pl.scan_csv("/tmp/data.csv", dtypes= { # Changed
  "country": pl.Categorical, # Added
  "amount": pl.Int32, # Added
  "discount": pl.Int32  #Added
  })

result = (
          df
            .group_by("country")
            .agg( total = (pl.col("amount") - pl.col("discount"))
                      .filter(pl.col("amount") <= pl.col("amount").median() * 10)
                      .sum()
            )
            .sort("country")
          ).collect() # Added
          
println(result);

Results

The results show Polars’ significant performance advantage over the other implementations. We can find results in 2.6 seconds, less than one-tenth of the original Pandas implementation, with minimal optimizations. The lack of multithreading can explain a large part of the underperformance of Pandas. Julia has a huge startup cost, as evident by the difference between the “Julia-Repl/DataFrames” and “Julia/DataFrames” results. Enabling lazy computation further improves the processing time of Polars.

x-axis: Wall Time [s] Python/pandas 28.55 ±0.495 Julia/DataFrames 15.888 ±0.126 R/tidyverse 13.723 ±0.269 Julia-Repl/DataFrames 5.35 ±0.158 Python/polars 3.752 ±0.036 Python/polars-opt 2.655 ±0.094
Wall-time performance comparison for the different implementations.

Polars shares the expressiveness and flexibility that is possible with R, while achieving significantly higher performance. This makes it an excellent choice for a DataFrame library.

Appendix

Rust script to create purchase file.

use rand::{rngs::StdRng, Rng, SeedableRng};
use rand_distr::{Normal, Distribution};

fn main() {
    let rows = 50_000_000;
    let countries = ["USA", "Spain", "Japan", "India",
    "Brazil", "Italy", "Germany",
    "France", "UK", "Australia", "Canada"];
    let medians = [
        250.0, 3000.0, 200.0,
        240.0, 230.0, 180.0,
        160.0, 230.0, 210.0,
        220.0, 170.0,
    ];
    let discount = Normal::new(10.0,2.0).unwrap();

    let rng = &mut StdRng::seed_from_u64(1);
    let devs : Vec<Normal<f64>> = medians.into_iter().map( |m|
        Normal::new(m, 50.0).unwrap()
    ).collect();
    let prob_outlier = 0.005;

    println!("country,amount,discount");

    for _ in 0..rows {

        let country_idx = rng.gen_range(0..countries.len());
        let is_outlier = rng.gen::<f64>() <= prob_outlier;
        let amount_pre_outlier = devs[country_idx].sample(rng);
        let discount = discount.sample(rng);

        println!("{}", [
            countries[country_idx],
            &format!("{:.0}",
                 if is_outlier {
                    amount_pre_outlier * 10.0
                } else { amount_pre_outlier }
            ),
            &format!("{:.0}", discount),
        ].join(","));
    }

}