The effectiveness of Polars DataFrames
In this article, we illustrate the effectiveness and flexibility of the Polars DataFrame library. This article is inspired by the discussion raised in the blog post Why pandas feels clunky when coming from R by Rasmus Bååth. We extend their case study and compare possible implementations in Data Frame libraries for Python, R, and Julia in terms of flexibility and performance. We show that Polars features an expressive interface and best-in-class performance.
Case study
A list of purchases with a country of purchase, a purchase amount, and a discount is given.
country amount discount
1 USA 2000 10
2 USA 3500 15
3 USA 3000 20
4 Canada 120 12
5 Canada 180 18
6 Canada 3100 21
The case study then describes several sequential incoming data analysis tasks and compares how the analyst needs to change their code to answer them:
- “How much do we sell..? Let’s take the total sum!”
- “Ah, they wanted it by country…”
- “And I guess I should deduct the discount.”
- “Oh, and Maria asked me to remove any outliers. Let’s remove everything 10x larger than the median.”
- “I probably should use the median within each country. Prices are quite different across the globe…”
In the original blog post Rasmus Bååth shows that the implementation in R allows not only for a very expressive formulation of the queries but is also very easy to adapt to changing requirements. The presented equivalent Pandas formulation is much more verbose and difficult to adapt.
In the following, we show how the same case study can be solved with the Polars library in Python.
”How much do we sell..? Let’s take the total sum!”
We read the CSV file and take the sum of a single column.
df = pl.read_csv("/tmp/data.csv")
result = (
df["amount"].sum()
)
print(result)
”Ah, they wanted it by country…”
We group by country, compute a total column, and sort the groups by country to match the output of the R/tidyverse reference implementation, which does so by default. While the chained functions are a little more verbose than R’s pipe syntax, the control flow is very obvious and readable.
df = pl.read_csv("/tmp/data.csv")
result = (
df
.group_by("country") # added
.agg(total = pl.col("amount").sum()) #added
.sort("country") # added
)
print(result)
”And I guess I should deduct the discount.”
We change the aggregation function to subtract the discount.
df = pl.read_csv("/tmp/data.csv")
result = (
df
.group_by("country")
.agg(total = (
pl.col("amount") - pl.col("discount")) #changed
.sum()
)
.sort("country")
)
print(result)
”Oh, and Maria asked me to remove any outliers. Let’s remove everything 10x larger than the median.”
We add a simple filter function at the start of the chain
df = pl.read_csv("/tmp/data.csv")
result = (
df
.filter(pl.col("amount") <= pl.col("amount").median() * 10) # added
.group_by("country")
.agg(total = (
pl.col("amount") - pl.col("discount"))
.sum()
)
.sort("country")
)
print(result)
”I probably should use the median within each country. Prices are quite different across the globe…”
We move the filter function into the aggregation function
df = pl.read_csv("/tmp/data.csv")
result = (
df
.group_by("country")
.agg( total = (pl.col("amount") - pl.col("discount"))
.filter(pl.col("amount") <= pl.col("amount").median() * 10) # moved
.sum()
)
.sort("country")
)
print(result)
Benchmark
For a quick benchmark study of the final query, we create a large purchase.csv file using random sampling. We create 50 million rows, where each row gets assigned a random country, an amount value picked from a normal distribution around the country’s median from the original file, and a discount normally distributed around N(10,2). With a probability of 0.5%, a row is designated an outlier. In this case, we multiply its amount by 10. The Rust code to generate this file is provided in the appendix.
We run the benchmarks using hyperfine which applies benchmarking best practices such as warmup loops and averaging over multiple runs.
Implementations
We use the Pandas and R/tidyverse implementations from the original blog post and the Polars implementation shown above. For comparison, we also add the following Julia code based on the DataFrames.jl library.
# julia
df = CSV.read("/tmp/data.csv", DataFrame);
result = sort(
combine(
groupby(df, :country),
[:amount, :discount] => ((a,d) -> sum( (a .- d)[ a .<= median(a) * 10])) => :total
)
, :country);
println(result);
Note that there are likely more expressive and or faster implementations possible in Julia (e.g. the Query.jl library). We applied a basic set of performance improvements for Julia, such as wrapping the code in a function, as Julia does not optimize untyped global variables, and setting a NUM_THREADS variable to nprocs. As Julia applies dynamic compilation of code, we further add a “Julia-Repl/DataFrames” variant to the benchmark that runs a benchmark using BenchmarkTools.jl within an active Julia REPL to see the difference to a cold-start “Julia/DataFrames”.
For the Polars library, we further add a variant that uses lazy computation and explicit column type information as “Python/polars-opt”
df = pl.scan_csv("/tmp/data.csv", dtypes= { # Changed
"country": pl.Categorical, # Added
"amount": pl.Int32, # Added
"discount": pl.Int32 #Added
})
result = (
df
.group_by("country")
.agg( total = (pl.col("amount") - pl.col("discount"))
.filter(pl.col("amount") <= pl.col("amount").median() * 10)
.sum()
)
.sort("country")
).collect() # Added
println(result);
Results
The results show Polars’ significant performance advantage over the other implementations. We can find results in 2.6 seconds, less than one-tenth of the original Pandas implementation, with minimal optimizations. The lack of multithreading can explain a large part of the underperformance of Pandas. Julia has a huge startup cost, as evident by the difference between the “Julia-Repl/DataFrames” and “Julia/DataFrames” results. Enabling lazy computation further improves the processing time of Polars.
Polars shares the expressiveness and flexibility that is possible with R, while achieving significantly higher performance. This makes it an excellent choice for a DataFrame library.
Appendix
Rust script to create purchase file.
use rand::{rngs::StdRng, Rng, SeedableRng};
use rand_distr::{Normal, Distribution};
fn main() {
let rows = 50_000_000;
let countries = ["USA", "Spain", "Japan", "India",
"Brazil", "Italy", "Germany",
"France", "UK", "Australia", "Canada"];
let medians = [
250.0, 3000.0, 200.0,
240.0, 230.0, 180.0,
160.0, 230.0, 210.0,
220.0, 170.0,
];
let discount = Normal::new(10.0,2.0).unwrap();
let rng = &mut StdRng::seed_from_u64(1);
let devs : Vec<Normal<f64>> = medians.into_iter().map( |m|
Normal::new(m, 50.0).unwrap()
).collect();
let prob_outlier = 0.005;
println!("country,amount,discount");
for _ in 0..rows {
let country_idx = rng.gen_range(0..countries.len());
let is_outlier = rng.gen::<f64>() <= prob_outlier;
let amount_pre_outlier = devs[country_idx].sample(rng);
let discount = discount.sample(rng);
println!("{}", [
countries[country_idx],
&format!("{:.0}",
if is_outlier {
amount_pre_outlier * 10.0
} else { amount_pre_outlier }
),
&format!("{:.0}", discount),
].join(","));
}
}