pola-rs/polars: Rust Polars 0.53.0

20260 citationsOthergreen Open Access

Authors

Ritchie Vink · Office of Polar Programs

Stijn de Gooijer

Alexander Beedie

nameexhaustion

Gijs Burghoorn · Radboud University Nijmegen

Orson R. L. Peters · Office of Polar Programs

Marco Gorelli · Quansight (United States)

reswqa

Marshall

Abstract

🏆 Highlights Add Extension types (#25322) 🚀 Performance improvements Don't always rechunk on gather of nested types (#26478) Enable zero-copy object_store put upload for IPC sink (#26288) Resolve file schema's and metadata concurrently (#26325) Run elementwise CSEE for the streaming engine (#26278) Disable morsel splitting for fast-count on streaming engine (#26245) Implement streaming decompression for scan_ndjson and scan_lines (#26200) Improve string slicing performance (#26206) Refactor scan_delta to use python dataset interface (#26190) Add dedicated kernel for group-by arg_max/arg_min (#26093) Add streaming merge-join (#25964) Generalize Bitmap::new_zeroed opt for Buffer::zeroed (#26142) Reduce fs stat calls in path expansion (#26173) Lower streaming group_by n_unique to unique().len() (#26109) Speed up SQL interface "UNION" clauses (#26039) Speed up SQL interface "ORDER BY" clauses (#26037) Add fast kernel for is_nan and use it for numpy NaN->null conversion (#26034) Optimize ArrayFromIter implementations for ObjectArray (#25712) New streaming NDJSON sink pipeline (#25948) New streaming CSV sink pipeline (#25900) Dispatch partitioned usage of sink_* functions to new-streaming by default (#25910) Replace ryu with faster zmij (#25885) Reduce memory usage for .item() count in grouped first/last (#25787) Skip schema inference if schema provided for scan_csv/ndjson (#25757) Add width-aware chunking to prevent degradation with wide data (#25764) Use new sink pipeline for write/sink_ipc (#25746) Reduce memory usage when scanning multiple parquet files in streaming (#25747) Don't call cluster_with_columns optimization if not needed (#25724) Tune partitioned sink_parquet cloud performance (#25687) New single file IO sink pipeline enabled for sink_parquet (#25670) New partitioned IO sink pipeline enabled for sink_parquet (#25629) Correct overly eager local predicate insertion for unpivot (#25644) Reduce HuggingFace API calls (#25521) Use strong hash instead of traversal for CSPE equality (#25537) Fix panic in is_between support in streaming Parquet predicate push down (#25476) Faster kernels for rle_lengths (#25448) Allow detecting plan sortedness in more cases (#25408) Enable predicate expressions on unsigned integers (#25416) Mark output of more non-order-maintaining ops as unordered (#25419) Fast find start window in group_by_dynamic with large offset (#25376) Add streaming native LazyFrame.group_by_dynamic (#25342) Add streaming sorted Group-By (#25013) Add parquet prefiltering for string regexes (#25381) Use fast path for agg_min/agg_max when nulls present (#25374) Fuse positive slice into streaming LazyFrame.rolling (#25338) Mark Expr.reshape((-1,)) as row separable (#25326) Use bitmap instead of Vec<bool> in first/last w. skip_nulls (#25318) Return references from aexpr_to_leaf_names_iter (#25319) ✨ Enhancements Add primitive filter -> agg lowering in streaming GroupBy (#26459) Support for the SQL FETCH clause (#26449) Add get() to retrieve a byte from binary data (#26454) Remove with_context in SQL lowering (#26416) Avoid OOM for scan_ndjson and scan_lines if input is compressed and negative slice (#26396) Add JoinBuildSide (#26403) Support annoymous agg in-mem (#26376) Add unstable arrow_schema parameter to sink_parquet (#26323) Improve error message formatting for structs (#26349) Remove parquet field overwrites (#26236) Enable zero-copy object_store put upload for IPC sink (#26288) Improved disambiguation for qualified wildcard columns in SQL projections (#26301) Expose upload_concurrency through env var (#26263) Allow quantile to compute multiple quantiles at once (#25516) Allow empty LazyFrame in LazyFrame.group_by(...).map_groups (#26275) Use delta file statistics for batch predicate pushdown (#26242) Add streaming UnorderedUnion (#26240) Implement compression support for sink_ndjson (#26212) Add unstable record batch statistics flags to {sink/scan}_ipc (#26254) Cloud retry/backoff configuration via storage_options (#26204) Use same sort order for expanded paths across local / cloud / directory / glob (#26191) Expose physical plan NodeStyle (#26184) Add streaming merge-join (#25964) Serialize optimization flags for cloud plan (#26168) Add compression support to write_csv and sink_csv (#26111) Add scan_lines (#26112) Support regex in str.split (#26060) Add unstable IPC Statistics read/write to scan_ipc/sink_ipc (#26079) Add nulls support for all rolling_by operations (#26081) ArrowStreamExportable and sink_delta (#25994) Release musl builds (#25894) Implement streaming decompression for CSV COUNT(*) fast path (#25988) Add nulls support for rolling_mean_by (#25917) Add lazy collect_all (#25991) Add streaming decompression for NDJSON schema inference (#25992) Improved handling of unqualified SQL JOIN columns that are ambiguous (#25761) Expose record batch size in {sink,write}_ipc (#25958) Add null_on_oob parameter to expr.get (#25957) Suggest correct timezone if timezone validation fails (#25937) Support streaming IPC scan from S3 object store (#25868) Implement streaming CSV schema inference (#25911) Support hashing of meta expressions (#25916) Improve SQLContext recognition of possible table objects in the Python globals (#25749) Add pl.Expr.(min|max)_by (#25905) Improve MemSlice Debug impl (#25913) Implement or fix json encode/decode for (U)Int128, Categorical, Enum, Decimal (#25896) Expand scatter to more dtypes (#25874) Implement streaming CSV decompression (#25842) Add Series sql method for API consistency (#25792) Mark Polars as safe for free-threading (#25677) Support Binary and Decimal in arg_(min|max) (#25839) Allow Decimal parsing in str.json_decode (#25797) Add shift support for Object data type (#25769) Add node status to NodeMetrics (#25760) Allow scientific notation when parsing Decimals (#25711) Allow creation of Object literal (#25690) Don't collect schema in SQL union processing (#25675) Add bin.slice(), bin.head(), and bin.tail() methods (#25647) Add SQL support for the QUALIFY clause (#25652) New partitioned IO sink pipeline enabled for sink_parquet (#25629) Add SQL syntax support for CROSS JOIN UNNEST(col) (#25623) Add separate env var to log tracked metrics (#25586) Expose fields for generating physical plan visualization data (#25562) Allow pl.Object in pivot value (#25533) Extend SQL UNNEST support to handle multiple array expressions (#25418) Minor improvement for as_struct repr (#25529) Temporal quantile in rolling context (#25479) Add support for Float16 dtype (#25185) Add strict parameter to pl.concat(how='horizontal') (#25452) Add leftmost option to str.replace_many / str.find_many / str.extract_many (#25398) Add quantile for missing temporals (#25464) Expose and document pl.Categories (#25443) Support decimals in search_sorted (#25450) Use reference to Graph pipes when flushing metrics (#25442) Add SQL support for named WINDOW references (#25400) Add Extension types (#25322) Add having to group_by context (#23550) Allow elementwise Expr.over in aggregation context (#25402) Add SQL support for ROW_NUMBER, RANK, and DENSE_RANK functions (#25409) Automatically Parquet dictionary encode floats (#25387) Add empty_as_null and keep_nulls to {Lazy,Data}Frame.explode (#25369) Allow hash for all List dtypes (#25372) Support unique_counts for all datatypes (#25379) Add maintain_order to Expr.mode (#25377) Display function of streaming physical plan map node (#25368) Allow slice on scalar in aggregation context (#25358) Allow implode and aggregation in aggregation context (#25357) Add empty_as_null and keep_nulls flags to Expr.explode (#25289) Add ignore_nulls to first / last (#25105) Move GraphMetrics into StreamingQuery (#25310) Allow Expr.unique on List/Array with non-numeric types (#25285) Allow Expr.rolling in aggregation contexts (#25258) Support additional forms of SQL CREATE TABLE statements (#25191) Add LazyFrame.pivot (#25016) Support column-positional SQL UNION operations (#25183) Allow arbitrary expressions as the Expr.rolling index_column (#25117) Allow arbitrary Expressions in "subset" parameter of unique frame method (#25099) Support arbitrary expressions in SQL JOIN constraints (#25132) 🐞 Bug fixes Do not overwrite used names in cluster_with_columns pushdown (#26467) Do not mark output of concat_str on multiple inputs as sorted (#26468) Fix CSV schema inference content line duplication bug (#26452) Fix InvalidOperationError using scan_delta with filter (#26448) Alias giving missing column after streaming GroupBy CSE (#26447) Ensure by_name selector selects only names (#26437) Restore compatibility of strings written to parquet with pyarrow filter (#26436) Update schema in cluster_with_columns optimization (#26430) Fix negative slice in groups slicing (#26442) Don't run CPU check on aarch64 musl (#26439) Remove the POLARS_IDEAL_MORSEL_SIZE monkeypatching in the parametric merge-join test (#26418) Correct off-by-one in RLE row counting for nullable dictionary-encoded columns (#26411) Support very large integers in env var limits (#26399) Fix PlPath panic from incorrect slicing of UTF8 boundaries (#26389) Fix Float dtype for spearman correlation (#26392) Fix optimizer panic in right joins with type coercion (#26365) Don't serialize retry config from local environment vars (#26289) Fix PartitionBy with scalar key expressions and diff() (#26370) Add {Float16, Float32} -> Float32 lossless upcast (#26373) Fix panic using with_columns and collect_all (#26366) Add multi-page support for writing dictionary-encoded Parquet columns (#26360) Ensure slice advancement when skipping non-inlinable values in is_in with inlinable needles (#26361) Pin xlsx2csv version temporarily (#26352) Bugs in ViewArray total_bytes_len (#26328) Overflow in i128::abs in Decimal fits check (#26341) Make Expr.hash on Cate

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18538612

Command Palette

pola-rs/polars: Rust Polars 0.53.0

Authors

Abstract

Topics & Keywords

Publication Details