Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
It would be nice to have a hash aggregate function that returns the first value of a column within each hash group.
If row order within groups is non-deterministic, then effectively this would return one arbitrary value. This is a very computationally cheap operation.
This can be quite useful when querying a non-normalized table. For example if you have a table with a country column and also a country_abbr column and you want to group by either/both of those columns but return the values from both columns, you could do
SELECT country, country_abbr FROM table GROUP BY country, country_abbr
but it would be more efficient to do
SELECT country, first(country_abbr) FROM table GROUP BY country
because then the engine does not need to scan all the values of the country_abbr column.
Attachments
Issue Links
- blocks
-
ARROW-14045 [R] Support for .keep_all = TRUE with distinct()
- In Progress
- is related to
-
ARROW-15717 [Docs] Add hash_one to the documentation
- Resolved
- links to