[ARROW-18137] [Python][Docs] Allow passing no aggregations to TableGroupBy.aggregate - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 9.0.0
Fix Version/s: 11.0.0
Component/s: Documentation, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/33333

Description

If we could allow TableGroupBy.aggregate to accept no aggregation functions then it would behave like pandas drop_duplicates.

t.group_by(['keys', 'values']).aggregate()

I did some naive benchmarks and looks like it should be 30% faster than converting to pandas and deduplicating. This was my naive test:

 t.append_column('i', pa.array([1]*len(t),pa.int64())).group_by(['keys', 'values']).aggregate([("i", "max")]).drop(['i_max'])

And on small 5M table it took 245ms while 359ms for t.to_pandas().drop_duplicates()

Actual aggregation without adding dummy column should be even faster still will allow drop_duplicates functionality until better implementation arrives

Attachments

Issue Links

links to

GitHub Pull Request #14482

Activity

People

Assignee:: Jacek Pliszka

Reporter:: Jacek Pliszka

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Oct/22 18:26

Updated:: 11/Jan/23 11:58

Resolved:: 24/Oct/22 21:53

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h