Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
9.0.0
Description
If we could allow TableGroupBy.aggregate to accept no aggregation functions then it would behave like pandas drop_duplicates.
t.group_by(['keys', 'values']).aggregate()
I did some naive benchmarks and looks like it should be 30% faster than converting to pandas and deduplicating. This was my naive test:
t.append_column('i', pa.array([1]*len(t),pa.int64())).group_by(['keys', 'values']).aggregate([("i", "max")]).drop(['i_max'])
And on small 5M table it took 245ms while 359ms for t.to_pandas().drop_duplicates()
Actual aggregation without adding dummy column should be even faster still will allow drop_duplicates functionality until better implementation arrives
Attachments
Issue Links
- links to