Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
I haven't investigated very deeply but this seems symptomatic of a problem:
In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)}) In [28]: pq.write_table(pa.table(df), 'test.parquet') In [29]: timeit pq.read_table('test.parquet') 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Attachments
Issue Links
- fixes
-
ARROW-9983 [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API
- Resolved
- is related to
-
ARROW-9983 [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API
- Resolved
- links to