Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
Description
Example with python:
import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': range(12)}) pq.write_table(table, "test_chunks.parquet", chunk_size=3) # reading with dataset import pyarrow.dataset as ds ds.dataset("test_chunks.parquet").to_table().to_pandas()
gives non-deterministic result (order of the row groups in the parquet file):
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[25]: a 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[26]: a 0 0 1 1 2 2 3 3 4 8 5 9 6 10 7 11 8 4 9 5 10 6 11 7
Attachments
Issue Links
- duplicates
-
ARROW-8447 [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
- Resolved
- is duplicated by
-
ARROW-8447 [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
- Resolved