Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.14.1, 0.15.0, 0.15.1
Description
We are reading a very simple csv (see below).
The file is only 245 bytes so way below the default block_size in the ReadOptions. Thus we would expect the resulting table to have only one batch. At least, if I understand correctly that a block refers to the number of lines of certain byte size?
The docs state: This will determine multi-threading granularity as well as the size of individual chunks in the Table. For me, that means also the size of individual batches?
Previously, we thought by fixing the block_size to the total file size, we would ensure that even for files larger than 1MB we get a pa.Table with only one batch. This mini file seems to prove us wrong?
Additionally, if I convert back and forth to pandas we get only one batch.
To reproduce:
import os from pyarrow import csv as pc import pyarrow as pa path = "test.csv" read_options = pc.ReadOptions(block_size=os.stat(path).st_size) df = pc.read_csv(path, read_options=read_options) print(len(df.to_batches())) # returns 2 print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas()) # returns the last line of the file pdf = df.to_pandas() ndf = pa.Table.from_pandas(pdf) print(len(ndf.to_batches())) # returns 1
test.csv:
"Name","Month","Change in %" "Surrey Quays","Sep 18","1.01" "Surrey Quays","Oct 18","0.38" "Surrey Quays","Nov 18","0.97" "Surrey Quays","Dec 18","1.28" "Surrey Quays","Jan 19","2.43" "Surrey Quays","Feb 19","2.49" "Surrey Quays","Mar 19","0.81"
Attachments
Issue Links
- links to