[ARROW-7661] [Python] Non-optimal CSV chunking when no newline at end - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.14.1, 0.15.0, 0.15.1
Fix Version/s: 0.16.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/23907

Description

We are reading a very simple csv (see below).
The file is only 245 bytes so way below the default block_size in the ReadOptions. Thus we would expect the resulting table to have only one batch. At least, if I understand correctly that a block refers to the number of lines of certain byte size?

The docs state: This will determine multi-threading granularity as well as the size of individual chunks in the Table. For me, that means also the size of individual batches?

Previously, we thought by fixing the block_size to the total file size, we would ensure that even for files larger than 1MB we get a pa.Table with only one batch. This mini file seems to prove us wrong?

Additionally, if I convert back and forth to pandas we get only one batch.

To reproduce:

import os
from pyarrow import csv as pc
import pyarrow as pa
path = "test.csv"
read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
df = pc.read_csv(path, read_options=read_options)
print(len(df.to_batches()))
# returns 2
print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
# returns the last line of the file
pdf = df.to_pandas()
ndf = pa.Table.from_pandas(pdf)
print(len(ndf.to_batches()))
# returns 1

test.csv:

"Name","Month","Change in %"
"Surrey Quays","Sep 18","1.01"
"Surrey Quays","Oct 18","0.38"
"Surrey Quays","Nov 18","0.97"
"Surrey Quays","Dec 18","1.28"
"Surrey Quays","Jan 19","2.43"
"Surrey Quays","Feb 19","2.49"
"Surrey Quays","Mar 19","0.81"

Attachments

Issue Links

links to

GitHub Pull Request #6305

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Sascha Hofmann

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Jan/20 11:11

Updated:: 11/Jan/23 07:55

Resolved:: 29/Jan/20 11:32

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m