Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7661

[Python] Non-optimal CSV chunking when no newline at end

    XMLWordPrintableJSON

    Details

      Description

      We are reading a very simple csv (see below).
      The file is only 245 bytes so way below the default block_size in the ReadOptions. Thus we would expect the resulting table to have only one batch. At least, if  I understand correctly that a block refers to the number of lines of certain byte size? 

      The docs state: This will determine multi-threading granularity as well as the size of individual chunks in the Table. For me, that means also the size of individual batches? 

      Previously, we thought by fixing the block_size to the total file size, we would ensure that even for files larger than 1MB we get a pa.Table with only one batch. This mini file seems to prove us wrong?

      Additionally, if I convert back and forth to pandas we get only one batch.

       

      To reproduce:

      import os
      from pyarrow import csv as pc
      import pyarrow as pa
      path = "test.csv"
      read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
      df = pc.read_csv(path, read_options=read_options)
      print(len(df.to_batches()))
      # returns 2
      print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
      # returns the last line of the file
      pdf = df.to_pandas()
      ndf = pa.Table.from_pandas(pdf)
      print(len(ndf.to_batches()))
      # returns 1

      test.csv:

      "Name","Month","Change in %"
      "Surrey Quays","Sep 18","1.01"
      "Surrey Quays","Oct 18","0.38"
      "Surrey Quays","Nov 18","0.97"
      "Surrey Quays","Dec 18","1.28"
      "Surrey Quays","Jan 19","2.43"
      "Surrey Quays","Feb 19","2.49"
      "Surrey Quays","Mar 19","0.81"
      

       

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apitrou Antoine Pitrou
                Reporter:
                saschahofmann Sascha Hofmann
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m