Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7661

[Python] Non-optimal CSV chunking when no newline at end

    XMLWordPrintableJSON

Details

    Description

      We are reading a very simple csv (see below).
      The file is only 245 bytes so way below the default block_size in the ReadOptions. Thus we would expect the resulting table to have only one batch. At least, if  I understand correctly that a block refers to the number of lines of certain byte size? 

      The docs state: This will determine multi-threading granularity as well as the size of individual chunks in the Table. For me, that means also the size of individual batches? 

      Previously, we thought by fixing the block_size to the total file size, we would ensure that even for files larger than 1MB we get a pa.Table with only one batch. This mini file seems to prove us wrong?

      Additionally, if I convert back and forth to pandas we get only one batch.

       

      To reproduce:

      import os
      from pyarrow import csv as pc
      import pyarrow as pa
      path = "test.csv"
      read_options = pc.ReadOptions(block_size=os.stat(path).st_size)
      df = pc.read_csv(path, read_options=read_options)
      print(len(df.to_batches()))
      # returns 2
      print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas())
      # returns the last line of the file
      pdf = df.to_pandas()
      ndf = pa.Table.from_pandas(pdf)
      print(len(ndf.to_batches()))
      # returns 1

      test.csv:

      "Name","Month","Change in %"
      "Surrey Quays","Sep 18","1.01"
      "Surrey Quays","Oct 18","0.38"
      "Surrey Quays","Nov 18","0.97"
      "Surrey Quays","Dec 18","1.28"
      "Surrey Quays","Jan 19","2.43"
      "Surrey Quays","Feb 19","2.49"
      "Surrey Quays","Mar 19","0.81"
      

       

       

       

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              saschahofmann Sascha Hofmann
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m