Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5318

[Python] pyarrow hdfs reader overrequests

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • 0.10.0
    • 0.14.0
    • Python
    • None

    Description

      I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.

      The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.

      I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7

      I have been using wireshark to track the packets passed on the network.

      I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.

       

      The regular pyarrow reader

      import pyarrow as pa 
      fs = pa.hdfs.connect(hostname, driver='libhdfs') 
      file_path = 'dataset/train/piece0000' 
      f = fs.open(file_path) 
      f.seek(0) 
      n_bytes = 3000000 
      f.read(n_bytes)
      

       

      Parquet code without the same issue

      parquet_file = 'dataset/train/parquet/part-22e3' 
      pf = fs.open(parquet_path) 
      pqf = pa.parquet.ParquetFile(pf)
      data = pqf.read_row_group(0, columns=['col_name'])
       

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dimitrov Ivan Dimitrov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: