Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4027

Reading partitioned datasets using RecordBatchFileReader into R

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.11.1
    • None
    • R
    • None
    • Ubuntu 16.04, building R package from master on github

    Description

      I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it

      ```
      data/foo1
      data/foo2
      data/foo3
      data/foo4
      ```

      Using pyarrow I can read them via

      `pq.read_table("data/foo1").to_pandas()`

      I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.

      ```
      Error in ipc__RecordBatchFileReader_Open(file) :
      Invalid: Not an Arrow file
      ```

      From debugging, I've traced it to this line https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112, which then goes to this Rcpp code https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85. It seems that this c++ function is expecting a single "[file like object](https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e)"; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.

       

      If I pass the whole directory using `read_table("data/")` I will get

      ```
      Error in ipc__RecordBatchFileReader_Open(file) :
      IOError: Error reading bytes from file: Is a directory
      ```

      So, how can I use the R package to correctly read multiple parquet files? If I need to call RecordBatchFileReader with a pointer to the footer, file layout and schema, how do I find the footer of the dataset? 

       

       

      I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great

      Attachments

        Activity

          People

            Unassigned Unassigned
            jeffreyw Jeffrey Wong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: