[ARROW-4027] Reading partitioned datasets using RecordBatchFileReader into R - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.11.1
Fix Version/s: None
Component/s: R
Labels:
None
Environment:
Ubuntu 16.04, building R package from master on github

External issue URL:
https://github.com/apache/arrow/issues/20629

Description

I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it

```
data/foo1
data/foo2
data/foo3
data/foo4
```

Using pyarrow I can read them via

`pq.read_table("data/foo1").to_pandas()`

I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.

```
Error in ipc__RecordBatchFileReader_Open(file) :
Invalid: Not an Arrow file
```

From debugging, I've traced it to this line https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112, which then goes to this Rcpp code https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85. It seems that this c++ function is expecting a single "[file like object](https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e)"; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.

If I pass the whole directory using `read_table("data/")` I will get

```
Error in ipc__RecordBatchFileReader_Open(file) :
IOError: Error reading bytes from file: Is a directory
```

So, how can I use the R package to correctly read multiple parquet files? If I need to call RecordBatchFileReader with a pointer to the footer, file layout and schema, how do I find the footer of the dataset?

I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jeffrey Wong

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Dec/18 04:20

Updated:: 11/Jan/23 07:31

Resolved:: 14/Dec/18 05:43