[ARROW-8899] [R] Add R metadata like pandas metadata for round-trip fidelity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25033

Description

Arrow Schema and Field objects have custom_metadata fields to store arbitrary strings in a key-value store. Pandas stores JSON in a "pandas" key and uses that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather and back. https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format describes this a bit.

You can see this pandas metadata in the sample Parquet file:

tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), as_data_frame = FALSE)
tab

# Table
# 10 rows x 11 columns
# $carat <double>
# $cut <string>
# $color <string>
# $clarity <string>
# $depth <double>
# $table <double>
# $price <int64>
# $x <double>
# $y <double>
# $z <double>
# $__index_level_0__ <int64>

tab$metadata

# $pandas
# [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": [{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", \"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\": \"0.20.1\"}"

We should do something similar in R: store the "attributes" for each column in a data.frame when we convert to Arrow, and restore those attributes when we read from Arrow.

Since ~~ARROW-8703~~, you could naively do this all in R, something like:

tab$metadata$r <- lapply(df, attributes)

on the conversion to Arrow, and in as.data.frame(), do

if (!is.null(tab$metadata$r)) {
  df[] <- mapply(function(col, meta) {
    attributes(col) <- meta
  }, col = df, meta = tab$metadata$r)
}

However, it's trickier than this because:

tab$metadata$r needs to be serialized to string and deserialized on the way back. Pandas uses JSON but arrow doesn't currently have a JSON R dependency. We could dput() to dump the R attributes, but that could introduce risks since you have to parse/eval code to consume it. My best idea at the moment is to try rawToChar(serialize(x, ascii = TRUE)) on the way out (ascii = TRUE doesn't mean it requires ASCII inputs, it's about how it serializes) and unserialize(charToRaw) on the way back. But maybe there's some lower-level way to do this better.
We'll need to do the same for all places where Tables and RecordBatches are created/converted
We'll need to make sure that nested types (structs) get the same coverage
This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays don't have a place to store extra metadata. So we probably want to attach to the R6 (Chunked)Array objects a metadata/attributes field so that if we convert an R vector to array, or if we extract an array out of a record batch, we don't lose the attributes.

Doing this should resolve ~~ARROW-4390~~ and make ~~ARROW-8867~~ trivial as well.

Finally, a note about this custom metadata vs. extension types. Extension types can be defined by adding metadata to a Field (in a Schema). I think this is out of scope here because we're only concerned with R roundtrip fidelity. If there were a type that (for example) R and Pandas both had that Arrow did not, we could define an extension type so that we could share that across the implementations. But unless/until there is value in establishing that extension type standard, let's not worry with it. (In other words, in R we should ignore pandas metadata; if there's anything that pandas wants to share with R, it will define it somewhere else.)

Attachments

Issue Links

is depended upon by

ARROW-8867 [R] Support converting POSIXlt type

Resolved

ARROW-4390 [R] Serialize "labeled" metadata in Feather files, IPC messages

Resolved

links to

GitHub Pull Request #7524

Activity

People

Assignee:: Romain Francois

Reporter:: Neal Richardson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/May/20 22:07

Updated:: 11/Jan/23 08:03

Resolved:: 30/Jun/20 15:17

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m