Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8899

[R] Add R metadata like pandas metadata for round-trip fidelity

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.0.0
    • R

    Description

      Arrow Schema and Field objects have custom_metadata fields to store arbitrary strings in a key-value store. Pandas stores JSON in a "pandas" key and uses that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather and back. https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format describes this a bit.

      You can see this pandas metadata in the sample Parquet file:

      tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), as_data_frame = FALSE)
      tab
      
      # Table
      # 10 rows x 11 columns
      # $carat <double>
      # $cut <string>
      # $color <string>
      # $clarity <string>
      # $depth <double>
      # $table <double>
      # $price <int64>
      # $x <double>
      # $y <double>
      # $z <double>
      # $__index_level_0__ <int64>
      
      tab$metadata
      
      # $pandas
      # [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": [{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", \"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\": \"0.20.1\"}"
      

      We should do something similar in R: store the "attributes" for each column in a data.frame when we convert to Arrow, and restore those attributes when we read from Arrow.

      Since ARROW-8703, you could naively do this all in R, something like:

      tab$metadata$r <- lapply(df, attributes)
      

      on the conversion to Arrow, and in as.data.frame(), do

      if (!is.null(tab$metadata$r)) {
        df[] <- mapply(function(col, meta) {
          attributes(col) <- meta
        }, col = df, meta = tab$metadata$r)
      }
      

      However, it's trickier than this because:

      • tab$metadata$r needs to be serialized to string and deserialized on the way back. Pandas uses JSON but arrow doesn't currently have a JSON R dependency. We could dput() to dump the R attributes, but that could introduce risks since you have to parse/eval code to consume it. My best idea at the moment is to try rawToChar(serialize(x, ascii = TRUE)) on the way out (ascii = TRUE doesn't mean it requires ASCII inputs, it's about how it serializes) and unserialize(charToRaw) on the way back. But maybe there's some lower-level way to do this better.
      • We'll need to do the same for all places where Tables and RecordBatches are created/converted
      • We'll need to make sure that nested types (structs) get the same coverage
      • This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays don't have a place to store extra metadata. So we probably want to attach to the R6 (Chunked)Array objects a metadata/attributes field so that if we convert an R vector to array, or if we extract an array out of a record batch, we don't lose the attributes.

      Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.

      Finally, a note about this custom metadata vs. extension types. Extension types can be defined by adding metadata to a Field (in a Schema). I think this is out of scope here because we're only concerned with R roundtrip fidelity. If there were a type that (for example) R and Pandas both had that Arrow did not, we could define an extension type so that we could share that across the implementations. But unless/until there is value in establishing that extension type standard, let's not worry with it. (In other words, in R we should ignore pandas metadata; if there's anything that pandas wants to share with R, it will define it somewhere else.)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            romainfrancois Romain Francois Assign to me
            npr Neal Richardson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m

                Slack

                  Issue deployment