Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11178

[Rust] StructArray: handling duplicate field names

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Rust
    • None

    Description

      The arrow spec leaves the solution of `duplicate field names` to implementors.

      The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:

      I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data interactively with tools like `pandas`?

      Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?

      Quote from nevi_me: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.

      If we choose to support `replace`, `ignore` or `append`, at least we must let user control the exact behavior.  For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mqy Qingyou Meng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: