[ARROW-11178] [Rust] StructArray: handling duplicate field names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Rust
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/27083

Description

The arrow spec leaves the solution of `duplicate field names` to implementors.

The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:

I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data interactively with tools like `pandas`?

Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?

Quote from nevi_me: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.

If we choose to support `replace`, `ignore` or `append`, at least we must let user control the exact behavior. For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Qingyou Meng

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jan/21 03:41

Updated:: 11/Jan/23 08:18