[ARROW-10514] [C++][Parquet] Data inconsistency in parquet-reader output modes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: C++
Labels:
- pull-request-available

Flags:

Patch
External issue URL:
https://github.com/apache/arrow/issues/26483

Description

I tried reading description for Parquet file with nested maps using parquet-reader tool.

This file has the following structure:

required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
  required int32 field_id=8 b;
  required double field_id=9 c;
}

When I print it using DebugPrint, I see:

$ ./parquet-reader nested_maps.snappy.parquet --only-metadata
<some text is omitted for the sake of readability>
Column 0: a.key_value.key (BYTE_ARRAY/UTF8)
Column 1: a.key_value.value.key_value.key (INT32)
Column 2: a.key_value.value.key_value.value (BOOLEAN)
Column 3: b (INT32)
Column 4: c (DOUBLE)
</some text is omitted for the sake of readability>

When I pring it using JSONPrint, I see:

$ ./parquet-reader nested_maps.snappy.parquet --json
<some text is omitted for the sake of readability>
"Columns": [
  { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
  { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
  { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
]
</some text is omitted for the sake of readability>

Column 0 and Column 1 has the same Name in JSON output. That's very confusing. It would be more correct to output the full path of the column (key -> a.key_value.key).

This issue can be corrected by changing a single line: https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218

The proposed patch in the attachment

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-Make-the-column-name-the-same-for-both-output-format.patch
07/Nov/20 18:45
1 kB
Zosimova Zhanna

Issue Links

links to

GitHub Pull Request #9649

Activity

People

Assignee:: Zosimova Zhanna

Reporter:: Zosimova Zhanna

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Nov/20 18:46

Updated:: 11/Jan/23 08:13

Resolved:: 10/Mar/21 13:14

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m