Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10514

[C++][Parquet] Data inconsistency in parquet-reader output modes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 4.0.0
    • C++

    Description

      I tried reading description for Parquet file with nested maps using parquet-reader tool

      This file has the following structure:

      required group field_id=0 spark_schema {
        optional group field_id=1 a (Map) {
          repeated group field_id=2 key_value {
            required binary field_id=3 key (String);
            optional group field_id=4 value (Map) {
              repeated group field_id=5 key_value {
                required int32 field_id=6 key;
                required boolean field_id=7 value;
              }
            }
          }
        }
        required int32 field_id=8 b;
        required double field_id=9 c;
      } 

      When I print it using DebugPrint, I see:

      $ ./parquet-reader nested_maps.snappy.parquet --only-metadata
      <some text is omitted for the sake of readability>
      Column 0: a.key_value.key (BYTE_ARRAY/UTF8)
      Column 1: a.key_value.value.key_value.key (INT32)
      Column 2: a.key_value.value.key_value.value (BOOLEAN)
      Column 3: b (INT32)
      Column 4: c (DOUBLE)
      </some text is omitted for the sake of readability>

      When I pring it using JSONPrint, I see:

      $ ./parquet-reader nested_maps.snappy.parquet --json
      <some text is omitted for the sake of readability>
      "Columns": [
        { "Id": "0", "Name": "key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
        { "Id": "1", "Name": "key", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
        { "Id": "2", "Name": "value", "PhysicalType": "BOOLEAN", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
        { "Id": "3", "Name": "b", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
        { "Id": "4", "Name": "c", "PhysicalType": "DOUBLE", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
      ]
      </some text is omitted for the sake of readability>

      Column 0 and Column 1 has the same Name in JSON output. That's very confusing. It would be more correct to output the full path of the column (key -> a.key_value.key).

       

      This issue can be corrected by changing a single line: https://github.com/apache/arrow/blob/master/cpp/src/parquet/printer.cc#L218

       

      The proposed patch in the attachment

      Attachments

        Issue Links

          Activity

            People

              FawnD2 Zosimova Zhanna
              FawnD2 Zosimova Zhanna
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m