Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1455

[parquet-protobuf] Handle "unknown" enum values for parquet-protobuf



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.12.0
    • None


      Background -

      In protobuf enum is more like integers other than string, and is encoded as integer on the wire.
      In Protobuf, each enum value is associated with a number (integer), and people can set enum field using number directly regardless whether the number is associated to an enum value or not. While enum filed is set with a number that does not match any enum value defined in the schema, by using protobuf reflection API (as parquet-protobuf does) to read the enum field we will get a label "UNKNOWN_ENUM_<enumName><number>" generated by protobuf reflection. Thus parquet-protobuf will write string "UNKNOWN_ENUM<enumName>_<number>" into the enum column whenever its protobuf schema does not recognize the number.


      Problematics -

      There are two cases of unknown enum while using parquet-protobuf:
      1. Protobuf already contains unknown enum when we write it to parquet (sometimes people manipulate enum using numbers), so it will write a label "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, we found this "true" unknown value
      2. Protobuf contains valid value when write to parquet, but the reader uses an outdated proto schema which misses some enum values. So the not-in-old-schema enum values are "unknown" to the reader.

      Current behavior of parquet-proto reader is to reject in both cases with some runtime exception. This does not make sense in case 1, the write part does respect protobuf enum behavior while the read part does not. And case 2 should be handled if protobuf user is interested in the number instead of label.



        Issue Links



              q.xu Qinghui Xu
              q.xu Qinghui Xu
              0 Vote for this issue
              2 Start watching this issue