Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-16289

Values.parseString on heterogeneous lists and maps sometimes corrupts data by inferring incorrect schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 3.8.0
    • connect
    • None

    Description

      The Values.parseString function makes a best-effort conversion of strings to Connect-schema'd data. It supports reading arrays and maps as delimited by `[,]` and `{:,}` characters, and attempts to infer the common type of these structures from the types of the elements. The algorithm it follows is:

      1. Parse the elements of the list in one-pass. Infer the smallest/strictest type which can contain each value individually.
      2. Iterate over the schemas inferred for each element, and repeatedly merge two schemas together to the smallest type which covers both element schemas.
      3. Convert the parsed elements to the common element schema.

      The implementation of step 2 here: https://github.com/apache/kafka/blob/ead2431c37ace9255df88ffe819bb905311af088/connect/api/src/main/java/org/apache/kafka/connect/data/Values.java#L805-L823 has a flaw in it however. The `elementSchema` variable has `null` as a sentinel both of the situations "no elements seen so far" and "no common schema possible" among the seen elements.

      When processing the first element of the list, `null` is used to adopt the schema of the first element as the initial common schema. Later when an incompatible element is found, the common schema is set to null to indicate that there is no common element schema. However, a following iteration can misinterpret the `null` as being at the start of the list again, and inject a schema which works for some of the elements and not others.

      When the values are converted in step 3, each element has one of the following happen:
      1. The value is left-as is (e.g. no common schema inferred)
      2. The value is converted correctly to the destination type (e.g. int -> long)
      3. An exception is thrown because the type could not be converted (e.g. string -> struct)
      4. The value is silently corrupted (e.g. long -> int, decimal -> long)

      In normal circumstances either case (1) happens to all of the elements, or case(2) does, depending on if a common schema was found. But when this bug is triggered by having heterogeneous types, case (2) or case (3) can happen to some of the elements in the array.

      The effects depend on the order of elements in the array, as the sentinel value bug is dependent on the iteration order of the elements. For example:

      • `[1,2,"3"]` returns Byte, Byte, String
      • `["1",2,3]` returns Byte, Byte, Byte (safely converts the data, case 2)
      • `[1,2,{}]` returns Byte, Byte, Map
      • `[{},2,3]` experiences an exception and returns String (exception, case 3)
      • `[1, 2, 1000000000000000000000]` returns Byte, Byte, BigDecimal
      • `[1000000000000000000000, 1, 2]` returns Byte, Byte, Byte (corruption, case 4)

      Fixing this bug would entail changing all of these to return heterogeneous lists without a common schema, and not convert the values at all. However, this is a backwards-incompatible change because these are all situations in which we return data without an exception, so downstream users could be relying on the result.

      However, this behavior is very opaque and unpredictable, and I think anyone that observes this in the wild would need to work-around it or avoid it, rather than rely on it happening. I think that fixing it to address the silent corruption case is a bigger benefit to users than the harm done by changing the other cases.

      Attachments

        Issue Links

          Activity

            People

              gharris1727 Greg Harris
              gharris1727 Greg Harris
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: