Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27237

Introduce State schema validation among query restart

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: Structured Streaming
    • Labels:
      None

      Description

      Even though Spark structured streaming guide page clearly documents that "Any change in number or type of grouping keys or aggregates is not allowed.", Spark doesn't do anything when end users try to do it, which would end up with indeterministic outputs or unexpected exceptions.

      Even worse, if the query doesn't crash by chance it could write the new messed values to state which completely breaks state unless end users roll back to specific batch via manually editing checkpoint.

      The restriction is clear, the number of columns, and data type for each must not be modified among query runs. We can store schema of state along with state, and verify whether the (maybe) new schema is compatible if state schema is modified. With this validation we can prevent query runs and shows indeterministic behavior when schema is incompatible, as well as we can give more informative error messages to end users.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kabhwan Jungtaek Lim
                Reporter:
                kabhwan Jungtaek Lim
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: