Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27237

Introduce State schema validation among query restart

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      Even though Spark structured streaming guide page clearly documents that "Any change in number or type of grouping keys or aggregates is not allowed.", Spark doesn't do anything when end users try to do it, which would end up with indeterministic outputs or unexpected exceptions.

      Even worse, if the query doesn't crash by chance it could write the new messed values to state which completely breaks state unless end users roll back to specific batch via manually editing checkpoint.

      The restriction is clear, the number of columns, and data type for each must not be modified among query runs. We can store schema of state along with state, and verify whether the (maybe) new schema is compatible if state schema is modified. With this validation we can prevent query runs and shows indeterministic behavior when schema is incompatible, as well as we can give more informative error messages to end users.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: