[SPARK-39839] Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0, 3.2.0, 3.3.0
Fix Version/s: 3.3.1, 3.2.3, 3.4.0
Component/s: SQL
Labels:
None

Description

The UnsafeRow structural integrity check in UnsafeRowUtils.validateStructuralIntegrity is added in Spark 3.1.0. It’s supposed to validate that a given UnsafeRow conforms to the format that the UnsafeRowWriter would have produced.

Currently the check expects all fields that are marked as null should also have its field (i.e. the fixed-length part) set to all zeros. It needs to be updated to handle a special case for variable-length Decimals, where the UnsafeRowWriter may mark a field as null but also leave the fixed-length part of the field as OffsetAndSize(offset=current_offset, size=0). This may happen when the Decimal being written is either a real null or has overflowed the specified precision.

Logic in UnsafeRowWriter:

in general:

  public void setNullAt(int ordinal) {
    BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
    write(ordinal, 0L);                                      // also zero out the fixed-length field
  }

special case for DecimalType:

      // Make sure Decimal object has the same scale as DecimalType.
      // Note that we may pass in null Decimal object to set null for it.
      if (input == null || !input.changePrecision(precision, scale)) {
        BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
        // keep the offset for future update
        setOffsetAndSize(ordinal, 0);                            // doesn't zero out the fixed-length field
      }

The special case is introduced to allow all {{DecimalType}}s (including both fixed-length and variable-length ones) to be mutable – thus need to leave space for the variable-length field even if it’s currently null.

Note that this special case in UnsafeRowWriter has been there since Spark 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was originally added for Structured Streaming’s checkpoint evolution validation, so that a newer version of Spark can check whether or not an older checkpoint file for Structured Streaming queries can be supported, and/or if the contents of the checkpoint file is corrupted.

Attachments

Issue Links

links to

[Github] Pull Request #37252 (rednaxelafx)

Activity

People

Assignee:: Kris Mok

Reporter:: Kris Mok

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Jul/22 07:19

Updated:: 28/Jul/22 00:18

Resolved:: 28/Jul/22 00:18