Details
Description
The UnsafeRow structural integrity check in UnsafeRowUtils.validateStructuralIntegrity is added in Spark 3.1.0. It’s supposed to validate that a given UnsafeRow conforms to the format that the UnsafeRowWriter would have produced.
Currently the check expects all fields that are marked as null should also have its field (i.e. the fixed-length part) set to all zeros. It needs to be updated to handle a special case for variable-length Decimals, where the UnsafeRowWriter may mark a field as null but also leave the fixed-length part of the field as OffsetAndSize(offset=current_offset, size=0). This may happen when the Decimal being written is either a real null or has overflowed the specified precision.
Logic in UnsafeRowWriter:
in general:
public void setNullAt(int ordinal) { BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit write(ordinal, 0L); // also zero out the fixed-length field }
special case for DecimalType:
// Make sure Decimal object has the same scale as DecimalType. // Note that we may pass in null Decimal object to set null for it. if (input == null || !input.changePrecision(precision, scale)) { BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit // keep the offset for future update setOffsetAndSize(ordinal, 0); // doesn't zero out the fixed-length field }
The special case is introduced to allow all {{DecimalType}}s (including both fixed-length and variable-length ones) to be mutable – thus need to leave space for the variable-length field even if it’s currently null.
Note that this special case in UnsafeRowWriter has been there since Spark 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was originally added for Structured Streaming’s checkpoint evolution validation, so that a newer version of Spark can check whether or not an older checkpoint file for Structured Streaming queries can be supported, and/or if the contents of the checkpoint file is corrupted.