Currently index rebuilds for global index tables are done on the server side. Phoenix client generates an aggregate plan using ServerBuildIndexCompiler to scan every data table row on the server side . This complier sets the scan attributes so that the row mutations that are scanned by UngroupedRegionObserver are then replayed on the data table so that index table rows are rebuilt. During this replay, data table row updates are skipped and only index table row are updated.
Phoenix allows column entries to have null values. Null values are represented by HBase column delete marker. This means that index rebuild must replay these delete markers along with put mutations. In order to do that ServerBuildIndexCompiler should use raw scans but currently it does use regular scans. This leads incorrect index rebuilds when null values are used.
A simple test where a data table with one global index with a covered column that can take null value is sufficient to reproduce this problem.
- Create a data table with columns a, b, and c where a is the primary key and c can have null value
- Write one row with not null values
- Overwrite the covered column with null (i.e., set it to null)
- Create an index on the table where b is the secondary key and c is covered column
- Rebuild the index
- Dump the index table
The index table row should have the null value for the covered column. However, it has the not null value written at step 2.