Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.16.0
-
None
Description
Drill's value vectors contain many counts that must be maintained in sync. Drill provides a utility, BatchValidator to check (a subset of) these values for consistency.
The IteratorValidatorBatchIterator class is used in tests to validate the state of each operator (AKA "record batch") as Drill runs the Volcano iterator. This class can also validate vectors by setting the VALIDATE_VECTORS constant to `true`.
This was done, then unit tests were run. Many tests failed. Examples:
[INFO] Running org.apache.drill.TestUnionDistinct 18:44:26.742 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch key - NullableBitVector: Row count = 0, but value count = 2 18:44:26.745 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch key - NullableBitVector: Row count = 0, but value count = 2 [INFO] Running org.apache.drill.TestUnionDistinct 8:44:48.302 [22d4256e-c90b-847c-5104-02d6cdf5223e:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch key - NullableBitVector: Row count = 0, but value count = 2 18:44:48.703 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch n_nationkey - IntVector: Row count = 2, but value count = 25 n_regionkey - IntVector: Row count = 2, but value count = 25 18:44:48.731 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch n_nationkey - IntVector: Row count = 4, but value count = 25 n_regionkey - IntVector: Row count = 4, but value count = 25 18:44:49.039 [22d4256f-6b39-d2ab-d145-4f2b0db315a3:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch n_nationkey - IntVector: Row count = 2, but value count = 25 18:44:49.363 [22d4256e-3d91-850f-9ab4-5939219ac0d0:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch c_custkey - IntVector: Row count = 4, but value count = 1500 18:44:49.597 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch n_nationkey - IntVector: Row count = 5, but value count = 25 n_regionkey - IntVector: Row count = 5, but value count = 25 18:44:49.610 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch r_regionkey - IntVector: Row count = 1, but value count = 5 18:44:53.029 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch n_nationkey - IntVector: Row count = 0, but value count = 25 n_name - VarCharVector: Row count = 0, but value count = 25 n_regionkey - IntVector: Row count = 0, but value count = 25 18:44:53.033 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch n_regionkey - IntVector: Row count = 5, but value count = 25 18:44:53.331 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch n_nationkey - IntVector: Row count = 5, but value count = 25 n_name - VarCharVector: Row count = 5, but value count = 25 n_regionkey - IntVector: Row count = 5, but value count = 25 18:44:53.337 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch n_regionkey - IntVector: Row count = 0, but value count = 25 18:44:53.646 [22d42569-c293-ced0-c3d0-e9153cc4a70a:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from LimitRecordBatch key - NullableBitVector: Row count = 0, but value count = 2 Running org.apache.drill.TestTpchSingleMode 18:45:01.299 [22d42563-0ed6-1501-86a1-4cb375a9cad4:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch Running org.apache.drill.TestMergeFilterPlan 18:45:03.738 [22d4255f-b322-fd56-2f93-34b7f5c709c1:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch o_orderkey - IntVector: Row count = 561, but value count = 15000 o_orderdate - DateVector: Row count = 561, but value count = 15000 o_orderpriority - VarCharVector: Row count = 561, but value count = 15000 18:45:03.828 [22d4255f-b322-fd56-2f93-34b7f5c709c1:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch l_orderkey - IntVector: Row count = 20580, but value count = 32767 l_commitdate - DateVector: Row count = 20580, but value count = 32767 l_receiptdate - DateVector: Row count = 20580, but value count = 32767 18:45:03.990 [22d4255f-b322-fd56-2f93-34b7f5c709c1:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch l_orderkey - IntVector: Row count = 17317, but value count = 27408 l_commitdate - DateVector: Row count = 17317, but value count = 27408 l_receiptdate - DateVector: Row count = 17317, but value count = 27408 [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.041 s - in org.apache.drill.TestMergeFilterPlan 18:45:04.929 [22d4255f-040c-f4c9-7d23-b90702db4a1e:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch o_orderkey - IntVector: Row count = 2287, but value count = 15000 o_custkey - IntVector: Row count = 2287, but value count = 15000 o_orderdate - DateVector: Row count = 2287, but value count = 15000 18:45:04.944 [22d4255f-040c-f4c9-7d23-b90702db4a1e:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch r_regionkey - IntVector: Row count = 1, but value count = 5 r_name - VarCharVector: Row count = 1, but value count = 5 [INFO] Running org.apache.drill.TestSelectWithOption 18:45:06.120 [22d4255e-5f13-aabb-40bb-bd09dc3d35e1:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch l_quantity - Float8Vector: Row count = 594, but value count = 32767 l_extendedprice - Float8Vector: Row count = 594, but value count = 32767 l_discount - Float8Vector: Row count = 594, but value count = 32767 l_shipdate - DateVector: Row count = 594, but value count = 32767 18:45:06.156 [22d4255e-5f13-aabb-40bb-bd09dc3d35e1:frag:0:0] ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from FilterRecordBatch l_quantity - Float8Vector: Row count = 543, but value count = 27408 l_extendedprice - Float8Vector: Row count = 543, but value count = 27408 l_discount - Float8Vector: Row count = 543, but value count = 27408 l_shipdate - DateVector: Row count = 543, but value count = 27408
And many, many more. (Note that the test names might not be accurate: Maven runs multiple tests in parallel and it is hard to correlate log messages with tests in this output format.)
The problem with these errors is that it makes operators very fragile: once we accept invalid vectors, it is very hard to detect when an operator makes vectors even more invalid. It is also hard to reason about the code if the inputs (or outputs) can be corrupt in normal operation.
Suggestions:
1. Extend BatchValidator with the vectors not yet covered (maps, repeated maps.)
2. Work step-by-step through tests.
3. Identify operators that corrupt vectors.
4. Fix the source of corruption and retest.
5. Continue until no vector corruption errors occur.
6. Change the IteratorValidatorBatchIterator to check vectors by default, and to throw a fatal error if corruption is found.
Attachments
Issue Links
- incorporates
-
DRILL-7325 Many operators do not set container record count
-
- Resolved
-
- Is contained by
-
DRILL-7325 Many operators do not set container record count
-
- Resolved
-
- is duplicated by
-
DRILL-7303 Filter record batch does not handle zero-length batches
-
- Resolved
-
-
DRILL-7305 Multiple operators do not handle empty batches
-
- Resolved
-
-
DRILL-7311 Partial fixes for empty batch bugs
-
- Resolved
-
- links to