[ORC-1024] BloomFilter hash computation is inconsistent between Java and C++ clients - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.7.0, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11
Fix Version/s: 1.7.1, 1.6.12
Component/s: C++
Labels:
- releasenotes

Docs Text:
Due to the bug of inconsistent hashing in bloom filters, when reading ORC files that have bloom filters written by old C++ clients, the bloom filters won't be used. This may results in performance regression.

Description

drorke found that the C++ reader could incorrectly filter out some rows (RowGroup) when reading Hive generated ORC files with SearchArgument "x = value" using some special values. It only happens when Hive generates bloom filters in these files.

I finally reproduced this by using the java tool (with ~~ORC-1023~~) to generate an ORC file with bloom filters, and read it using the c++ reader. Attached the orc file (id_name_with_bloom_filters.orc). It contains 2 columns and 3 rows:

{"id": 0, "name": "Alice"}
{"id": 1, "name": "Bob"}
{"id": 18000000000, "name": "Mike"}

Using SearchArgument "id = 18000000000" in the C++ reader, no rows will be read out.

Looking into the codes, the Java codes use long as hash key, while the C++ codes use uint64_t as hash key. long in Java is signed so should correspond to int64_t in C++. I think this causes the issue.

In Java codes, the hash key of 18000000000 is -1097054448615658549. In the C++ codes, the hash key of it is 15298148493198126027. This results in different results in testHash().

Java codes:
https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/java/core/src/java/org/apache/orc/util/BloomFilter.java#L195-L204
C++ codes:
https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/c%2B%2B/src/BloomFilter.cc#L106-L115

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

id_name_with_bloom_filters.orc
10/Oct/21 09:10
0.4 kB
Quanlong Huang

Issue Links

blocks

IMPALA-10873 Push down EQUALS, IS NULL and IN-list predicate to ORC reader

Resolved

causes

ORC-1043 Fix C++ conversion compilation error in CentOS 7

Closed

links to

GitHub Pull Request #934

GitHub Pull Request #937

Activity

People

Assignee:: Quanlong Huang

Reporter:: Quanlong Huang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Oct/21 09:19

Updated:: 02/Feb/22 23:52

Resolved:: 15/Oct/21 05:20