Details
Description
In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function unrolledUnpackVectorN with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.
In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.
Intel AVX512 instructions official link:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this feature enable or not in the building process.
The default value of ENABLE_AVX512_BIT_PACKING is OFF.
For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native" -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON -DSNAPPY_HOME=/usr/local
2. Added macro "ENABLE_AVX512" to enable this feature code build or not in ORC.
3. Added the function "detect_platform" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode.
4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode instead of the original function plainUnpackLongs or unrolledUnpackN
5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking.
7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one.
Attachments
Issue Links
- links to