Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.1, Impala 2.3.0
Description
If the bz2 file is a multi-stream one, impala will stop after decompressing just the first file in the stream, usually 900K. this is because the bzip2 API Impala calls doesn't support multi-stream file.
You can verify if the file contains multiple stream this way
~/tmp$ bzip2 -tvvv part-r-00011.bz2
if the file contains multiple stream, you will see multiple CRC part like the following
part-r-00011.bz2: [1: huff+mtf rt+rld {0x2c1dd887, 0x2c1dd887}] combined CRCs: stored = 0x2c1dd887, computed = 0x2c1dd887 [1: huff+mtf rt+rld {0x54f3cc6d, 0x54f3cc6d}] combined CRCs: stored = 0x54f3cc6d, computed = 0x54f3cc6d [1: huff+mtf rt+rld {0x4d154663, 0x4d154663}] combined CRCs: stored = 0x4d154663, computed = 0x4d154663 ... combined CRCs: stored = 0x0c669d2c, computed = 0x0c669d2c [1: huff+mtf rt+rld {0xc98168b2, 0xc98168b2}] combined CRCs: stored = 0xc98168b2, computed = 0xc98168b2 ok
File contains one stream
~/tmp$ bzip2 -tvvv test0.bz2
test0.bz2: [1: huff+mtf rt+rld {0x1f9e828d, 0x1f9e828d}] [2: huff+mtf rt+rld {0x3f92b829, 0x3f92b829}] [3: huff+mtf rt+rld {0xad17755a, 0xad17755a}] ... [721: huff+mtf rt+rld {0x000e4eb1, 0x000e4eb1}] [722: huff+mtf rt+rld {0x0894b080, 0x0894b080}] [723: huff+mtf rt+rld {0x647b4336, 0x647b4336}] combined CRCs: stored = 0x1caafbf0, computed = 0x1caafbf0 ok
Impala should use bzip2 high-level interface to handle embedded compressed data streams better.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-2154 Fix decompressor to allow parsing gzips with multiple streams
- Resolved