Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1886

Impala doesn't support multi-stream bz2 compressed file

    XMLWordPrintableJSON

Details

    Description

      If the bz2 file is a multi-stream one, impala will stop after decompressing just the first file in the stream, usually 900K. this is because the bzip2 API Impala calls doesn't support multi-stream file.

      You can verify if the file contains multiple stream this way
      ~/tmp$ bzip2 -tvvv part-r-00011.bz2
      if the file contains multiple stream, you will see multiple CRC part like the following

        part-r-00011.bz2: 
          [1: huff+mtf rt+rld {0x2c1dd887, 0x2c1dd887}]
          combined CRCs: stored = 0x2c1dd887, computed = 0x2c1dd887
          [1: huff+mtf rt+rld {0x54f3cc6d, 0x54f3cc6d}]
          combined CRCs: stored = 0x54f3cc6d, computed = 0x54f3cc6d
          [1: huff+mtf rt+rld {0x4d154663, 0x4d154663}]
          combined CRCs: stored = 0x4d154663, computed = 0x4d154663
          ...
          combined CRCs: stored = 0x0c669d2c, computed = 0x0c669d2c
          [1: huff+mtf rt+rld {0xc98168b2, 0xc98168b2}]
          combined CRCs: stored = 0xc98168b2, computed = 0xc98168b2
          ok
      

      File contains one stream
      ~/tmp$ bzip2 -tvvv test0.bz2

        test0.bz2: 
          [1: huff+mtf rt+rld {0x1f9e828d, 0x1f9e828d}]
          [2: huff+mtf rt+rld {0x3f92b829, 0x3f92b829}]
          [3: huff+mtf rt+rld {0xad17755a, 0xad17755a}]
          ...
          [721: huff+mtf rt+rld {0x000e4eb1, 0x000e4eb1}]
          [722: huff+mtf rt+rld {0x0894b080, 0x0894b080}]
          [723: huff+mtf rt+rld {0x647b4336, 0x647b4336}]
          combined CRCs: stored = 0x1caafbf0, computed = 0x1caafbf0
          ok
      

      Impala should use bzip2 high-level interface to handle embedded compressed data streams better.

      Attachments

        1. data-bzip2.bz2
          1.14 MB
          Juan Yu
        2. data-pbzip2.bz2
          1.15 MB
          Juan Yu

        Issue Links

          Activity

            People

              jyu@cloudera.com Juan Yu
              jyu@cloudera.com Juan Yu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: