Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9939

Fix Hive interop for HLL with STRING types

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • Not Applicable
    • Backend
    • None

    Description

      It turned out that Impala hashes STRINGs differently than Hive.

      Impala's implementation simply hashes the original byte array (e.g. a UTF-8 encoded string), while Hive hashes the UTF-16 encoded char array behind java strings. If the STRING is cast to BINARY in Hive (e.g. ds_hll_sketch(cast(s as binary)) ), then it is interoperable with Impala's current implementation.

      I am not sure how to proceed - we could UTF-16 encode the strings in Impala before hashing, but this would be pretty slow, and I think that Hive actually could be also faster if it would hash UTF-8 arrays - as STRINGs are stored as org.apache.hadoop.io.Text, they are currently UTF-8 decoded to java string first and could be hashed directly without any conversion. This would break compatibility with existing Hive produced sketches though.

      Attachments

        Activity

          People

            Unassigned Unassigned
            csringhofer Csaba Ringhofer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: