Hive
  1. Hive
  2. HIVE-4244

Make string dictionaries adaptive in ORC

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: File Formats
    • Labels:
      None

      Description

      The ORC writer should adaptively switch between dictionary and direct encoding. I'd propose looking at the first 100,000 values in each column and decide whether there is sufficient loading in the dictionary to use dictionary encoding.

        Activity

        Hide
        Kevin Wilfong added a comment -

        Some initial thoughts based on some experiments.

        Dicitonary encoding seems to be less effective than just Zlib at compressing values if the number of distinct values is > ~80% of the total number of values. This number can be configurable. It's still smaller in memory, so we may be able to get away with on writing the stripe, writing out the data directly there. This should be comparable in performance to converting the dictionary index that is already done.

        Also, if the uncompressed (but encoded) size of the dictionary + index (data stream) is greater than the size of the uncompressed size of the original data, the compressed data tends to be larger as well despite the sorting. This will be more expensive to figure out as we don't know the size of the index until it has been run length encoded.

        Show
        Kevin Wilfong added a comment - Some initial thoughts based on some experiments. Dicitonary encoding seems to be less effective than just Zlib at compressing values if the number of distinct values is > ~80% of the total number of values. This number can be configurable. It's still smaller in memory, so we may be able to get away with on writing the stripe, writing out the data directly there. This should be comparable in performance to converting the dictionary index that is already done. Also, if the uncompressed (but encoded) size of the dictionary + index (data stream) is greater than the size of the uncompressed size of the original data, the compressed data tends to be larger as well despite the sorting. This will be more expensive to figure out as we don't know the size of the index until it has been run length encoded.
        Hide
        Owen O'Malley added a comment -

        We should play with different values, but I was guessing the right cutover point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values).

        We aren't really going to know whether the heuristic is right or wrong unless we compare both encodings, which is much too expensive. By taking a good guess after looking at the start of the stripe, we can get good performance most of the time.

        Show
        Owen O'Malley added a comment - We should play with different values, but I was guessing the right cutover point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values). We aren't really going to know whether the heuristic is right or wrong unless we compare both encodings, which is much too expensive. By taking a good guess after looking at the start of the stripe, we can get good performance most of the time.

          People

          • Assignee:
            Kevin Wilfong
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development