Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The Decimal data type has been supported for Parquet on Hive. But the predicate push down does not work for Decimal type. This Jira will make PPD support Decimal type for Parquet.

        Issue Links

          Activity

          Hide
          dongc Dong Chen added a comment -

          After investigating this, I found we might need some changes on Parquet side.

          Problem:
          Decimal in Hive is mapped to Binary in Parquet. When using predicate and statistic to filter values, comparing Binary values in Parquet cannot reflect the correct relationship of Decimal values in Hive. This type mapping causes 2 problems:
          1. When writing Decimal column, Binary.compareTo() is used to judge and set the column statistic (min, max). The generated statistic value is not correct from a Decimal perspective.
          2. When reading with Predicate (also Filter), in which the expected Decimal value is converted to Binary type, Binary.compareTo() is used to compare the expected value and column statistic value. They are Binary perspective, and also the result is not right.

          An idea:
          I was thinking whether we could add a customized comparator as an attribute in Binary class, and high level user like Hive provides the comparator, since Hive knows how to decode the binary to Decimal and compare. Then Binary.compareTo() could be changed to switch between customized and original comparison method.

          Not sure this solution is ok. It has to change Parquet API.

          Any thoughts? Other ideas?

          Show
          dongc Dong Chen added a comment - After investigating this, I found we might need some changes on Parquet side. Problem: Decimal in Hive is mapped to Binary in Parquet. When using predicate and statistic to filter values, comparing Binary values in Parquet cannot reflect the correct relationship of Decimal values in Hive. This type mapping causes 2 problems: 1. When writing Decimal column, Binary.compareTo() is used to judge and set the column statistic (min, max). The generated statistic value is not correct from a Decimal perspective. 2. When reading with Predicate (also Filter), in which the expected Decimal value is converted to Binary type, Binary.compareTo() is used to compare the expected value and column statistic value. They are Binary perspective, and also the result is not right. An idea: I was thinking whether we could add a customized comparator as an attribute in Binary class, and high level user like Hive provides the comparator, since Hive knows how to decode the binary to Decimal and compare. Then Binary.compareTo() could be changed to switch between customized and original comparison method. Not sure this solution is ok. It has to change Parquet API. Any thoughts? Other ideas?
          Hide
          dongc Dong Chen added a comment -

          Hi Sergio Peña, Ryan Blue, would you like to take a look at this if time is available? Thanks!

          Show
          dongc Dong Chen added a comment - Hi Sergio Peña , Ryan Blue , would you like to take a look at this if time is available? Thanks!
          Hide
          rdblue Ryan Blue added a comment -

          Dong Chen, I think you're right that we need a comparator, but I think this should be at the column-level rather than associated with the Binary class. Could you open a Parquet issue to discuss this in the Parquet community?

          Show
          rdblue Ryan Blue added a comment - Dong Chen , I think you're right that we need a comparator, but I think this should be at the column-level rather than associated with the Binary class. Could you open a Parquet issue to discuss this in the Parquet community?
          Hide
          dongc Dong Chen added a comment -

          Thanks for your feedback, Ryan Blue

          PARQUET-281 was created for this. Column-level is a good idea.
          I will investigate Parquet code to see how to make it. It seems not straight-forward like changing Binary class, and needs some code digging.

          Show
          dongc Dong Chen added a comment - Thanks for your feedback, Ryan Blue PARQUET-281 was created for this. Column-level is a good idea. I will investigate Parquet code to see how to make it. It seems not straight-forward like changing Binary class, and needs some code digging.

            People

            • Assignee:
              dongc Dong Chen
              Reporter:
              dongc Dong Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development