Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-281

Statistic and Filter need a mechanism to get customized comparator from high layer user

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As discussed in HIVE-10254, we might need a customized comparator from high layer user for generating statistic when writing and applying filter when reading.

      The problem is that (use Decimal type in Hive as an example):
      Decimal in Hive is mapped to Binary in Parquet. When using predicate and statistic to filter values, comparing Binary values in Parquet cannot reflect the correct relationship of Decimal values in Hive. This type mapping causes 2 problems:
      1. When writing Decimal column, Binary.compareTo() is used to judge and set the column statistic (min, max). The generated statistic value is not correct from a Decimal perspective.
      2. When reading with Predicate (also Filter), in which the expected Decimal value is converted to Binary type, Binary.compareTo() is used to compare the expected value and column statistic value. They are Binary perspective, and also the result is not right.

      We could add an interface for customized comparator, and high level user like Hive provides the comparator to Parquet, since Hive knows how to decode the binary to Decimal and compare. Then Parquet could switch between customized and original comparison method.

        Issue Links

          Activity

          Hide
          dongc Dong Chen added a comment -

          Trying to find a way to add the comparator in column-level rather than Binary class.

          Show
          dongc Dong Chen added a comment - Trying to find a way to add the comparator in column-level rather than Binary class.
          Hide
          dongc Dong Chen added a comment -

          Hi Ryan Blue, as we discussed in HIVE-10254, here is some thoughts about adding a comparator at column level rather than Binary class. Could you take a look if time is available? Thanks.

          The customized comparator will be injected and used in 3 parts:

          • generating blocks statistics when writing
          • filter blocks with predicate when reading
          • filter records with predicate when reading

          1. Writing
          Statistics instance hold the data and is compared & updated when writing a record. It is initialized in ColumnWriter inside Parquet and not exposed for Hive.

          In order to transit the comparator from Hive to Parquet, how about we adding params (like parquet.customized.comparator.type and p.c.c.class) in conf or WriteContext.extraMetaData? Then add a delegated comparator in Statistic. Statistics could extract the param and instantiate the comparator based on data type.

          2. Reading
          Methods like FilterApi.binaryColumn is exposed so that we could pass the comparator from Hive. Then Operators.Column class should have an attribute to store the comparator.

          For filtering blocks, modify the visit methods in StatisticsFilter to get the comparator through Column and use it if existed.

          For fitlering records, modify the update methods in IncrementallyUpdatedFilterPredicate.ValueInspector (the impl is actually in IncrementallyUpdatedFilterPredicateGenerator) to get the comparator through Column and use it if existed.

          How does this sound?

          Show
          dongc Dong Chen added a comment - Hi Ryan Blue , as we discussed in HIVE-10254 , here is some thoughts about adding a comparator at column level rather than Binary class. Could you take a look if time is available? Thanks. The customized comparator will be injected and used in 3 parts: generating blocks statistics when writing filter blocks with predicate when reading filter records with predicate when reading 1. Writing Statistics instance hold the data and is compared & updated when writing a record. It is initialized in ColumnWriter inside Parquet and not exposed for Hive. In order to transit the comparator from Hive to Parquet, how about we adding params (like parquet.customized.comparator.type and p.c.c.class ) in conf or WriteContext.extraMetaData? Then add a delegated comparator in Statistic . Statistics could extract the param and instantiate the comparator based on data type. 2. Reading Methods like FilterApi.binaryColumn is exposed so that we could pass the comparator from Hive. Then Operators.Column class should have an attribute to store the comparator. For filtering blocks, modify the visit methods in StatisticsFilter to get the comparator through Column and use it if existed. For fitlering records, modify the update methods in IncrementallyUpdatedFilterPredicate.ValueInspector (the impl is actually in IncrementallyUpdatedFilterPredicateGenerator ) to get the comparator through Column and use it if existed. How does this sound?
          Hide
          rdblue Ryan Blue added a comment -

          In order to transit the comparator from Hive to Parquet . . .

          There should be no need to pass a comparator between Hive and Parquet. This would be completely inside of Parquet because Parquet defines the types and implements the predicates. That means that Parquet should be able to have a custom comparator for any type, determined by its logical type. For this, I would add a `getComparator` method to the `Type`, but I'd like to hear what Alex Levenson's opinion is.

          For example, UINT32 will need a custom comparator that sorts negative numbers after positive ones because the sign bit isn't sign, it is data.

          When you can get a `Comparator` from the `Type`, then you should be passing only the type and getting a comparator when it is needed.

          Show
          rdblue Ryan Blue added a comment - In order to transit the comparator from Hive to Parquet . . . There should be no need to pass a comparator between Hive and Parquet. This would be completely inside of Parquet because Parquet defines the types and implements the predicates. That means that Parquet should be able to have a custom comparator for any type, determined by its logical type. For this, I would add a `getComparator` method to the `Type`, but I'd like to hear what Alex Levenson 's opinion is. For example, UINT32 will need a custom comparator that sorts negative numbers after positive ones because the sign bit isn't sign, it is data. When you can get a `Comparator` from the `Type`, then you should be passing only the type and getting a comparator when it is needed.
          Hide
          dongc Dong Chen added a comment -

          Sounds good. Thanks Ryan Blue.

          Last question is shall we need to consider the possibility that user code and Parquet have different conversion for logical type and 'Type', such as Decimal <-> Binary?
          Maybe it will not happen, then Parquet can build the custom comparator inside for any types based on their logical type.

          Show
          dongc Dong Chen added a comment - Sounds good. Thanks Ryan Blue . Last question is shall we need to consider the possibility that user code and Parquet have different conversion for logical type and 'Type', such as Decimal <-> Binary? Maybe it will not happen, then Parquet can build the custom comparator inside for any types based on their logical type.
          Hide
          rdblue Ryan Blue added a comment -

          Dong Chen, I'd prefer to keep the logical type representations separate from this. The Comparator should work on the primitive type.

          Show
          rdblue Ryan Blue added a comment - Dong Chen , I'd prefer to keep the logical type representations separate from this. The Comparator should work on the primitive type.

            People

            • Assignee:
              dongc Dong Chen
              Reporter:
              dongc Dong Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development