Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.3.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
      None

      Description

      The CREATE TABLE LIKE PARQUET statement works on the file attached. However, executing

      Select * from table20
      

      causes the following error:

      File 'hdfs://localhost:20500/taras/table20.parquet' has an incompatible Parquet schema for column 'taras_tmp.table20.col2'. Column type: DECIMAL(9, 2), Parquet schema:
      required byte_array col2 [i:1 d:0 r:0]
      

      This is supported according to the Parquet spec: https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal

      1. table20.parquet
        0.6 kB
        Taras Bobrovytsky

        Issue Links

          Activity

          Hide
          skye Skye Wanderman-Milne added a comment -

          It looks like the Impala parquet scanner doesn't recognize enough decimal types (we only support fixed_len_byte_array): https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

          [~rblue], do you know what parquet-mr will use for decimals by default? Or we can implement all of them, but we should get something that works by default in for the release. Also, did this change recently, because it appears to have worked before.

          Show
          skye Skye Wanderman-Milne added a comment - It looks like the Impala parquet scanner doesn't recognize enough decimal types (we only support fixed_len_byte_array): https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal [~rblue] , do you know what parquet-mr will use for decimals by default? Or we can implement all of them, but we should get something that works by default in for the release. Also, did this change recently, because it appears to have worked before.
          Hide
          tarasbob Taras Bobrovytsky added a comment -

          Note: The parquet file was generated using an unreleased branch of parquet-mr.

          Show
          tarasbob Taras Bobrovytsky added a comment - Note: The parquet file was generated using an unreleased branch of parquet-mr.
          Hide
          dhecht Dan Hecht added a comment -

          Ryan, please see Skye's question. Please determine whether this is an issue that needs fixing in 2.3 or not.

          Show
          dhecht Dan Hecht added a comment - Ryan, please see Skye's question. Please determine whether this is an issue that needs fixing in 2.3 or not.
          Hide
          blue_impala_48d6 Ryan Blue added a comment -

          Impala needs to support reading all of the storage options from the Parquet decimal spec. Parquet MR allows all of them depending on the user's schema.

          This isn't a blocker for 2.3 because we don't expect many people to be writing decimals outside of Hive and Impala (which both use a minimal-sized fixed). In fact, Taras is using a version of parquet-avro that I built for him to make it easy to write decimals that won't be in 5.5 and hasn't been committed upstream because it relies on Avro 1.8 that isn't out yet. I think this is low priority since we either write decimals correctly or it is difficult to write them at all.

          Show
          blue_impala_48d6 Ryan Blue added a comment - Impala needs to support reading all of the storage options from the Parquet decimal spec . Parquet MR allows all of them depending on the user's schema. This isn't a blocker for 2.3 because we don't expect many people to be writing decimals outside of Hive and Impala (which both use a minimal-sized fixed). In fact, Taras is using a version of parquet-avro that I built for him to make it easy to write decimals that won't be in 5.5 and hasn't been committed upstream because it relies on Avro 1.8 that isn't out yet. I think this is low priority since we either write decimals correctly or it is difficult to write them at all.
          Hide
          dhecht Dan Hecht added a comment -

          Thanks for clarifying. Updating the target version / priority accordingly.

          Show
          dhecht Dan Hecht added a comment - Thanks for clarifying. Updating the target version / priority accordingly.
          Hide
          tarasbob Taras Bobrovytsky added a comment -

          Silvius Rus, should this be assigned to the team in Budapest?

          Show
          tarasbob Taras Bobrovytsky added a comment - Silvius Rus , should this be assigned to the team in Budapest?
          Hide
          zi Zoltan Ivanfi added a comment -

          Attila, could you please take a look? Thanks!

          Show
          zi Zoltan Ivanfi added a comment - Attila, could you please take a look? Thanks!
          Hide
          zi Zoltan Ivanfi added a comment -

          I was wondering how different physical types for the same logical type should be handled across multiple files. Does a table schema only define logical types for fields that can be stored using any kind of primitive type capable of representing the given logical types? If so, is an HDFS directory considered to be a valid Impala table if it contains two Parquet files with matching logical types but stored using different primitive types?

          For example, given these two files:

          • `test/file1.par`: single column: name = `num`, primitive type = BYTE_ARRAY, logical type = DECIMAL.
          • `test/file2.par`: single column: name = `num`, primitive type = FIXED_LEN_BYTE_ARRAY, logical type = DECIMAL.

          Should we be able to define a table over these two files in Impala? I think the answer is yes, but I would like to get your feedback. The alternative approach that requires Impala users to specify non-default primitive types for table columns is less flexible and seems much more complicated to implement.

          Show
          zi Zoltan Ivanfi added a comment - I was wondering how different physical types for the same logical type should be handled across multiple files. Does a table schema only define logical types for fields that can be stored using any kind of primitive type capable of representing the given logical types? If so, is an HDFS directory considered to be a valid Impala table if it contains two Parquet files with matching logical types but stored using different primitive types? For example, given these two files: `test/file1.par`: single column: name = `num`, primitive type = BYTE_ARRAY, logical type = DECIMAL. `test/file2.par`: single column: name = `num`, primitive type = FIXED_LEN_BYTE_ARRAY, logical type = DECIMAL. Should we be able to define a table over these two files in Impala? I think the answer is yes, but I would like to get your feedback. The alternative approach that requires Impala users to specify non-default primitive types for table columns is less flexible and seems much more complicated to implement.
          Hide
          attilaj Attila Jeges added a comment -

          Unassigned it for now because I won't have time for a few days to work on it.

          Show
          attilaj Attila Jeges added a comment - Unassigned it for now because I won't have time for a few days to work on it.
          Hide
          tarasbob Taras Bobrovytsky added a comment -

          Here's what I think:
          Yes, the table schema should define the logical type, the primitive type does not matter. So if there are multiple files in a directory with different primitive types, but the same logical type, the table should be valid in Impala.

          I am pretty sure how Hive handles this situation (but I haven't tested it). We should verify that this is what Hive is doing and match it's behavior in Impala.

          Show
          tarasbob Taras Bobrovytsky added a comment - Here's what I think: Yes, the table schema should define the logical type, the primitive type does not matter. So if there are multiple files in a directory with different primitive types, but the same logical type, the table should be valid in Impala. I am pretty sure how Hive handles this situation (but I haven't tested it). We should verify that this is what Hive is doing and match it's behavior in Impala.
          Hide
          dhecht Dan Hecht added a comment -

          Henry Robinson, looks like you had a change for this https://gerrit.cloudera.org/#/c/5115/ that was abandoned... do we want to resolve this as "won't fix"? Also, if you don't plan to work on it again, please unassign.

          Show
          dhecht Dan Hecht added a comment - Henry Robinson , looks like you had a change for this https://gerrit.cloudera.org/#/c/5115/ that was abandoned... do we want to resolve this as "won't fix"? Also, if you don't plan to work on it again, please unassign.
          Hide
          tarmstrong Tim Armstrong added a comment -

          Dan Hecht this is a part of the Parquet spec we don't implement (also int32_t and int64_t). See https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal . I think we should plan on fixing it.

          Show
          tarmstrong Tim Armstrong added a comment - Dan Hecht this is a part of the Parquet spec we don't implement (also int32_t and int64_t). See https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal . I think we should plan on fixing it.
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-2515 and IMPALA-5542 are similar cases where we are stricter than the Parquet spec about decimals.

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-2515 and IMPALA-5542 are similar cases where we are stricter than the Parquet spec about decimals.
          Hide
          henryr Henry Robinson added a comment -

          My patch worked as a proof-of-concept (and we wanted to add support in a hurry at the time), but if we were doing it again with more time the code would probably come out cleaner if we added a template parameter that encodes the Parquet type being decoded (see comments on the code review). Agree with Tim that this is worth fixing.

          Show
          henryr Henry Robinson added a comment - My patch worked as a proof-of-concept (and we wanted to add support in a hurry at the time), but if we were doing it again with more time the code would probably come out cleaner if we added a template parameter that encodes the Parquet type being decoded (see comments on the code review). Agree with Tim that this is worth fixing.
          Hide
          dhecht Dan Hecht added a comment -

          Yeah, if it's part of the spec we should fix.

          Show
          dhecht Dan Hecht added a comment - Yeah, if it's part of the spec we should fix.

            People

            • Assignee:
              Unassigned
              Reporter:
              tarasbob Taras Bobrovytsky
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development