Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8120 Umbrella JIRA tracking Parquet improvements
  3. HIVE-11611

A bad performance regression issue with Parquet happens if Hive does not select any columns

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • None
    • None

    Description

      A possible performance issue may happen with the below code when using a query like this SELECT count(1) FROM parquetTable.

      if (!ColumnProjectionUtils.isReadAllColumns(configuration) && !indexColumnsWanted.isEmpty()) {
              MessageType requestedSchemaByUser =
                  getSchemaByIndex(tableSchema, columnNamesList, indexColumnsWanted);
              return new ReadContext(requestedSchemaByUser, contextMetadata);
      } else {
        return new ReadContext(tableSchema, contextMetadata);
      }
      

      If there are not columns nor indexes selected, then the above code will read the full schema from Parquet even if Hive does not do anything with such values.

      Attachments

        1. HIVE-11611.patch
          3 kB
          Ferdinand Xu

        Issue Links

          Activity

            People

              Ferd Ferdinand Xu
              spena Sergio Peña
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: