Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8120 Umbrella JIRA tracking Parquet improvements
  3. HIVE-11611

A bad performance regression issue with Parquet happens if Hive does not select any columns

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A possible performance issue may happen with the below code when using a query like this SELECT count(1) FROM parquetTable.

      if (!ColumnProjectionUtils.isReadAllColumns(configuration) && !indexColumnsWanted.isEmpty()) {
              MessageType requestedSchemaByUser =
                  getSchemaByIndex(tableSchema, columnNamesList, indexColumnsWanted);
              return new ReadContext(requestedSchemaByUser, contextMetadata);
      } else {
        return new ReadContext(tableSchema, contextMetadata);
      }
      

      If there are not columns nor indexes selected, then the above code will read the full schema from Parquet even if Hive does not do anything with such values.

        Attachments

        1. HIVE-11611.patch
          3 kB
          Ferdinand Xu

          Issue Links

            Activity

              People

              • Assignee:
                Ferd Ferdinand Xu
                Reporter:
                spena Sergio Peña
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: