Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14792

AvroSerde reads the remote schema-file at least once per mapper, per table reference.

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2.1, 2.1.0
    • 2.2.1, 2.4.0, 3.0.0
    • None

    Description

      Avro tables that use "external" schema files stored on HDFS can cause excessive calls to FileSystem::open(), especially for queries that spawn large numbers of mappers.

      This is because of the following code in AvroSerDe::initialize():

      AvroSerDe.java
      public void initialize(Configuration configuration, Properties properties) throws SerDeException {
      // ...
          if (hasExternalSchema(properties)
              || columnNameProperty == null || columnNameProperty.isEmpty()
              || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
            schema = determineSchemaOrReturnErrorSchema(configuration, properties);
          } else {
            // Get column names and sort order
            columnNames = Arrays.asList(columnNameProperty.split(","));
            columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
      
            schema = getSchemaFromCols(properties, columnNames, columnTypes, columnCommentProperty);
               properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(), schema.toString());
          }
      // ...
      }
      

      For tables using avro.schema.url, every time the SerDe is initialized (i.e. at least once per mapper), the schema file is read remotely. For queries with thousands of mappers, this leads to a stampede to the handful (3?) datanodes that host the schema-file. In the best case, this causes slowdowns.

      It would be preferable to distribute the Avro-schema to all mappers as part of the job-conf. The alternatives aren't exactly appealing:

      1. One can't rely solely on the column.list.types stored in the Hive metastore. (HIVE-14789).
      2. avro.schema.literal might not always be usable, because of the size-limit on table-parameters. The typical size of the Avro-schema file is between 0.5-3MB, in my limited experience. Bumping the max table-parameter size isn't a great solution.

      If the avro.schema.file were read during query-planning, and made available as part of table-properties (but not serialized into the metastore), the downstream logic will remain largely intact. I have a patch that does this.

      Attachments

        1. HIVE-14792.1.patch
          9 kB
          Mithun Radhakrishnan
        2. HIVE-14792.3.patch
          5 kB
          Mithun Radhakrishnan
        3. HIVE-14792.4.patch
          4 kB
          Aihua Xu
        4. HIVE-14792.5.patch
          16 kB
          Aihua Xu
        5. HIVE-14792.patch.addendum
          16 kB
          Aihua Xu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            aihuaxu Aihua Xu Assign to me
            mithun Mithun Radhakrishnan
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment