Details

      Description

      IMPALA-4172 changed the behavior of "REFRESH" by making it to use a new code path that reloads the file block metadata of the entire table rather than re-using the metadata of already loaded partitions. This was done under the assumption that the actual metadata loading was sped up so much that we'd have very little impact even if we re-compute the block metadata of already loaded files. Based on our testing, this clearly isn't the case, especially if the table is huge. We need to revert this behavior.

        Issue Links

          Activity

          Hide
          bharathv bharath v added a comment -

          IMPALA-4840: Fix REFRESH performance regression.

          The fix for IMPALA-4172 introduced a regression in
          performance of the REFRESH command. The regression
          stems from the fact that we reload the block metadata
          of every valid data file without considering whether it
          has changed since the last load. This caused unnecessary
          metadata loads for unchanged files and thus increasing
          the runtime.

          The fix involves having the refresh codepath (and other
          operations that use the same codepath like insert etc.) to
          reload the metadata of only modified files by doing a
          listStatus() on the partition directory and checking the
          last modified time of each file. Without this patch, we relied
          on listFiles(), which fetched the block locations irrespective of
          whether the file has changed and it was significantly slower on
          unchanged tables. The initial/invalidate metadata load still
          fetches the block locations in bulk using listFiles(). The
          side effect of this change is that the refresh no longer picks up
          block location changes after HDFS block rebalancing. We suggest
          using "invalidate metadata" for that which loads the metadata from
          scratch.

          Additionally, this commit enables the reuse of metadata during
          table refresh (which was disabled in IMPALA-4172) to prevent
          reloading metadata from HMS everytime.

          Change-Id: I859b9fe93563ba886d0b5db6db42a14c88caada8
          Reviewed-on: http://gerrit.cloudera.org:8080/6009
          Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          bharathv bharath v added a comment - IMPALA-4840 : Fix REFRESH performance regression. The fix for IMPALA-4172 introduced a regression in performance of the REFRESH command. The regression stems from the fact that we reload the block metadata of every valid data file without considering whether it has changed since the last load. This caused unnecessary metadata loads for unchanged files and thus increasing the runtime. The fix involves having the refresh codepath (and other operations that use the same codepath like insert etc.) to reload the metadata of only modified files by doing a listStatus() on the partition directory and checking the last modified time of each file. Without this patch, we relied on listFiles(), which fetched the block locations irrespective of whether the file has changed and it was significantly slower on unchanged tables. The initial/invalidate metadata load still fetches the block locations in bulk using listFiles(). The side effect of this change is that the refresh no longer picks up block location changes after HDFS block rebalancing. We suggest using "invalidate metadata" for that which loads the metadata from scratch. Additionally, this commit enables the reuse of metadata during table refresh (which was disabled in IMPALA-4172 ) to prevent reloading metadata from HMS everytime. Change-Id: I859b9fe93563ba886d0b5db6db42a14c88caada8 Reviewed-on: http://gerrit.cloudera.org:8080/6009 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              bharathv bharath v
              Reporter:
              bharathv bharath v
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development