Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3885

Parquet files with multiple blocks cause remote reads

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Information Provided
    • Affects Version/s: Impala 2.7.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
      None

      Description

      For parquet files with multiple blocks we schedule the scan ranges across all replicas of the blocks, aiming for local execution on impalads, which are running on the datanodes where the blocks are located. However there seems to be a high number of remote reads in these scenarios.

      The scheduler makes local assignments:

      I0613 16:43:01.741288 36424 simple-scheduler.cc:605] Total remote scan volume = 0
      I0613 16:43:01.741441 36424 simple-scheduler.cc:607] Total local scan volume = 1426.95 GB
      I0613 16:43:01.741576 36424 simple-scheduler.cc:609] Total cached scan volume = 0
      

      However the profile shows remote scans:

               - RemoteScanRanges: 283 (283)
               - RowsRead: 139.36M (139355074)
               - RowsReturned: 239 (239)
               - RowsReturnedRate: 127.00 /sec
               - ScanRangesComplete: 304 (304)
      

      Somehow Impala seems to read data from the wrong datanodes, possibly reading everything from the one where it read the footer from.

      Sailesh Mukil - We briefly talked about this before in person. Do you have time to look at this? If not feel free to re-assign it to me.

        Issue Links

          Activity

          Hide
          dhecht Dan Hecht added a comment -

          Are these the footer ranges (which we expect to be mostly remote)?

          Show
          dhecht Dan Hecht added a comment - Are these the footer ranges (which we expect to be mostly remote)?
          Hide
          lv Lars Volker added a comment -

          My suspicion is that we open the file once to read the footer, which makes a HDFS connection to the (probably remote) backend with the footer block. My suspicion was that subsequent reads pick the wrong replica host to read from, somehow going through the footer replica. However after thinking about it for some more this doesn't seem to make sense at all.

          Show
          lv Lars Volker added a comment - My suspicion is that we open the file once to read the footer, which makes a HDFS connection to the (probably remote) backend with the footer block. My suspicion was that subsequent reads pick the wrong replica host to read from, somehow going through the footer replica. However after thinking about it for some more this doesn't seem to make sense at all.
          Hide
          sailesh Sailesh Mukil added a comment -

          (Copying my response from mail thread)

          How a Parquet scanner works now is like so:

          • Frontend finds all blocks for all files associated to table and creates splits. These splits are usually 1:1 with blocks, i.e. a split is a block. (A split is a logical block, and in some very extreme special cases a split and a block may not be 1:1)
          • The scheduler then looks at the splits and schedules them on different hosts with locality as one of the main factors. This means that usually a host would be assigned a block that is local to it.
          • The parquet scanner on each backend receives the list of files and their corresponding splits, and the IssueInitialRanges() function of the scanner issues a scan range to the footer for every split. This means that every split will read the footer once, be it remote or local. This is because every split needs to know the metadata before starting to scan the actual data (There are ways to do it by making each backend read the footer only once, but this was a design decision taken over a year ago where they preferred not to maintain global state of the footer across splits. Compute vs complexity of scanner. They chose compute)
          • Once this scan range is issued per split to the footer, the ProcessSplit() function reads the footer and then decides which row groups to read. It does this by seeing if a row group(s) mid-point falls within its assigned split. If there are row groups whose mid-point fall within its split, it goes ahead and scans, else it returns without doing any work.
            Note: Why mid point? Because if a row group's mid point falls in node's split and ~49% of it crosses a block boundary and is in a remote block, that node is guaranteed that at least half the block is local to it.

          Now, the cases where I see remote reads happening are in the cases where the row groups do not fall within their assigned splits. For example:

          • The entire parquet file has only one row group, but has >1 block: This means that if there are 10 blocks (assume all blocks are on different nodes), there will be 10 splits but only one row group spanning all these splits. So only the node that has been assigned the fifth split will do any scanning (because the mid point of the one row group would most likely fall within the fifth split). So that node will scan some data locally and scan the rest of the data remotely, and all other scanners will return without doing any work.

          So basically if the file isn't written properly, we will have a lot of remote reads for Parquet files.

          Show
          sailesh Sailesh Mukil added a comment - (Copying my response from mail thread) How a Parquet scanner works now is like so: Frontend finds all blocks for all files associated to table and creates splits. These splits are usually 1:1 with blocks, i.e. a split is a block. (A split is a logical block, and in some very extreme special cases a split and a block may not be 1:1) The scheduler then looks at the splits and schedules them on different hosts with locality as one of the main factors. This means that usually a host would be assigned a block that is local to it. The parquet scanner on each backend receives the list of files and their corresponding splits, and the IssueInitialRanges() function of the scanner issues a scan range to the footer for every split. This means that every split will read the footer once, be it remote or local. This is because every split needs to know the metadata before starting to scan the actual data (There are ways to do it by making each backend read the footer only once, but this was a design decision taken over a year ago where they preferred not to maintain global state of the footer across splits. Compute vs complexity of scanner. They chose compute) Once this scan range is issued per split to the footer, the ProcessSplit() function reads the footer and then decides which row groups to read. It does this by seeing if a row group(s) mid-point falls within its assigned split. If there are row groups whose mid-point fall within its split, it goes ahead and scans, else it returns without doing any work. Note: Why mid point? Because if a row group's mid point falls in node's split and ~49% of it crosses a block boundary and is in a remote block, that node is guaranteed that at least half the block is local to it. Now, the cases where I see remote reads happening are in the cases where the row groups do not fall within their assigned splits. For example: The entire parquet file has only one row group, but has >1 block: This means that if there are 10 blocks (assume all blocks are on different nodes), there will be 10 splits but only one row group spanning all these splits. So only the node that has been assigned the fifth split will do any scanning (because the mid point of the one row group would most likely fall within the fifth split). So that node will scan some data locally and scan the rest of the data remotely, and all other scanners will return without doing any work. So basically if the file isn't written properly, we will have a lot of remote reads for Parquet files.
          Hide
          lv Lars Volker added a comment -

          Thanks Sailesh Mukil for the investigation.

          It sounds like there's not much we can do in those cases where blocks and row groups don't align. Should we add a separate counter for those splits that ended up not scanning any data because the row group's midpoint was outside of the split? Then we could issue a warning, pointing out that the data is poorly formatted and queries are therefore less efficient.

          John Russell - Should we also add this behavior to the documentation?

          Show
          lv Lars Volker added a comment - Thanks Sailesh Mukil for the investigation. It sounds like there's not much we can do in those cases where blocks and row groups don't align. Should we add a separate counter for those splits that ended up not scanning any data because the row group's midpoint was outside of the split? Then we could issue a warning, pointing out that the data is poorly formatted and queries are therefore less efficient. John Russell - Should we also add this behavior to the documentation?
          Hide
          sailesh Sailesh Mukil added a comment -

          Lars Volker That sounds like a good idea, to display a warning message. I've opened IMPALA-3989 to track this.

          Show
          sailesh Sailesh Mukil added a comment - Lars Volker That sounds like a good idea, to display a warning message. I've opened IMPALA-3989 to track this.
          Hide
          dhecht Dan Hecht added a comment -

          Sailesh Mukil, Lars Volker, what's remaining here besides IMPALA-3989?

          Show
          dhecht Dan Hecht added a comment - Sailesh Mukil , Lars Volker , what's remaining here besides IMPALA-3989 ?
          Hide
          lv Lars Volker added a comment -

          Dan Hecht, John Russell], we need to decide whether we want to make changes to the documentation to describe under what circumstances this can happen and how it can be mitigated. Other than that I can't think of anything we can do here.

          Show
          lv Lars Volker added a comment - Dan Hecht , John Russell ], we need to decide whether we want to make changes to the documentation to describe under what circumstances this can happen and how it can be mitigated. Other than that I can't think of anything we can do here.
          Hide
          sailesh Sailesh Mukil added a comment -
          Show
          sailesh Sailesh Mukil added a comment - + John Russell
          Hide
          dhecht Dan Hecht added a comment -

          The doc changes are tracked elsewhere.

          Show
          dhecht Dan Hecht added a comment - The doc changes are tracked elsewhere.

            People

            • Assignee:
              sailesh Sailesh Mukil
              Reporter:
              lv Lars Volker
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development