Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4697

short-circuit reads do not honor readahead settings

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.3-alpha
    • Fix Version/s: None
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Neither the new nor the legacy short-circuit read implementations honor dfs.datanode.readahead.bytes. This can result in scenarios where non-short-circuit reads are faster for long reads, simply because they are doing more readahead, and SCR is not. We should do readahead in both cases if it is configured.

        Issue Links

          Activity

          Hide
          Liang Xie added a comment -

          If that's true, the sequence read in HBase, e.g. large scan is suboptimal, right? nice finding!

          Show
          Liang Xie added a comment - If that's true, the sequence read in HBase, e.g. large scan is suboptimal, right? nice finding!
          Hide
          Kihwal Lee added a comment -

          The OS-level readahead will happen even without this for most sequential reads. The fadvise done by manageOsCache() helps triggering OS readahead, which may not happen for slow small reads. The gap between reads are worse for remote reads and that's where manageOsCache() would help most. It does not, however, prolong the lifetime of cached data, so if the reader is slow and the memory pressure is high, the data may get thrown away before the reader gets to it. In such cases, it may actually lower the overall system throughput by causing extra reads.

          I agree that this needs to be fixed, but am also curious how much performance improvement can be obtained for short-circuit reads. If we have important use cases for the precise control of caching and disk activities, aio + direct i/o can be used. What are the common and performance-critical access patterns of hbase? All I know from old days is that it does a lot of random reads of about 64KB.

          Show
          Kihwal Lee added a comment - The OS-level readahead will happen even without this for most sequential reads. The fadvise done by manageOsCache() helps triggering OS readahead, which may not happen for slow small reads. The gap between reads are worse for remote reads and that's where manageOsCache() would help most. It does not, however, prolong the lifetime of cached data, so if the reader is slow and the memory pressure is high, the data may get thrown away before the reader gets to it. In such cases, it may actually lower the overall system throughput by causing extra reads. I agree that this needs to be fixed, but am also curious how much performance improvement can be obtained for short-circuit reads. If we have important use cases for the precise control of caching and disk activities, aio + direct i/o can be used. What are the common and performance-critical access patterns of hbase? All I know from old days is that it does a lot of random reads of about 64KB.
          Hide
          Colin Patrick McCabe added a comment -

          It is certainly true that the OS does some readahead of its own. However, we found that doing manual readahead provided a performance boost in many scenarios, especially ones involving long sequential reads. That's why the Datanode currently does readahead by default. These settings should be honored when using short-circuit local reads, so that the behavior is consistent and configurable.

          Most of HBase's reads are random reads. Readahead will not benefit random reads. The current readahead code in the DN will not do readahead when small, random reads are being done, and we should follow suit in BlockReaderLocal. I do think HBase will see some benefit when doing long scans, and doing compactions.

          As you mentioned, it's true that readahead is not always a win when memory pressure is extremely high. However, when memory pressure is so high that sections that got read ahead have to be purged prior to use, the system usually has other problems that make it unstable and essentially unusable, like the OOM killer triggering.

          Show
          Colin Patrick McCabe added a comment - It is certainly true that the OS does some readahead of its own. However, we found that doing manual readahead provided a performance boost in many scenarios, especially ones involving long sequential reads. That's why the Datanode currently does readahead by default. These settings should be honored when using short-circuit local reads, so that the behavior is consistent and configurable. Most of HBase's reads are random reads. Readahead will not benefit random reads. The current readahead code in the DN will not do readahead when small, random reads are being done, and we should follow suit in BlockReaderLocal . I do think HBase will see some benefit when doing long scans, and doing compactions. As you mentioned, it's true that readahead is not always a win when memory pressure is extremely high. However, when memory pressure is so high that sections that got read ahead have to be purged prior to use, the system usually has other problems that make it unstable and essentially unusable, like the OOM killer triggering.
          Hide
          Andrew Wang added a comment -

          Colin Patrick McCabe I think this was fixed by the BRL rewrite in HDFS-5634, are we good to close this out?

          Show
          Andrew Wang added a comment - Colin Patrick McCabe I think this was fixed by the BRL rewrite in HDFS-5634 , are we good to close this out?

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development