Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17222

Create socket address leveraging URI cache

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.3.1, 3.4.0
    • common, hdfs-client
    • HBase version: 2.1.0
      JVM: -Xmx2g -Xms2g 
      hadoop hdfs version: 2.7.4
      disk:SSD
      OS:CentOS Linux release 7.4.1708 (Core)
      JMH Benchmark: @Fork(value = 1)
      @Warmup(iterations = 300)
      @Measurement(iterations = 300)

    • Reviewed
    • Hide
      DFS client can use the newly added URI cache when creating socket address for read operations. By default it is disabled. When enabled, creating socket address will use cached URI object based on host:port to reduce the frequency of URI object creation.

      To enable it, set the following config key to true:
      <property>
        <name>dfs.client.read.uri.cache.enabled</name>
        <value>true</value>
      </property>
      Show
      DFS client can use the newly added URI cache when creating socket address for read operations. By default it is disabled. When enabled, creating socket address will use cached URI object based on host:port to reduce the frequency of URI object creation. To enable it, set the following config key to true: <property>   <name>dfs.client.read.uri.cache.enabled</name>   <value>true</value> </property>

    Description

      Note:Not only the hdfs client can get the current benefit, all callers of NetUtils.createSocketAddr will get the benefit. Just use hdfs client as an example.

       

      Hdfs client selects best DN for hdfs Block. method call stack:

      DFSInputStream.chooseDataNode -> getBestNodeDNAddrPair -> NetUtils.createSocketAddr

      NetUtils.createSocketAddr creates the corresponding InetSocketAddress based on the host and port. There are some heavier operations in the NetUtils.createSocketAddr method, for example: URI.create(target), so NetUtils.createSocketAddr takes more time to execute.

      The following is my performance report. The report is based on HBase calling hdfs. HBase is a high-frequency access client for hdfs, because HBase read operations often access a small DataBlock (about 64k) instead of the entire HFile. In the case of high frequency access, the NetUtils.createSocketAddr method is time-consuming.

      Test Environment:

       

      HBase version: 2.1.0
      JVM: -Xmx2g -Xms2g 
      hadoop hdfs version: 2.7.4
      disk:SSD
      OS:CentOS Linux release 7.4.1708 (Core)
      JMH Benchmark: @Fork(value = 1) 
      @Warmup(iterations = 300) 
      @Measurement(iterations = 300)
      

      Before Optimization FlameGraph:

      In the figure, we can see that DFSInputStream.getBestNodeDNAddrPair accounts for 4.86% of the entire CPU, and the creation of URIs accounts for a larger proportion.

      Optimization ideas:

      NetUtils.createSocketAddr creates InetSocketAddress based on host and port. Here we can add Cache to InetSocketAddress. The key of Cache is host and port, and the value is InetSocketAddress.

      After Optimization FlameGraph:

      In the figure, we can see that DFSInputStream.getBestNodeDNAddrPair accounts for 0.54% of the entire CPU. Here, ConcurrentHashMap is used as the Cache, and the ConcurrentHashMap.get() method gets data from the Cache. The CPU usage of DFSInputStream.getBestNodeDNAddrPair has been optimized from 4.86% to 0.54%.

      Original FlameGraph link:

      Before Optimization

      After Optimization FlameGraph

      Attachments

        1. After optimization.svg
          382 kB
          Rui Fan
        2. After Optimization remark.png
          334 kB
          Rui Fan
        3. Before optimization.svg
          470 kB
          Rui Fan
        4. Before Optimization remark.png
          366 kB
          Rui Fan

        Activity

          People

            fanrui Rui Fan
            fanrui Rui Fan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 50m
                5h 50m