Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-3672

Performance is poor when multiple queries occur in short period

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: v2.5.0
    • Fix Version/s: v2.5.2
    • Component/s: Query Engine
    • Labels:
    • Environment:
      CentOS 6.7, HBase 1.2.0+cdh5.14.2+456
    • Flags:
      Patch

      Description

      Hi, Kylin Team

      We found one Kylin performance bug during performance tuning for our BI report integrate with Kylin.

       

      Background

      Our BI report show customer usage report to enterprise customers, provide 15 usage charts in report page.

      Each chart need send API request to Kylin with different SQLs. So it means for one user, it will trigger 15 API calls(by JDBC) to Kylin.

      For our product scale, we need support at least 20 users to review the report at same time for each Kylin query node.

      So it means each Kylin node should be able to handle 15 * 20 = 300 queries  per second.

       

      Performance Report

      To reduce the network impact. We built up Kylin cluster and testing machine in the same network with Hadoop system.

      We use gatling and Jmeter tools to do several round testing, result as follow.

       

      Thread Handled Queries (in 60 seconds) Handled Queries (per second) Mean Response Time
      (ms)
      1 773 13 77
      15 3245 54 279
      25 3844 64 390
      50 4912 82 612
      75 5405 90 841
      100 5436 91 1108
      150 5434 91 1688

       

      And draw the trend chart as follow:

       

      Conclusion

      From the trend, when the thread count reach 75, the handled queries per second reaches peak data 90, and cannot improved by increase the thread count.

      Each Kylin query engine can handle 90 queries per second, it means only support 90/15 = 6 users to review report page at same time.

      Even we setup 3 query nodes, can extend to 18 users at same time, this performance capacity cannot meet our business requirement.

       

      Analyze

      From test result, response for one thread is fast, but as the thread increase, throughput of Kylin not increased as we expected.

      We have full code review for Kylin query engine, and use Jstack and JProfile to do analyze, found the root cause for this performance bottleneck.

      This is one regression bug introduced by new feature involved one year before.

      With bug fixing, one Kylin node can handle 350+ queries per second. Submit this bug for contribute patch to Kylin.

      Jstack Log Analyze

      We use Jstack to capture thread info during performance testing. Already attach one of them 'jstackBeforeBugFix.log'.

      From the log, we can found that 

      One thread locked at sun.misc.URLClassPath.getNextLoader. TID is 0x000000048007a180

       
      "Query e9c44a2d-6226-ff3b-f984-ce8489107d79-3425" #3425 daemon prio=5 os_prio=0 tid=0x000000000472b000 nid=0x1433 waiting }}{{for monitor entry [}}\\\{{0x00007f272e40d000}}{{]
       
         java.lang.Thread.State: BLOCKED (on object monitor)
       
          at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:469)
       
          - locked <0x000000048007a180> (a sun.misc.URLClassPath)
       
          at sun.misc.URLClassPath.findResource(URLClassPath.java:214)
       
          at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
       
          at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
       
          at java.security.AccessController.doPrivileged(Native Method)
       
          at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
       
          at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
       
          at java.lang.ClassLoader.getResource(ClassLoader.java:1091)
       
          at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:1666)
       
          at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:338)
       
      43 threads waiting to lock <0x000000048007a180
       
      "Query f1f0bbec-a3f7-04b2-1ac6-fd3e03a0232d-4002" #4002 daemon prio=5 os_prio=0 tid=0x00007f27e71e7800 nid=0x1676 waiting }}{{for monitor entry [}}\\\{{0x00007f279f503000}}{{]
         java.lang.Thread.State: BLOCKED (on object monitor)
          at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:469)
          - waiting to lock <0x000000048007a180> (a sun.misc.URLClassPath)
          at sun.misc.URLClassPath.findResource(URLClassPath.java:214)
          at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
          at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
          at java.security.AccessController.doPrivileged(Native Method)
          at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
          at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
          at java.lang.ClassLoader.getResource(ClassLoader.java:1091)
          at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:1666)
          at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:338)

       

      After dig into code, we can found that when query engine build request to Hbase CoProcessor, it will export Kylin properties as String, this issue caused by relative logic.

      Code Analyze

      In KylinConfig.class

      function private static OrderedProperties buildSiteOrderedProps() 

      1.  For each thread, will call getResouce to get "kylin-defaults.properties"

      // 1. load default configurations from classpath.
      // we have a kylin-defaults.properties in kylin/core-common/src/main/resources
      URL resource = Thread.currentThread().getContextClassLoader().getResource("kylin-defaults.properties");
      Preconditions.checkNotNull(resource);
      logger.info("Loading kylin-defaults.properties from {}", resource.getPath());
      OrderedProperties orderedProperties = new OrderedProperties();
      loadPropertiesFromInputStream(resource.openStream(), orderedProperties);

      2. Exist unused logics to call 10 times to getResouce for  "kylin-defaults" + + ".properties", thread LOCKED occurs here.

      for (int i = 0; i < 10; i++) {
          String fileName = "kylin-defaults" + + ".properties";
       URL additionalResource = Thread.currentThread().getContextClassLoader().getResource(fileName);
       if (additionalResource != null) {
              logger.info("Loading {} from {} ", fileName, additionalResource.getPath());
       loadPropertiesFromInputStream(additionalResource.openStream(), orderedProperties);
       }
      }

      Those logics involved in 2017/6/7

      What's the solution

      1. Because kylin-defaults.properties is built in kylin-core-common-xxxx.jar, no need to getResource for it every time.

      Move it into method getInstanceFromEnv(). Because it only need load once.

      And not to impact CoProcessor logic. In CoProcessor class CubeVisitService, it will use KylinConfig as util class to generate config object from String, it's dangerous to involve any logic of load properties file with CoProcessor due to there is no Kylin.properties in its package, so it will not call  KylinConfig.getInstanceFromEnv().

      buildDefaultOrderedProperties();

      2. Suggest remove unused logic to getResource for "kylin-defaults" + ( i ) + ".properties". To reduce the risk, just keep those logic in the static block, we'd better have a discuss whether we need this logic, because there is no ' kylin-defaults" + ( i ) + ".properties' exist from those logic checked in.

      Performance Testing Result After Bug Fix

      Users Handled Queries (in 60 seconds) Handled Queries (per second) Mean Response Time
      (ms)
      1 2451 41 12
      15 12422 207 37
      25 15600 260 56
      50 18481 308 129
      75 21055 351 136
      100 24036 400 251
      150 28014 467 277

      Trend Chart

      Kylin Server Info

      Role vCPU Memory(GB) Volume(GB)
      Query Engine 16 (2.4G) 128 1024

       

      Kylin Package

      apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)

       

      Query SQL

      SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no PreparedStatement cache, performance will be more worse). Filter will hit all 6 segments.

       

      Cube Info

      Segment Number: 6 Total Size: 47 MB

       

      Segment: 20180101000000_20181011000000

      Start Time: 2018-01-01 00:00:00

      End Time: 2018-10-11 00:00:00

      Source Count: 351934019

      HBase Table: KYLIN_69Q9A850DZ

      Region Count: 1

      Size: 47 MB

       

      Segment: 20181011000000_20181012000000

      Start Time: 2018-10-11 00:00:00

      End Time: 2018-10-12 00:00:00

      Source Count: 7085485

      HBase Table: KYLIN_ZCT39S8FUA

      Region Count: 1

      Size: less than 1 MB

       

       

      Segment: 20181012000000_20181013000000

      Start Time: 2018-10-12 00:00:00

      End Time: 2018-10-13 00:00:00

      Source Count: 5534968

      HBase Table: KYLIN_RKRRLA958T

      Region Count: 1

      Size: less than 1 MB

       

      Segment: 20181013000000_20181014000000

      Start Time: 2018-10-13 00:00:00

      End Time: 2018-10-14 00:00:00

      Source Count: 242856

      HBase Table: KYLIN_Q6DKCONN81

      Region Count: 1

      Size: less than 1 MB

       

      Segment: 20181014000000_20181015000000

      Start Time: 2018-10-14 00:00:00

      End Time: 2018-10-15 00:00:00

      Source Count: 236122

      HBase Table: KYLIN_JY4WQD2MJH

      Region Count: 1

      Size: less than 1 MB

       

      Segment: 20181015000000_20181016000000

      Start Time: 2018-10-15 00:00:00

      End Time: 2018-10-16 00:00:00

      Source Count: 6172353

      HBase Table: KYLIN_E2ELLINV22

      Region Count: 1

      Size: less than 1 MB

       

      HBase Region Server 

      Count: 6

      hbase.regionserver.handler.count: 120.

      Not blocked in CoProcessor RPC call.

        Attachments

        1. TrendChartBeforeFix.png
          145 kB
          Zongwei Li
        2. jstackBeforeBugFix.log
          3.13 MB
          Zongwei Li
        3. codeChangedCausedThisBug.png
          23 kB
          Zongwei Li
        4. TrendChartAfterFix.png
          47 kB
          Zongwei Li
        5. KYLIN-3672.master.002.patch
          9 kB
          Zongwei Li

          Activity

            People

            • Assignee:
              zonli Zongwei Li
              Reporter:
              zonli Zongwei Li
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: