Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-740

Slowness with many IN() values

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.1, v1.4.0
    • Component/s: None
    • Labels:
      None
    • Sprint:
      Sprint 43, Sprint 44, Sprint 45, Sprint 46, Sprint 47, Sprint 48

      Description

      [from dev mail list]

      Huang Hua <huanghua@mininglamp.com>

      We were testing the query efficiency with some sql containing 'IN' keyword
      on a relatively big table which holds 22+ Million records.

      It seems to us that sql with a number of items in 'IN' show very poor
      performance in terms of running time.

      The fact table used has three columns: themonth, id, trans_at, and we
      created the cube with 2 dimensions and 1 measure:

      Dimensions: themonth, id

      Measure: sum(trans_at)

      Basically our queries are all similar to the one:

      select themonth, id, sum(trans_at) from the_table where id IN (id1, id2,
      id3, .) group by themonth, id limit 10

      When the number of ids reaches about 30-50, the response time becomes
      considerable long in terms of minutes. By the way, we are running kylin on
      12 nodes with 2 name nodes and 10 data nodes.

      Besides, we tried to break the above sql with 'IN' keyword into several
      sub-sql and merge the results of each sub-sql to get the final result like
      the following:

      select themonth, id, sum(trans_at) from the_table where id = id1 group by
      themonth, id limit 10

      select themonth, id, sum(trans_at) from the_table where id = id2 group by
      themonth, id limit 10

      select themonth, id, sum(trans_at) from the_table where id = id3 group by
      themonth, id limit 10

      .

      Surprisedly, the total running time of those sub-sql is much less than the
      running time of the orginal sql with 'IN' keyword.

      Initially I thought the backend query engine should handle 'IN' keyword in
      the similar way as the individual sql with '=' keyword, but it seems not.

      Can anybody provide any thoughts regarding this? Any ideas on how to tune
      the queries containing 'IN' keyword?

        Attachments

          Activity

            People

            • Assignee:
              mahongbin Hongbin Ma
              Reporter:
              liyang.gmt8@gmail.com liyang
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: