Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-740

Slowness with many IN() values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.1, v1.4.0
    • None
    • None
    • Sprint 43, Sprint 44, Sprint 45, Sprint 46, Sprint 47, Sprint 48

    Description

      [from dev mail list]

      Huang Hua <huanghua@mininglamp.com>

      We were testing the query efficiency with some sql containing 'IN' keyword
      on a relatively big table which holds 22+ Million records.

      It seems to us that sql with a number of items in 'IN' show very poor
      performance in terms of running time.

      The fact table used has three columns: themonth, id, trans_at, and we
      created the cube with 2 dimensions and 1 measure:

      Dimensions: themonth, id

      Measure: sum(trans_at)

      Basically our queries are all similar to the one:

      select themonth, id, sum(trans_at) from the_table where id IN (id1, id2,
      id3, .) group by themonth, id limit 10

      When the number of ids reaches about 30-50, the response time becomes
      considerable long in terms of minutes. By the way, we are running kylin on
      12 nodes with 2 name nodes and 10 data nodes.

      Besides, we tried to break the above sql with 'IN' keyword into several
      sub-sql and merge the results of each sub-sql to get the final result like
      the following:

      select themonth, id, sum(trans_at) from the_table where id = id1 group by
      themonth, id limit 10

      select themonth, id, sum(trans_at) from the_table where id = id2 group by
      themonth, id limit 10

      select themonth, id, sum(trans_at) from the_table where id = id3 group by
      themonth, id limit 10

      .

      Surprisedly, the total running time of those sub-sql is much less than the
      running time of the orginal sql with 'IN' keyword.

      Initially I thought the backend query engine should handle 'IN' keyword in
      the similar way as the individual sql with '=' keyword, but it seems not.

      Can anybody provide any thoughts regarding this? Any ideas on how to tune
      the queries containing 'IN' keyword?

      Attachments

        Activity

          People

            mahongbin Hongbin Ma
            liyang.gmt8@gmail.com liyang
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: