Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-5153

Big in-list query cause slow performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • v4.0.1
    • None
    • Query Engine
    • None

    Description

      select SELLER_ID,sum(PRICE) from KYLIN_SALES where SELLER_ID in (10000001,10000002,10000003,10000004,10000005,10000006)  GROUP BY SELLER_ID 

      Current the above SQL will convert to a spark physical plan like this:

      Project [2#122L AS F__KYLIN_SALES_SELLER_ID__1_4392b2b0__0#128L, 5#124 AS F__SUM_PRICE__1_4392b2b0__2#130]
      +- Filter ((((((2#122L = 10000001) || (2#122L = 10000002)) || (2#122L = 10000003)) || (2#122L = 10000004)) || (2#122L = 10000005)) || (2#122L = 10000006))
         +- FileScan parquet [2#122L,5#124] Batched: false, Format: Parquet, Location: FilePruner[file:/Users/hujiahua/work/project/yz-kylin/examples/test_case_data/sample_local/defaul..., PartitionFilters: [], PushedFilters: [Or(Or(Or(Or(Or(EqualTo(2,10000001),EqualTo(2,10000002)),EqualTo(2,10000003)),EqualTo(2,10000004)..., ReadSchema: struct<2:bigint,5:decimal(29,4)> 

      IN-LIST expression will always convert to OR expression. If the size of LIST was relatively small, it work fine. But when the size of LIST get bigger (The size value was  1000+ in our production case), it will have performance issues (the RT was more than 10 seconds). Too many OR expression cause spend too many time in plan optimization phase and spark code generation phase. 

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            sleep1661 hujiahua
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: