Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5502

Parallelized external sort is slower compared to the single fragment scenario on some data sets

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.0
    • None
    • None

    Description

      git.commit.id.abbrev=1e0a14c

      The below query runs in a single fragment and completes in ~13 minutes

      ALTER SESSION SET `exec.sort.disable_managed` = false;
      alter session set `planner.width.max_per_node` = 1;
      alter session set `planner.memory.max_query_memory_per_node` = 62600000;
      alter session set `planner.width.max_per_query` = 17;
      select count(*) from (select * from dfs.`/drill/testdata/resource-manager/5kwidecolumns_500k.tbl` order by columns[0]) d where d.columns[0] = '4041054511';
      +---------+
      | EXPR$0  |
      +---------+
      | 0       |
      +---------+
      1 row selected (832.705 seconds)
      

      Now I increased the parallelization to 10 and also increased the memory allocated to the sort by 10 times, so that each individual fragments still ends up getting the similar amount of memory. In this case however the query takes ~30 minutes to complete which is strange

      ALTER SESSION SET `exec.sort.disable_managed` = false;
      alter session set `planner.width.max_per_node` = 10;
      alter session set `planner.memory.max_query_memory_per_node` = 626000000;
      alter session set `planner.width.max_per_query` = 17;
      select count(*) from (select * from dfs.`/drill/testdata/resource-manager/5kwidecolumns_500k.tbl` order by columns[0]) d where d.columns[0] = '4041054511';
      +---------+
      | EXPR$0  |
      +---------+
      | 0       |
      +---------+
      1 row selected (1845.508 seconds)
      

      My data set contains wide columns (5k chars wide). I will try to reproduce this with a data set where the column width is < 256 bytes.

      Attached the data profile and log file from both the scenarios. The data set is too large to attach to a jira

      Attachments

        1. multiple_fragments.log
          13.81 MB
          Rahul Kumar Challapalli
        2. multiple_fragments.sys.drill
          38 kB
          Rahul Kumar Challapalli
        3. single_fragment.log
          2.46 MB
          Rahul Kumar Challapalli
        4. single_fragment.sys.drill
          13 kB
          Rahul Kumar Challapalli

        Activity

          People

            paul-rogers Paul Rogers
            rkins Rahul Kumar Challapalli
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: