Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7720

Issue observed in performance of UNION ALL between Parquet and DB query

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.17.0
    • Fix Version/s: None
    • Component/s: Functions - Drill
    • Labels:
      None

      Description

      Consider the below scenarios. The first 2 scenarios are giving expected results in terms of performance. But we are not getting expected performance for 3rd scenario which is UNION ALL with 2 different types of datasets (Parquet + DB).

       

      Scenario 1- Parquet UNION ALL Parquet

      Individual execution time of 1st query - 5 secs

      Individual execution time of 2nd query - 5 secs

      UNION ALL of both queries execution time - 10 secs

       

      Scenario 2 - DB query UNION ALL DB query

      Individual execution time of 1st query - 5 secs

      Individual execution time of 2nd query - 5 secs

      UNION ALL of both queries execution time - 10 secs

       

      Scenario 3 - Parquet UNION ALL DB query

      Individual execution time of 1st query - 5 secs

      Individual execution time of 2nd query - 1 sec

      UNION ALL execution time - 20 secs

      Ideally the execution time should not be more than 6 secs.

       

      Config-

      HEAP memory - 16 GB

      DRILL_MAX_DIRECT_MEMORY - 32 GB

      2 Drillbits

       

      Observation-

      Observed that the query is distributed in 2 NODES when we are executing individual query or executing UNION ALL between same type datasets. But query is executing only on 1 NODE when we are executing UNION ALL between 2 types datasets (like Parquet UNION ALL DB). The Union query is not being parallelized i.e. split into multiple 'Minor Fragments'

       

      Storage-

      Storage is HDFS.

       

      Parquet file size - 849 MB

       

      Nature of query-

      Both Parquet and DB query have some filter criteria. Those doesn't have sort or join.

       

      Time taken-

        SCAN Total
      Parquet 2.018s 5.419 sec
      DB 0.146s 0.257 sec
      Parquet UNION ALL DB 15.632s 20.729 sec

      The query we used-
      SELECT column1, column2,...column25
      FROM    dfs.`root`.`Parquet`
      WHERE column1 = <> AND column2 = <> AND column3 = <>
      UNION ALL
      SELECT  column1, column2,...column25
      FROM    db.`Database`
      WHERE column1 = <> AND column2 = <> AND column3 = <>
       
      Below is the Plan after redacted names-
      00-00 Screen
      00-01 Project(Column1=[$0],...,Column25=[$24])
      00-02 UnionExchange
      01-01 UnionAll(all=[true])
      01-03 Project (Column1=[$0],...,Column25=[$24])
      01-05 SelectionVectorRemover
      01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
      01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>, numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false, filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
      01-02  Project(Column1=[$0],...,Column25=[$24])
      01-04 Jdbc()
       
      json-
      {
      "head" :

      { "version":1, "generator" :

      {"type":"ExplainHandler","info":""}

      ,
      "type" : "APACHE_DRILL_PHYSICAL","options" : [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
      },
      "graph":[

      {"pop":"jdbc-scan"}

      ,

      {"pop":"project","initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {memoryCost}

      :4194304,"outputRowCount":1},

      {"pop":"parquet-scan", "cost":
      {"memoryCost":4194304,"outputRowCount":1.2287038E7}

      },

      {"pop":"filter" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
      {"memoryCost":4194304,"outputRowCount":41468.75}

        },

      {"pop":"selection-vector-remover" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
      {memoryCost}

      :4194304,"outputRowCount":41468.75  }  ,

      {"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {"memoryCost":4194304,"outputRowCount":41468.75}

        }    

      {"pop":"union-all" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {"memoryCost":4194304,"outputRowCount":41468.75}

        }    ,

      {"pop":"union-exchange" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {"memoryCost":4194304,"outputRowCount":41468.75 }

      }    ,

      {"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {"memoryCost":4194304,"outputRowCount":41468.75}

      ,"outputProj":true  },

      {"pop":"screen" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":

      {"memoryCost":4194304,"outputRowCount":41468.75}

      }       
      ]
      }

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              bhabani.sreeparna@gmail.com Sreeparna Bhabani
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: