Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4982

Hive Queries degrade when queries switch between different formats

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10.0
    • Component/s: None
    • Labels:
      None

      Description

      We have seen degraded performance by doing these steps:
      1) generate the repro data:

      python script repro.py as below:

      import string
      import random

      for i in range(30000000):
      x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
      x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
      x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
      x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
      x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
      x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))

      print "

      {0}".format(x1),"{0}

      ".format(x2),"

      {0}".format(x3),"{0}

      ".format(x4),"

      {0}".format(x5),"{0}

      ".format(x6)

      python repro.py > repro.csv

      2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt, use the following sql command to create an external table:
      CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'

      3) create Hive's table of ORC|PARQUET format:
      CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
      CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;

      4) Query switch between these two tables, then the query time on the same table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats, after injecting a query to other formats, all have significant slow down.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                karthikm Karthikeyan Manivannan
                Reporter:
                cshi Chunhui Shi
                Reviewer:
                Dechang Gu
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: