[DRILL-4982] Hive Queries degrade when queries switch between different formats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10.0
Component/s: None
Labels:
None

Description

We have seen degraded performance by doing these steps:
1) generate the repro data:

python script repro.py as below:

import string
import random

for i in range(30000000):
x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))

print "

{0}".format(x1),"{0}

".format(x2),"

{0}".format(x3),"{0}

".format(x4),"

{0}".format(x5),"{0}

".format(x6)

python repro.py > repro.csv

2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt, use the following sql command to create an external table:
CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'

3) create Hive's table of ORC|PARQUET format:
CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;

4) Query switch between these two tables, then the query time on the same table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats, after injecting a query to other formats, all have significant slow down.

Attachments

Issue Links

is related to

DRILL-5106 Refactor SkipRecordsInspector to exclude check for predefined file formats

Resolved

links to

GitHub Pull Request #638

Activity

People

Assignee:: Karthikeyan Manivannan

Reporter:: Chunhui Shi

Reviewer:: Dechang Gu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Oct/16 08:04

Updated:: 08/May/17 18:42

Resolved:: 05/Dec/16 16:13