Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
We have seen degraded performance by doing these steps:
1) generate the repro data:
python script repro.py as below:
import string
import random
for i in range(30000000):
x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
print "
{0}".format(x1),"{0}".format(x2),"
{0}".format(x3),"{0}".format(x4),"
{0}".format(x5),"{0}".format(x6)
python repro.py > repro.csv
2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt, use the following sql command to create an external table:
CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'
3) create Hive's table of ORC|PARQUET format:
CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
4) Query switch between these two tables, then the query time on the same table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats, after injecting a query to other formats, all have significant slow down.
Attachments
Issue Links
- is related to
-
DRILL-5106 Refactor SkipRecordsInspector to exclude check for predefined file formats
- Resolved
- links to