[SPARK-19145] Timestamp to String casting is slowing the query significantly - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

i have a time series table with timestamp column

Following query
SELECT COUNT AS `count`
FROM `default`.`table`
WHERE `time` >= '2017-01-02 19:53:51'
AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

is significantly SLOWER than

SELECT COUNT AS `count`
FROM `default`.`table`
WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000

After investigation i found that in the first query time colum is cast to String before applying the filter
However in the second query no such casting is performed and its a filter with long value

Below are the generate Physical plan for slower execution followed by physical plan for faster execution

SELECT COUNT AS `count`
FROM `default`.`table`
WHERE `time` >= '2017-01-02 19:53:51'
AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3339L)
+- *Project
+- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
+- *FileScan parquet default.cstattime#3314 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: struct<time:timestamp>

SELECT COUNT AS `count`
FROM `default`.`table`
WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3287L)
+- *Project
+- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000))
+- *FileScan parquet default.cstattime#3262 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>

In Impala both query run efficiently without and performance difference
Spark should be able to parse the Date string and convert to Long/Timestamp during generation of Optimized Logical Plan so that both the query would have similar performance

Attachments

Issue Links

relates to

SPARK-25039 Binary comparison behavior should refer to Teradata

Open

links to

[Github] Pull Request #17174 (tanejagagan)

GitHub Pull Request #17174

Timestamp to String casting is slowing the query significantly

Details

Description

Attachments

Issue Links

Activity

People

Dates