Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19145

Timestamp to String casting is slowing the query significantly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0
    • None
    • SQL

    Description

      i have a time series table with timestamp column

      Following query
      SELECT COUNT AS `count`
      FROM `default`.`table`
      WHERE `time` >= '2017-01-02 19:53:51'
      AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

      is significantly SLOWER than

      SELECT COUNT AS `count`
      FROM `default`.`table`
      WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
      AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000

      After investigation i found that in the first query time colum is cast to String before applying the filter
      However in the second query no such casting is performed and its a filter with long value

      Below are the generate Physical plan for slower execution followed by physical plan for faster execution

      SELECT COUNT AS `count`
      FROM `default`.`table`
      WHERE `time` >= '2017-01-02 19:53:51'
      AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

      == Physical Plan ==
      CollectLimit 50000
      +- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
      +- Exchange SinglePartition
      +- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3339L)
      +- *Project
      +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
      +- *FileScan parquet default.cstattime#3314 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: struct<time:timestamp>

      SELECT COUNT AS `count`
      FROM `default`.`table`
      WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
      AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000

      == Physical Plan ==
      CollectLimit 50000
      +- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
      +- Exchange SinglePartition
      +- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3287L)
      +- *Project
      +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000))
      +- *FileScan parquet default.cstattime#3262 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>

      In Impala both query run efficiently without and performance difference
      Spark should be able to parse the Date string and convert to Long/Timestamp during generation of Optimized Logical Plan so that both the query would have similar performance

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tanejagagan gagan taneja
              Herman van Hövell Herman van Hövell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: