[SPARK-27594] spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Using spark.sql.orc.impl=native and spark.sql.orc.enableVectorizedReader=true causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled.

Input/output of a Zeppelin session to demonstrate:

%pyspark

from pprint import pprint

spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")

pprint(spark.sparkContext.getConf().getAll())
--------------------
[('sql.stacktrace', 'false'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.app.id', 'application_1556200632329_0005'),
 ('importImplicit', 'true'),
 ('printREPLOutput', 'true'),
 ('spark.history.ui.port', '18081'),
 ('spark.driver.extraLibraryPath',
  '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('spark.driver.extraJavaOptions',
  ' -Dfile.encoding=UTF-8 '
  '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties '
  '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
 ('concurrentSQL', 'false'),
 ('spark.driver.port', '40195'),
 ('spark.executor.extraLibraryPath',
  '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('useHiveContext', 'true'),
 ('spark.jars',
  'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.filters',
  'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  'sandbox-hdp.hortonworks.com'),
 ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
 ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
 ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
 ('master', 'yarn'),
 ('spark.yarn.dist.archives',
  '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
 ('spark.scheduler.mode', 'FAIR'),
 ('spark.yarn.queue', 'default'),
 ('spark.history.kerberos.keytab',
  '/etc/security/keytabs/spark.headless.keytab'),
 ('spark.executor.id', 'driver'),
 ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
 ('spark.history.kerberos.enabled', 'false'),
 ('spark.master', 'yarn'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.history.kerberos.principal', 'none'),
 ('spark.driver.extraClassPath',
  ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
 ('spark.repl.class.outputDir',
  '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
 ('spark.yarn.isPython', 'true'),
 ('spark.app.name', 'Zeppelin'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
  'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
 ('maxResult', '1000'),
 ('spark.executorEnv.PYTHONPATH',
  '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'),
 ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]

%pyspark

spark.sql("""
DROP TABLE IF EXISTS default.hivetest
""")

spark.sql("""
CREATE TABLE default.hivetest (
    day DATE,
    time TIMESTAMP,
    timestring STRING
)
USING ORC
""")

%pyspark

df1 = spark.createDataFrame(
    [
        ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
    ],
    schema=("date", "timestamp", "string")
)

df2 = spark.createDataFrame(
    [
        ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
    ],
    schema=("date", "timestamp", "string")
)

%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
df1.write.insertInto("default.hivetest")

spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
df1.write.insertInto("default.hivetest")

%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
spark.read.table("default.hivetest").show(2, False)

"""
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""

%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
spark.read.table("default.hivetest").show(2, False)

"""
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""

import spark.sql
import spark.implicits._

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")

sql("SELECT * FROM default.hivetest").show(2, false)

"""
import spark.sql
import spark.implicits._
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""

Querying using HIVE produces the correct data also:

select * from default.hivetest;

day       |time                   |timestring             |
----------|-----------------------|-----------------------|
2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|