Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
2.3.0
-
None
-
None
Description
Using spark.sql.orc.impl=native and spark.sql.orc.enableVectorizedReader=true causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled.
Input/output of a Zeppelin session to demonstrate:
%pyspark from pprint import pprint spark.conf.set("spark.sql.orc.impl", "native") spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") pprint(spark.sparkContext.getConf().getAll()) -------------------- [('sql.stacktrace', 'false'), ('spark.eventLog.enabled', 'true'), ('spark.app.id', 'application_1556200632329_0005'), ('importImplicit', 'true'), ('printREPLOutput', 'true'), ('spark.history.ui.port', '18081'), ('spark.driver.extraLibraryPath', '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), ('spark.driver.extraJavaOptions', ' -Dfile.encoding=UTF-8 ' '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties ' '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'), ('concurrentSQL', 'false'), ('spark.driver.port', '40195'), ('spark.executor.extraLibraryPath', '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), ('useHiveContext', 'true'), ('spark.jars', 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), ('spark.history.provider', 'org.apache.spark.deploy.history.FsHistoryProvider'), ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'), ('spark.submit.deployMode', 'client'), ('spark.ui.filters', 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', 'sandbox-hdp.hortonworks.com'), ('spark.eventLog.dir', 'hdfs:///spark2-history/'), ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'), ('spark.driver.host', 'sandbox-hdp.hortonworks.com'), ('master', 'yarn'), ('spark.yarn.dist.archives', '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'), ('spark.scheduler.mode', 'FAIR'), ('spark.yarn.queue', 'default'), ('spark.history.kerberos.keytab', '/etc/security/keytabs/spark.headless.keytab'), ('spark.executor.id', 'driver'), ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'), ('spark.history.kerberos.enabled', 'false'), ('spark.master', 'yarn'), ('spark.sql.catalogImplementation', 'hive'), ('spark.history.kerberos.principal', 'none'), ('spark.driver.extraClassPath', ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'), ('spark.repl.class.outputDir', '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'), ('spark.yarn.isPython', 'true'), ('spark.app.name', 'Zeppelin'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'), ('maxResult', '1000'), ('spark.executorEnv.PYTHONPATH', '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'), ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
%pyspark spark.sql(""" DROP TABLE IF EXISTS default.hivetest """) spark.sql(""" CREATE TABLE default.hivetest ( day DATE, time TIMESTAMP, timestring STRING ) USING ORC """)
%pyspark df1 = spark.createDataFrame( [ ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123") ], schema=("date", "timestamp", "string") ) df2 = spark.createDataFrame( [ ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234") ], schema=("date", "timestamp", "string") )
%pyspark spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") df1.write.insertInto("default.hivetest") spark.conf.set("spark.sql.orc.enableVectorizedReader", "false") df1.write.insertInto("default.hivetest")
%pyspark spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") spark.read.table("default.hivetest").show(2, False) """ +----------+-----------------------+-----------------------+ |day |time |timestring | +----------+-----------------------+-----------------------+ |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| +----------+-----------------------+-----------------------+ """
%pyspark spark.conf.set("spark.sql.orc.enableVectorizedReader", "false") spark.read.table("default.hivetest").show(2, False) """ +----------+-----------------------+-----------------------+ |day |time |timestring | +----------+-----------------------+-----------------------+ |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| +----------+-----------------------+-----------------------+ """
import spark.sql import spark.implicits._ spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") sql("SELECT * FROM default.hivetest").show(2, false) """ import spark.sql import spark.implicits._ +----------+-----------------------+-----------------------+ |day |time |timestring | +----------+-----------------------+-----------------------+ |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| +----------+-----------------------+-----------------------+ """
Querying using HIVE produces the correct data also:
select * from default.hivetest; day |time |timestring | ----------|-----------------------|-----------------------| 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
Attachments
Attachments
Issue Links
- is caused by
-
ORC-546 The timestamps are getting duplicated millis after ORC-306.
- Closed