Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27594

spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.3.0
    • None
    • SQL
    • None

    Description

      Using spark.sql.orc.impl=native and spark.sql.orc.enableVectorizedReader=true causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled.

      Input/output of a Zeppelin session to demonstrate:

      %pyspark
      
      from pprint import pprint
      
      spark.conf.set("spark.sql.orc.impl", "native")
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
      
      pprint(spark.sparkContext.getConf().getAll())
      --------------------
      [('sql.stacktrace', 'false'),
       ('spark.eventLog.enabled', 'true'),
       ('spark.app.id', 'application_1556200632329_0005'),
       ('importImplicit', 'true'),
       ('printREPLOutput', 'true'),
       ('spark.history.ui.port', '18081'),
       ('spark.driver.extraLibraryPath',
        '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
       ('spark.driver.extraJavaOptions',
        ' -Dfile.encoding=UTF-8 '
        '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties '
        '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
       ('concurrentSQL', 'false'),
       ('spark.driver.port', '40195'),
       ('spark.executor.extraLibraryPath',
        '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
       ('useHiveContext', 'true'),
       ('spark.jars',
        'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
       ('spark.history.provider',
        'org.apache.spark.deploy.history.FsHistoryProvider'),
       ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
       ('spark.submit.deployMode', 'client'),
       ('spark.ui.filters',
        'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
       ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
        'sandbox-hdp.hortonworks.com'),
       ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
       ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
       ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
       ('master', 'yarn'),
       ('spark.yarn.dist.archives',
        '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
       ('spark.scheduler.mode', 'FAIR'),
       ('spark.yarn.queue', 'default'),
       ('spark.history.kerberos.keytab',
        '/etc/security/keytabs/spark.headless.keytab'),
       ('spark.executor.id', 'driver'),
       ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
       ('spark.history.kerberos.enabled', 'false'),
       ('spark.master', 'yarn'),
       ('spark.sql.catalogImplementation', 'hive'),
       ('spark.history.kerberos.principal', 'none'),
       ('spark.driver.extraClassPath',
        ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
       ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
       ('spark.repl.class.outputDir',
        '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
       ('spark.yarn.isPython', 'true'),
       ('spark.app.name', 'Zeppelin'),
       ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
        'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
       ('maxResult', '1000'),
       ('spark.executorEnv.PYTHONPATH',
        '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'),
       ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
      
      %pyspark
      
      spark.sql("""
      DROP TABLE IF EXISTS default.hivetest
      """)
      
      spark.sql("""
      CREATE TABLE default.hivetest (
          day DATE,
          time TIMESTAMP,
          timestring STRING
      )
      USING ORC
      """)
      
      %pyspark
      
      df1 = spark.createDataFrame(
          [
              ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
          ],
          schema=("date", "timestamp", "string")
      )
      
      df2 = spark.createDataFrame(
          [
              ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
          ],
          schema=("date", "timestamp", "string")
      )
      
      %pyspark
      
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
      df1.write.insertInto("default.hivetest")
      
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
      df1.write.insertInto("default.hivetest")
      
      %pyspark
      
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
      spark.read.table("default.hivetest").show(2, False)
      
      """
      +----------+-----------------------+-----------------------+
      |day       |time                   |timestring             |
      +----------+-----------------------+-----------------------+
      |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
      |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
      +----------+-----------------------+-----------------------+
      """
      
      %pyspark
      
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
      spark.read.table("default.hivetest").show(2, False)
      
      """
      +----------+-----------------------+-----------------------+
      |day       |time                   |timestring             |
      +----------+-----------------------+-----------------------+
      |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
      |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
      +----------+-----------------------+-----------------------+
      """
      
      import spark.sql
      import spark.implicits._
      
      spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
      
      sql("SELECT * FROM default.hivetest").show(2, false)
      
      """
      import spark.sql
      import spark.implicits._
      +----------+-----------------------+-----------------------+
      |day       |time                   |timestring             |
      +----------+-----------------------+-----------------------+
      |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
      |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
      +----------+-----------------------+-----------------------+
      """
      

      Querying using HIVE produces the correct data also:

      select * from default.hivetest;
      
      day       |time                   |timestring             |
      ----------|-----------------------|-----------------------|
      2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
      2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dutch_gecko Jan-Willem van der Sijp
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment