Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9643

Error serializing datetimes with timezones using Dataframes and Parquet

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1
    • 1.6.0
    • PySpark

    Description

      Trying to serialize a DataFrame with a datetime column that includes a timezone fails with the following error.

      net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2
          at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
          at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
          at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
          at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
          at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
          at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
          at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
          at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150)
          at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
          at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
          at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
          at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.org$apache$spark$sql$execution$datasources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:185)
          at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163)
          at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163)
          at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64)
          at org.apache.spark.scheduler.Task.run(Task.scala:86)
          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
      

      According to Davies Liu timezone serialization is done directly in Spark and not dependent on Pyrolite, but I was not able to prove that.

      Upgrading to Pyrolite 4.9 fixed this issue

      https://github.com/apache/spark/pull/7950

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            alexangelini Alex Angelini
            alexangelini Alex Angelini
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment