Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-1134

Possible Wildcard Parsing Bug

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Cannot Reproduce
    • 0.6.0
    • None
    • GUI
    • Spark 1.6.2, ubuntu server, zeppelin 0.6.0 binary all interpreters

    Description

      I have an issue where I try to load more than one file into a dataframe using a wildcard e.g.
      %pyspark
      df = sqlContext.read.json("/jsonfiles/*.json")

      throws an exception (stack trace at end). Whereas the following succeeds:

      df = sqlContext.read.json("/jsonfiles/namedfile.json")

      stack trace:

      Py4JJavaError: An error occurred while calling o147.json.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 21.0 failed 4 times, most recent failure: Lost task 9.3 in stage 21.0 (TID 296, cti-u-125.tipic.on.bell.ca): java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$1$$anonfun$apply$14
      at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:84)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at java.lang.Class.forName0(Native Method)
      at java.lang.Class.forName(Class.java:278)
      at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
      at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      at org.apache.spark.scheduler.Task.run(Task.scala:89)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
      at scala.Option.foreach(Option.scala:236)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
      at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
      at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
      at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
      at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
      at org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(InferSchema.scala:65)
      at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
      at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
      at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
      at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
      at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
      at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
      at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
      at py4j.Gateway.invoke(Gateway.java:259)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:209)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$1$$anonfun$apply$14
      at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:84)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at java.lang.Class.forName0(Native Method)
      at java.lang.Class.forName(Class.java:278)
      at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
      at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
      at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
      at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      at org.apache.spark.scheduler.Task.run(Task.scala:89)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      ... 1 more
      (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o147.json.\n', JavaObject id=o148), <traceback object at 0x7f651cc11cf8>)

      Attachments

        1. ZEPPELIN-1134.png
          63 kB
          venkatramanan

        Activity

          People

            vensant venkatramanan
            mwd102 Michael
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: