Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16232

Getting error by making columns using DataFrame

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 1.5.1
    • Fix Version/s: None
    • Component/s: MLlib, PySpark
    • Environment:

      Winodws, ipython notebook

      Description

      I am using pyspark in ipython notebook for analysis.
      I am following an example toturial this
      http://nbviewer.jupyter.org/github/bensadeghi/pyspark-churn-prediction/blob/master/churn-prediction.ipynb

      I am getting error on this step in the 7th cell of notebook
      pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

      Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

      Here is the full error :
      Py4JJavaError Traceback (most recent call last)
      <ipython-input-10-d3dfeab0b119> in <module>()
      ----> 1 pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

      C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\dataframe.py in take(self, num)
      303 with SCCallSiteSync(self._sc) as css:
      304 port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
      --> 305 self._jdf, num)
      306 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
      307

      C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in _call_(self, *args)
      536 answer = self.gateway_client.send_command(command)
      537 return_value = get_return_value(answer, self.gateway_client,
      --> 538 self.target_id, self.name)
      539
      540 for temp_arg in temp_args:

      C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\utils.py in deco(*a, **kw)
      34 def deco(*a, **kw):
      35 try:
      ---> 36 return f(*a, **kw)
      37 except py4j.protocol.Py4JJavaError as e:
      38 s = e.java_exception.toString()

      C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
      298 raise Py4JJavaError(
      299 'An error occurred while calling

      {0} {1} {2}

      .\n'.
      --> 300 format(target_id, '.', name), value)
      301 else:
      302 raise Py4JError(

      Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
      vs = list(itertools.islice(iterator, batch))
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1417, in <lambda>
      func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
      File "<ipython-input-7-6db2287430d4>", line 5, in <lambda>
      KeyError: False

      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
      at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
      at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
      at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397)
      at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362)
      at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
      at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      at org.apache.spark.scheduler.Task.run(Task.scala:88)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
      at scala.Option.foreach(Option.scala:236)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
      at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
      at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
      at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
      at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
      at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
      at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
      at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
      at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:127)
      at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:497)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
      at py4j.Gateway.invoke(Gateway.java:259)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:207)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
      vs = list(itertools.islice(iterator, batch))
      File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1417, in <lambda>
      func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
      File "<ipython-input-7-6db2287430d4>", line 5, in <lambda>
      KeyError: False

      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
      at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
      at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
      at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397)
      at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362)
      at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
      at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      at org.apache.spark.scheduler.Task.run(Task.scala:88)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      ... 1 more

      In [ ]:

      ‚Äč

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Nomii5007 Inam Ur Rehman
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: