Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20802

kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.1
    • None
    • MLlib, PySpark
    • Important
    • Hide
      ***********************************************************************************************
      Call stack on error (When data is normally distributed):
      ***********************************************************************************************
      17/05/17 21:59:22 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67
      17/05/17 21:59:22 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions
      17/05/17 21:59:22 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67)
      17/05/17 21:59:22 INFO DAGScheduler: Parents of final stage: List()
      17/05/17 21:59:22 INFO DAGScheduler: Missing parents: List()
      17/05/17 21:59:22 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents
      17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB)
      17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB)
      17/05/17 21:59:22 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:38047 (size: 2.8 KB, free: 413.9 MB)
      17/05/17 21:59:22 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996
      17/05/17 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345)
      17/05/17 21:59:22 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
      17/05/17 21:59:22 WARN TaskSetManager: Stage 14 contains a task of very large size (204 KB). The maximum recommended task size is 100 KB.
      17/05/17 21:59:22 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 209337 bytes)
      17/05/17 21:59:22 INFO Executor: Running task 0.0 in stage 14.0 (TID 14)
      17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
      net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
      at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      17/05/17 21:59:23 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
      at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      17/05/17 21:59:23 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job
      17/05/17 21:59:23 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
      17/05/17 21:59:23 INFO TaskSchedulerImpl: Cancelling stage 14
      17/05/17 21:59:23 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) failed in 1.253 s due to Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
      at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      Driver stacktrace:
      17/05/17 21:59:23 INFO DAGScheduler: Job 14 failed: count at KolmogorovSmirnovTest.scala:67, took 1.499121 s
      Traceback (most recent call last):
        File "/home/bsrsharma/work/python/Features.py", line 271, in <module>
          rc = findFeatures("/home/bsrsharma/work/python/arran.csv", "/home/bsrsharma/work/python/features.csv" )
        File "/home/bsrsharma/work/python/Features.py", line 245, in findFeatures
          testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j])
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 301, in kolmogorovSmirnovTest
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
        File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o273.kolmogorovSmirnovTest.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
      at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
      at scala.Option.foreach(Option.scala:257)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
      at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
      at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:67)
      at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:85)
      at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:187)
      at org.apache.spark.mllib.stat.Statistics$.kolmogorovSmirnovTest(Statistics.scala:220)
      at org.apache.spark.mllib.api.python.PythonMLLibAPI.kolmogorovSmirnovTest(PythonMLLibAPI.scala:1135)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:280)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:214)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
      at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
      at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      ... 1 more

      17/05/17 21:59:24 INFO SparkContext: Invoking stop() from shutdown hook
      17/05/17 21:59:24 INFO SparkUI: Stopped Spark web UI at http://192.168.0.115:4040
      17/05/17 21:59:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
      17/05/17 21:59:25 INFO MemoryStore: MemoryStore cleared
      17/05/17 21:59:25 INFO BlockManager: BlockManager stopped
      17/05/17 21:59:25 INFO BlockManagerMaster: BlockManagerMaster stopped
      17/05/17 21:59:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
      17/05/17 21:59:25 INFO SparkContext: Successfully stopped SparkContext
      17/05/17 21:59:25 INFO ShutdownHookManager: Shutdown hook called
      17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a/pyspark-60290f87-c16d-4f4b-a4df-c4e40eaf61a1
      17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a
      bash-4.3$

      ***********************************************************************************************
      Output when there is no error (When the data is NOT normally distributed!)
      ***********************************************************************************************
      17/05/18 11:41:20 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67
      17/05/18 11:41:20 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions
      17/05/18 11:41:20 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67)
      17/05/18 11:41:20 INFO DAGScheduler: Parents of final stage: List()
      17/05/18 11:41:20 INFO DAGScheduler: Missing parents: List()
      17/05/18 11:41:20 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents
      17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB)
      17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB)
      17/05/18 11:41:20 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:37499 (size: 2.8 KB, free: 413.9 MB)
      17/05/18 11:41:20 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996
      17/05/18 11:41:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345)
      17/05/18 11:41:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
      17/05/18 11:41:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 96099 bytes)
      17/05/18 11:41:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 14)
      17/05/18 11:41:20 INFO PythonRunner: Times: total = 70, boot = -4396, init = 4409, finish = 57
      17/05/18 11:41:20 INFO Executor: Finished task 0.0 in stage 14.0 (TID 14). 1680 bytes result sent to driver
      17/05/18 11:41:20 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) finished in 0.353 s
      17/05/18 11:41:20 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 14) in 357 ms on localhost (executor driver) (1/1)
      17/05/18 11:41:20 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
      17/05/18 11:41:20 INFO DAGScheduler: Job 14 finished: count at KolmogorovSmirnovTest.scala:67, took 0.664443 s
      17/05/18 11:41:21 INFO SparkContext: Starting job: collect at KolmogorovSmirnovTest.scala:71
      17/05/18 11:41:21 INFO DAGScheduler: Registering RDD 22 (sortBy at KolmogorovSmirnovTest.scala:68)
      17/05/18 11:41:21 INFO DAGScheduler: Got job 15 (collect at KolmogorovSmirnovTest.scala:71) with 1 output partitions
      17/05/18 11:41:21 INFO DAGScheduler: Final stage: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71)
      17/05/18 11:41:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 15)
      17/05/18 11:41:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 15)
      17/05/18 11:41:21 INFO DAGScheduler: Submitting ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68), which has no missing parents
      17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 6.1 KB, free 413.7 MB)
      17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.7 MB)
      17/05/18 11:41:21 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.0.115:37499 (size: 3.6 KB, free: 413.9 MB)
      17/05/18 11:41:21 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:996
      17/05/18 11:41:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68)
      17/05/18 11:41:21 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
      17/05/18 11:41:21 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, executor driver, partition 0, PROCESS_LOCAL, 96173 bytes)
      17/05/18 11:41:21 INFO Executor: Running task 0.0 in stage 15.0 (TID 15)
      17/05/18 11:41:22 INFO PythonRunner: Times: total = 67, boot = -1033, init = 1042, finish = 58
      17/05/18 11:41:22 INFO Executor: Finished task 0.0 in stage 15.0 (TID 15). 2043 bytes result sent to driver
      17/05/18 11:41:22 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 994 ms on localhost (executor driver) (1/1)
      17/05/18 11:41:22 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
      17/05/18 11:41:22 INFO DAGScheduler: ShuffleMapStage 15 (sortBy at KolmogorovSmirnovTest.scala:68) finished in 1.002 s
      17/05/18 11:41:22 INFO DAGScheduler: looking for newly runnable stages
      17/05/18 11:41:22 INFO DAGScheduler: running: Set()
      17/05/18 11:41:22 INFO DAGScheduler: waiting: Set(ResultStage 16)
      17/05/18 11:41:22 INFO DAGScheduler: failed: Set()
      17/05/18 11:41:22 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68), which has no missing parents
      17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 19.9 KB, free 413.6 MB)
      17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 7.9 KB, free 413.6 MB)
      17/05/18 11:41:23 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.0.115:37499 (size: 7.9 KB, free: 413.9 MB)
      17/05/18 11:41:23 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:996
      17/05/18 11:41:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68)
      17/05/18 11:41:23 INFO TaskSchedulerImpl: Adding task set 16.0 with 1 tasks
      17/05/18 11:41:23 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, executor driver, partition 0, ANY, 5805 bytes)
      17/05/18 11:41:23 INFO Executor: Running task 0.0 in stage 16.0 (TID 16)
      17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
      17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 100 ms
      17/05/18 11:41:26 INFO Executor: Finished task 0.0 in stage 16.0 (TID 16). 2061 bytes result sent to driver
      17/05/18 11:41:26 INFO TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 3009 ms on localhost (executor driver) (1/1)
      17/05/18 11:41:26 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
      17/05/18 11:41:26 INFO DAGScheduler: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) finished in 3.007 s
      17/05/18 11:41:26 INFO DAGScheduler: Job 15 finished: collect at KolmogorovSmirnovTest.scala:71, took 4.701565 s
      Kolmogorov-Smirnov test summary:
      degrees of freedom = 0
      statistic = 0.9987456949896243
      pValue = 3.845022078508009E-10
      Very strong presumption against null hypothesis: Sample follows theoretical distribution.
      Show
      *********************************************************************************************** Call stack on error (When data is normally distributed): *********************************************************************************************** 17/05/17 21:59:22 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67 17/05/17 21:59:22 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions 17/05/17 21:59:22 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) 17/05/17 21:59:22 INFO DAGScheduler: Parents of final stage: List() 17/05/17 21:59:22 INFO DAGScheduler: Missing parents: List() 17/05/17 21:59:22 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB) 17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB) 17/05/17 21:59:22 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:38047 (size: 2.8 KB, free: 413.9 MB) 17/05/17 21:59:22 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996 17/05/17 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/17 21:59:22 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks 17/05/17 21:59:22 WARN TaskSetManager: Stage 14 contains a task of very large size (204 KB). The maximum recommended task size is 100 KB. 17/05/17 21:59:22 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 209337 bytes) 17/05/17 21:59:22 INFO Executor: Running task 0.0 in stage 14.0 (TID 14) 17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 21:59:23 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 21:59:23 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job 17/05/17 21:59:23 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool 17/05/17 21:59:23 INFO TaskSchedulerImpl: Cancelling stage 14 17/05/17 21:59:23 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) failed in 1.253 s due to Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: 17/05/17 21:59:23 INFO DAGScheduler: Job 14 failed: count at KolmogorovSmirnovTest.scala:67, took 1.499121 s Traceback (most recent call last):   File "/home/bsrsharma/work/python/Features.py", line 271, in <module>     rc = findFeatures("/home/bsrsharma/work/python/arran.csv", "/home/bsrsharma/work/python/features.csv" )   File "/home/bsrsharma/work/python/Features.py", line 245, in findFeatures     testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j])   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 301, in kolmogorovSmirnovTest   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco   File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o273.kolmogorovSmirnovTest. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965) at org.apache.spark.rdd.RDD.count(RDD.scala:1158) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:67) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:85) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:187) at org.apache.spark.mllib.stat.Statistics$.kolmogorovSmirnovTest(Statistics.scala:220) at org.apache.spark.mllib.api.python.PythonMLLibAPI.kolmogorovSmirnovTest(PythonMLLibAPI.scala:1135) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more 17/05/17 21:59:24 INFO SparkContext: Invoking stop() from shutdown hook 17/05/17 21:59:24 INFO SparkUI: Stopped Spark web UI at http://192.168.0.115:4040 17/05/17 21:59:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/05/17 21:59:25 INFO MemoryStore: MemoryStore cleared 17/05/17 21:59:25 INFO BlockManager: BlockManager stopped 17/05/17 21:59:25 INFO BlockManagerMaster: BlockManagerMaster stopped 17/05/17 21:59:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/05/17 21:59:25 INFO SparkContext: Successfully stopped SparkContext 17/05/17 21:59:25 INFO ShutdownHookManager: Shutdown hook called 17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a/pyspark-60290f87-c16d-4f4b-a4df-c4e40eaf61a1 17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a bash-4.3$ *********************************************************************************************** Output when there is no error (When the data is NOT normally distributed!) *********************************************************************************************** 17/05/18 11:41:20 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67 17/05/18 11:41:20 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions 17/05/18 11:41:20 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) 17/05/18 11:41:20 INFO DAGScheduler: Parents of final stage: List() 17/05/18 11:41:20 INFO DAGScheduler: Missing parents: List() 17/05/18 11:41:20 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB) 17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB) 17/05/18 11:41:20 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:37499 (size: 2.8 KB, free: 413.9 MB) 17/05/18 11:41:20 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/18 11:41:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks 17/05/18 11:41:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 96099 bytes) 17/05/18 11:41:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 14) 17/05/18 11:41:20 INFO PythonRunner: Times: total = 70, boot = -4396, init = 4409, finish = 57 17/05/18 11:41:20 INFO Executor: Finished task 0.0 in stage 14.0 (TID 14). 1680 bytes result sent to driver 17/05/18 11:41:20 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) finished in 0.353 s 17/05/18 11:41:20 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 14) in 357 ms on localhost (executor driver) (1/1) 17/05/18 11:41:20 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool 17/05/18 11:41:20 INFO DAGScheduler: Job 14 finished: count at KolmogorovSmirnovTest.scala:67, took 0.664443 s 17/05/18 11:41:21 INFO SparkContext: Starting job: collect at KolmogorovSmirnovTest.scala:71 17/05/18 11:41:21 INFO DAGScheduler: Registering RDD 22 (sortBy at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:21 INFO DAGScheduler: Got job 15 (collect at KolmogorovSmirnovTest.scala:71) with 1 output partitions 17/05/18 11:41:21 INFO DAGScheduler: Final stage: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) 17/05/18 11:41:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 15) 17/05/18 11:41:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 15) 17/05/18 11:41:21 INFO DAGScheduler: Submitting ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68), which has no missing parents 17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 6.1 KB, free 413.7 MB) 17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.7 MB) 17/05/18 11:41:21 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.0.115:37499 (size: 3.6 KB, free: 413.9 MB) 17/05/18 11:41:21 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:21 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks 17/05/18 11:41:21 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, executor driver, partition 0, PROCESS_LOCAL, 96173 bytes) 17/05/18 11:41:21 INFO Executor: Running task 0.0 in stage 15.0 (TID 15) 17/05/18 11:41:22 INFO PythonRunner: Times: total = 67, boot = -1033, init = 1042, finish = 58 17/05/18 11:41:22 INFO Executor: Finished task 0.0 in stage 15.0 (TID 15). 2043 bytes result sent to driver 17/05/18 11:41:22 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 994 ms on localhost (executor driver) (1/1) 17/05/18 11:41:22 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool 17/05/18 11:41:22 INFO DAGScheduler: ShuffleMapStage 15 (sortBy at KolmogorovSmirnovTest.scala:68) finished in 1.002 s 17/05/18 11:41:22 INFO DAGScheduler: looking for newly runnable stages 17/05/18 11:41:22 INFO DAGScheduler: running: Set() 17/05/18 11:41:22 INFO DAGScheduler: waiting: Set(ResultStage 16) 17/05/18 11:41:22 INFO DAGScheduler: failed: Set() 17/05/18 11:41:22 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68), which has no missing parents 17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 19.9 KB, free 413.6 MB) 17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 7.9 KB, free 413.6 MB) 17/05/18 11:41:23 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.0.115:37499 (size: 7.9 KB, free: 413.9 MB) 17/05/18 11:41:23 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:23 INFO TaskSchedulerImpl: Adding task set 16.0 with 1 tasks 17/05/18 11:41:23 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, executor driver, partition 0, ANY, 5805 bytes) 17/05/18 11:41:23 INFO Executor: Running task 0.0 in stage 16.0 (TID 16) 17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 100 ms 17/05/18 11:41:26 INFO Executor: Finished task 0.0 in stage 16.0 (TID 16). 2061 bytes result sent to driver 17/05/18 11:41:26 INFO TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 3009 ms on localhost (executor driver) (1/1) 17/05/18 11:41:26 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool 17/05/18 11:41:26 INFO DAGScheduler: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) finished in 3.007 s 17/05/18 11:41:26 INFO DAGScheduler: Job 15 finished: collect at KolmogorovSmirnovTest.scala:71, took 4.701565 s Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 0.9987456949896243 pValue = 3.845022078508009E-10 Very strong presumption against null hypothesis: Sample follows theoretical distribution.

    Description

      In Scala,(correct behavior)
      code:
      testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), stdDev(j))
      produces:
      17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary:
      degrees of freedom = 0
      statistic = 0.005495681749849268
      pValue = 0.9216108887428276
      No presumption against null hypothesis: Sample follows theoretical distribution.

      in python (incorrect behavior):
      the code:
      testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j])

      causes this error:
      17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
      net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

      Attachments

        Activity

          People

            Unassigned Unassigned
            bsrsharma Bettadapura Srinath Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: