Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.1
-
None
-
Hide
Linux version 4.4.14-smp
x86/fpu: Legacy x87 FPU detected.
using command line:
bash-4.3$ ./bin/spark-submit ~/work/python/Features.py
bash-4.3$ pwd
/home/bsrsharma/spark-2.1.1-bin-hadoop2.7
export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121ShowLinux version 4.4.14-smp x86/fpu: Legacy x87 FPU detected. using command line: bash-4.3$ ./bin/spark-submit ~/work/python/Features.py bash-4.3$ pwd /home/bsrsharma/spark-2.1.1-bin-hadoop2.7 export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121
-
Important
-
Hide***********************************************************************************************
Call stack on error (When data is normally distributed):
***********************************************************************************************
17/05/17 21:59:22 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67
17/05/17 21:59:22 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions
17/05/17 21:59:22 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67)
17/05/17 21:59:22 INFO DAGScheduler: Parents of final stage: List()
17/05/17 21:59:22 INFO DAGScheduler: Missing parents: List()
17/05/17 21:59:22 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents
17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB)
17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB)
17/05/17 21:59:22 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:38047 (size: 2.8 KB, free: 413.9 MB)
17/05/17 21:59:22 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996
17/05/17 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345)
17/05/17 21:59:22 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
17/05/17 21:59:22 WARN TaskSetManager: Stage 14 contains a task of very large size (204 KB). The maximum recommended task size is 100 KB.
17/05/17 21:59:22 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 209337 bytes)
17/05/17 21:59:22 INFO Executor: Running task 0.0 in stage 14.0 (TID 14)
17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 21:59:23 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 21:59:23 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job
17/05/17 21:59:23 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
17/05/17 21:59:23 INFO TaskSchedulerImpl: Cancelling stage 14
17/05/17 21:59:23 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) failed in 1.253 s due to Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
17/05/17 21:59:23 INFO DAGScheduler: Job 14 failed: count at KolmogorovSmirnovTest.scala:67, took 1.499121 s
Traceback (most recent call last):
File "/home/bsrsharma/work/python/Features.py", line 271, in <module>
rc = findFeatures("/home/bsrsharma/work/python/arran.csv", "/home/bsrsharma/work/python/features.csv" )
File "/home/bsrsharma/work/python/Features.py", line 245, in findFeatures
testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j])
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 301, in kolmogorovSmirnovTest
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o273.kolmogorovSmirnovTest.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:67)
at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:85)
at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:187)
at org.apache.spark.mllib.stat.Statistics$.kolmogorovSmirnovTest(Statistics.scala:220)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.kolmogorovSmirnovTest(PythonMLLibAPI.scala:1135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
17/05/17 21:59:24 INFO SparkContext: Invoking stop() from shutdown hook
17/05/17 21:59:24 INFO SparkUI: Stopped Spark web UI at http://192.168.0.115:4040
17/05/17 21:59:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/05/17 21:59:25 INFO MemoryStore: MemoryStore cleared
17/05/17 21:59:25 INFO BlockManager: BlockManager stopped
17/05/17 21:59:25 INFO BlockManagerMaster: BlockManagerMaster stopped
17/05/17 21:59:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/05/17 21:59:25 INFO SparkContext: Successfully stopped SparkContext
17/05/17 21:59:25 INFO ShutdownHookManager: Shutdown hook called
17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a/pyspark-60290f87-c16d-4f4b-a4df-c4e40eaf61a1
17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a
bash-4.3$
***********************************************************************************************
Output when there is no error (When the data is NOT normally distributed!)
***********************************************************************************************
17/05/18 11:41:20 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67
17/05/18 11:41:20 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions
17/05/18 11:41:20 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67)
17/05/18 11:41:20 INFO DAGScheduler: Parents of final stage: List()
17/05/18 11:41:20 INFO DAGScheduler: Missing parents: List()
17/05/18 11:41:20 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents
17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB)
17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB)
17/05/18 11:41:20 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:37499 (size: 2.8 KB, free: 413.9 MB)
17/05/18 11:41:20 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996
17/05/18 11:41:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345)
17/05/18 11:41:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
17/05/18 11:41:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 96099 bytes)
17/05/18 11:41:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 14)
17/05/18 11:41:20 INFO PythonRunner: Times: total = 70, boot = -4396, init = 4409, finish = 57
17/05/18 11:41:20 INFO Executor: Finished task 0.0 in stage 14.0 (TID 14). 1680 bytes result sent to driver
17/05/18 11:41:20 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) finished in 0.353 s
17/05/18 11:41:20 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 14) in 357 ms on localhost (executor driver) (1/1)
17/05/18 11:41:20 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
17/05/18 11:41:20 INFO DAGScheduler: Job 14 finished: count at KolmogorovSmirnovTest.scala:67, took 0.664443 s
17/05/18 11:41:21 INFO SparkContext: Starting job: collect at KolmogorovSmirnovTest.scala:71
17/05/18 11:41:21 INFO DAGScheduler: Registering RDD 22 (sortBy at KolmogorovSmirnovTest.scala:68)
17/05/18 11:41:21 INFO DAGScheduler: Got job 15 (collect at KolmogorovSmirnovTest.scala:71) with 1 output partitions
17/05/18 11:41:21 INFO DAGScheduler: Final stage: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71)
17/05/18 11:41:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 15)
17/05/18 11:41:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 15)
17/05/18 11:41:21 INFO DAGScheduler: Submitting ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68), which has no missing parents
17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 6.1 KB, free 413.7 MB)
17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.7 MB)
17/05/18 11:41:21 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.0.115:37499 (size: 3.6 KB, free: 413.9 MB)
17/05/18 11:41:21 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:996
17/05/18 11:41:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68)
17/05/18 11:41:21 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
17/05/18 11:41:21 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, executor driver, partition 0, PROCESS_LOCAL, 96173 bytes)
17/05/18 11:41:21 INFO Executor: Running task 0.0 in stage 15.0 (TID 15)
17/05/18 11:41:22 INFO PythonRunner: Times: total = 67, boot = -1033, init = 1042, finish = 58
17/05/18 11:41:22 INFO Executor: Finished task 0.0 in stage 15.0 (TID 15). 2043 bytes result sent to driver
17/05/18 11:41:22 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 994 ms on localhost (executor driver) (1/1)
17/05/18 11:41:22 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
17/05/18 11:41:22 INFO DAGScheduler: ShuffleMapStage 15 (sortBy at KolmogorovSmirnovTest.scala:68) finished in 1.002 s
17/05/18 11:41:22 INFO DAGScheduler: looking for newly runnable stages
17/05/18 11:41:22 INFO DAGScheduler: running: Set()
17/05/18 11:41:22 INFO DAGScheduler: waiting: Set(ResultStage 16)
17/05/18 11:41:22 INFO DAGScheduler: failed: Set()
17/05/18 11:41:22 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68), which has no missing parents
17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 19.9 KB, free 413.6 MB)
17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 7.9 KB, free 413.6 MB)
17/05/18 11:41:23 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.0.115:37499 (size: 7.9 KB, free: 413.9 MB)
17/05/18 11:41:23 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:996
17/05/18 11:41:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68)
17/05/18 11:41:23 INFO TaskSchedulerImpl: Adding task set 16.0 with 1 tasks
17/05/18 11:41:23 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, executor driver, partition 0, ANY, 5805 bytes)
17/05/18 11:41:23 INFO Executor: Running task 0.0 in stage 16.0 (TID 16)
17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 100 ms
17/05/18 11:41:26 INFO Executor: Finished task 0.0 in stage 16.0 (TID 16). 2061 bytes result sent to driver
17/05/18 11:41:26 INFO TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 3009 ms on localhost (executor driver) (1/1)
17/05/18 11:41:26 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
17/05/18 11:41:26 INFO DAGScheduler: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) finished in 3.007 s
17/05/18 11:41:26 INFO DAGScheduler: Job 15 finished: collect at KolmogorovSmirnovTest.scala:71, took 4.701565 s
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.9987456949896243
pValue = 3.845022078508009E-10
Very strong presumption against null hypothesis: Sample follows theoretical distribution.Show*********************************************************************************************** Call stack on error (When data is normally distributed): *********************************************************************************************** 17/05/17 21:59:22 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67 17/05/17 21:59:22 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions 17/05/17 21:59:22 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) 17/05/17 21:59:22 INFO DAGScheduler: Parents of final stage: List() 17/05/17 21:59:22 INFO DAGScheduler: Missing parents: List() 17/05/17 21:59:22 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB) 17/05/17 21:59:22 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB) 17/05/17 21:59:22 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:38047 (size: 2.8 KB, free: 413.9 MB) 17/05/17 21:59:22 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996 17/05/17 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/17 21:59:22 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks 17/05/17 21:59:22 WARN TaskSetManager: Stage 14 contains a task of very large size (204 KB). The maximum recommended task size is 100 KB. 17/05/17 21:59:22 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 209337 bytes) 17/05/17 21:59:22 INFO Executor: Running task 0.0 in stage 14.0 (TID 14) 17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 21:59:23 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 21:59:23 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job 17/05/17 21:59:23 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool 17/05/17 21:59:23 INFO TaskSchedulerImpl: Cancelling stage 14 17/05/17 21:59:23 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) failed in 1.253 s due to Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: 17/05/17 21:59:23 INFO DAGScheduler: Job 14 failed: count at KolmogorovSmirnovTest.scala:67, took 1.499121 s Traceback (most recent call last): File "/home/bsrsharma/work/python/Features.py", line 271, in <module> rc = findFeatures("/home/bsrsharma/work/python/arran.csv", "/home/bsrsharma/work/python/features.csv" ) File "/home/bsrsharma/work/python/Features.py", line 245, in findFeatures testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j]) File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 301, in kolmogorovSmirnovTest File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/home/bsrsharma/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o273.kolmogorovSmirnovTest. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965) at org.apache.spark.rdd.RDD.count(RDD.scala:1158) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:67) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:85) at org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest$.testOneSample(KolmogorovSmirnovTest.scala:187) at org.apache.spark.mllib.stat.Statistics$.kolmogorovSmirnovTest(Statistics.scala:220) at org.apache.spark.mllib.api.python.PythonMLLibAPI.kolmogorovSmirnovTest(PythonMLLibAPI.scala:1135) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1349) at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1348) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more 17/05/17 21:59:24 INFO SparkContext: Invoking stop() from shutdown hook 17/05/17 21:59:24 INFO SparkUI: Stopped Spark web UI at http://192.168.0.115:4040 17/05/17 21:59:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/05/17 21:59:25 INFO MemoryStore: MemoryStore cleared 17/05/17 21:59:25 INFO BlockManager: BlockManager stopped 17/05/17 21:59:25 INFO BlockManagerMaster: BlockManagerMaster stopped 17/05/17 21:59:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/05/17 21:59:25 INFO SparkContext: Successfully stopped SparkContext 17/05/17 21:59:25 INFO ShutdownHookManager: Shutdown hook called 17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a/pyspark-60290f87-c16d-4f4b-a4df-c4e40eaf61a1 17/05/17 21:59:25 INFO ShutdownHookManager: Deleting directory /tmp/spark-8509ad84-e38e-485b-addd-04d8258ee73a bash-4.3$ *********************************************************************************************** Output when there is no error (When the data is NOT normally distributed!) *********************************************************************************************** 17/05/18 11:41:20 INFO SparkContext: Starting job: count at KolmogorovSmirnovTest.scala:67 17/05/18 11:41:20 INFO DAGScheduler: Got job 14 (count at KolmogorovSmirnovTest.scala:67) with 1 output partitions 17/05/18 11:41:20 INFO DAGScheduler: Final stage: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) 17/05/18 11:41:20 INFO DAGScheduler: Parents of final stage: List() 17/05/18 11:41:20 INFO DAGScheduler: Missing parents: List() 17/05/18 11:41:20 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 4.4 KB, free 413.7 MB) 17/05/18 11:41:20 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.7 MB) 17/05/18 11:41:20 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.0.115:37499 (size: 2.8 KB, free: 413.9 MB) 17/05/18 11:41:20 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[21] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/18 11:41:20 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks 17/05/18 11:41:20 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14, localhost, executor driver, partition 0, PROCESS_LOCAL, 96099 bytes) 17/05/18 11:41:20 INFO Executor: Running task 0.0 in stage 14.0 (TID 14) 17/05/18 11:41:20 INFO PythonRunner: Times: total = 70, boot = -4396, init = 4409, finish = 57 17/05/18 11:41:20 INFO Executor: Finished task 0.0 in stage 14.0 (TID 14). 1680 bytes result sent to driver 17/05/18 11:41:20 INFO DAGScheduler: ResultStage 14 (count at KolmogorovSmirnovTest.scala:67) finished in 0.353 s 17/05/18 11:41:20 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 14) in 357 ms on localhost (executor driver) (1/1) 17/05/18 11:41:20 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool 17/05/18 11:41:20 INFO DAGScheduler: Job 14 finished: count at KolmogorovSmirnovTest.scala:67, took 0.664443 s 17/05/18 11:41:21 INFO SparkContext: Starting job: collect at KolmogorovSmirnovTest.scala:71 17/05/18 11:41:21 INFO DAGScheduler: Registering RDD 22 (sortBy at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:21 INFO DAGScheduler: Got job 15 (collect at KolmogorovSmirnovTest.scala:71) with 1 output partitions 17/05/18 11:41:21 INFO DAGScheduler: Final stage: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) 17/05/18 11:41:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 15) 17/05/18 11:41:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 15) 17/05/18 11:41:21 INFO DAGScheduler: Submitting ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68), which has no missing parents 17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 6.1 KB, free 413.7 MB) 17/05/18 11:41:21 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.7 MB) 17/05/18 11:41:21 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.0.115:37499 (size: 3.6 KB, free: 413.9 MB) 17/05/18 11:41:21 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 15 (MapPartitionsRDD[22] at sortBy at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:21 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks 17/05/18 11:41:21 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, executor driver, partition 0, PROCESS_LOCAL, 96173 bytes) 17/05/18 11:41:21 INFO Executor: Running task 0.0 in stage 15.0 (TID 15) 17/05/18 11:41:22 INFO PythonRunner: Times: total = 67, boot = -1033, init = 1042, finish = 58 17/05/18 11:41:22 INFO Executor: Finished task 0.0 in stage 15.0 (TID 15). 2043 bytes result sent to driver 17/05/18 11:41:22 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 994 ms on localhost (executor driver) (1/1) 17/05/18 11:41:22 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool 17/05/18 11:41:22 INFO DAGScheduler: ShuffleMapStage 15 (sortBy at KolmogorovSmirnovTest.scala:68) finished in 1.002 s 17/05/18 11:41:22 INFO DAGScheduler: looking for newly runnable stages 17/05/18 11:41:22 INFO DAGScheduler: running: Set() 17/05/18 11:41:22 INFO DAGScheduler: waiting: Set(ResultStage 16) 17/05/18 11:41:22 INFO DAGScheduler: failed: Set() 17/05/18 11:41:22 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68), which has no missing parents 17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 19.9 KB, free 413.6 MB) 17/05/18 11:41:23 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 7.9 KB, free 413.6 MB) 17/05/18 11:41:23 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.0.115:37499 (size: 7.9 KB, free: 413.9 MB) 17/05/18 11:41:23 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:996 17/05/18 11:41:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (MapPartitionsRDD[25] at mapPartitions at KolmogorovSmirnovTest.scala:68) 17/05/18 11:41:23 INFO TaskSchedulerImpl: Adding task set 16.0 with 1 tasks 17/05/18 11:41:23 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, executor driver, partition 0, ANY, 5805 bytes) 17/05/18 11:41:23 INFO Executor: Running task 0.0 in stage 16.0 (TID 16) 17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 17/05/18 11:41:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 100 ms 17/05/18 11:41:26 INFO Executor: Finished task 0.0 in stage 16.0 (TID 16). 2061 bytes result sent to driver 17/05/18 11:41:26 INFO TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 3009 ms on localhost (executor driver) (1/1) 17/05/18 11:41:26 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool 17/05/18 11:41:26 INFO DAGScheduler: ResultStage 16 (collect at KolmogorovSmirnovTest.scala:71) finished in 3.007 s 17/05/18 11:41:26 INFO DAGScheduler: Job 15 finished: collect at KolmogorovSmirnovTest.scala:71, took 4.701565 s Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 0.9987456949896243 pValue = 3.845022078508009E-10 Very strong presumption against null hypothesis: Sample follows theoretical distribution.
Description
In Scala,(correct behavior)
code:
testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), stdDev(j))
produces:
17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.005495681749849268
pValue = 0.9216108887428276
No presumption against null hypothesis: Sample follows theoretical distribution.
in python (incorrect behavior):
the code:
testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j])
causes this error:
17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)