Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.1
-
None
-
Important
-
Description
When data is NOT normally distributed (correct behavior):
This code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
kd.setSample(vecRDD)
kd.setBandwidth(3.0)
- Find density estimates for the given values
densities = kd.estimate(samplePoints)
produces:
17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at KernelDensity.scala:92
17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at KernelDensity.scala:92) with 1 output partitions
17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate at KernelDensity.scala:92)
17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List()
17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List()
17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 6.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB)
17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:996
17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345)
17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 24, localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes)
17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24)
17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 1844, finish = 53
17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 2476 bytes result sent to driver
17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at KernelDensity.scala:92) finished in 1.001 s
17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 24) in 1004 ms on localhost (executor driver) (1/1)
17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks have all completed, from pool
17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at KernelDensity.scala:92, took 1.136263 s
17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB)
5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001
,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
But if Data IS normally distributed:
I see:
17/05/18 15:50:16 ERROR Executor: Exception in task 0.0 in stage 24.0 (TID 24)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
On Scala, the correct result is:
Code:
vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)
// Find density estimates for the given values
densities = kd.estimate(samplePoints)
[0.04113814235801906,1.0994865517293571E-163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,]