[SPARK-2433] In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.1
Fix Version/s: 0.9.2
Component/s: MLlib, PySpark
Labels:
- easyfix
- test
Environment:

Any

Target Version/s:

0.9.2

Description

Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below)

In the pyspark mllib library.
Path : \spark-0.9.1\python\pyspark\mllib\classification.py

Class: NaiveBayesModel

Method: self.predict

Earlier Implementation:
def predict(self, x):
"""Return the most likely class for a data vector x"""
return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

New Implementation:
No:1
def predict(self, x):
"""Return the most likely class for a data vector x"""
return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

No:2
def predict(self, x):
"""Return the most likely class for a data vector x"""
return numpy.argmax(self.pi + dot(x,self.theta.T))

Explanation:
No:1 is correct according to me. Don't know about No:2.

Error one:
The matrix self.theta is of dimension [n_classes , n_features].
while the matrix x is of dimension [1 , n_features].

Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
It will always give error: "ValueError: matrices are not aligned"
In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error.

Both Implementation no.1 and Implementation no. 2 takes care of it.

Error 2:
As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)

and taking the class with max value.
That's what implementation 1 is doing.

In Implementation 2:
Its basically class with max value :
( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n))

Don't know if it gives the exact result.

Thanks
Rahul Bhojwani
rahulbhojwani2003@gmail.com

Attachments

Activity

People

Assignee:: Xiangrui Meng

Reporter:: Rahul K Bhojwani

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jul/14 16:37

Updated:: 17/Jul/14 05:27

Resolved:: 17/Jul/14 03:12

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified