#### Details

#### Description

Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below)

In the pyspark mllib library.

Path : \spark-0.9.1\python\pyspark\mllib\classification.py

Class: NaiveBayesModel

Method: self.predict

Earlier Implementation:

def predict(self, x):

"""Return the most likely class for a data vector x"""

return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

New Implementation:

No:1

def predict(self, x):

"""Return the most likely class for a data vector x"""

return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

No:2

def predict(self, x):

"""Return the most likely class for a data vector x"""

return numpy.argmax(self.pi + dot(x,self.theta.T))

Explanation:

No:1 is correct according to me. Don't know about No:2.

Error one:

The matrix self.theta is of dimension [n_classes , n_features].

while the matrix x is of dimension [1 , n_features].

Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].

It will always give error: "ValueError: matrices are not aligned"

In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error.

Both Implementation no.1 and Implementation no. 2 takes care of it.

Error 2:

As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)

and taking the class with max value.

That's what implementation 1 is doing.

In Implementation 2:

Its basically class with max value :

( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n))

Don't know if it gives the exact result.

Thanks

Rahul Bhojwani

rahulbhojwani2003@gmail.com