Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2433

In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.1
    • 0.9.2
    • MLlib, PySpark
    • Any

    Description

      Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below)

      In the pyspark mllib library.
      Path : \spark-0.9.1\python\pyspark\mllib\classification.py

      Class: NaiveBayesModel

      Method: self.predict

      Earlier Implementation:
      def predict(self, x):
      """Return the most likely class for a data vector x"""
      return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

      New Implementation:
      No:1
      def predict(self, x):
      """Return the most likely class for a data vector x"""
      return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

      No:2
      def predict(self, x):
      """Return the most likely class for a data vector x"""
      return numpy.argmax(self.pi + dot(x,self.theta.T))

      Explanation:
      No:1 is correct according to me. Don't know about No:2.

      Error one:
      The matrix self.theta is of dimension [n_classes , n_features].
      while the matrix x is of dimension [1 , n_features].

      Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
      It will always give error: "ValueError: matrices are not aligned"
      In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error.

      Both Implementation no.1 and Implementation no. 2 takes care of it.

      Error 2:
      As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)

      and taking the class with max value.
      That's what implementation 1 is doing.

      In Implementation 2:
      Its basically class with max value :
      ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n))

      Don't know if it gives the exact result.

      Thanks
      Rahul Bhojwani
      rahulbhojwani2003@gmail.com

      Attachments

        Activity

          People

            mengxr Xiangrui Meng
            rahul1993 Rahul K Bhojwani
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified