Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1786

Make classes implements Serializable for Spark 1.5+

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.13.1
    • Component/s: Math
    • Labels:

      Description

      Spark 1.5 comes with a new very efficient serializer that uses code generation. It is twice as fast as kryo. When using mahout, we have to set KryoSerializer because some classes aren't serializable otherwise.

      I suggest to declare Math classes as "implements Serializable" where needed. For instance, to use coocurence package in spark 1.5, we had to modify AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it work without Kryo.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user michellemay opened a pull request:

          https://github.com/apache/mahout/pull/174

          MAHOUT-1786: Make classes implements Serializable for Spark 1.5+

          Add some "implements Serializable" for Apache Spark 1.5+
          There might be other classes that would benefit from the same modification.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/michellemay/mahout fix-mahout-1786

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/174.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #174


          commit a0aad6a716d001c5bbe1ee08e8865dbb83b6f1e4
          Author: michellemay <mlemay@gmail.com>
          Date: 2015-11-06T15:34:04Z

          implements Serializable


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user michellemay opened a pull request: https://github.com/apache/mahout/pull/174 MAHOUT-1786 : Make classes implements Serializable for Spark 1.5+ Add some "implements Serializable" for Apache Spark 1.5+ There might be other classes that would benefit from the same modification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/michellemay/mahout fix-mahout-1786 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #174 commit a0aad6a716d001c5bbe1ee08e8865dbb83b6f1e4 Author: michellemay <mlemay@gmail.com> Date: 2015-11-06T15:34:04Z implements Serializable
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-154476424

          -1. Unfortunately, i think just adding "serializable" to every matrix class is not going to cut it.

          If custom java serialization were implemented a bit more efficiently (as it is currently done in Writeable and Kryo serializations), i would vote -0.1 (per apache voting guidelines).

          The reason is, java serialization is still not the best way to pack the tensor data.

          Another reason is there is no motivation in Mahout for java serialization support. All of our supported backends support both Kryo and Writable protocols for the purpose of the backends.

          There are other minor reasons not to use java serialization as well (such as class compatibility checks etc.)

          There admittedly may be an external reason but i feel like kryo serialzation should be an answer good enough for external reasons as well.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-154476424 -1. Unfortunately, i think just adding "serializable" to every matrix class is not going to cut it. If custom java serialization were implemented a bit more efficiently (as it is currently done in Writeable and Kryo serializations), i would vote -0.1 (per apache voting guidelines). The reason is, java serialization is still not the best way to pack the tensor data. Another reason is there is no motivation in Mahout for java serialization support. All of our supported backends support both Kryo and Writable protocols for the purpose of the backends. There are other minor reasons not to use java serialization as well (such as class compatibility checks etc.) There admittedly may be an external reason but i feel like kryo serialzation should be an answer good enough for external reasons as well.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-154478671

          although i am not sure about efficiency point said in the original issue – bytecode generated something in spark 1.5... i am dubious it will work well without custom serialization algorithms. needs a benchmark imo. there is still may be a significant difference between between serializing data structure vs. serializing data structure iterators and rebuilding a data structure.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-154478671 although i am not sure about efficiency point said in the original issue – bytecode generated something in spark 1.5... i am dubious it will work well without custom serialization algorithms. needs a benchmark imo. there is still may be a significant difference between between serializing data structure vs. serializing data structure iterators and rebuilding a data structure.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user michellemay commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-154481034

          Forcing the use of Kryo might not be a valid statu-quo for spark 1.5+ though..

          For reference, here is project Tungsten:
          https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

          "The above chart compares the performance of shuffling 8 million complex rows in one thread using the Kryo serializer and a code generated custom serializer. The code generated serializer exploits the fact that all rows in a single shuffle have the same schema and generates specialized code for that. This made the generated version over 2X faster to shuffle than the Kryo version."

          Show
          githubbot ASF GitHub Bot added a comment - Github user michellemay commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-154481034 Forcing the use of Kryo might not be a valid statu-quo for spark 1.5+ though.. For reference, here is project Tungsten: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html "The above chart compares the performance of shuffling 8 million complex rows in one thread using the Kryo serializer and a code generated custom serializer. The code generated serializer exploits the fact that all rows in a single shuffle have the same schema and generates specialized code for that. This made the generated version over 2X faster to shuffle than the Kryo version."
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user michellemay commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-154481151

          I leave it to you to find a way to use that new codegen optimizations in spark.

          Show
          githubbot ASF GitHub Bot added a comment - Github user michellemay commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-154481151 I leave it to you to find a way to use that new codegen optimizations in spark.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-154542945

          Michel:

          Just to be clear: can you or can you not confirm that this change

          (1) is known to work correctly (as in all unit tests pass in a
          java-serialized session); and

          (2) that it is actually indeed faster than the same things in a kryo
          session?

          I think those are the main things to confirm in this issue.

          On Fri, Nov 6, 2015 at 9:37 AM, Michel Lemay <notifications@github.com>
          wrote:

          > I leave it to you to find a way to use that new codegen optimizations in
          > spark.
          >
          > —
          > Reply to this email directly or view it on GitHub
          > <https://github.com/apache/mahout/pull/174#issuecomment-154481151>.
          >

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-154542945 Michel: Just to be clear: can you or can you not confirm that this change (1) is known to work correctly (as in all unit tests pass in a java-serialized session); and (2) that it is actually indeed faster than the same things in a kryo session? I think those are the main things to confirm in this issue. On Fri, Nov 6, 2015 at 9:37 AM, Michel Lemay <notifications@github.com> wrote: > I leave it to you to find a way to use that new codegen optimizations in > spark. > > — > Reply to this email directly or view it on GitHub > < https://github.com/apache/mahout/pull/174#issuecomment-154481151 >. >
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user michellemay commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-155052769

          I have a hard time testing vanilla 'master' on Windows.. (my dev environment)
          I get tons of NPE (92 in total) at __randomizedtesting.SeedInfo.seed, at java.lang.ProcessBuilder.start

          Show
          githubbot ASF GitHub Bot added a comment - Github user michellemay commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-155052769 I have a hard time testing vanilla 'master' on Windows.. (my dev environment) I get tons of NPE (92 in total) at __randomizedtesting.SeedInfo.seed, at java.lang.ProcessBuilder.start
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user michellemay commented on the pull request:

          https://github.com/apache/mahout/pull/174#issuecomment-155447141

          I'm not 100% sure.. I still have other tests to do but it looks like default serializer of spark 1.5.1 is 19% SLOWER than kryo when performing large matrice AtA.

          That is really sad since I see huge gains elsewhere in our spark tasks.

          Is there any possibility to have best of both world ?

          Show
          githubbot ASF GitHub Bot added a comment - Github user michellemay commented on the pull request: https://github.com/apache/mahout/pull/174#issuecomment-155447141 I'm not 100% sure.. I still have other tests to do but it looks like default serializer of spark 1.5.1 is 19% SLOWER than kryo when performing large matrice AtA. That is really sad since I see huge gains elsewhere in our spark tasks. Is there any possibility to have best of both world ?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewmusselman commented on the issue:

          https://github.com/apache/mahout/pull/174

          I see this is old; @michellemay could you let us know if there's any room to progress?

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewmusselman commented on the issue: https://github.com/apache/mahout/pull/174 I see this is old; @michellemay could you let us know if there's any room to progress?
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          This (MAHOUT-1786) may be a part of the issue of the broken broadcast issues.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - This ( MAHOUT-1786 ) may be a part of the issue of the broken broadcast issues.
          Hide
          pferrel Pat Ferrel added a comment -

          It sounds like we could remove Kryo altogether and improve performance by using the new Spark serializer. It also sounds like this uses the more standard extending serializable, which is built into many Scala classes IIRC.

          Removing Kryo with a performance gains seems a big win. Kryo causes many config problems for new users.

          Show
          pferrel Pat Ferrel added a comment - It sounds like we could remove Kryo altogether and improve performance by using the new Spark serializer. It also sounds like this uses the more standard extending serializable, which is built into many Scala classes IIRC. Removing Kryo with a performance gains seems a big win. Kryo causes many config problems for new users.
          Hide
          Andrew_Palumbo Andrew Palumbo added a comment -

          It would be one less setting to worry about.

          Show
          Andrew_Palumbo Andrew Palumbo added a comment - It would be one less setting to worry about.
          Hide
          pferrel Pat Ferrel added a comment -

          Hmm, removing Kryo altogether is probably a good idea, I have never touched this code and do not maintain classes that need this. All my classes either use data that is in the above types or base scala types that have serializable.

          I'm sending this back to Andrew Palumbo for reassignment or further discussion.

          If the new serializer if better than Kryo by all means let's move there ASAP.

          Show
          pferrel Pat Ferrel added a comment - Hmm, removing Kryo altogether is probably a good idea, I have never touched this code and do not maintain classes that need this. All my classes either use data that is in the above types or base scala types that have serializable. I'm sending this back to Andrew Palumbo for reassignment or further discussion. If the new serializer if better than Kryo by all means let's move there ASAP.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the issue:

          https://github.com/apache/mahout/pull/174

          Michel, I would like to revisit this issue but i need to understand whether there is indeed any merit to it.

          One thing i don't understand is why you think supporting java serialization has anything to do with enabling codegen serializer in spark? where is the information source that explicitly makes such guarantees? I couldn't immediately find any.

          All i can see is that it is used in catalyst, perhaps to serialize specific types there with keys and payloads (tuples?) but i don't seem to be able to understand how this may be applied to any non-dataframe pipeline or type. If you have any pointers, please let me know.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the issue: https://github.com/apache/mahout/pull/174 Michel, I would like to revisit this issue but i need to understand whether there is indeed any merit to it. One thing i don't understand is why you think supporting java serialization has anything to do with enabling codegen serializer in spark? where is the information source that explicitly makes such guarantees? I couldn't immediately find any. All i can see is that it is used in catalyst, perhaps to serialize specific types there with keys and payloads (tuples?) but i don't seem to be able to understand how this may be applied to any non-dataframe pipeline or type. If you have any pointers, please let me know.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/174

          @michellemay let me know if there is anything I can do to help push this forward.

          Thanks for your hard work!

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/174 @michellemay let me know if there is anything I can do to help push this forward. Thanks for your hard work!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on the issue:

          https://github.com/apache/mahout/pull/174

          @rawkintrevo i don't believe at this point this issue contains valid assertion w.r.t. being java serializable triggering codegen serializer. Nor is there any experimental evidence of such. And there's spark 2.0.0 guidance to the contrary as i mentioned.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the issue: https://github.com/apache/mahout/pull/174 @rawkintrevo i don't believe at this point this issue contains valid assertion w.r.t. being java serializable triggering codegen serializer. Nor is there any experimental evidence of such. And there's spark 2.0.0 guidance to the contrary as i mentioned.

            People

            • Assignee:
              pferrel Pat Ferrel
              Reporter:
              FlamingMike Michel Lemay
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development