Mahout
  1. Mahout
  2. MAHOUT-510

Standardize serialization mechanisms

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      At the moment this is tracking a broader concern: to standardize as much as possible how we approach serialization. The long-term goal is notionally to use the following "encodings" as the input/output of Mahout stuff, and by extension, probably internally too.

      • Text
      • Vector Writable
      • (maybe Avro)

      not

      • Serializable
      • GSON / JSON
      1. MAHOUT-510.patch
        60 kB
        Sean Owen
      2. MAHOUT-510.patch
        184 kB
        Sean Owen
      3. MAHOUT-510.patch
        186 kB
        Sean Owen
      4. MAHOUT-510.patch
        184 kB
        Sean Owen
      5. MAHOUT-510_part_2.patch
        17 kB
        Sean Owen

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #726 (See https://hudson.apache.org/hudson/job/Mahout-Quality/726/)
        MAHOUT-510 part 2 take out more JSON and remove defunct output selection option in a few jobs

        Show
        Hudson added a comment - Integrated in Mahout-Quality #726 (See https://hudson.apache.org/hudson/job/Mahout-Quality/726/ ) MAHOUT-510 part 2 take out more JSON and remove defunct output selection option in a few jobs
        Hide
        Sean Owen added a comment -

        As an encore, I did find more use of JSON and here's another change as a result. It came up while looking at the vector-related Driver classes, which have an option to dump to SequenceFile or JSON. I removed the JSON option, though it could be left in as a "dump as some text" option though I don't know if that's useful or just more confusing. I renamed the "JWriter*" classes as they no longer involve "J"SON.

        Show
        Sean Owen added a comment - As an encore, I did find more use of JSON and here's another change as a result. It came up while looking at the vector-related Driver classes, which have an option to dump to SequenceFile or JSON. I removed the JSON option, though it could be left in as a "dump as some text" option though I don't know if that's useful or just more confusing. I renamed the "JWriter*" classes as they no longer involve "J"SON.
        Hide
        Sean Owen added a comment -

        It does quite little now, if not nothing. It essentially tests the getter/setter. I suppose I wanted to err on the side of less change than more, and it had non-zero content.

        Show
        Sean Owen added a comment - It does quite little now, if not nothing. It essentially tests the getter/setter. I suppose I wanted to err on the side of less change than more, and it had non-zero content.
        Hide
        Lance Norskog added a comment -

        MatrixTest still has a testLabelBindingSerialization which does nothing.

        Show
        Lance Norskog added a comment - MatrixTest still has a testLabelBindingSerialization which does nothing.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #722 (See https://hudson.apache.org/hudson/job/Mahout-Quality/722/)
        MAHOUT-510 remove use of JSON

        Show
        Hudson added a comment - Integrated in Mahout-Quality #722 (See https://hudson.apache.org/hudson/job/Mahout-Quality/722/ ) MAHOUT-510 remove use of JSON
        Hide
        Sean Owen added a comment -

        Here's one more iteration, with ModelSerializer's binary methods restored along with tests.

        Show
        Sean Owen added a comment - Here's one more iteration, with ModelSerializer's binary methods restored along with tests.
        Hide
        Ted Dunning added a comment -

        Yes. Preserving those is all that is needed.

        That will let TrainNewGroups remain unmodified (and still work)

        Show
        Ted Dunning added a comment - Yes. Preserving those is all that is needed. That will let TrainNewGroups remain unmodified (and still work)
        Hide
        Sean Owen added a comment -

        Ted are you saying it's simply sufficient to retain the Writable-related methods like writeBinary() and readBinary()? That's quite simple, I can do that.

        Show
        Sean Owen added a comment - Ted are you saying it's simply sufficient to retain the Writable-related methods like writeBinary() and readBinary()? That's quite simple, I can do that.
        Hide
        Ted Dunning added a comment -

        That stitching has to be kind of fancy because there is a lot of polymorphism that goes on.

        That was a big point of the ModelSerializer ... as I remember the readBinary allows a super
        class to be passed in and the actual class is burned into the serialized representations.
        This gives type safety with a simple API. Since we don't know the actual type to be returned,
        we can't use the normal Writable structure. This polymorphism is internal as well as external
        although most of the internal stuff should be hidden by the PolymorphicWritable static methods.

        Anyway, the upshot as I see it is that the static convenience methods in ModelSerializer really
        did have value in that they put an official imprimatur on the required idioms.

        Show
        Ted Dunning added a comment - That stitching has to be kind of fancy because there is a lot of polymorphism that goes on. That was a big point of the ModelSerializer ... as I remember the readBinary allows a super class to be passed in and the actual class is burned into the serialized representations. This gives type safety with a simple API. Since we don't know the actual type to be returned, we can't use the normal Writable structure. This polymorphism is internal as well as external although most of the internal stuff should be hidden by the PolymorphicWritable static methods. Anyway, the upshot as I see it is that the static convenience methods in ModelSerializer really did have value in that they put an official imprimatur on the required idioms.
        Hide
        Sean Owen added a comment -

        No problem I can stitch that back in with a Writable-based serialization.

        Show
        Sean Owen added a comment - No problem I can stitch that back in with a Writable-based serialization.
        Hide
        Ted Dunning added a comment -

        The deletion of ModelSerializer in its entirety is a real problem because we lose the writeBinary method.

        Losing the JSON stuff is fine, but writeBinary is the only way to serialize SGD models properly.

        Show
        Ted Dunning added a comment - The deletion of ModelSerializer in its entirety is a real problem because we lose the writeBinary method. Losing the JSON stuff is fine, but writeBinary is the only way to serialize SGD models properly.
        Hide
        Sean Owen added a comment -

        Yes it's just a few wiki pages, it seems – I just search on cwiki.apache.org. I can update those. The book "Mahout in Action" is another concern, though I don't know if it appears in the book and arguably well the code is going to keep changing anyway and it's gone to print at this point, so...

        I would like to commit but do encourage others to take a peek. It's a substantial chop, and non-trivial surgery to the two bits I noted above.

        Show
        Sean Owen added a comment - Yes it's just a few wiki pages, it seems – I just search on cwiki.apache.org. I can update those. The book "Mahout in Action" is another concern, though I don't know if it appears in the book and arguably well the code is going to keep changing anyway and it's gone to print at this point, so... I would like to commit but do encourage others to take a peek. It's a substantial chop, and non-trivial surgery to the two bits I noted above.
        Hide
        Isabel Drost-Fromm added a comment -

        > The only remaining concern was that this will mean some examples on the wiki (and in the book) are a little
        > out of date then. How big an issue is that vs. wanting to commit this at some near point?

        I'd rather commit this now and fix the wiki ASAP. Do we have a list of pages that this patch renders out of date? I think the issue should not only make the code simpler - but reduce complexity in the core documentation as well?!

        > Nice thing is that this results in a net decrease of over 1,600 lines of code !

        Yeah!

        Show
        Isabel Drost-Fromm added a comment - > The only remaining concern was that this will mean some examples on the wiki (and in the book) are a little > out of date then. How big an issue is that vs. wanting to commit this at some near point? I'd rather commit this now and fix the wiki ASAP. Do we have a list of pages that this patch renders out of date? I think the issue should not only make the code simpler - but reduce complexity in the core documentation as well?! > Nice thing is that this results in a net decrease of over 1,600 lines of code ! Yeah!
        Hide
        Sean Owen added a comment -

        This is my proposed final patch. Highlights:

        It removes all use of JSON. This was a stated goal, to reduce the number of serialization approaches, and this does by 1.

        Dirichlet had used JSON to serialize ModelDistribution, when really what it was doing was serializing a simple description of the instance as a string. So, there's now DistributionDescription used throughout to encapsulate these params and transport them around as a string.

        LogisticModelParameters serialized itself with JSON too. This was the only bit that looked a bit hard to know what to do with. Its main component, OnlineLogisticRegression, was already Writable. So I made LMP Writable too and used this serialization to write/load it to a stream. At least, it's using an existing serialization mechanism.

        Tests pass of course.

        The only remaining concern was that this will mean some examples on the wiki (and in the book) are a little out of date then. How big an issue is that vs. wanting to commit this at some near point?

        Thoughts on the original concept of this JIRA to begin with?

        Nice thing is that this results in a net decrease of over 1,600 lines of code !

        Show
        Sean Owen added a comment - This is my proposed final patch. Highlights: It removes all use of JSON. This was a stated goal, to reduce the number of serialization approaches, and this does by 1. Dirichlet had used JSON to serialize ModelDistribution, when really what it was doing was serializing a simple description of the instance as a string. So, there's now DistributionDescription used throughout to encapsulate these params and transport them around as a string. LogisticModelParameters serialized itself with JSON too. This was the only bit that looked a bit hard to know what to do with. Its main component, OnlineLogisticRegression, was already Writable. So I made LMP Writable too and used this serialization to write/load it to a stream. At least, it's using an existing serialization mechanism. Tests pass of course. The only remaining concern was that this will mean some examples on the wiki (and in the book) are a little out of date then. How big an issue is that vs. wanting to commit this at some near point? Thoughts on the original concept of this JIRA to begin with? Nice thing is that this results in a net decrease of over 1,600 lines of code !
        Hide
        Sean Owen added a comment -

        Here's my latest work on the subject. It's a big patch which cleans out all use of JSON. Nice and all that.

        The only piece I can't get to work is this Dirichlet business. I tried to carefully revise it to pass those same configuration values through simple Configuration object params, and recreate the model on the other end, but it's not working.

        It's a bit of a black box... I would be very grateful if Jeff would have any time to apply, maybe spot the issue, and/or fix it. That would complete the patch. Otherwise I'm going to have a hard time understanding how to look into this.

        Show
        Sean Owen added a comment - Here's my latest work on the subject. It's a big patch which cleans out all use of JSON. Nice and all that. The only piece I can't get to work is this Dirichlet business. I tried to carefully revise it to pass those same configuration values through simple Configuration object params, and recreate the model on the other end, but it's not working. It's a bit of a black box... I would be very grateful if Jeff would have any time to apply, maybe spot the issue, and/or fix it. That would complete the patch. Otherwise I'm going to have a hard time understanding how to look into this.
        Hide
        Robin Anil added a comment -

        The changes are fine. What are we moving towards for model? Sequence Files seem like an obvious choice? Any thoughts?

        Show
        Robin Anil added a comment - The changes are fine. What are we moving towards for model? Sequence Files seem like an obvious choice? Any thoughts?
        Hide
        Ted Dunning added a comment -

        (replied instead of commenting ... sorry for the duplicate email)

        Putting data objects in the Configuration is a bit of a misuse (it has been the subject of an argument on the hadoop mailing lists for a long time now).

        I would leave this use in place for now and later refactor to read from HDFS. That has more legs in any case when it comes to using the clustering on new data without retraining.

        Show
        Ted Dunning added a comment - (replied instead of commenting ... sorry for the duplicate email) Putting data objects in the Configuration is a bit of a misuse (it has been the subject of an argument on the hadoop mailing lists for a long time now). I would leave this use in place for now and later refactor to read from HDFS. That has more legs in any case when it comes to using the clustering on new data without retraining.
        Hide
        Sean Owen added a comment -

        (BTW I'm not committing this for some time.)

        I've managed to take out almost all the usages. The only real usage of it is in the dirichlet implementation, which uses it to serialize a ModelDistribution and pass it as a string to Hadoop workers via the Configuration object.

        Now, per the issue description, we could re-do serialization here to use Writable. That's not hard and makes it possible to write these things out to HDFS later in a more Hadoop-ish way later. But that gives you a serialization to bytes, not String. I could Base64-encode it; it's not huge.

        That's starting to get a little weird. Is the better answer to look at writing the ModelDistribution to HDFS? or just leave this use of JSON?

        Show
        Sean Owen added a comment - (BTW I'm not committing this for some time.) I've managed to take out almost all the usages. The only real usage of it is in the dirichlet implementation, which uses it to serialize a ModelDistribution and pass it as a string to Hadoop workers via the Configuration object. Now, per the issue description, we could re-do serialization here to use Writable. That's not hard and makes it possible to write these things out to HDFS later in a more Hadoop-ish way later. But that gives you a serialization to bytes, not String. I could Base64-encode it; it's not huge. That's starting to get a little weird. Is the better answer to look at writing the ModelDistribution to HDFS? or just leave this use of JSON?
        Hide
        Ted Dunning added a comment -

        I think that this patch looks pretty reasonable. It affects a lot of code that Jeff has committed so that he should comment. The Naive Bayes code is hit less so, but it would be good to hear from Robin on that side.

        A key thing to remember is that asFormatString won't work on large objects anyway. It has occasionally been useful to assist with equality checks, but that is a small justification. As far as the Naive Bayes stuff is concerned, the serialized form hasn't been sufficient to fully re-instantiate a usable model, so it is hard to see what the point is.

        Show
        Ted Dunning added a comment - I think that this patch looks pretty reasonable. It affects a lot of code that Jeff has committed so that he should comment. The Naive Bayes code is hit less so, but it would be good to hear from Robin on that side. A key thing to remember is that asFormatString won't work on large objects anyway. It has occasionally been useful to assist with equality checks, but that is a small justification. As far as the Naive Bayes stuff is concerned, the serialized form hasn't been sufficient to fully re-instantiate a usable model, so it is hard to see what the point is.
        Hide
        Sean Owen added a comment -

        Here is one initial pass at removing JSON that seems completely unused. This is the easy bit.

        Show
        Sean Owen added a comment - Here is one initial pass at removing JSON that seems completely unused. This is the easy bit.
        Hide
        Sean Owen added a comment -

        So, I'm starting to take a crack at this. I've identified about 1/3 of all the JSON code so far that is not apparently used (except for code testing it), removed it, and everything seems OK. Before I keep going – are people still keen on this goal? we had some small assent in an earlier thread.

        That is to say, if JSON disappears, can anyone think of a real problem it causes?

        Show
        Sean Owen added a comment - So, I'm starting to take a crack at this. I've identified about 1/3 of all the JSON code so far that is not apparently used (except for code testing it), removed it, and everything seems OK. Before I keep going – are people still keen on this goal? we had some small assent in an earlier thread. That is to say, if JSON disappears, can anyone think of a real problem it causes?

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development