Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 0.3
    • Component/s: Math
    • Labels:
      None

      Description

      Currently in the matrix package we have asFormatString() methods to serialize matrices, however there are no corresponding methods to decode the serialized matrices. At the moment I do not think any of the code base uses the matrix asFormatString() methods, however for the Gaussian Mixture Model(GMM) code I am working on I will need to serialize/deserialize covariance matrices.

      The following matrix classes will require decoding methods:
      1. DenseMatrix
      2. SparseMatrix
      3. SparseColumnMatrix
      4. SparseRowMatrix
      5. MatrixView

        Activity

        Hide
        Daniel Nee added a comment -

        Patch to implement the decoding methods for the five classes listed above, plus updates to their respective unit tests.

        In order to make the decoding easier I made some changes to what asFormatString() produces for the matrix classes. The string representing the actual data in the matrix is prefixed by a two-character code indicating the matrix type and the string '(N,M)' where N is number of rows and M is the number of columns. For instance in TestSparseMatrix the 3x2 sparse matrix produces the output from asFormatString: sm(3,2)[ [s2, 0:1.1, 1:2.2, ] [s2, 0:3.3, 1:4.4, ] [s2, 0:5.5, 1:6.6, ] ]

        As much as possible I have tried to keep the way in which matrices are decoded similar to the way in which vectors are currently decoded. Thus to decode a matrix you would call the static method decodeMatrix in AbstractMatrix, which would in turn call the respective decodeFormat method in the appropriate matrix class. In order to decode the rows of the matrix I have used the decodeVector method.

        Couple of things I am currently not happy with:

        SparseColumnMatrix does not output a sparse representation, as the asFormatString method of SparseVector will produce a row wise representation of the vector. At the moment it will basically output the same as a DenseMatrix representation. Arguably we could output the transpose of the matrix as series of sparse rows and just ensure when the matrix is decoded we obtain correct matrix.

        Fair amount of shared code between the decodeFormat methods implemented in each of the five classes. Potentially some it it could be moved into the AbstractMatrix method.

        Your thoughts and suggestions would be welcome.

        Show
        Daniel Nee added a comment - Patch to implement the decoding methods for the five classes listed above, plus updates to their respective unit tests. In order to make the decoding easier I made some changes to what asFormatString() produces for the matrix classes. The string representing the actual data in the matrix is prefixed by a two-character code indicating the matrix type and the string '(N,M)' where N is number of rows and M is the number of columns. For instance in TestSparseMatrix the 3x2 sparse matrix produces the output from asFormatString: sm(3,2)[ [s2, 0:1.1, 1:2.2, ] [s2, 0:3.3, 1:4.4, ] [s2, 0:5.5, 1:6.6, ] ] As much as possible I have tried to keep the way in which matrices are decoded similar to the way in which vectors are currently decoded. Thus to decode a matrix you would call the static method decodeMatrix in AbstractMatrix, which would in turn call the respective decodeFormat method in the appropriate matrix class. In order to decode the rows of the matrix I have used the decodeVector method. Couple of things I am currently not happy with: SparseColumnMatrix does not output a sparse representation, as the asFormatString method of SparseVector will produce a row wise representation of the vector. At the moment it will basically output the same as a DenseMatrix representation. Arguably we could output the transpose of the matrix as series of sparse rows and just ensure when the matrix is decoded we obtain correct matrix. Fair amount of shared code between the decodeFormat methods implemented in each of the five classes. Potentially some it it could be moved into the AbstractMatrix method. Your thoughts and suggestions would be welcome.
        Hide
        Jeff Eastman added a comment -

        We have been living with the ad-hoc asFormatString methods since the beginning days of Vectors and Matrices. In MAHOUT-30, I introduced Google's Gson implementation of Json for storing and retrieving models and states, and this works pretty well IMHO. We have also discussed the use of alternative serialization/deserialization methods in other threads and at other times. I'd rather we converge on a single mechanism with broad generality rather than keep inventing more ad-hoc serialization code.

        The current format strings really break down once we decide to implement e.g. MAHOUT-65 Element Labels. Could you take a look at Gson and see if it would fill the bill?

        Show
        Jeff Eastman added a comment - We have been living with the ad-hoc asFormatString methods since the beginning days of Vectors and Matrices. In MAHOUT-30 , I introduced Google's Gson implementation of Json for storing and retrieving models and states, and this works pretty well IMHO. We have also discussed the use of alternative serialization/deserialization methods in other threads and at other times. I'd rather we converge on a single mechanism with broad generality rather than keep inventing more ad-hoc serialization code. The current format strings really break down once we decide to implement e.g. MAHOUT-65 Element Labels. Could you take a look at Gson and see if it would fill the bill?
        Hide
        Daniel Nee added a comment -

        Sure I will have a look at GSON. I certainty think a JSON based approach will offer a lot more flexibility than the current asFormatString methods.

        Show
        Daniel Nee added a comment - Sure I will have a look at GSON. I certainty think a JSON based approach will offer a lot more flexibility than the current asFormatString methods.
        Hide
        Isabel Drost-Fromm added a comment -

        I would like to suggest to provide one standard serialization mechanism but
        make the implementation interchangeable so people can provide their own way
        of (de-)serializing matrices and vectors.

        I don't like the idea to force people to use one or another method for storing
        and distributing pre-processed data sets. Especially as there are already various ways (mainly file formats) to encode datasets (more or less) established.

        Show
        Isabel Drost-Fromm added a comment - I would like to suggest to provide one standard serialization mechanism but make the implementation interchangeable so people can provide their own way of (de-)serializing matrices and vectors. I don't like the idea to force people to use one or another method for storing and distributing pre-processed data sets. Especially as there are already various ways (mainly file formats) to encode datasets (more or less) established.
        Hide
        Sean Owen added a comment -

        Pinging this issue – still live?

        Show
        Sean Owen added a comment - Pinging this issue – still live?
        Hide
        Ted Dunning added a comment -

        Hasn't this been subsumed by other work?

        Show
        Ted Dunning added a comment - Hasn't this been subsumed by other work?
        Hide
        Sean Owen added a comment -

        Sounds like this is either done, or stale, or a bit of both – closing for now.

        Show
        Sean Owen added a comment - Sounds like this is either done, or stale, or a bit of both – closing for now.

          People

          • Assignee:
            Unassigned
            Reporter:
            Daniel Nee
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development