Mahout
  1. Mahout
  2. MAHOUT-18

Embrace interoperability with other softwares

    Details

    • Type: New JIRA Project New JIRA Project
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is an open issue. It is related with all possible components existing or to born in the future.

      ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.

      There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.

      Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.

      Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

        Activity

        Hide
        Sean Owen added a comment -

        I agree with Ted's assessment. If/when this becomes a real issue – have a particular format to support to interchange with a particular framework, and there's demand for it – then make a new more specific issue.

        Show
        Sean Owen added a comment - I agree with Ted's assessment. If/when this becomes a real issue – have a particular format to support to interchange with a particular framework, and there's demand for it – then make a new more specific issue.
        Hide
        Ted Dunning added a comment -

        This will be important someday.

        At that time, we should open a new JIRA and implement it. Right now, we are working on getting relevant capabilities. Until we have them, interchange is fruitless.

        Show
        Ted Dunning added a comment - This will be important someday. At that time, we should open a new JIRA and implement it. Right now, we are working on getting relevant capabilities. Until we have them, interchange is fruitless.
        Hide
        Sean Owen added a comment -

        Same, sounds like something to archive?

        Show
        Sean Owen added a comment - Same, sounds like something to archive?
        Hide
        Isabel Drost-Fromm added a comment -

        > To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right
        > now.

        I think it is overkill for algorithms mainly used for data exploration - e.g. clustering is used for exploring large amounts of data, grouping it in manageable pieces. Once we start working on algorithms that create models of the data that are later applied to new incoming data (stuff like classification, regression, ...) we will need some way to store the resulting model. If that model can later be imported into one of the standard tools - all the better.

        Maybe it is possible to start out with supporting just a subset of the standard that is really relevant for us?

        Show
        Isabel Drost-Fromm added a comment - > To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right > now. I think it is overkill for algorithms mainly used for data exploration - e.g. clustering is used for exploring large amounts of data, grouping it in manageable pieces. Once we start working on algorithms that create models of the data that are later applied to new incoming data (stuff like classification, regression, ...) we will need some way to store the resulting model. If that model can later be imported into one of the standard tools - all the better. Maybe it is possible to start out with supporting just a subset of the standard that is really relevant for us?
        Hide
        Karl Wettin added a comment -

        Isabel Drost - 18/Mar/08 11:54 PM

        PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.

        I wonder whether "the inputs" here means meta information to input data or the dataset itself.

        I think both. PMML seems to be an XML schema for feature attributes, data transformation, classifier parameter values, etc. It also defines a spare/dense matrix for instance data. All in the same XML file.

        According to the FAQ is implemented by JSR-73 (see Mahout-8):
        > > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as
        > > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.

        Karl, you had a look at the FAQ, can you confirm this?

        JSR 73 says: http://jcp.org/en/jsr/detail?id=73

        JDMAPI will be based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDMAPI model will support four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results.

        I have very little clue to what these meta model models really are. I also suppose they expect whoever that implement JSR 73 also implement the thing that read and write all these formats, but I'm just guessing here.

        To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right now. When calculating Jaccard index on the vector spaces of two text documets I store the values in a Mahout Vector along with a Map<Feature, Index>. (I could just store in Map<Feature, Double>, but I thought it would be nice if other wanted to use the distance class.)

        If one implements this map a new class and fill it with text on what it represents in JSR 73, PMML, CWM and what not, then at least people that wants to dig in will know where to start.

        Show
        Karl Wettin added a comment - Isabel Drost - 18/Mar/08 11:54 PM PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves. I wonder whether "the inputs" here means meta information to input data or the dataset itself. I think both. PMML seems to be an XML schema for feature attributes, data transformation, classifier parameter values, etc. It also defines a spare/dense matrix for instance data. All in the same XML file. According to the FAQ is implemented by JSR-73 (see Mahout-8): > > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as > > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining. Karl, you had a look at the FAQ, can you confirm this? JSR 73 says: http://jcp.org/en/jsr/detail?id=73 JDMAPI will be based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDMAPI model will support four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results. I have very little clue to what these meta model models really are. I also suppose they expect whoever that implement JSR 73 also implement the thing that read and write all these formats, but I'm just guessing here. To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right now. When calculating Jaccard index on the vector spaces of two text documets I store the values in a Mahout Vector along with a Map<Feature, Index>. (I could just store in Map<Feature, Double>, but I thought it would be nice if other wanted to use the distance class.) If one implements this map a new class and fill it with text on what it represents in JSR 73, PMML, CWM and what not, then at least people that wants to dig in will know where to start.
        Hide
        Isabel Drost-Fromm added a comment -

        > How does this relate to MAHOUT-8? Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting?

        I think this is misinterpreted. Maybe the explanation on the PMML website helps:

        PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.

        I wonder whether "the inputs" here means meta information to input data or the dataset itself.

        According to the FAQ is implemented by JSR-73 (see Mahout-8):
        > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as
        > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.

        Karl, you had a look at the FAQ, can you confirm this?

        > What are the criteria that we should use to decide which formats to support?

        I think one criterion should be how expressive the format is, the second should be the number of tools supporting the format. Of course there is an obvious criterion as well: The format should at least be open

        The group developing the format is part of the standards group xml.org, so there is some standardization process backing it up.

        I what is supported by the format and what cannot be expressed.

        Show
        Isabel Drost-Fromm added a comment - > How does this relate to MAHOUT-8 ? Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting? I think this is misinterpreted. Maybe the explanation on the PMML website helps: PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves. I wonder whether "the inputs" here means meta information to input data or the dataset itself. According to the FAQ is implemented by JSR-73 (see Mahout-8): > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining. Karl, you had a look at the FAQ, can you confirm this? > What are the criteria that we should use to decide which formats to support? I think one criterion should be how expressive the format is, the second should be the number of tools supporting the format. Of course there is an obvious criterion as well: The format should at least be open The group developing the format is part of the standards group xml.org, so there is some standardization process backing it up. I what is supported by the format and what cannot be expressed.
        Hide
        Grant Ingersoll added a comment -

        How does this relate to MAHOUT-8? Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting?

        Show
        Grant Ingersoll added a comment - How does this relate to MAHOUT-8 ? Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting?
        Hide
        Ted Dunning added a comment -

        What are the possible formats?

        Do any of the formats express parallel execution?

        What are the criteria that we should use to decide which formats to support?

        Show
        Ted Dunning added a comment - What are the possible formats? Do any of the formats express parallel execution? What are the criteria that we should use to decide which formats to support?

          People

          • Assignee:
            Unassigned
            Reporter:
            Shunkai Fu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development