Isabel Drost - 18/Mar/08 11:54 PM
PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.
I wonder whether "the inputs" here means meta information to input data or the dataset itself.
I think both. PMML seems to be an XML schema for feature attributes, data transformation, classifier parameter values, etc. It also defines a spare/dense matrix for instance data. All in the same XML file.
According to the FAQ is implemented by JSR-73 (see Mahout-8):
> > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as
> > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.
Karl, you had a look at the FAQ, can you confirm this?
JSR 73 says: http://jcp.org/en/jsr/detail?id=73
JDMAPI will be based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDMAPI model will support four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results.
I have very little clue to what these meta model models really are. I also suppose they expect whoever that implement JSR 73 also implement the thing that read and write all these formats, but I'm just guessing here.
To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right now. When calculating Jaccard index on the vector spaces of two text documets I store the values in a Mahout Vector along with a Map<Feature, Index>. (I could just store in Map<Feature, Double>, but I thought it would be nice if other wanted to use the distance class.)
If one implements this map a new class and fill it with text on what it represents in JSR 73, PMML, CWM and what not, then at least people that wants to dig in will know where to start.