Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      This task is the first port of spark.mllib.fpm functionality to spark.ml (Scala).

      This will require a brief design doc to confirm a reasonable DataFrame-based API, with details for this class. The doc could also look ahead to the other fpm classes, especially if their API decisions will affect FPGrowth.

        Issue Links

          Activity

          Hide
          yuhaoyan yuhao yang added a comment - - edited

          Thanks for reporting that. I just found there's a misplaced distinct along the iterations of updates. Do you want to send a PR with some unit tests? Maciej Szymkiewicz

          I can revert the related change in https://github.com/apache/spark/pull/17130 since I didn't add new unit test for the case.

          Show
          yuhaoyan yuhao yang added a comment - - edited Thanks for reporting that. I just found there's a misplaced distinct along the iterations of updates. Do you want to send a PR with some unit tests? Maciej Szymkiewicz I can revert the related change in https://github.com/apache/spark/pull/17130 since I didn't add new unit test for the case.
          Hide
          zero323 Maciej Szymkiewicz added a comment -

          I think we should keep only unique predictions in transform otherwise it is possible to get results like this:

          scala> val data = spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", "\\s+").alias("features")) 
          data: org.apache.spark.sql.DataFrame = [features: array<string>]
          
          scala> val data = spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", "\\s+").alias("features")) 
          data: org.apache.spark.sql.DataFrame = [features: array<string>]
          
          scala> fpm.transform(Seq(Array("t", "s")).toDF("features")).show(1, false)
          +--------+---------------------+
          |features|prediction           |
          +--------+---------------------+
          |[t, s]  |[y, x, z, x, y, x, z]|
          +--------+---------------------+
          
          
          Show
          zero323 Maciej Szymkiewicz added a comment - I think we should keep only unique predictions in transform otherwise it is possible to get results like this: scala> val data = spark.read.text( "data/mllib/sample_fpgrowth.txt" ).select(split($ "value" , "\\s+" ).alias( "features" )) data: org.apache.spark.sql.DataFrame = [features: array<string>] scala> val data = spark.read.text( "data/mllib/sample_fpgrowth.txt" ).select(split($ "value" , "\\s+" ).alias( "features" )) data: org.apache.spark.sql.DataFrame = [features: array<string>] scala> fpm.transform(Seq(Array( "t" , "s" )).toDF( "features" )).show(1, false ) +--------+---------------------+ |features|prediction | +--------+---------------------+ |[t, s] |[y, x, z, x, y, x, z]| +--------+---------------------+
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 15415
          https://github.com/apache/spark/pull/15415

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 15415 https://github.com/apache/spark/pull/15415
          Hide
          josephkb Joseph K. Bradley added a comment -

          Sorry for the slow reply. I actually haven't read enough about PrefixSpan to say, and the original paper doesn't seem to cover rule generation.

          We're going with PrefixSpanModel currently. The benefit from combining the models seems pretty small given the unknowns here, and if they can share implementations, we can do so internally in the future.

          Show
          josephkb Joseph K. Bradley added a comment - Sorry for the slow reply. I actually haven't read enough about PrefixSpan to say, and the original paper doesn't seem to cover rule generation. We're going with PrefixSpanModel currently. The benefit from combining the models seems pretty small given the unknowns here, and if they can share implementations, we can do so internally in the future.
          Hide
          mlnick Nick Pentreath added a comment -

          Seems PrefixSpan even takes different input: Array[Array[T]] vs FPGrowth: Array[T]. So it may be tricky to unify.

          However we do have the case where e.g. QuantileDiscretizer returns a Bucketizer as Model from fit. In that case Bucketizer can be instantiated directly and independently, but it could in theory be the case that some other estimator returns a Bucketizer as its model.

          So we could perhaps think about both FPGrowth and PrefixSpan returning an AssociationRuleModel from fit. It could work if the input can be generalized to Seq[T] where for FPGrowth it would be Seq[Item] and for PrefixSpan it would be Seq[Seq[Item]]. The output of transform for the model would be the predicted items as above. It would expose getFreqItems and getAssociationRules both returning a DataFrame.

          Is there something in the nature of PrefixSpan vs FPGrowth that makes this too difficult? (I'll have to go read the papers when I get some time!)

          But having said that it could be pretty complex to try to support this. If so, unless there's a compelling argument I'd go for Joseph K. Bradley's suggestion above, and hide the association rule class for now (can expose later as needed). Then PrefixSpan will be totally independent and return its own PrefixSpanModel (that may also expose a transform method that has similar semantics but different internals).

          Show
          mlnick Nick Pentreath added a comment - Seems PrefixSpan even takes different input: Array[Array [T] ] vs FPGrowth: Array [T] . So it may be tricky to unify. However we do have the case where e.g. QuantileDiscretizer returns a Bucketizer as Model from fit . In that case Bucketizer can be instantiated directly and independently, but it could in theory be the case that some other estimator returns a Bucketizer as its model. So we could perhaps think about both FPGrowth and PrefixSpan returning an AssociationRuleModel from fit . It could work if the input can be generalized to Seq [T] where for FPGrowth it would be Seq [Item] and for PrefixSpan it would be Seq[Seq [Item] ] . The output of transform for the model would be the predicted items as above. It would expose getFreqItems and getAssociationRules both returning a DataFrame . Is there something in the nature of PrefixSpan vs FPGrowth that makes this too difficult? (I'll have to go read the papers when I get some time!) But having said that it could be pretty complex to try to support this. If so, unless there's a compelling argument I'd go for Joseph K. Bradley 's suggestion above, and hide the association rule class for now (can expose later as needed). Then PrefixSpan will be totally independent and return its own PrefixSpanModel (that may also expose a transform method that has similar semantics but different internals).
          Hide
          josephkb Joseph K. Bradley added a comment -

          Sounds good, thank you!

          Show
          josephkb Joseph K. Bradley added a comment - Sounds good, thank you!
          Hide
          yuhaoyan yuhao yang added a comment -

          Thanks for the suggestions Joseph K. Bradley

          About ItemType, previously the blocking issue is about saving and loading the frequent itemsets and association rules DataFrame. I'll test it further and let you know what's the blocking issues.

          On API level, I agree we can make AssociationRules private or Developer API for now, until obvious requirements come up.

          Show
          yuhaoyan yuhao yang added a comment - Thanks for the suggestions Joseph K. Bradley About ItemType, previously the blocking issue is about saving and loading the frequent itemsets and association rules DataFrame. I'll test it further and let you know what's the blocking issues. On API level, I agree we can make AssociationRules private or Developer API for now, until obvious requirements come up.
          Hide
          josephkb Joseph K. Bradley added a comment -

          There are a couple of design issues which have been mentioned in either the design doc or the PR, but which should probably be discussed in more detail in JIRA:

          • Item type: It looks like this currently assumes every item is represented as a String. I'd like us to support any Catalyst type. If that's hard to do (until we port the implementation over to DataFrames), then just supporting String is OK as long as it's clearly documented.
          • FPGrowth vs AssociationRules: The APIs are a bit fuzzy right now. I’ve listed them out below. The problem is that AssociationRules is tightly tied to FPGrowth. While I like the idea of being able to use AssociationRules to analyze the output of multiple FPM algorithms, I don’t think it’s applicable to PrefixSpan since it does not take the ordering of the itemsets into account. I’d propose we provide a single API under the name "FPGrowth."
            • Q: Have you heard of anyone needing the AssociationRules API without going through FPGrowth first? If so, then we could expose the AssociationRules algorithm as a @DeveloperApi static method.

          What does everyone think?

          Current APIs

          • FPGrowth
            • Input to fit() and transform(): Seq(items)
            • Output
              • transform(): —> same as AssociationRules
              • getFreqItems:
                DataFrame["items", "freq"]
          • AssociationRules
            • Input
              • fit(): (output of FPGrowth)
              • transform(): Seq(items) —> Not good that fit/transform take different inputs
            • Output
              • transform(): predicted items for each Seq(items)
              • associationRules:
                DataFrame["antecedent", "consequent", "confidence"]

          Proposal: Combine under FPGrowth

          • FPGrowth
            • Input to fit() and transform(): Seq(items)
            • Output
              • transform(): predicted items for each Seq(items)
              • getFreqItems:
                DataFrame["items", "freq"]
              • associationRules:
                DataFrame["antecedent", "consequent", "confidence"]
          Show
          josephkb Joseph K. Bradley added a comment - There are a couple of design issues which have been mentioned in either the design doc or the PR, but which should probably be discussed in more detail in JIRA: Item type: It looks like this currently assumes every item is represented as a String. I'd like us to support any Catalyst type. If that's hard to do (until we port the implementation over to DataFrames), then just supporting String is OK as long as it's clearly documented. FPGrowth vs AssociationRules: The APIs are a bit fuzzy right now. I’ve listed them out below. The problem is that AssociationRules is tightly tied to FPGrowth. While I like the idea of being able to use AssociationRules to analyze the output of multiple FPM algorithms, I don’t think it’s applicable to PrefixSpan since it does not take the ordering of the itemsets into account. I’d propose we provide a single API under the name "FPGrowth." Q: Have you heard of anyone needing the AssociationRules API without going through FPGrowth first? If so, then we could expose the AssociationRules algorithm as a @DeveloperApi static method. What does everyone think? Current APIs FPGrowth Input to fit() and transform(): Seq(items) Output transform(): —> same as AssociationRules getFreqItems: DataFrame[ "items" , "freq" ] AssociationRules Input fit(): (output of FPGrowth) transform(): Seq(items) —> Not good that fit/transform take different inputs Output transform(): predicted items for each Seq(items) associationRules: DataFrame[ "antecedent" , "consequent" , "confidence" ] Proposal: Combine under FPGrowth FPGrowth Input to fit() and transform(): Seq(items) Output transform(): predicted items for each Seq(items) getFreqItems: DataFrame[ "items" , "freq" ] associationRules: DataFrame[ "antecedent" , "consequent" , "confidence" ]
          Hide
          apachespark Apache Spark added a comment -

          User 'hhbyyh' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15415

          Show
          apachespark Apache Spark added a comment - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/15415
          Hide
          yuhaoyan yuhao yang added a comment -

          Yes, let me just send what I got.

          Show
          yuhaoyan yuhao yang added a comment - Yes, let me just send what I got.
          Hide
          holdenk holdenk added a comment -

          Jeff Zhang & yuhao yang are you still working on this?

          Show
          holdenk holdenk added a comment - Jeff Zhang & yuhao yang are you still working on this?
          Hide
          holdenk holdenk added a comment -

          +1 for porting the current functionality then updating from there.

          Show
          holdenk holdenk added a comment - +1 for porting the current functionality then updating from there.
          Hide
          mlnick Nick Pentreath added a comment -

          Shall we focus first on the porting of what is currently in mllib FPM to ml? Thereafter we can potentially redesign or add features?

          Show
          mlnick Nick Pentreath added a comment - Shall we focus first on the porting of what is currently in mllib FPM to ml ? Thereafter we can potentially redesign or add features?
          Hide
          yuhaoyan yuhao yang added a comment -

          Link two related issue about adding "support" to Association Rule in MLlib Association Rule.
          https://issues.apache.org/jira/browse/SPARK-15930
          https://issues.apache.org/jira/browse/SPARK-15938

          I've the current work in https://github.com/hhbyyh/spark/tree/mlfpm/mllib/src/main/scala/org/apache/spark/ml/fpm. Maybe we'd better first decide if MLlib Association Rule should have "support" or not.

          Show
          yuhaoyan yuhao yang added a comment - Link two related issue about adding "support" to Association Rule in MLlib Association Rule. https://issues.apache.org/jira/browse/SPARK-15930 https://issues.apache.org/jira/browse/SPARK-15938 I've the current work in https://github.com/hhbyyh/spark/tree/mlfpm/mllib/src/main/scala/org/apache/spark/ml/fpm . Maybe we'd better first decide if MLlib Association Rule should have "support" or not.
          Hide
          yuhaoyan yuhao yang added a comment -

          Hi Jeff, welcome to contribute.
          I'm discussing with some industry users to see what's the optimal interface for FPM, especially what should the output column contains. Appreciate if you can share some thoughts.

          Show
          yuhaoyan yuhao yang added a comment - Hi Jeff, welcome to contribute. I'm discussing with some industry users to see what's the optimal interface for FPM, especially what should the output column contains. Appreciate if you can share some thoughts.
          Hide
          zjffdu Jeff Zhang added a comment -

          Gayathri Murali yuhao yang Do you still work on this ? If not, I can help to continue

          Show
          zjffdu Jeff Zhang added a comment - Gayathri Murali yuhao yang Do you still work on this ? If not, I can help to continue
          Show
          yuhaoyan yuhao yang added a comment - design doc draft https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub
          Hide
          GayathriMurali Gayathri Murali added a comment - - edited

          Joseph K. Bradley : yuhao yang and I can work on this. Will submit a design doc shortly

          Show
          GayathriMurali Gayathri Murali added a comment - - edited Joseph K. Bradley : yuhao yang and I can work on this. Will submit a design doc shortly

            People

            • Assignee:
              yuhaoyan yuhao yang
              Reporter:
              josephkb Joseph K. Bradley
              Shepherd:
              Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development