Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1541

Create CLI Driver for Spark Cooccurrence Analysis

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: CLI
    • Labels:

      Description

      Create a CLI driver to import data in a flexible manner, create an IndexedDataset with BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate params, then write output with external IDs optionally reattached.

      Ultimately it should be able to read input as the legacy mr does but will support reading externally defined IDs and flexible formats. Output will be of the legacy format or text files of the user's specification with reattached Item IDs.

      Support for legacy formats is a question, users can always use the legacy code if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished without any writing to an actual file so the legacy sequence file output may not be needed.

      Opinions?

        Attachments

          Activity

            People

            • Assignee:
              pferrel Pat Ferrel
              Reporter:
              pferrel Pat Ferrel
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: