Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1568

Build an I/O model that can replace sequence files for import/export

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: CLI
    • Labels:
    • Environment:

      Scala, Spark

      Description

      Implement mechanisms to read and write data from/to flexible stores. These will support tuples streams and drms but with extensions that allow keeping user defined values for IDs. The mechanism in some sense can replace Sequence Files for import/export and will make the operation much easier for the user. In many cases directly consuming their input files.

      Start with text delimited files for input/output in the Spark version of ItemSimilarity

      A proposal is running with ItemSimilarity on Spark and is documented on the github wiki here: https://github.com/pferrel/harness/wiki

      Comments are appreciated

        Activity

        Hide
        pferrel Pat Ferrel added a comment -

        the code to read/write text delimited files is in MAHOUT-1541, it has a set of abstract classes and traits for I/O and implements text delimited read of tuples into a drm, and output of a drm into a text delimited format.

        to futher this Jira a version of "rowsimilarity" will be created to read in a text delimited drm as well as output one.

        Show
        pferrel Pat Ferrel added a comment - the code to read/write text delimited files is in MAHOUT-1541 , it has a set of abstract classes and traits for I/O and implements text delimited read of tuples into a drm, and output of a drm into a text delimited format. to futher this Jira a version of "rowsimilarity" will be created to read in a text delimited drm as well as output one.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Mahout-Quality #2682 (See https://builds.apache.org/job/Mahout-Quality/2682/)
        MAHOUT-1561, MAHOUT-1568, MAHOUT-1569 text-delimited Spark readers and writers with drivers and a CLI for 'spark-itemsimilarity' closes apache/mahout#22 (pat: rev 2b65475c3ab682ebd47cffdc6b502698799cd2c8)

        • spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
        • spark/src/main/scala/org/apache/mahout/drivers/FileSysUtils.scala
        • spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
        • spark/pom.xml
        • spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala
        • bin/mahout
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/main/assembly/job.xml
        • spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • CHANGELOG
        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Mahout-Quality #2682 (See https://builds.apache.org/job/Mahout-Quality/2682/ ) MAHOUT-1561 , MAHOUT-1568 , MAHOUT-1569 text-delimited Spark readers and writers with drivers and a CLI for 'spark-itemsimilarity' closes apache/mahout#22 (pat: rev 2b65475c3ab682ebd47cffdc6b502698799cd2c8) spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala spark/src/main/scala/org/apache/mahout/drivers/FileSysUtils.scala spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala spark/pom.xml spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/test/scala/org/apache/mahout/sparkbindings/test/MahoutLocalContext.scala bin/mahout spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/main/assembly/job.xml spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala CHANGELOG spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        Hide
        pferrel Pat Ferrel added a comment -

        First cut of readers and writers for text-delimited files working. Input is tuple only, output is DRM-ish, with application specific IDs. Once we have DRM-ish input we can trivially do an RSJ calc, even cross-RSJ.

        Show
        pferrel Pat Ferrel added a comment - First cut of readers and writers for text-delimited files working. Input is tuple only, output is DRM-ish, with application specific IDs. Once we have DRM-ish input we can trivially do an RSJ calc, even cross-RSJ.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2684 (See https://builds.apache.org/job/Mahout-Quality/2684/)
        MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 fixed a build test problem, drivers have an option new to not search for MAHOUT_HOME and SPARK_HOME (pat: rev 32badb1d360ddf514e6b253f2dea9ae7e5df078a)

        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2684 (See https://builds.apache.org/job/Mahout-Quality/2684/ ) MAHOUT-1541 , MAHOUT-1568 , MAHOUT-1569 fixed a build test problem, drivers have an option new to not search for MAHOUT_HOME and SPARK_HOME (pat: rev 32badb1d360ddf514e6b253f2dea9ae7e5df078a) spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2688 (See https://builds.apache.org/job/Mahout-Quality/2688/)
        MAHOUT-1541, MAHOUT-1568, added option to ItemSimilarityDriver to allow output that is directly search engine indexable, also some default schema's for input and output of TDF tuples and DRMs (pat: rev 9bfb767323833586873272af4db446f68f357f1f)

        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2688 (See https://builds.apache.org/job/Mahout-Quality/2688/ ) MAHOUT-1541 , MAHOUT-1568 , added option to ItemSimilarityDriver to allow output that is directly search engine indexable, also some default schema's for input and output of TDF tuples and DRMs (pat: rev 9bfb767323833586873272af4db446f68f357f1f) spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2733 (See https://builds.apache.org/job/Mahout-Quality/2733/)
        MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e)

        • math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
        • spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2733 (See https://builds.apache.org/job/Mahout-Quality/2733/ ) MAHOUT-1541 , MAHOUT-1568 , MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e) math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2768 (See https://builds.apache.org/job/Mahout-Quality/2768/)
        MAHOUT-1604 add a CLI and associated code for spark-rowsimilarity, also cleans up some things in MAHOUT-1568 and MAHOUT-1569, closes apache/mahout#47 (pat: rev 149c98592fe447c98dfb5afc67b5809725cc3056)

        • spark/pom.xml
        • spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
        • CHANGELOG
        • spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala
        • spark/src/main/scala/org/apache/mahout/drivers/FileSysUtils.scala
        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        • spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala
        • math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala
        • math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala
        • bin/mahout
        • spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
        • spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
        • math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2768 (See https://builds.apache.org/job/Mahout-Quality/2768/ ) MAHOUT-1604 add a CLI and associated code for spark-rowsimilarity, also cleans up some things in MAHOUT-1568 and MAHOUT-1569 , closes apache/mahout#47 (pat: rev 149c98592fe447c98dfb5afc67b5809725cc3056) spark/pom.xml spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala CHANGELOG spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala spark/src/main/scala/org/apache/mahout/drivers/FileSysUtils.scala spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala bin/mahout spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Hide
        pferrel Pat Ferrel added a comment -

        Created IntexedDataset, which has readers and writers for text delimited formats. Supports reading by row or element and writing by row.

        IndexedDataset is a step along the road to DataFrames and so may be refactored when those are added.

        Show
        pferrel Pat Ferrel added a comment - Created IntexedDataset, which has readers and writers for text delimited formats. Supports reading by row or element and writing by row. IndexedDataset is a step along the road to DataFrames and so may be refactored when those are added.
        Hide
        sslavic Stevo Slavic added a comment -

        Bulk closing all 0.10.0 resolved issues

        Show
        sslavic Stevo Slavic added a comment - Bulk closing all 0.10.0 resolved issues

          People

          • Assignee:
            pferrel Pat Ferrel
            Reporter:
            pferrel Pat Ferrel
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development