Mahout
  1. Mahout
  2. MAHOUT-621

Support more data import mechanisms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      We should have more ways of getting data in:

      1. ARFF (MAHOUT-155)
      2. CSV (MAHOUT-548)
      3. Databases
      4. Behemoth (Tika, Map-Reduce)
      5. Other

        Activity

        Hide
        Sean Owen added a comment -

        FWIW I envision this as a series of support in mahout-utils perhaps that make it very easy to import into Vectors. Is that about right? It'd be good to have one theory of what data looks like coming in, and provide means to ingest data from m sources into that format for use in n algorithms, rather than support m*n source/algo combinations.

        Show
        Sean Owen added a comment - FWIW I envision this as a series of support in mahout-utils perhaps that make it very easy to import into Vectors. Is that about right? It'd be good to have one theory of what data looks like coming in, and provide means to ingest data from m sources into that format for use in n algorithms, rather than support m*n source/algo combinations.
        Hide
        Ted Dunning added a comment -

        The data sources that I have mostly seen include:

        • document like things that have semi-structured fields. This includes most of our recommendation style inputs if you do a group by user id and collect
          the values of the item being rated. It also includes document inputs where the Lucene document is an excellent example.
        • sql queries which ultimately produce something that looks like a document, possibly by denormalizing the final query result.
        • time series. The openTSDB project has the nicest time series schema that I have seen.
        Show
        Ted Dunning added a comment - The data sources that I have mostly seen include: document like things that have semi-structured fields. This includes most of our recommendation style inputs if you do a group by user id and collect the values of the item being rated. It also includes document inputs where the Lucene document is an excellent example. sql queries which ultimately produce something that looks like a document, possibly by denormalizing the final query result. time series. The openTSDB project has the nicest time series schema that I have seen.
        Hide
        Shannon Quinn added a comment -

        On the note of openTSDB, is that something Hadoop supports or could support? I understand it's built on top of HBase, but could Mahout theoretically use this data transparently?

        Show
        Shannon Quinn added a comment - On the note of openTSDB, is that something Hadoop supports or could support? I understand it's built on top of HBase, but could Mahout theoretically use this data transparently?
        Hide
        Ted Dunning added a comment -

        It would be easy to do an hbase query and pass the data to mahout. It would not be easy for mahout to use the data without the good offices of hbase.

        Show
        Ted Dunning added a comment - It would be easy to do an hbase query and pass the data to mahout. It would not be easy for mahout to use the data without the good offices of hbase.
        Hide
        Lance Norskog added a comment -

        Can there be export mechanisms too?

        Show
        Lance Norskog added a comment - Can there be export mechanisms too?
        Hide
        Julien Nioche added a comment -

        Re-Behemoth : I've started working on a Mahout module (https://github.com/jnioche/behemoth/tree/master/modules/mahout) which will help converting the Behemoth sequence files into vectors as done by seq2sparse.

        Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional.

        Show
        Julien Nioche added a comment - Re-Behemoth : I've started working on a Mahout module ( https://github.com/jnioche/behemoth/tree/master/modules/mahout ) which will help converting the Behemoth sequence files into vectors as done by seq2sparse. Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional.
        Hide
        Ted Dunning added a comment -

        Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional.

        How is this different from having more than one dependency?

        Can't you just use jar-with-dependencies (with maven) or the ant-ish equivalent?

        Show
        Ted Dunning added a comment - Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional. How is this different from having more than one dependency? Can't you just use jar-with-dependencies (with maven) or the ant-ish equivalent?
        Hide
        Julien Nioche added a comment -

        From https://issues.apache.org/jira/browse/MAHOUT-368

        > Why not having a bundle artifact where all the Mahout submodules would be put it a single jar?

        How is this not trivial for you to handle with maven?
        If you are writing your own maven project (recommended), then jar-with-dependencies will do what you want.
        If you are extending Mahout (ok for prototypes), just put your code in the examples job jar and all will be good.

        I am not extending Mahout and as you've probably seen in the comments above the point is to be able to generate Mahout data structures from Behemoth so putting the code in examples is not an option anyway.

        Back to the original problem. I generate a job file for my Mahout module in Behemoth (https://github.com/jnioche/behemoth/tree/master/modules/mahout) and manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is a slightly modified version of SparseVectorsFromSequenceFiles which gets the Tokens from Behemoth documents instead of using Lucene and generates the data structures expected by the classifiers and clusterers.

        The job file contains :

        • the Behemoth classes for the Mahout module
        • the dependencies in /lib including
          • mahout-math-0.4.jar
          • mahout-core-0.4.jar

        The problem I had was the same as Han Hui Wen (MAHOUT-368) i.e I was getting a class not found exception on org.apache.mahout.math.VectorWritable. My understanding of the problem is that my main class calls DictionaryVectorizer which in my job file was in lib/mahout-core-0.4.jar and this has a dependency on VectorWritable which is in lib/mahout-maths-0.4.jar. For some reason MapReduce was not able to find VectorWritable, which I assume has to do with the jobs in DictionaryVectorizer calling 'job.setJarByClass(DictionaryVectorizer.class)'.

        I could of course use jar-with-dependencies on the Mahout code and generate a single jar then manage the jar locally. However this means that I have very little control over the dependencies used by Mahout (e.g. potentially conflicting versions with other components in my job files) and I'd rather rely on external publicised jars anyway. A better option would be to simply unpack the content of the mahout core and maths jars into the root of my job file. At least the Mahout dependencies would be handled and versioned normally.

        I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that something must have changed in the way the classloader handles dependencies within a job file.

        Makes sense?

        Show
        Julien Nioche added a comment - From https://issues.apache.org/jira/browse/MAHOUT-368 > Why not having a bundle artifact where all the Mahout submodules would be put it a single jar? How is this not trivial for you to handle with maven? If you are writing your own maven project (recommended), then jar-with-dependencies will do what you want. If you are extending Mahout (ok for prototypes), just put your code in the examples job jar and all will be good. I am not extending Mahout and as you've probably seen in the comments above the point is to be able to generate Mahout data structures from Behemoth so putting the code in examples is not an option anyway. Back to the original problem. I generate a job file for my Mahout module in Behemoth ( https://github.com/jnioche/behemoth/tree/master/modules/mahout ) and manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is a slightly modified version of SparseVectorsFromSequenceFiles which gets the Tokens from Behemoth documents instead of using Lucene and generates the data structures expected by the classifiers and clusterers. The job file contains : the Behemoth classes for the Mahout module the dependencies in /lib including mahout-math-0.4.jar mahout-core-0.4.jar The problem I had was the same as Han Hui Wen ( MAHOUT-368 ) i.e I was getting a class not found exception on org.apache.mahout.math.VectorWritable. My understanding of the problem is that my main class calls DictionaryVectorizer which in my job file was in lib/mahout-core-0.4.jar and this has a dependency on VectorWritable which is in lib/mahout-maths-0.4.jar. For some reason MapReduce was not able to find VectorWritable, which I assume has to do with the jobs in DictionaryVectorizer calling 'job.setJarByClass(DictionaryVectorizer.class)'. I could of course use jar-with-dependencies on the Mahout code and generate a single jar then manage the jar locally. However this means that I have very little control over the dependencies used by Mahout (e.g. potentially conflicting versions with other components in my job files) and I'd rather rely on external publicised jars anyway. A better option would be to simply unpack the content of the mahout core and maths jars into the root of my job file. At least the Mahout dependencies would be handled and versioned normally. I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that something must have changed in the way the classloader handles dependencies within a job file. Makes sense?
        Hide
        Ted Dunning added a comment -

        Makes sense?

        No.

        I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. Why do you have a lib directory at all?

        Show
        Ted Dunning added a comment - Makes sense? No. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. Why do you have a lib directory at all?
        Hide
        Julien Nioche added a comment - - edited

        I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies.

        you've obviously not understood the explanations above or Han Hui Wen's : we do get the dependencies OK?

        Why do you have a lib directory at all?

        this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid

        Show
        Julien Nioche added a comment - - edited I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. you've obviously not understood the explanations above or Han Hui Wen's : we do get the dependencies OK? Why do you have a lib directory at all? this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid
        Hide
        Sean Owen added a comment -

        No action on this issue, and hasn't been touched in 7 months. I think it's a bit too high-level, and could come back with more specific sub-issues. Note that we did refactor, package and improve some integration into the integration/ module, so it's kind of been addressed.

        Show
        Sean Owen added a comment - No action on this issue, and hasn't been touched in 7 months. I think it's a bit too high-level, and could come back with more specific sub-issues. Note that we did refactor, package and improve some integration into the integration/ module, so it's kind of been addressed.

          People

          • Assignee:
            Unassigned
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development