Mahout
  1. Mahout
  2. MAHOUT-274

Use avro for serialization of structured documents.

    Details

      Description

      Explore the intersection between Writables and Avro to see how serialization can be improved within Mahout.

      An intermediate goal is the provide a structured document format that can be serialized using Avro as an Input/OutputFormat and Writable

      1. mahout-avro-examples.tar.gz
        455 kB
        Drew Farris
      2. mahout-avro-examples.tar.gz
        6 kB
        Drew Farris

        Activity

        Hide
        Ted Dunning added a comment -

        I think that this is definitely stale, but it is a pity to lose such nice work.

        (drew tends to do nice work)

        Show
        Ted Dunning added a comment - I think that this is definitely stale, but it is a pity to lose such nice work. (drew tends to do nice work)
        Hide
        Sean Owen added a comment -

        I think this has gone fairly stale... there's no use of Avro as yet otherwise, and this lives already in github. Is there any reason to move forward with this patch in Mahout?

        Show
        Sean Owen added a comment - I think this has gone fairly stale... there's no use of Avro as yet otherwise, and this lives already in github. Is there any reason to move forward with this patch in Mahout?
        Hide
        Ted Dunning added a comment -

        Still a good idea, but not for 0.4, I think.

        Show
        Ted Dunning added a comment - Still a good idea, but not for 0.4, I think.
        Hide
        Drew Farris added a comment -

        Tracking some interesting things happening over at AVRO-493 that would be a viable alternative to following the pattern established by MAPREDUCE-815

        Show
        Drew Farris added a comment - Tracking some interesting things happening over at AVRO-493 that would be a viable alternative to following the pattern established by MAPREDUCE-815
        Hide
        Drew Farris added a comment -

        pushed to github: http://github.com/drewfarris/mahout-avro-testbed

        need to update for the latest version of avro (1.3.1) in addition to getting back to work on this in earnest.

        Show
        Drew Farris added a comment - pushed to github: http://github.com/drewfarris/mahout-avro-testbed need to update for the latest version of avro (1.3.1) in addition to getting back to work on this in earnest.
        Hide
        Drew Farris added a comment -

        (this is really the right tarball this time, honest)

        Show
        Drew Farris added a comment - (this is really the right tarball this time, honest)
        Hide
        Drew Farris added a comment -

        Status update w/ new tarball which contains a maven project (mvn clean install should do the trick)

        README.txt included, relevant portions included below:

        Provided are two different versions of AvroInputFormat/AvroOutputFormat that are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis. They are based on, code provided as a part of MAPREDUCE-815 and other patches. Also provided are backports of the SerializationBase/AvroSerialization classes from the current hadoop-core trunk.

        When writing a job using the pre 0.20 apis:

        Add serializations:

                conf.setStrings("io.serializations",
                new String[] {
                  WritableSerialization.class.getName(), 
                  AvroSpecificSerialization.class.getName(), 
                  AvroReflectSerialization.class.getName(),
                  AvroGenericSerialization.class.getName()
                });
        

        Setup input and output formats:

            conf.setInputFormat(AvroInputFormat.class);
            conf.setOutputFormat(AvroOutputFormat.class);
            
            AvroInputFormat.setAvroInputClass(conf, AvroDocument.class);
            AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class);
        

        AvroInputFormat provides the specified class as the key and a LongWritable file offset as the value.
        AvroOutputFormat expects the specified class as the key and expects a NullWritable as a value.

        If an avro serializable class is passed between the map and reduce phases it is necessary to set the following:

            AvroComparator.setSchema(AvroDocument._SCHEMA);
            conf.setClass("mapred.output.key.comparator.class", 
              AvroComparator.class, RawComparator.class);
        

        So far I've been using avro 'specific' serialization, which compiles an avro schema into a Java class. see
        src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is currently compiled into classes o.a.m.avro.document (AvroDocument|AvroField) using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a maven plugin, Generated sources are currently checked in.).

        Helper classes for AvroDocument and AvroField include o.a.m.avro.document.Avro(Document|Field)Builder, o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not certain that this is be best pattern to use, especially when there are many pre-existing classes (such as there are in the case of vector.

        Avro also provides reflection-based serialization and schema-based serialization, both should be supported by the infrastructure that has been backported here, but that's something else to explore.

        Examples:

        These are quick and dirty and need much cleanup work before they can be taken out to the dance.

        see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce:

        • AvroDocumentsFromDirectory: quick and dirty port of SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing documents in avro format; file contents is stored in a single field named 'content', contents are stored in the originalText portion of this field.
        • AvroDocumentsDumper: dump an avro documents file to a standard output
        • AvroDocumentsWordCount: perform a wordcount on an avro document input file.
        • AvroDocumentProcessor: tokenizes the text found in the input document file, reads from the originalText of the field named content and writes original document+tokens to output file.

        Running the examples:

        (haven't tested with the hadoop driver yet)

        mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \
          -Dexec.args='--parent /home/drew/mahout/20news-18828 \
          --outputDir /home/drew/mahout/20news-18828-example \
          --charset UTF-8'
        
        mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \
           -Dexec.args='/home/drew/mahout/20news-18828-example /home/drew/mahout/20news-18828-processed' 
        
        mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \
          -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' > foobar.txt
        

        The Wikipedia stuff is in there, but isn't working yet. Many thanks (apologies) to Robin for the starting point for much of this code and hacking it to pieces so badly.

        Show
        Drew Farris added a comment - Status update w/ new tarball which contains a maven project (mvn clean install should do the trick) README.txt included, relevant portions included below: Provided are two different versions of AvroInputFormat/AvroOutputFormat that are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis. They are based on, code provided as a part of MAPREDUCE-815 and other patches. Also provided are backports of the SerializationBase/AvroSerialization classes from the current hadoop-core trunk. When writing a job using the pre 0.20 apis: Add serializations: conf.setStrings( "io.serializations" , new String [] { WritableSerialization.class.getName(), AvroSpecificSerialization.class.getName(), AvroReflectSerialization.class.getName(), AvroGenericSerialization.class.getName() }); Setup input and output formats: conf.setInputFormat(AvroInputFormat.class); conf.setOutputFormat(AvroOutputFormat.class); AvroInputFormat.setAvroInputClass(conf, AvroDocument.class); AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class); AvroInputFormat provides the specified class as the key and a LongWritable file offset as the value. AvroOutputFormat expects the specified class as the key and expects a NullWritable as a value. If an avro serializable class is passed between the map and reduce phases it is necessary to set the following: AvroComparator.setSchema(AvroDocument._SCHEMA); conf.setClass( "mapred.output.key.comparator.class" , AvroComparator.class, RawComparator.class); So far I've been using avro 'specific' serialization, which compiles an avro schema into a Java class. see src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is currently compiled into classes o.a.m.avro.document (AvroDocument|AvroField) using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a maven plugin, Generated sources are currently checked in.). Helper classes for AvroDocument and AvroField include o.a.m.avro.document.Avro(Document|Field)Builder, o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not certain that this is be best pattern to use, especially when there are many pre-existing classes (such as there are in the case of vector. Avro also provides reflection-based serialization and schema-based serialization, both should be supported by the infrastructure that has been backported here, but that's something else to explore. Examples: These are quick and dirty and need much cleanup work before they can be taken out to the dance. see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce: AvroDocumentsFromDirectory: quick and dirty port of SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing documents in avro format; file contents is stored in a single field named 'content', contents are stored in the originalText portion of this field. AvroDocumentsDumper: dump an avro documents file to a standard output AvroDocumentsWordCount: perform a wordcount on an avro document input file. AvroDocumentProcessor: tokenizes the text found in the input document file, reads from the originalText of the field named content and writes original document+tokens to output file. Running the examples: (haven't tested with the hadoop driver yet) mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \ -Dexec.args='--parent /home/drew/mahout/20news-18828 \ --outputDir /home/drew/mahout/20news-18828-example \ --charset UTF-8' mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \ -Dexec.args='/home/drew/mahout/20news-18828-example /home/drew/mahout/20news-18828-processed' mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \ -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' > foobar.txt The Wikipedia stuff is in there, but isn't working yet. Many thanks (apologies) to Robin for the starting point for much of this code and hacking it to pieces so badly.
        Hide
        Ted Dunning added a comment -


        Those discussions seem pretty future tense and very much not something that will work in v20.

        Show
        Ted Dunning added a comment - Those discussions seem pretty future tense and very much not something that will work in v20.
        Hide
        Drew Farris added a comment -

        I suspect providing a writable wrapper that implements avro serialization may not bet the best way to go here. After looking at hadoop/avro api/code the best approach may be to implement something like AvroSerialization/AvroDeserializer/AvroSerializer. I need to look more closely at MAPREDUCE-815 and MAPREDUCE-1126 because much of the work for this has probably been done there already.

        Show
        Drew Farris added a comment - I suspect providing a writable wrapper that implements avro serialization may not bet the best way to go here. After looking at hadoop/avro api/code the best approach may be to implement something like AvroSerialization/AvroDeserializer/AvroSerializer. I need to look more closely at MAPREDUCE-815 and MAPREDUCE-1126 because much of the work for this has probably been done there already.
        Hide
        Drew Farris added a comment -

        Very rudimentary exploration of using avro to produce writables.

        Uses the avro specific java class generation facility to produce a structured document class which is wrapped in a generic writable container for serialization.

        • clases on o.a.m.avro are produces from schema in src/main/schemata/o../a../m../avro/AvroDocument.avsc using o.a.m.avro.util.AvroDocumentCompiler
        • provides a generic avro Writable implementation in o.a.m.avro.mapred.SpecificAvroWritable
        • see the test in src/test/java o.a.m.avro.mapred.SpecificAvroWritableTest to see how this can be used

        'mvn clean install' will run the whole shebang.

        Show
        Drew Farris added a comment - Very rudimentary exploration of using avro to produce writables. Uses the avro specific java class generation facility to produce a structured document class which is wrapped in a generic writable container for serialization. clases on o.a.m.avro are produces from schema in src/main/schemata/o../a../m../avro/AvroDocument.avsc using o.a.m.avro.util.AvroDocumentCompiler provides a generic avro Writable implementation in o.a.m.avro.mapred.SpecificAvroWritable see the test in src/test/java o.a.m.avro.mapred.SpecificAvroWritableTest to see how this can be used 'mvn clean install' will run the whole shebang.

          People

          • Assignee:
            Drew Farris
            Reporter:
            Drew Farris
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development