Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-1133

ScriptRecordReader should allow any class implementing/extending RecordReader to bust up blocks, not just LineRecordReader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 3.2.0-incubating, 3.1.2-incubating
    • None
    • documentation, hadoop, io
    • None

    Description

      I stuck a slightly modified XmlRecordReader class from the Apache Mahout project into org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader to bulk load XML with ScriptInputFormat, which I have notes on here:
      https://github.com/dylanht/thamyris

      I'm not sure what other formats would need a custom record reader, but why not allow it and let any class that implements RecordReader feed the user's groovy script? I was thinking the config would be something like:

      // Enum for <Format>RecordReaders TinkerPop provides, otherwise fully qualified class name
      gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader
      
      // omit closing angled bracket to start block split before attributes
      gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
      gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
      // An idea for later, because the above has big issues with nested elements
      gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
      

      Hadoop's RecordReader interface has InterruptedException checked for several methods, whereas LineRecordReader doesn't throw it for the respective methods. That's fine if LineRecordReader is imported directly as it is now, or XmlRecordReader is a weird hidden inner class the way I had it before. But to initialize anything that implements RecordReader, it seems LineRecordReader and XmlRecordReader both have to end up in the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with something like this added in:

      // same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress
      public void initialize() throws IOException, InterruptedException {
          // doesn't enclose things in a try/catch as is
          try { // things } catch (InterruptedException e) {
                  Thread.currentThread().interrupt();
                  throw new RuntimeException(e.getMessage(), e);
              }
      }
      

      I don't know how good an idea pulling LineRecordReader and XmlRecordReader into that package is, or how to handle the InterruptedException, and if there are more useful "<Format>RecordReader" classes that could be implemented I would like to know about them, so I thought I would throw this up here before trying a PR. What do you think?

      References:
      https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
      https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

      Attachments

        Activity

          People

            spmallette Stephen Mallette
            dylanht Dylan Bethune-Waddell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: