Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
3.2.0-incubating, 3.1.2-incubating
-
None
-
None
Description
I stuck a slightly modified XmlRecordReader class from the Apache Mahout project into org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader to bulk load XML with ScriptInputFormat, which I have notes on here:
https://github.com/dylanht/thamyris
I'm not sure what other formats would need a custom record reader, but why not allow it and let any class that implements RecordReader feed the user's groovy script? I was thinking the config would be something like:
// Enum for <Format>RecordReaders TinkerPop provides, otherwise fully qualified class name gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader // omit closing angled bracket to start block split before attributes gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer> // An idea for later, because the above has big issues with nested elements gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
Hadoop's RecordReader interface has InterruptedException checked for several methods, whereas LineRecordReader doesn't throw it for the respective methods. That's fine if LineRecordReader is imported directly as it is now, or XmlRecordReader is a weird hidden inner class the way I had it before. But to initialize anything that implements RecordReader, it seems LineRecordReader and XmlRecordReader both have to end up in the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with something like this added in:
// same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress public void initialize() throws IOException, InterruptedException { // doesn't enclose things in a try/catch as is try { // things } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException(e.getMessage(), e); } }
I don't know how good an idea pulling LineRecordReader and XmlRecordReader into that package is, or how to handle the InterruptedException, and if there are more useful "<Format>RecordReader" classes that could be implemented I would like to know about them, so I thought I would throw this up here before trying a PR. What do you think?
References:
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java