Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Invalid
-
None
-
None
-
None
Description
To use the DFW#appendTo API, one needs to pass a SeekableInput interface object. Avro provides a usable utility for files that can be represented by a File object, but in the Hadoop land, HDFS and other FSes can't be represented via a File object and need a longer route to implement this interface.
We can add a simple HadoopSeekableFSInput or so that can take Hadoop provided objects and wrap it into a SeekableInput interface ready for passing to Avro.
I propose something of the following type:
public static class HadoopSeekableFSInput implements SeekableInput { FSDataInputStream in; long length; public SeekableFSInput(FSDataInputStream in, long length) { this.in = in; this.length = length; } public void close() throws IOException { in.close(); } public void seek(long p) throws IOException { in.seek(p); } public long tell() throws IOException { return in.getPos(); } public long length() throws IOException { return length; } public int read(byte[] b, int off, int len) throws IOException { return in.read(b, off, len); } }
The above can be constructed by users via a simple call such as new HadoopSeekableFSInput(fs.open(filePath), fs.getFileStatus(filePath).getLen()).
Ideally this class should belong in the avro core module but that strictly does not depend on Hadoop-Common today, and hence somewhere else may be more suitable.
This lets users write Avro-append code such as https://gist.github.com/QwertyManiac/4724582 more easily.