Avro
  1. Avro
  2. AVRO-1130

MapReduce Jobs can output write SortedKeyValueFiles directly

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.7.1
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      It would be nice if MapReduce jobs could write directly to SortedKeyValueFile's.

      harsh@'s response on this thread http://goo.gl/OT1rN for some more information on what needs to be done.

        Activity

        Hide
        Steven Willis added a comment -

        I'd very much like this as well. How would you imagine the output would be structured? With a normal SortedKeyValueFile you've got a single directory containing exactly two files data and index. With a mapreduce that has multiple reducers I wonder how this should look.

        Maybe:

        output_path/data-part-00000
        output_path/data-part-00001
        output_path/data-part-00002
        output_path/index-part-00000
        output_path/index-part-00001
        output_path/index-part-00002
        

        But then if you wanted to treat output_path as a SortedKeyValueFile, you'd have to modify the code to allow for multiple data and index files. Perhaps any directory containing exactly the same number of data* and index* files can be treated as a SKVF as long as the trailing portion of each data filename matched an index filename.

        Or would something like this be better:

        output_path/part-00000/data
        output_path/part-00000/index
        output_path/part-00001/data
        output_path/part-00001/index
        output_path/part-00002/data
        output_path/part-00002/index
        

        That way, each part is a SKVF and works with the existing code. But then you wouldn't be able to treat output_path as a SKVF. Maybe the new SKVFInputFormat would allow for the input path to be either an SKVF directory, or a directory containing SKVF directories.

        I think I'd lean towards the first approach myself.

        Show
        Steven Willis added a comment - I'd very much like this as well. How would you imagine the output would be structured? With a normal SortedKeyValueFile you've got a single directory containing exactly two files data and index . With a mapreduce that has multiple reducers I wonder how this should look. Maybe: output_path/data-part-00000 output_path/data-part-00001 output_path/data-part-00002 output_path/index-part-00000 output_path/index-part-00001 output_path/index-part-00002 But then if you wanted to treat output_path as a SortedKeyValueFile , you'd have to modify the code to allow for multiple data and index files. Perhaps any directory containing exactly the same number of data* and index* files can be treated as a SKVF as long as the trailing portion of each data filename matched an index filename. Or would something like this be better: output_path/part-00000/data output_path/part-00000/index output_path/part-00001/data output_path/part-00001/index output_path/part-00002/data output_path/part-00002/index That way, each part is a SKVF and works with the existing code. But then you wouldn't be able to treat output_path as a SKVF . Maybe the new SKVFInputFormat would allow for the input path to be either an SKVF directory, or a directory containing SKVF directories. I think I'd lean towards the first approach myself.
        Hide
        Doug Cutting added a comment -

        I'd expect this to work like Hadoop's MapFileOutputFormat, which is the latter of your two examples. Note that MapFileOutputFormat#getReaders() can be used to open all of the files. The array can then be accessed using the Partitioner that was used by the MapReduce job, e.g.:

        SortedKeyValueFile.Reader<K,V>[] readers;
        Partitioner<K,V> partitioner;
        
        public V getValue(K key) throws IOException {
          return readers[partitioner.getPartition(key, null, readers.size)].get(key);
        }
        
        Show
        Doug Cutting added a comment - I'd expect this to work like Hadoop's MapFileOutputFormat, which is the latter of your two examples. Note that MapFileOutputFormat#getReaders() can be used to open all of the files. The array can then be accessed using the Partitioner that was used by the MapReduce job, e.g.: SortedKeyValueFile.Reader<K,V>[] readers; Partitioner<K,V> partitioner; public V getValue(K key) throws IOException { return readers[partitioner.getPartition(key, null , readers.size)].get(key); }

          People

          • Assignee:
            Harsh J
            Reporter:
            Jeremy Lewi
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development