Hadoop Common
  1. Hadoop Common
  2. HADOOP-2120

dfs -getMerge does not do what it says it does

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.14.3
    • Fix Version/s: None
    • Component/s: documentation, fs
    • Labels:
    • Environment:

      All

      Description

      dfs -getMerge, which calls FileUtil.CopyMerge, contains this javadoc:

      Get all the files in the directories that match the source file pattern
         * and merge and sort them to only one file on local fs 
         * srcf is kept.
      

      However, it only concatenates the set of input files, rather than merging them in sorted order.

      Ideally, the copyMerge should be equivalent to a map-reduce job with IdentityMapper and IdentityReducer with numReducers = 1. However, not having to run this as a map-reduce job has some advantages, since it increases cluster utilization during reduce phase.

        Activity

        Hide
        Ari Pollak added a comment -

        Even if the copyMerge comment is somewhat correct, the documentation for fs -getMerge is still confusing, so I think that this bug should be reopened; from copyCommands.java:
        public static final String DESCRIPTION =
        "Get all the files in the directories that\n" +
        "match the source file pattern and merge and sort them to only\n" +
        "one file on local fs. <src> is kept.";

        The command neither sorts nor merges the actual source files according to the traditional definition (e.g. sort and sort -m commands), it merely concatenates them, in order of the filenames. It should look more like this:

        Concatenate all the files in the directories that match the pattern in <src> and output to only one file on local fs. <src> is kept.

        Show
        Ari Pollak added a comment - Even if the copyMerge comment is somewhat correct, the documentation for fs -getMerge is still confusing, so I think that this bug should be reopened; from copyCommands.java: public static final String DESCRIPTION = "Get all the files in the directories that\n" + "match the source file pattern and merge and sort them to only\n" + "one file on local fs. <src> is kept."; The command neither sorts nor merges the actual source files according to the traditional definition (e.g. sort and sort -m commands), it merely concatenates them, in order of the filenames. It should look more like this: Concatenate all the files in the directories that match the pattern in <src> and output to only one file on local fs. <src> is kept.
        Hide
        Ranadip added a comment -

        > hadoop fs -get dir/* - > out

        > has the same behavior, no ?

        Not exactly same I think. In this case, out is a file on the local filesystem while -getmerge is supposed to create the merged file on the hdfs.
        Of course we can still achieve that doing something like

        hadoop fs -get dir/* - | hadoop fs -put - dir/dest

        Show
        Ranadip added a comment - > hadoop fs -get dir/* - > out > has the same behavior, no ? Not exactly same I think. In this case, out is a file on the local filesystem while -getmerge is supposed to create the merged file on the hdfs. Of course we can still achieve that doing something like hadoop fs -get dir/* - | hadoop fs -put - dir/dest
        Hide
        Milind Bhandarkar added a comment -

        This is not a merge in commonly used sense (well, squinting a little bit, it is a merge if the comparator always returns -1

        hadoop fs -get dir/* - > out
        

        has the same behavior, no ?

        Show
        Milind Bhandarkar added a comment - This is not a merge in commonly used sense (well, squinting a little bit, it is a merge if the comparator always returns -1 hadoop fs -get dir/* - > out has the same behavior, no ?
        Hide
        Harsh J added a comment -

        I believe the sorting earlier referred to the file list sorting?

        In that case, although FSNamesystem gives consistent sorting for HDFS's listStatus and such, note that Java's File APIs do not provide the same consistency while using getmerge over any LocalFileSystem. I've opened HADOOP-7659 for this, btw.

        Show
        Harsh J added a comment - I believe the sorting earlier referred to the file list sorting? In that case, although FSNamesystem gives consistent sorting for HDFS's listStatus and such, note that Java's File APIs do not provide the same consistency while using getmerge over any LocalFileSystem. I've opened HADOOP-7659 for this, btw.
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Milind,

        It looks to me that, Doc is updated in current code base.

         /** Copy all files in a directory to one output file (merge). */
          public static boolean copyMerge(FileSystem srcFS, Path srcDir, 
                                          FileSystem dstFS, Path dstFile, 
                                          boolean deleteSource,
                                          Configuration conf, String addString) throws IOException {
            dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false);
        
        

        I am closing it as Not A Problem. If you have any concern with it, please reopen.

        Thanks
        Uma

        Show
        Uma Maheswara Rao G added a comment - Hi Milind, It looks to me that, Doc is updated in current code base. /** Copy all files in a directory to one output file (merge). */ public static boolean copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String addString) throws IOException { dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false ); I am closing it as Not A Problem. If you have any concern with it, please reopen. Thanks Uma
        Hide
        Lohit Vijayarenu added a comment -

        Visualizing this as a map-reduce job which actually merge/sort into a single file, shouldn't it be available as a separate package (like distcp, may be)?
        This feature of merging files would be very useful for users who would like to have only one output file. For now they would want to stick to a single reducer and do not want to submit a job with multiple reducers (even thought that is better machine utilization). A generic merge utility with understands the format and merges would be useful? Something motivated from https://issues.apache.org/jira/browse/HADOOP-2113

        Show
        Lohit Vijayarenu added a comment - Visualizing this as a map-reduce job which actually merge/sort into a single file, shouldn't it be available as a separate package (like distcp, may be)? This feature of merging files would be very useful for users who would like to have only one output file. For now they would want to stick to a single reducer and do not want to submit a job with multiple reducers (even thought that is better machine utilization). A generic merge utility with understands the format and merges would be useful? Something motivated from https://issues.apache.org/jira/browse/HADOOP-2113

          People

          • Assignee:
            Unassigned
            Reporter:
            Milind Bhandarkar
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development