Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3772

MultipleOutputs output lost if baseOutputPath starts with ../

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: client
    • Labels:
      None
    • Target Version/s:

      Description

      Lets say you have output directory set:

      FileOutputFormat.setOutputPath(job, "/tmp/multi1/out");

      and want to place output from MultipleOutputs into /tmp/multi1/extra

      I expect following code to work:

      mos = new MultipleOutputs<Text, IntWritable>(context);
      mos.write(new Text("zrr"), value, "../extra/");

      but no Exception is throw and expected output directory /tmp/multi1/extra does not even exists. All data written to this output vanish without trace.

      To make it work fullpath must be used
      mos.write(new Text("zrr"), value, "/tmp/multi1/extra/");

      Output is listed in statistics from MultipleOutputs correctly:

      org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
      ../gaja1/=13333 (* everything is lost *)
      /tmp/multi1/out/../ksd34/=13333 (* this using full path works *)
      list1=6667

        Activity

        Hide
        Radim Kolar added a comment -

        Not a bug, but user error.

        Show
        Radim Kolar added a comment - Not a bug, but user error.
        Hide
        Alejandro Abdelnur added a comment -

        MultipleOutputs was implemented to work properly when speculative execution enable with FileOutputFormat implementations (Text, SequenceFile). FileOutputFormats, to handle speculative execution, write the output to the following path $mapred.out.dir/_temporary/_$taskid while execution. If speculative execution is in progress for a given task, there will be 2 tasks IDs for it, this means that while the 'competing' tasks are running their outputs go to different directories. When the first speculative task completes, its output will be committed (moved to the $mapred.out.dir) and the second speculative task will be discarded, as well as its output. MultipleOutputs creates the named outputs under the $taskid directory, thus leveraging all the speculative execution functionality and behavior implemented by FileOutputFormat. If the named output file is not within the $taskid directory, then all the logic just described does not work as the task commit procedure is done only from the _$taskid directory to the $mapred.out.dir.

        Because of this I think that Priyo suggestion of logging a warning makes sense. There is caveat to this, using MO.write(K,V,NamedOutputPath) method, the warning would be logged in the task log that is creating the named output with an absolute path.

        Show
        Alejandro Abdelnur added a comment - MultipleOutputs was implemented to work properly when speculative execution enable with FileOutputFormat implementations (Text, SequenceFile). FileOutputFormats, to handle speculative execution, write the output to the following path $mapred.out.dir/_temporary/_$taskid while execution. If speculative execution is in progress for a given task, there will be 2 tasks IDs for it, this means that while the 'competing' tasks are running their outputs go to different directories. When the first speculative task completes, its output will be committed (moved to the $mapred.out.dir ) and the second speculative task will be discarded, as well as its output. MultipleOutputs creates the named outputs under the $taskid directory, thus leveraging all the speculative execution functionality and behavior implemented by FileOutputFormat. If the named output file is not within the $taskid directory, then all the logic just described does not work as the task commit procedure is done only from the _$taskid directory to the $mapred.out.dir . Because of this I think that Priyo suggestion of logging a warning makes sense. There is caveat to this, using MO.write(K,V,NamedOutputPath) method, the warning would be logged in the task log that is creating the named output with an absolute path.
        Hide
        Radim Kolar added a comment -

        Priyo, why not fix root of the problem instead of inventing workarounds.

        Show
        Radim Kolar added a comment - Priyo, why not fix root of the problem instead of inventing workarounds.
        Hide
        Alejandro Abdelnur added a comment -

        Radim, if you have speculative execution enabled, any file written by MultipleOutputs outside of the output dir of the job (where the temp dir mechanism for output files of tasks is not in place) is at risk of having incomplete data.

        Show
        Alejandro Abdelnur added a comment - Radim, if you have speculative execution enabled, any file written by MultipleOutputs outside of the output dir of the job (where the temp dir mechanism for output files of tasks is not in place) is at risk of having incomplete data.
        Hide
        Priyo Mustafi added a comment -

        In that case for current releases we should send out a very visible warning in the mapreduce output when somebody attempts to use the baseOutputPath with a fully qualified path which is outside main output along with the javadoc changes. This way folks who are currently using this feature in the incorrect way are aware that there would be data-loss. In a later release we should enforce it so whether somebody reads the javadoc in detail or not, they are never faced with data loss.

        Show
        Priyo Mustafi added a comment - In that case for current releases we should send out a very visible warning in the mapreduce output when somebody attempts to use the baseOutputPath with a fully qualified path which is outside main output along with the javadoc changes. This way folks who are currently using this feature in the incorrect way are aware that there would be data-loss. In a later release we should enforce it so whether somebody reads the javadoc in detail or not, they are never faced with data loss.
        Hide
        Radim Kolar added a comment -

        data loss with ../path is not related to speculative execution.

        Show
        Radim Kolar added a comment - data loss with ../path is not related to speculative execution.
        Hide
        Harsh J added a comment -

        I agree that adding some javadocs doesn't make the bug into a feature. Any problem with silent data loss is a very serious issue.

        Its not as much as a data loss as it is a lack of understanding in how speculative execution may affect a task's write attempts, when the concept of attempt IDs (i.e. Output Committing) isn't utilized. This does not apply to MO alone, it also applies to general writes done from any job. We've hit it with MO cause MO provided an API (quite accidental, if you ask me) minus enforcement that has lead to all this.

        Show
        Harsh J added a comment - I agree that adding some javadocs doesn't make the bug into a feature. Any problem with silent data loss is a very serious issue. Its not as much as a data loss as it is a lack of understanding in how speculative execution may affect a task's write attempts, when the concept of attempt IDs (i.e. Output Committing) isn't utilized. This does not apply to MO alone, it also applies to general writes done from any job. We've hit it with MO cause MO provided an API (quite accidental, if you ask me) minus enforcement that has lead to all this.
        Hide
        Priyo Mustafi added a comment -

        I agree that adding some javadocs doesn't make the bug into a feature. Any problem with silent data loss is a very serious issue. Also sudden enforcing is going to break a whole lot of code out there in production.

        Show
        Priyo Mustafi added a comment - I agree that adding some javadocs doesn't make the bug into a feature. Any problem with silent data loss is a very serious issue. Also sudden enforcing is going to break a whole lot of code out there in production.
        Hide
        Radim Kolar added a comment -

        Yes, rename logic needs to be added to OC. FileOutputComiter will check for existence of variable like mo.rename and move files there. User will not need to write its own OC to move files around. Best workflow is just to submit job and exit, not wait for end and then do rename files.

        Second enhancement is to mark directories with files from MO with "_SUCCESS"

        Show
        Radim Kolar added a comment - Yes, rename logic needs to be added to OC. FileOutputComiter will check for existence of variable like mo.rename and move files there. User will not need to write its own OC to move files around. Best workflow is just to submit job and exit, not wait for end and then do rename files. Second enhancement is to mark directories with files from MO with "_SUCCESS"
        Hide
        Harsh J added a comment -

        Radim,

        Could you propose your idea more formally?

        At present, the baseOutputPath is directly evaluated to a path (either in relation, or absolutely), and a record writer created upon it.

        If we begin the idea of handling this (absolutes or external-to-output paths) for the user (which, IMO we should NOT do, as MultipleOutputs never really intended to expose these APIs originally - see the stable API, there's no such thing), then there is quite a few things to take care of:

        1. Detect all forms of out-of-dir paths (/foo, ../foo, ./../foo, etc.) and create temporary name mappings for the real write, to rename them out later.
        2. End-rename logic needs to be done in the OC stage, which MO does not control by its design (the stable API had a direct OutputFormat for this, which could do it perhaps).

        Any other things we'll need addressed?

        I also wonder if its worth doing all this, when user logic can take care of sub-dir movement in the post-job stage with a few moves - which costs even lesser.

        Show
        Harsh J added a comment - Radim, Could you propose your idea more formally? At present, the baseOutputPath is directly evaluated to a path (either in relation, or absolutely), and a record writer created upon it. If we begin the idea of handling this (absolutes or external-to-output paths) for the user (which, IMO we should NOT do, as MultipleOutputs never really intended to expose these APIs originally - see the stable API, there's no such thing), then there is quite a few things to take care of: Detect all forms of out-of-dir paths (/foo, ../foo, ./../foo, etc.) and create temporary name mappings for the real write, to rename them out later. End-rename logic needs to be done in the OC stage, which MO does not control by its design (the stable API had a direct OutputFormat for this, which could do it perhaps). Any other things we'll need addressed? I also wonder if its worth doing all this, when user logic can take care of sub-dir movement in the post-job stage with a few moves - which costs even lesser.
        Hide
        Radim Kolar added a comment -

        write into temporary file and save into Jobconf mapping temporary file = real file if this an absolute path.

        Show
        Radim Kolar added a comment - write into temporary file and save into Jobconf mapping temporary file = real file if this an absolute path.
        Hide
        Radim Kolar added a comment -

        -1. You should fix root cause of this problem instead of documenting why it could not be done.

        Show
        Radim Kolar added a comment - -1. You should fix root cause of this problem instead of documenting why it could not be done.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555386/MAPREDUCE-3772.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3083//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3083//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555386/MAPREDUCE-3772.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3083//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3083//console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment -

        yes, i noticed that. but this JIRA is not about cleaning up imports, just javadocs.

        Show
        Alejandro Abdelnur added a comment - yes, i noticed that. but this JIRA is not about cleaning up imports, just javadocs.
        Hide
        Harsh J added a comment -

        Those are unused imports, removed them as I was editing the file anyway. Will remove upon commit (awaiting jenkins), if you insist.

        Show
        Harsh J added a comment - Those are unused imports, removed them as I was editing the file anyway. Will remove upon commit (awaiting jenkins), if you insist.
        Hide
        Alejandro Abdelnur added a comment -

        +1 after reverting the import changes.

        Show
        Alejandro Abdelnur added a comment - +1 after reverting the import changes.
        Hide
        Harsh J added a comment -

        Here's a patch that just javadocs the warning.

        Show
        Harsh J added a comment - Here's a patch that just javadocs the warning.
        Hide
        Alejandro Abdelnur added a comment -

        I've meant, clearly document it in the javadocs.

        Show
        Alejandro Abdelnur added a comment - I've meant, clearly document it in the javadocs.
        Hide
        Harsh J added a comment -

        We should enforce disallowing absolute paths as they would end up lying outside of the generic FileOutputCommitter.

        Sorry for repeated back/forth, but should we disallow or merely warn and document? Cause without speculative execution, out-of-job-dir writing MO tasks do work fine.

        Show
        Harsh J added a comment - We should enforce disallowing absolute paths as they would end up lying outside of the generic FileOutputCommitter. Sorry for repeated back/forth, but should we disallow or merely warn and document? Cause without speculative execution, out-of-job-dir writing MO tasks do work fine.
        Hide
        Alejandro Abdelnur added a comment -

        makes sense Harsh, thx.

        Show
        Alejandro Abdelnur added a comment - makes sense Harsh, thx.
        Hide
        Harsh J added a comment -

        I think that, at least, javadocs should be updated to reflect that if you want to use speculative execution the baseOutputPath must not be a path but a name. I would prefer to do enforce it, as IMO it is a bug it is not enforced.

        Subdirectories is a feature users utilize today. We should enforce disallowing absolute paths as they would end up lying outside of the generic FileOutputCommitter. The 'why' for this is documented clearly at http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

        Show
        Harsh J added a comment - I think that, at least, javadocs should be updated to reflect that if you want to use speculative execution the baseOutputPath must not be a path but a name. I would prefer to do enforce it, as IMO it is a bug it is not enforced. Subdirectories is a feature users utilize today. We should enforce disallowing absolute paths as they would end up lying outside of the generic FileOutputCommitter. The 'why' for this is documented clearly at http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
        Hide
        Alejandro Abdelnur added a comment -

        I think that, at least, javadocs should be updated to reflect that if you want to use speculative execution the baseOutputPath must not be a path but a name. I would prefer to do enforce it, as IMO it is a bug it is not enforced.

        Show
        Alejandro Abdelnur added a comment - I think that, at least, javadocs should be updated to reflect that if you want to use speculative execution the baseOutputPath must not be a path but a name. I would prefer to do enforce it, as IMO it is a bug it is not enforced.
        Hide
        Radim Kolar added a comment -

        If you have budget, i can fix it for you.

        Show
        Radim Kolar added a comment - If you have budget, i can fix it for you.
        Hide
        Priyo Mustafi added a comment -

        MultipleOutputs exposes to methods.
        1) public <K,V> void write(String namedOutput,K key,V value)
        2) public <K,V> void write(String namedOutput,K key,V value,String baseOutputPath)
        where
        namedOutput - the named output name
        baseOutputPath - base-output path to write the record to. Note: Framework will generate unique filename for the baseOutputPath

        We use the second one which allows you to provide a baseOutputPath where the data needs to be written. I don't see anywhere in the javadoc which mentions that baseOutputPath shouldn't be a fully qualified path. So the Jira is definitely valid. Either the Javadoc needs to be fixed or the code needs to be fixed and I would prefer the latter as we have developed extensive data-pipelines based on this. If it is not fixed, we have to change the absolute paths to sub-directory paths and then once the job is done, move all those directories out to the expected locations.

        Aside that, if we provide baseOutputPath as "abc/def/xyz" then it puts the directory under the main output directory i.e. you get files like this <main-output-dir>/abc/def/xyz-r-00000. Instead if you use baseOutputPath as "/abc/def/xyz" where the path isn't a subdirectory of the main output directory, then the problem is seen.

        Show
        Priyo Mustafi added a comment - MultipleOutputs exposes to methods. 1) public <K,V> void write(String namedOutput,K key,V value) 2) public <K,V> void write(String namedOutput,K key,V value,String baseOutputPath) where namedOutput - the named output name baseOutputPath - base-output path to write the record to. Note: Framework will generate unique filename for the baseOutputPath We use the second one which allows you to provide a baseOutputPath where the data needs to be written. I don't see anywhere in the javadoc which mentions that baseOutputPath shouldn't be a fully qualified path. So the Jira is definitely valid. Either the Javadoc needs to be fixed or the code needs to be fixed and I would prefer the latter as we have developed extensive data-pipelines based on this. If it is not fixed, we have to change the absolute paths to sub-directory paths and then once the job is done, move all those directories out to the expected locations. Aside that, if we provide baseOutputPath as "abc/def/xyz" then it puts the directory under the main output directory i.e. you get files like this <main-output-dir>/abc/def/xyz-r-00000. Instead if you use baseOutputPath as "/abc/def/xyz" where the path isn't a subdirectory of the main output directory, then the problem is seen.
        Hide
        Radim Kolar added a comment -

        With slight code modification MultipleOutputs can write to any file. Just write to temporary directory as usual but after task is done, you can overwrite any file. Just create mapping table between temporary file and final name.

        Show
        Radim Kolar added a comment - With slight code modification MultipleOutputs can write to any file. Just write to temporary directory as usual but after task is done, you can overwrite any file. Just create mapping table between temporary file and final name.
        Hide
        Alejandro Abdelnur added a comment -

        Priyo,

        MultipleOutputs has been designed assuming that named outputs are written in the output-dir of the job. This enables speculative execution to work. While a task is writing its output, this is done in a task's temporary directory within the output directory, when the task is completed, the files from the tasks's temporary directory are promoted to the output-dir. This includes all named outputs. Even if speculative execution is ON, there is no conflict/overriding because the 2 competing speculative tasks are using 2 different temporary directories and only one will be promoted.

        The assumption for this to works is that named outputs are NAMEs, not paths. Then, they end up next to the default 'part-####' output files.

        Please verify this is the case, and if so, please close this JIRA as invalid.

        Show
        Alejandro Abdelnur added a comment - Priyo, MultipleOutputs has been designed assuming that named outputs are written in the output-dir of the job. This enables speculative execution to work. While a task is writing its output, this is done in a task's temporary directory within the output directory, when the task is completed, the files from the tasks's temporary directory are promoted to the output-dir. This includes all named outputs. Even if speculative execution is ON, there is no conflict/overriding because the 2 competing speculative tasks are using 2 different temporary directories and only one will be promoted. The assumption for this to works is that named outputs are NAMEs, not paths. Then, they end up next to the default 'part-####' output files. Please verify this is the case, and if so, please close this JIRA as invalid.
        Hide
        Radim Kolar added a comment -

        There is no data loss. In HDFS you can not overwrite open file.

        Show
        Radim Kolar added a comment - There is no data loss. In HDFS you can not overwrite open file.
        Hide
        Priyo Mustafi added a comment -

        We are seeing the same issue in 0.20.203. We are seeing partial data loss. We write 100's of GB and we are seeing 7-10% loss in the written records compared to what is reported by the MultipleOutputs counters. Sometimes we are seeing 0 sized sequencefiles as well which is invalid. That is how we first noted the problem as our jobs sometimes will get EOFException while reading SequenceFile.

        The issue happens when
        1) MultipleOutputs writes data outside of the main output directory of the reducer i.e. by giving an absolute path which is not inside the main output directory. Which matches with the above comment by Radim as he is using ../ which is moving the multipleoutputs directory outside of the main output directory.

        2) It happens only when speculative execution is turned on for the reducer. Without speculative execution, everything works whether directory is inside or outside.

        By the way, my code used multipleoutputs only in the reducer so not sure if the same problem exists for mapper.

        Show
        Priyo Mustafi added a comment - We are seeing the same issue in 0.20.203. We are seeing partial data loss. We write 100's of GB and we are seeing 7-10% loss in the written records compared to what is reported by the MultipleOutputs counters. Sometimes we are seeing 0 sized sequencefiles as well which is invalid. That is how we first noted the problem as our jobs sometimes will get EOFException while reading SequenceFile. The issue happens when 1) MultipleOutputs writes data outside of the main output directory of the reducer i.e. by giving an absolute path which is not inside the main output directory. Which matches with the above comment by Radim as he is using ../ which is moving the multipleoutputs directory outside of the main output directory. 2) It happens only when speculative execution is turned on for the reducer. Without speculative execution, everything works whether directory is inside or outside. By the way, my code used multipleoutputs only in the reducer so not sure if the same problem exists for mapper.

          People

          • Assignee:
            Harsh J
            Reporter:
            Radim Kolar
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development