Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5247

FileInputFormat should filter files with '._COPYING_' sufix

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      FsShell copy/put creates staging files with '.COPYING' suffix. These files should be considered hidden by FileInputFormat. (A simple fix is to add the following conjunct to the existing hiddenFilter:

      !name.endsWith("._COPYING_")
      

      After upgrading to CDH 4.2.0 we encountered this bug. We have a legacy data loader which uses 'hadoop fs -put' to load data into hourly partitions. We also have intra-hourly jobs which are scheduled to execute several times per hour using the same hourly partition as input. Thus, as the new data is continuously loaded, these staging files (i.e., .COPYING) are breaking our jobs (since when copy/put completes staging files are moved).

      As a workaround, we've defined a custom input path filter and loaded it with "mapred.input.pathFilter.class".

        Issue Links

          Activity

          Hide
          Stan Rosenberg added a comment -

          Robert, what is the intended operational meaning of 'hdfs fs put local dst'? Is it not that file denoted by local is "atomically" transferred into dst? If that's the case, then I'd argue that it's broken--from MR perspective the transfer is not truly atomic since files with the suffix .COPYING are visible.

          As I've indicated above, we have jobs which execute as soon as new data is available for that (hdfs) partition. The external scheduler knows when new data has finished loading, namely when all pending hdfs 'put' operations complete.
          (Think of it as a special type of job in the sense that it runs many times per hour, every time processing a superset of the input files.)

          Your claim that MR was not designed to run on data that is changing underneath it seems rather putative. What is wrong with the above approach assuming that the intended semantics of 'put' is atomic transfer without (MR) observable side effect of .COPYING? (In other words, if MR is oblivious to .COPYING, then data it not changing underneath it.)

          Show
          Stan Rosenberg added a comment - Robert, what is the intended operational meaning of 'hdfs fs put local dst'? Is it not that file denoted by local is "atomically" transferred into dst? If that's the case, then I'd argue that it's broken --from MR perspective the transfer is not truly atomic since files with the suffix .COPYING are visible . As I've indicated above, we have jobs which execute as soon as new data is available for that (hdfs) partition. The external scheduler knows when new data has finished loading, namely when all pending hdfs 'put' operations complete. (Think of it as a special type of job in the sense that it runs many times per hour, every time processing a superset of the input files.) Your claim that MR was not designed to run on data that is changing underneath it seems rather putative. What is wrong with the above approach assuming that the intended semantics of 'put' is atomic transfer without (MR) observable side effect of .COPYING? (In other words, if MR is oblivious to .COPYING, then data it not changing underneath it.)
          Hide
          Robert Joseph Evans added a comment -

          I am happy to hear arguments as to why this is really necessary, but I would rather have my job fail then have the job give me partial/inconsistent results.

          Show
          Robert Joseph Evans added a comment - I am happy to hear arguments as to why this is really necessary, but I would rather have my job fail then have the job give me partial/inconsistent results.
          Hide
          Robert Joseph Evans added a comment -

          Why are you running a Map/Reduce job with input from a directory that has not finished being copied? MR was not designed to run on data that is changing underneath it. When the job is done how do you know which of the input files were actually used to produce the output? This issue existed prior to 2.0 but was even worse without the .COPYING suffix. In those cases the files were opened in place and data started to be copied into them. You may have only even gotten a part of the file in your MR job, not all of it. The file could have disappeared out from under the MR job if an error occurred.

          This is not behavior that I want to make a common park of Map/Reduce. If you want to do this and you know the risks then you can filter .COPYING files out of your list of input files to the MR job. But I don't want the framework to do it automatically for everyone.

          Show
          Robert Joseph Evans added a comment - Why are you running a Map/Reduce job with input from a directory that has not finished being copied? MR was not designed to run on data that is changing underneath it. When the job is done how do you know which of the input files were actually used to produce the output? This issue existed prior to 2.0 but was even worse without the . COPYING suffix. In those cases the files were opened in place and data started to be copied into them. You may have only even gotten a part of the file in your MR job, not all of it. The file could have disappeared out from under the MR job if an error occurred. This is not behavior that I want to make a common park of Map/Reduce. If you want to do this and you know the risks then you can filter . COPYING files out of your list of input files to the MR job. But I don't want the framework to do it automatically for everyone.
          Hide
          Kousuke Saruta added a comment -

          Thanks Stan. I have asked in HADOOP-7771.

          Show
          Kousuke Saruta added a comment - Thanks Stan. I have asked in HADOOP-7771 .
          Hide
          Stan Rosenberg added a comment -

          I have no objections, but, perhaps we can get committers in HADOOP-7771 to comment on whether or not they think this belongs to HDFS?

          Show
          Stan Rosenberg added a comment - I have no objections, but, perhaps we can get committers in HADOOP-7771 to comment on whether or not they think this belongs to HDFS?
          Hide
          Kousuke Saruta added a comment -

          May I recreate this jira as HDFS issue?

          Show
          Kousuke Saruta added a comment - May I recreate this jira as HDFS issue?
          Hide
          Kousuke Saruta added a comment -

          As Devaraj said, we can use "mapred.input.pathFilter.class" but, as far as I know, the name of the temporary file is undocumented and I think changes of the specification or implementation of HDFS should not affect users who have ever used HDFS.
          So, I think we should consider the name of the temporary file. It may good that the name of the temporary file starts with "." or "_".

          Show
          Kousuke Saruta added a comment - As Devaraj said, we can use "mapred.input.pathFilter.class" but, as far as I know, the name of the temporary file is undocumented and I think changes of the specification or implementation of HDFS should not affect users who have ever used HDFS. So, I think we should consider the name of the temporary file. It may good that the name of the temporary file starts with "." or "_".
          Hide
          Stan Rosenberg added a comment -

          Kousuke,

          HADOOP-7771 is a very interesting find! Based on my reading of the history, the naming convention uses the suffix explicitly to make the corresponding temporary (staging) file non-hidden. Indeed, the last comment in that jira is a question by Daryn w.r.t. whether or not partially copied files should be visible; there are no follow-ups to the question but the jira is closed. Since the decision to have temporary files visible seems to be arbitrary, I propose that we fix FsShell by making them hidden. Otherwise, this notion of partially loaded files needs to be lifted to the level of FileInputFormat---jobs should not be failing under normal conditions.

          Show
          Stan Rosenberg added a comment - Kousuke, HADOOP-7771 is a very interesting find! Based on my reading of the history, the naming convention uses the suffix explicitly to make the corresponding temporary (staging) file non-hidden. Indeed, the last comment in that jira is a question by Daryn w.r.t. whether or not partially copied files should be visible; there are no follow-ups to the question but the jira is closed. Since the decision to have temporary files visible seems to be arbitrary, I propose that we fix FsShell by making them hidden. Otherwise, this notion of partially loaded files needs to be lifted to the level of FileInputFormat---jobs should not be failing under normal conditions.
          Hide
          Devaraj K added a comment -

          I don't think giving this responsibility to FileInputFormat is a good idea. FileInputFormat already provides extensibility to add new filters using "mapred.input.pathFilter.class" configuration. If the user want to filter some specific files from the input dir for some Jobs they can achieve the same using the current behavior.

          Show
          Devaraj K added a comment - I don't think giving this responsibility to FileInputFormat is a good idea. FileInputFormat already provides extensibility to add new filters using "mapred.input.pathFilter.class" configuration. If the user want to filter some specific files from the input dir for some Jobs they can achieve the same using the current behavior.
          Hide
          Kousuke Saruta added a comment -

          I found the jira where the code which ".COPYING" temporary file being created is added.
          https://issues.apache.org/jira/browse/HADOOP-7771
          In that jira, they discussed NPE problem when using copyToLocal and the reason why the ".COPYING" is created is to copy file persistently.
          So, I think the temporary file is not necessarily assigned the name with ".COPYING" suffix.

          Show
          Kousuke Saruta added a comment - I found the jira where the code which ". COPYING " temporary file being created is added. https://issues.apache.org/jira/browse/HADOOP-7771 In that jira, they discussed NPE problem when using copyToLocal and the reason why the ". COPYING " is created is to copy file persistently. So, I think the temporary file is not necessarily assigned the name with ". COPYING " suffix.
          Hide
          Stan Rosenberg added a comment -

          I agree. It appears this change is confined to FsShell and nothing else. Do we know why this particular file naming convention was chosen or was it just an oversight?

          Show
          Stan Rosenberg added a comment - I agree. It appears this change is confined to FsShell and nothing else. Do we know why this particular file naming convention was chosen or was it just an oversight?
          Hide
          Kousuke Saruta added a comment -

          Stan, I think we should modify FsShell to create a file assigned the name with underscore prefix for a file being created so that FileInputFormat can ignore the file rather than modify FileInputFormat to handle a .COPYING suffix file as a hidden file.
          It's just HDFS matter and I think the specification change shouldn't affects MapReduce.
          How do you think?

          Show
          Kousuke Saruta added a comment - Stan, I think we should modify FsShell to create a file assigned the name with underscore prefix for a file being created so that FileInputFormat can ignore the file rather than modify FileInputFormat to handle a . COPYING suffix file as a hidden file. It's just HDFS matter and I think the specification change shouldn't affects MapReduce. How do you think?
          Hide
          Kousuke Saruta added a comment -

          I succeeded to reproduce in branch-2.1-beta.
          I saw a temporary file with prefix ".COPYING" during putting a file into HDFS.
          As you say, it causes something bad that we run MapReduce jobs when there are .COPYING files in the directory where MapReduce jobs use for input path.

          Show
          Kousuke Saruta added a comment - I succeeded to reproduce in branch-2.1-beta. I saw a temporary file with prefix ". COPYING " during putting a file into HDFS. As you say, it causes something bad that we run MapReduce jobs when there are . COPYING files in the directory where MapReduce jobs use for input path.
          Hide
          Kousuke Saruta added a comment -

          OK. I think we should change Affects Version/s to trunc.

          Show
          Kousuke Saruta added a comment - OK. I think we should change Affects Version/s to trunc.
          Hide
          Stan Rosenberg added a comment -

          Correct, the above holds in the community version; before submitting this jira I checked the (apache) trunk.

          Show
          Stan Rosenberg added a comment - Correct, the above holds in the community version; before submitting this jira I checked the (apache) trunk.
          Hide
          Kousuke Saruta added a comment -

          We should discuss just community version, not specific distribution.
          But what you say seems to affect Hadoop-2 and trunc.

          Show
          Kousuke Saruta added a comment - We should discuss just community version, not specific distribution. But what you say seems to affect Hadoop-2 and trunc.

            People

            • Assignee:
              Unassigned
              Reporter:
              Stan Rosenberg
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development