Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13604

DistCp filtering conflicts with snapshotting



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.6.0
    • None
    • distcp
    • None


      DistCp has an option to filter (not copy) files that match one of the file patterns in a file.  DistCp also has options where it optimizes incremental copying based on snapshots present at the source and target location.  When enabling both options, files that should be copied from source to target are missing on the target.

      To reproduce the issue:

      • Create two directories, source and target.
      • In source, put two files, A and B, with some random content.
      • Create a filter file that filters A (so blocks copying A).
      • Create a snapshot, snapshot_old, of the source directory.
      • Use distcp to copy the content of source to target.
      • As expected, the target directory will contain only file BA is filtered.
      • Take a snapshot of the target directory, snapshot_old.
      • In the source directory, rename A to C.
      • Take a new snapshot of the source directory, snapshot_new.
      • Now, perform an incremental distcp copy using the created snapshots so as to optimize the incremental copy process: distcp -update -filters filters.txt -diff snapshot_old snapshot_new ... ...
      • You will find that the newly created file C is not copied to the target directory.

      I suspect that the reason for this is that distcp concludes from analyzing the difference between snapshot_source and snapshot_source_new that A was renamed to C. This can be confirmed by using snapshotDiff to compare the two snapshot:  it reports that A has been renamed to C.

      distcp seems to then assume that the data for C is already present in the target directory and only needs to be renamed.  However, due to the filtering, A is not present on the target and cannot be renamed to C.

      Although the final distcp fails to create a copy of the C file in the target directory, distcp does not report any failure, nor can I find any trace of errors in the job logs of the jobs created by distcp to execute the actual copy.

      So, some options:

      • Combining -diff and -filters could be disallowed.
      • distcp could assume that files that have been filtered are not present and should be replicated in ordinary fashion.






            Unassigned Unassigned
            mark_christiaens Mark Christiaens
            0 Vote for this issue
            2 Start watching this issue