Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6874

Make DistributedCache check if the content of a directory has changed

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      DistributedCache does not check recursively if the content a directory has changed when adding files to it with DistributedCache.addCacheFile().

      Background

      I have an Oozie workflow on HDFS:

      example_workflow
      ├── job.properties
      ├── lib
      │   ├── components
      │   │   ├── sub-component.sh
      │   │   └── subsub
      │   │       └── subsub.sh
      │   ├── main.sh
      │   └── sub.sh
      └── workflow.xml
      

      Executed the workflow; then made some changes in subsub.sh. Replaced the file on HDFS. When I re-ran the workflow, DistributedCache did not notice the changes as the timestamp on the components directory did not change. As a result, the old script was materialized.

      This behaviour might be related to determineTimestamps() .
      In order to use the new script during workflow execution, I had to update the whole components directory.

      Some more info:

      In Oozie, DistributedCache.addCacheFile() is used to add files to the distributed cache.

      Attachments

        Activity

          People

            Unassigned Unassigned
            asasvari Attila Sasvári
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: