Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-543 [Umbrella] NodeManager localization related issues
  3. YARN-547

Race condition in Public / Private Localizer may result into resource getting downloaded again



    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0-beta
    • None
    • None
    • Reviewed


      Public Localizer :
      At present when multiple containers try to request a localized resource

      • If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state)
      • Now if in this state multiple ResourceRequestEvents arrive then ResourceLocalizationEvents are sent for all of them.

      Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Inside ResourceLocalization (for public download) all the requests are added to local attempts map. If a new request comes in then first it is checked in this map before a new download starts for the same. For the current download the request will be there in the map. Now if a same resource request comes in then it will rejected (i.e. resource is getting downloaded already). However if the current download completes then the request will be removed from this local map. Now after this removal if the LocalizerRequestEvent comes in then as it is not present in local map the resource will be downloaded again.

      PrivateLocalizer :
      Here a different but similar race condition is present.

      • Here inside findNextResource method call; each LocalizerRunner tries to grab a lock on LocalizerResource. If the lock is not acquired then it will keep trying until the resource state changes to LOCALIZED. This lock will be released by the LocalizerRunner when download completes.
      • Now if another ContainerLocalizer tries to grab the lock on a resource before LocalizedResource state changes to LOCALIZED then resource will be downloaded again.

      At both the places the root cause of this is that all the threads try to acquire the lock on resource however current state of the LocalizedResource is not taken into consideration.


        1. yarn-547-20130418.patch
          46 kB
          Omkar Vinit Joshi
        2. yarn-547-20130416.patch
          19 kB
          Omkar Vinit Joshi
        3. yarn-547-20130416.1.patch
          34 kB
          Omkar Vinit Joshi
        4. yarn-547-20130415.patch
          18 kB
          Omkar Vinit Joshi
        5. yarn-547-20130412.patch
          15 kB
          Omkar Vinit Joshi
        6. yarn-547-20130411.patch
          16 kB
          Omkar Vinit Joshi
        7. yarn-547-20130411.1.patch
          20 kB
          Omkar Vinit Joshi

        Issue Links



              ojoshi Omkar Vinit Joshi
              ojoshi Omkar Vinit Joshi
              0 Vote for this issue
              8 Start watching this issue