Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1526

Dfs client name for a map/reduce task should have some randomness

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Make a client name has this format: DFSClient_applicationid_randomint_threadid, where applicationid = mapred.task.id or else = "NONMAPREDUCE".

      Description

      Fsck shows one of the files in our dfs cluster is corrupt.

      /bin/hadoop fsck aFile -files -blocks -locations
      aFile: 4633 bytes, 2 block(s):
      aFile: CORRUPT block blk_-4597378336099313975
      OK
      0. blk_-4597378336099313975_2284630101 len=0 repl=3 [...]
      1. blk_5024052590403223424_2284630107 len=4633 repl=3 [...]Status: CORRUPT

      On disk, these two blocks are of the same size and the same content. It turns out the writer of the file is from a multiple threaded map task. Each thread may write to the same file. One possible interaction of two threads might make this to happen:
      [T1: create aFile] [T2: delete aFile] [T2: create aFile][T1: addBlock 0 to aFile][T2: addBlock1 to aFile]...

      Because T1 and T2 have the same client name, which is the map task id, the above interactions could be done without any lease exception, thus eventually leading to a corrupt file. To solve the problem, a mapreduce task's client name could be formed by its task id followed by a random number.

      1. randClientId3.patch
        1 kB
        Hairong Kuang
      2. randClientId2.patch
        0.9 kB
        Hairong Kuang
      3. randClientId1.patch
        13 kB
        Hairong Kuang
      4. clientName.patch
        0.5 kB
        Hairong Kuang

        Activity

        Hide
        Hairong Kuang added a comment -

        Here comes the patch.

        Show
        Hairong Kuang added a comment - Here comes the patch.
        Hide
        Koji Noguchi added a comment -

        Hairong, using a random number (with time as a seed) for getting a unique filename reminds me of an old old data corruption bug we looked at. When dfsclient was writing to temporary file before Dhruba's HADOOP-1707, we had cases of multiple tasks using the same seed/time resulting in data corruption. Wouldn't your patch have the same problem?

        Show
        Koji Noguchi added a comment - Hairong, using a random number (with time as a seed) for getting a unique filename reminds me of an old old data corruption bug we looked at. When dfsclient was writing to temporary file before Dhruba's HADOOP-1707 , we had cases of multiple tasks using the same seed/time resulting in data corruption. Wouldn't your patch have the same problem?
        Hide
        Hairong Kuang added a comment -

        Koji, your comment makes sense if all threads happen to have the same seed. Another option for this problem is to concatenate the thread id, which might be a safer.

        Show
        Hairong Kuang added a comment - Koji, your comment makes sense if all threads happen to have the same seed. Another option for this problem is to concatenate the thread id, which might be a safer.
        Hide
        Allen Wittenauer added a comment -

        -1 on random without some sort of check for uniqueness
        +1 on thread id if it is guaranteed to be unique

        ... and this sounds like it should be a blocker for 0.22 and 0.21.1.

        Show
        Allen Wittenauer added a comment - -1 on random without some sort of check for uniqueness +1 on thread id if it is guaranteed to be unique ... and this sounds like it should be a blocker for 0.22 and 0.21.1.
        Hide
        dhruba borthakur added a comment -

        To address Allen's comments, we should make the Random object be a static object so that the generated random numbers do not repeat within the same DistributedFileSystem object.

        Show
        dhruba borthakur added a comment - To address Allen's comments, we should make the Random object be a static object so that the generated random numbers do not repeat within the same DistributedFileSystem object.
        Hide
        Hairong Kuang added a comment -

        This patch makes random number generator to be static and concatenates a random number and the thread id to the client name.

        Show
        Hairong Kuang added a comment - This patch makes random number generator to be static and concatenates a random number and the thread id to the client name.
        Hide
        dhruba borthakur added a comment -

        I think you attached the wrong patch to this JIRA

        Show
        dhruba borthakur added a comment - I think you attached the wrong patch to this JIRA
        Hide
        Hairong Kuang added a comment -

        Oops, here is the right one.

        Show
        Hairong Kuang added a comment - Oops, here is the right one.
        Hide
        dhruba borthakur added a comment -

        +1. I do not think we need any unit tests for this one.

        Show
        dhruba borthakur added a comment - +1. I do not think we need any unit tests for this one.
        Hide
        Mahadev konar added a comment -

        was just thinking abt this, would it be better for log parsing if both the DFSClient names looked the same? Meaning with and without mapreduce?

        something like:

        DFSClient_applicationid_randomint_threadid (where applicationid = mapred.task.id or else = "null" or some other constant).

        I dont know much abt the dfsclient code so dont know if this would be useful or not.

        Show
        Mahadev konar added a comment - was just thinking abt this, would it be better for log parsing if both the DFSClient names looked the same? Meaning with and without mapreduce? something like: DFSClient_applicationid_randomint_threadid (where applicationid = mapred.task.id or else = "null" or some other constant). I dont know much abt the dfsclient code so dont know if this would be useful or not.
        Hide
        dhruba borthakur added a comment -

        I agree with Mahadev, we could follow his suggestion.

        Show
        dhruba borthakur added a comment - I agree with Mahadev, we could follow his suggestion.
        Hide
        Hairong Kuang added a comment -

        Which constant string we should choose? null seems not user friendly. How about an empty string or "NONMAPREDUCE"?

        Show
        Hairong Kuang added a comment - Which constant string we should choose? null seems not user friendly. How about an empty string or "NONMAPREDUCE"?
        Hide
        Mahadev konar added a comment -

        agreed. NONMAPREDUCE seems fine (or you can use MAHADEV, just kidding!! )

        Show
        Mahadev konar added a comment - agreed. NONMAPREDUCE seems fine (or you can use MAHADEV, just kidding!! )
        Hide
        Hairong Kuang added a comment -

        Here is the patch that uses "NONMAPREDUCE" as app name for dfs clients that are not mapreduce tasks. (MAHADEV sounds good to me too.

        Show
        Hairong Kuang added a comment - Here is the patch that uses "NONMAPREDUCE" as app name for dfs clients that are not mapreduce tasks. (MAHADEV sounds good to me too.
        Hide
        Mahadev konar added a comment -

        +1 the patch looks good.

        Show
        Mahadev konar added a comment - +1 the patch looks good.
        Hide
        Mahadev konar added a comment -

        resubmitting for hudson.

        Show
        Mahadev konar added a comment - resubmitting for hudson.
        Hide
        Hairong Kuang added a comment -

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appe
        [exec] ar to include any new or modified tests.
        [exec] Please justify why no new tests are needed for this patch.
        [exec] Also please list what manual steps were performed to verify this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        [exec]
        [exec] +1 system test framework. The patch passed system test framework compile

        Show
        Hairong Kuang added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appe [exec] ar to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile
        Hide
        Hairong Kuang added a comment -

        All unit tests passed except for known failed ones.

        Show
        Hairong Kuang added a comment - All unit tests passed except for known failed ones.
        Hide
        Hairong Kuang added a comment -

        I just committed this!

        Show
        Hairong Kuang added a comment - I just committed this!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
        Hide
        Harsh J added a comment -

        Noticed this today while poking around with 0.23.0:

        If its not mapreduce, labelling it as 'NONMAPREDUCE' only makes it harder to grep, cause there's still some 'MAPREDUCE' in it? Its a nitpick (cause IDs don't carry that string), but perhaps you may consider switching to something more 'REGULAR'?

        Show
        Harsh J added a comment - Noticed this today while poking around with 0.23.0: If its not mapreduce, labelling it as 'NONMAPREDUCE' only makes it harder to grep, cause there's still some 'MAPREDUCE' in it? Its a nitpick (cause IDs don't carry that string), but perhaps you may consider switching to something more 'REGULAR'?

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Hairong Kuang
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development