Hadoop Common
  1. Hadoop Common
  2. HADOOP-158

dfs should allocate a random blockid range to a file, then assign ids sequentially to blocks in the file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.1.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A random number generator is used to allocate block ids in dfs. Sometimes a block id is allocated that is already used in the filesystem, which causes filesystem corruption.

      A short-term fix for this is to simply check when allocating block ids whether any file is already using the newly allocated id, and, if it is, generate another one. There can still be collisions in some rare conditions, but these are harder to fix and will wait, since this simple fix will handle the vast majority of collisions.

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Using the formulas in:

          http://en.wikipedia.org/wiki/Birthday_paradox#Generalisation

          I think it is actually very unlikely that, with 64-bit block ids and a decent random number generator, we are actually seeing collisions. It seems more likely that the symptoms ascribed to duplicate block id allocations are actually the result of other bugs. Still, it would be more comfortable to not rely on random block id allocation long-term.

          Show
          Doug Cutting added a comment - Using the formulas in: http://en.wikipedia.org/wiki/Birthday_paradox#Generalisation I think it is actually very unlikely that, with 64-bit block ids and a decent random number generator, we are actually seeing collisions. It seems more likely that the symptoms ascribed to duplicate block id allocations are actually the result of other bugs. Still, it would be more comfortable to not rely on random block id allocation long-term.
          Hide
          Sameer Paranjpye added a comment -

          Block id collisions have largely beed addressed by the fix in HADOOP-146. The namenode first checks for the presence of a randomly generated blockid before assigning it to a block.

          Longer term, we should implement a scheme where the namenode allocates large blockid ranges to files. When a file is created the namenode generates a 5-byte integer, a range-id, for the file, checking for collisions and re-generating the range-id if necessary. Blocks for the new file are then assigned 8-byte block ids sequentially, where the high 5-bytes are the range-id and the low 3-bytes correspond to the block number within the file. Blocks in a file then get the ids rangeid-0, rangeid-1, ..., rangeid-(N-1), where N is the number of blocks in the file. This lets us assign upto a trillion ranges and lets each file grow to 0.5, 1, 2 ... petabytes depending on whether the block size is 32, 64, 128 ... MB. The scheme has the additional benefit of saving some memory per block at the namenode.

          There is the scenario of a file being deleted while a node holding some of its blocks is down, having the files range-id re-assigned to another file and seeing collisions when the node later comes back. To get around this, the namenode tags each block in a file with the creation time of the file. When a collision occurs the timestamps will be different, the most recent timestamp wins and old blocks are discarded.

          Show
          Sameer Paranjpye added a comment - Block id collisions have largely beed addressed by the fix in HADOOP-146 . The namenode first checks for the presence of a randomly generated blockid before assigning it to a block. Longer term, we should implement a scheme where the namenode allocates large blockid ranges to files. When a file is created the namenode generates a 5-byte integer, a range-id, for the file, checking for collisions and re-generating the range-id if necessary. Blocks for the new file are then assigned 8-byte block ids sequentially, where the high 5-bytes are the range-id and the low 3-bytes correspond to the block number within the file. Blocks in a file then get the ids rangeid-0, rangeid-1, ..., rangeid-(N-1), where N is the number of blocks in the file. This lets us assign upto a trillion ranges and lets each file grow to 0.5, 1, 2 ... petabytes depending on whether the block size is 32, 64, 128 ... MB. The scheme has the additional benefit of saving some memory per block at the namenode. There is the scenario of a file being deleted while a node holding some of its blocks is down, having the files range-id re-assigned to another file and seeing collisions when the node later comes back. To get around this, the namenode tags each block in a file with the creation time of the file. When a collision occurs the timestamps will be different, the most recent timestamp wins and old blocks are discarded.
          Hide
          Doug Cutting added a comment -

          Why must the file-id part of the block id be random? Can't that be sequential?

          Show
          Doug Cutting added a comment - Why must the file-id part of the block id be random? Can't that be sequential?
          Hide
          Sameer Paranjpye added a comment -

          It can be sequential. In that case, the namenode would need to determine the lowest unused file-id at startup and start file-id assignments from that point.

          Even sequential allocation of file-ids should probably do the collision check because you don't need a trillion files in the system before you wrap around, you only need a trillion file creation events. If you're doing the collision check in both schemes the random file-id assignment keeps things simpler.

          The possibility of collision with sequential assignment of file-ids is very remote, but why expose ourselves? I'm probably being paranoid so ignore me on this one if you want.

          Show
          Sameer Paranjpye added a comment - It can be sequential. In that case, the namenode would need to determine the lowest unused file-id at startup and start file-id assignments from that point. Even sequential allocation of file-ids should probably do the collision check because you don't need a trillion files in the system before you wrap around, you only need a trillion file creation events. If you're doing the collision check in both schemes the random file-id assignment keeps things simpler. The possibility of collision with sequential assignment of file-ids is very remote, but why expose ourselves? I'm probably being paranoid so ignore me on this one if you want.
          Hide
          Yoram Arnon added a comment -

          It could, but that would make upgrading an existing file system harder: one would have to "compress" the id's before upgrading, then keep track of the high-water-mark file id. The upgrade wouldn't be as smooth as when selecting 'clean' ranges randomly, and would require metadata conversion. Blocks on individual datanodes may even need to change their ids.
          OR, one could copy/move the entire filesystem to a new, clean one.
          The selected method allows for a seamless upgrade, requiring no conversion, and would work equally well with a 2^64 id address space.

          Show
          Yoram Arnon added a comment - It could, but that would make upgrading an existing file system harder: one would have to "compress" the id's before upgrading, then keep track of the high-water-mark file id. The upgrade wouldn't be as smooth as when selecting 'clean' ranges randomly, and would require metadata conversion. Blocks on individual datanodes may even need to change their ids. OR, one could copy/move the entire filesystem to a new, clean one. The selected method allows for a seamless upgrade, requiring no conversion, and would work equally well with a 2^64 id address space.
          Hide
          Doug Cutting added a comment -

          I would think that random allocation would make collisions more likely, not less. We always know which block ids are used by complete files. The concern is only about block ids which have been recently allocated to a file, but the file is somehow not yet complete. So, with sequential allocation, a collision can only happen if the probe key (the next block id to allocate) wraps all the way around before a file is completed, while with random allocation it can happen much more frequently. We simply have to make sure that probe key increments are logged to the edits file along with other file system changes. Am I missing something?

          Show
          Doug Cutting added a comment - I would think that random allocation would make collisions more likely, not less. We always know which block ids are used by complete files. The concern is only about block ids which have been recently allocated to a file, but the file is somehow not yet complete. So, with sequential allocation, a collision can only happen if the probe key (the next block id to allocate) wraps all the way around before a file is completed, while with random allocation it can happen much more frequently. We simply have to make sure that probe key increments are logged to the edits file along with other file system changes. Am I missing something?
          Hide
          Sameer Paranjpye added a comment -

          Yes, random assignment of file-ids makes collisions more likely. However, collisions are possible even with sequential assignment, and if they are possible they need to be detected. Since, collision detection code is needed with both random and sequential assignment, random assignment makes the system simpler because the namenode doesn't have to track the 'high watermark' file-id.

          Don't think recently assigned file-ids that belong to incomplete files are a concern, since the namenode will be aware of all file-ids used, whether they belong to incomplete files or not.

          Wrap around before a file completes is not the only collision scenario. In the sequential assignment scheme, suppose, the first million files in the system get the file-ids, 0-999999. These files archival data of some kind, so are never deleted. Life goes on, lots of files are created and removed, at any given time there are only a few million files total (complete + incomplete) in the system. At some point, the system will have gone through a trillion file creation events, the file-ids will wrap and start to collide with the first million files.

          Show
          Sameer Paranjpye added a comment - Yes, random assignment of file-ids makes collisions more likely. However, collisions are possible even with sequential assignment, and if they are possible they need to be detected. Since, collision detection code is needed with both random and sequential assignment, random assignment makes the system simpler because the namenode doesn't have to track the 'high watermark' file-id. Don't think recently assigned file-ids that belong to incomplete files are a concern, since the namenode will be aware of all file-ids used, whether they belong to incomplete files or not. Wrap around before a file completes is not the only collision scenario. In the sequential assignment scheme, suppose, the first million files in the system get the file-ids, 0-999999. These files archival data of some kind, so are never deleted. Life goes on, lots of files are created and removed, at any given time there are only a few million files total (complete + incomplete) in the system. At some point, the system will have gone through a trillion file creation events, the file-ids will wrap and start to collide with the first million files.
          Hide
          Doug Cutting added a comment -

          > At some point, the system will have gone through a trillion file creation events [ ... ]

          We generally aim for a block created per drive no more than every 100 milliseconds, so that transfer dominates seek. With 10,000 nodes, each with four drives, that would give a maximum block creation rate of 400k/second (assuming a replication level of one). At that rate it would take 100,000 years to exhaust all 64-bit block ids. I wonder what version Hadoop will have then?

          Show
          Doug Cutting added a comment - > At some point, the system will have gone through a trillion file creation events [ ... ] We generally aim for a block created per drive no more than every 100 milliseconds, so that transfer dominates seek. With 10,000 nodes, each with four drives, that would give a maximum block creation rate of 400k/second (assuming a replication level of one). At that rate it would take 100,000 years to exhaust all 64-bit block ids. I wonder what version Hadoop will have then?

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development