[HDFS-15420] approx scheduled blocks not reseting over time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.6.0, 3.0.0
Fix Version/s: None
Component/s: block placement
Labels:
None
Environment:

Our 2.6.0 environment is a 3 node cluster running cdh5.15.0.
Our 3.0.0 environment is a 4 node cluster running cdh6.3.0.

Description

We have been experiencing large amounts of scheduled blocks that never get cleared out. This is preventing blocks from being placed even when there is plenty of space on the system.
Here is an example of the block growth over 24 hours on one of our systems running 2.6.0

Here is an example of the block growth over 24 hours on one of our systems running 3.0.0

https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, however, there appears to still be a systemic growth in scheduled blocks over time and our systems will still need to restart the namenode on occasion to reset this count. I have not determined what is causing the leaked blocks in 3.0.0.

Looking into the issue, I discovered that the intention is for scheduled blocks to slowly go back down to 0 after errors cause blocks to be leaked.

  /** Increment the number of blocks scheduled. */
  void incrementBlocksScheduled(StorageType t) {
    currApproxBlocksScheduled.add(t, 1);
  }
  
  /** Decrement the number of blocks scheduled. */
  void decrementBlocksScheduled(StorageType t) {
    if (prevApproxBlocksScheduled.get(t) > 0) {
      prevApproxBlocksScheduled.subtract(t, 1);
    } else if (currApproxBlocksScheduled.get(t) > 0) {
      currApproxBlocksScheduled.subtract(t, 1);
    } 
    // its ok if both counters are zero.
  }
  
  /** Adjusts curr and prev number of blocks scheduled every few minutes. */
  private void rollBlocksScheduled(long now) {
    if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) {
      prevApproxBlocksScheduled.set(currApproxBlocksScheduled);
      currApproxBlocksScheduled.reset();
      lastBlocksScheduledRollTime = now;
    }
  }

However, this code does not do what is intended if the system has a constant flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the next scheduled block increments currApproxBlocksScheduled and when it completes, it decrements prevApproxBlocksScheduled preventing the leaked block to be removed from the approx count. So, for errors to be corrected, we have to not write any data for the roll period of 10 minutes. The number of blocks we write per 10 minutes is quite high. This allows the error on the approx counts to grow to very large numbers.

The comments in the ticket for the original implementation suggest this issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, it's not clear to me if the severity of it was known at the time.
> So if there are some blocks that are not reported back by the datanode, they will eventually get adjusted (usually 10 min; bit longer if datanode is continuously receiving blocks).
The comments suggest it will eventually get cleared out, but in our case, it never gets cleared out.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot from 2020-06-18 09-29-57.png
18/Jun/20 13:30
58 kB
Max Mizikar
Screenshot from 2020-06-18 09-31-15.png
18/Jun/20 13:31
47 kB
Max Mizikar

Activity

People

Assignee:: Unassigned

Reporter:: Max Mizikar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Jun/20 13:56

Updated:: 27/Jun/20 13:33