Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.90.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      10 Nodes Write Heavy Cluster

      Description

      The .oldlogs folder is never being cleaned up. The hbase.master.logcleaner.ttl has been set to clean up the old logs but the clean up is never kicking in. The limit of 10 files is not the problem. After running for 5 days not a single log file has ever been deleted and the logcleaner is set to 2 days (from the default of 7 days). It is assumed that the replication changes that want to be sure to keep these logs around if needed have caused the cleanup to be blocked. There is no replication defined (knowingly).

        Activity

        Wayne created issue -
        Hide
        Jean-Daniel Cryans added a comment -

        Hey Wayne,

        Could you grep the Namenode's log for "oldlogs" and attach the result in this jira? I'd like to see what kind of log archival rate you have and if logs are really deleted at all. Thanks!

        Show
        Jean-Daniel Cryans added a comment - Hey Wayne, Could you grep the Namenode's log for "oldlogs" and attach the result in this jira? I'd like to see what kind of log archival rate you have and if logs are really deleted at all. Thanks!
        Hide
        Wayne added a comment -

        Attached are the name node entries with .oldlog. After running on .90 for 5 days I had more than 18+ TB of old logs. The data size was only ~2TB. Having not even known about this until 2 days ago I had mis-calculated our required production cluster size (good news). I think these logs should be deleted normally after 1 day and at a pace able to keep up with heavy writes.

        There is I believe a bug from replication in .90 as no logs seemed to get deleted, but even if it worked as designed I would have to wait until 7 days and at the rate I was load testing the 40TB limit of our test cluster might have been reached with actual data of only 10% of that. I question the 7 day limit as a good default. It causes novices like myself to think the data size is a lot bigger than it is. I was even convinced lzo compression was not working due to the spike in disk usage.

        Show
        Wayne added a comment - Attached are the name node entries with .oldlog. After running on .90 for 5 days I had more than 18+ TB of old logs. The data size was only ~2TB. Having not even known about this until 2 days ago I had mis-calculated our required production cluster size (good news). I think these logs should be deleted normally after 1 day and at a pace able to keep up with heavy writes. There is I believe a bug from replication in .90 as no logs seemed to get deleted, but even if it worked as designed I would have to wait until 7 days and at the rate I was load testing the 40TB limit of our test cluster might have been reached with actual data of only 10% of that. I question the 7 day limit as a good default. It causes novices like myself to think the data size is a lot bigger than it is. I was even convinced lzo compression was not working due to the spike in disk usage.
        Wayne made changes -
        Field Original Value New Value
        Attachment oldlog.txt [ 12469770 ]
        Hide
        Todd Lipcon added a comment -

        Anyone reproduced this, recently?

        Show
        Todd Lipcon added a comment - Anyone reproduced this, recently?
        Hide
        Josh Wymer added a comment -

        We are seeing this on our replication cluster using 0.90.4. The /hbase/.oldlogs is filled with logs that are ~ 1 month old.

        Show
        Josh Wymer added a comment - We are seeing this on our replication cluster using 0.90.4. The /hbase/.oldlogs is filled with logs that are ~ 1 month old.
        Hide
        Josh Wymer added a comment -

        After turning replication off on the slave cluster, the .oldlogs were cleaned up. So it appears as if hbase thinks that the slave cluster intends to replicate as well and doesn't clean the logs.

        Show
        Josh Wymer added a comment - After turning replication off on the slave cluster, the .oldlogs were cleaned up. So it appears as if hbase thinks that the slave cluster intends to replicate as well and doesn't clean the logs.
        Josh Wymer made changes -
        Link This issue relates to HBASE-5222 [ HBASE-5222 ]
        Josh Wymer made changes -
        Link This issue relates to HBASE-5222 [ HBASE-5222 ]
        Hide
        Neil Yalowitz added a comment -

        We are seeing this as well using 0.90.4. The .oldlogs folder contents contains items as far back as ~1 month ago, although not in perfect sequential days of history ("banding," with folder datestamps show several sequential days then several days skipped then several more sequential days).

        Show
        Neil Yalowitz added a comment - We are seeing this as well using 0.90.4. The .oldlogs folder contents contains items as far back as ~1 month ago, although not in perfect sequential days of history ("banding," with folder datestamps show several sequential days then several days skipped then several more sequential days).
        Hide
        Harsh J added a comment -

        Hi Neil,

        0.90.x releases are not maintained anymore upstream - perhaps you can try upgrading to 0.92 or 0.94 based current releases that hasn't gotten a report of this bug?

        Show
        Harsh J added a comment - Hi Neil, 0.90.x releases are not maintained anymore upstream - perhaps you can try upgrading to 0.92 or 0.94 based current releases that hasn't gotten a report of this bug?
        Hide
        Dave Latham added a comment -

        The ReplicationLogCleaner prevents any oldlogs from being cleaned if replication is enabled in the config (hbase-site.xml) but stopped (via stop_replication). That caused this issue for us, and my guess is it might be behind other people's cases too.

        Show
        Dave Latham added a comment - The ReplicationLogCleaner prevents any oldlogs from being cleaned if replication is enabled in the config (hbase-site.xml) but stopped (via stop_replication). That caused this issue for us, and my guess is it might be behind other people's cases too.
        Hide
        Lars Hofhansl added a comment -

        So is this still the case? That seems bad and wrong.

        Show
        Lars Hofhansl added a comment - So is this still the case? That seems bad and wrong.
        Hide
        Lars Hofhansl added a comment -

        Especially since replication is now enabled by default... Jean-Daniel Cryans, FYI.

        Show
        Lars Hofhansl added a comment - Especially since replication is now enabled by default... Jean-Daniel Cryans , FYI.
        Hide
        Jean-Daniel Cryans added a comment -

        We removed the kill switch, start/stop_replication don't exist anymore.

        Show
        Jean-Daniel Cryans added a comment - We removed the kill switch, start/stop_replication don't exist anymore.

          People

          • Assignee:
            Unassigned
            Reporter:
            Wayne
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development