Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26398

CellCounter fails for large tables filling up local disk

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.7, 2.5.0, 3.0.0-alpha-2, 2.3.7, 2.4.8
    • 2.5.0, 2.2.8, 3.0.0-alpha-2, 2.3.8, 2.4.9
    • mapreduce
    • None

    Description

      CellCounter dumps all cell coordinates into its output, which can become huge.

      The spill can fill the local disk on the reducer.
      CellCounter hardcodes mapreduce.job.reduces to 1, so it is not possible to use multiple reducers to get around this.

      Fixing this is easy, by not hardcoding mapreduce.job.reduces, it still defaults to 1, but can be overriden by the user.

      CellCounter also generates two extra records with constant keys for each cell, which have to be processed by the reducer.
      Even with multiple reducers, these (1/3 of the totcal records) will go the same reducer, which can also fill up the disk.

      This can be fixed by adding a Combiner to the Mapper, which sums the counter records, thereby reducing the Mapper output records to 1/3 of their previous amount, which can be evenly distibuted between the reducers.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stoty Istvan Toth
            stoty Istvan Toth
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment