Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11188

hadoop-azure: automatically expand page blobs when they become full

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.7.0
    • Component/s: fs
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Right now, page blobs are initialized to a fixed size (fs.azure.page.blob.size) and cannot be expanded. This task is to make them automatically expand when they get to be nearly full.

      Design: if a write occurs that does not have enough room in the file to finish, then flush all preceding operations, extend the file, and complete the write. This will be synchronized (to have exclusive access) in access to PageBlobOutputStream so there won't be race conditions.

      The file will be extended by fs.azure.page.blob.extension.size bytes, which must be a multiple of 512. The internal default for fs.azure.page.blob.extension size will be 128 * 1024 * 1024. The minimum extension size will be 4 * 1024 * 1024 which is the maximum write size, so the new write will finish.

      Extension will stop when the file size reaches 1TB. The final extension may be less than fs.azure.page.blob.extension.size if the remainder (1TB - current_file_size) is smaller than fs.azure.page.blob.extension.size.

      An alternative to this is to make the default size 1TB. This is much simpler to implement. It's a one-line change. Or even simpler, don't change it at all because it is adequate for HBase.

      Rationale for this file size extension feature:

      1) be able to download files to local disk easily with CloudXplorer and similar tools. Downloading a 1TB page blob is not practical if you don't have 1TB disk space since on the local side it expands to the full file size, locally filled with zeros where there is no valid data.

      2) don't make customers uncomfortable when they see large 1TB files. They often ask if they have to pay for it, even though they only pay for the space actually used in the page blob.

      I think rationale 2 is a relatively minor issue, because 98% of customers for HBase will never notice. They will just use it and not look at what kind of files are used for the logs. They don't pay for the unused space, so it is not a problem for them. We can document this. Also, if they use hadoop fs -ls, they will see the actual size of the files since I put in a fix for that.

      Rationale 1 is a minor issue because you cannot interpret the data on your local file system anyway due to the data format. So really, the only reason to copy data locally in its binary format would be if you are moving it around or archiving it. Copying a 1TB page blob from one location in the cloud to another is pretty fast with smart copy utilities that don't actually move the 0-filled parts of the file.

      Nevertheless, this is a convenience feature for users. They won't have to worry about setting fs.azure.page.blob.size under normal circumstances and can make the files grow as big as they want.

      If we make the change to extend the file size on the fly, that introduces new possible error or failure modes for HBase. We should included retry logic.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ehans Eric Hanson
                Reporter:
                ehans Eric Hanson
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: