[LIBCLOUD-269] Multipart upload for amazon S3 - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.1
Component/s: Storage
Labels:
None

Description

This patch adds support for streaming data upload using Amazon's multipart upload support as listed in (http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)

As per current behaviour, the upload_object_via_stream() API will download the entire object in memory, and then upload it to S3. This can turn problematic with large files (think HD videos around 4GB). This will be a huge hit in performance and memory of the python application.

With this patch, the API upload_object_via_stream() will use the S3 multipart upload feature to upload data in 5MB chunks, thus reducing the overall memory impact of the application.

Design of this feature:

The S3StorageDriver() is not used just for Amazon S3. It is subclassed for being used with other S3 compliant cloud storage providers like Google Storage.
The Amazon S3 multipart upload is not (or may not be) supported by other storage providers (who will prefer the Chunked upload mechanism)

We can solve this problem in two ways:
1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver), which implements this new multipart upload mechanism. Other storage providers will subclass S3StorageDriver. This is a more cleaner approach.
2) Introduce an attribute supports_s3_multipart_upload and based on it's value, control the callback function passed to _put_object() API. This makes the code look a bit hacky, but this approach is better for supporting such features in the future. We don't have to keep making subclasses for each feature.

In the current patch, I have implemented approach (2), though I prefer (1). After discussions with the community and knowing their preferrences, we can select a final approach.

Design notes:

Implementation has three steaps
1) POST request to /container/object_name?uploads. This returns an XML with a unique uploadId. This is handled as part of _put_object(). Doing it via _put_object() ensures that all S3 related parameters are set correctly.
2) Upload each chunk via PUT to /container/object_name?partNumber=X&uploadId=*** - This is done via the callback that is passed to _put_object() named _upload_multipart()
3) POST an XML containing part-numbers and etag headers returned for each part to /container/object_name?uploadId=***, implemented via _commit_multipart()
4) In case of any failures in steps (2) or (3), the upload is deleted from S3 through a DELETE request to /container/object_name?uploadId=****, implemented via _abort_multipart()

The chunk size for upload was set as 5MB - This is the minimum allowed size as per Amazon S3 docs.

Other changes:

Did some PEP8 cleanup on s3.py

s3.get_container() would iterate through the list of containers for finding the requested entry. This can be simplified by making a HEAD request. The only downside is that 'created_time' is not available for the container. Let me know if this approach is OK or if I must revert it.

Introduced the following APIs for the S3StorageDriver(), to make some functionality easier.
get_container_cdn_url()
get_object_cdn_url()

In libcloud.common.base.Connection, the request() method is used as the basis for all HTTP requests made by libcloud. This method had a limitation, which became apparent in S3 multipart upload implementation. For initializing an upload, the API invoked was
/container/object_name?uploads
The 'uploads' parameter had to be passed as-is, without any values. If we made use of "params" argument in request() method, it would have come up as 'uploads=***'. To prevent this, the 'action' was set to /container/object_name?uploads and slight modifications were made to how parameters were appended.

This also forced a change in BaseMockHttpObject._get_method_name()

Bug fixes in test framework

While working on the test cases, I noticed a small issue. Not sure if it was a bug or as per design.
MockRawResponse._get_response_if_not_availale() would return two different values on subsequent invocations.
if not self._response:
...
return self <----- this was inconsistent.
return self._response

While adding test cases for the Amazon S3 functionality, I noticed that instead of getting back MockResponse, I was getting MockRawResponse instance (which did not have methods like read()) or parse_body(). So, I fixed this issue. Because of this other test cases started failing and they were subsequently fixed. Not sure if this has to be fixed or if it was done on purpose. If someone can throw some light on it, I can work on it further. As of now, all test cases pass.

In test_s3.py, the driver was being set everywhere to S3StorageDriver. This same test case is used for GoogleStorageDriver, where the driver turns up as S3StorageDriver instead of GoogleStorageDriver. This was fixed by changing code to driver=self.driver_type

Multipart upload for amazon S3

Details

Description

Attachments

Attachments

Activity

People

Dates