Sling
  1. Sling
  2. SLING-2707

Support of chunked file upload into Sling

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Servlets Post 2.3.4
    • Component/s: Servlets
    • Labels:
      None

      Description

      Use cases:
      1. Large file upload - With high speed internet connections, advent of cloud and HD going mainstream, Sling should support large files (> 2GB) upload.
      2. Fault tolerant uploads - Sling should provide capability to resume upload from failure point. It should not require client to restart the complete upload process.
      3. Faster upload: Sling should support its clients to initiate multiple connection and upload file chunks in parallel.

      1. SLING-2707-svn.patch
        51 kB
        Shashank Gupta
      2. SLING-2707.patch
        54 kB
        Shashank Gupta
      3. uploadclient.jar
        4.95 MB
        Shashank Gupta

        Activity

        Hide
        Julian Reschke added a comment -

        re 3: why would multiple connections actually help?

        Show
        Julian Reschke added a comment - re 3: why would multiple connections actually help?
        Hide
        Felix Meschberger added a comment -

        I agree with Julian on #3: If a client can fill its pipe, fine. But do we want the client to fill the pipe of the server ? Probably not.

        #1 is certainly already possible, but in the case of interruption there is no recovery other than to restart from zero.

        #2 is probably really key, I agree.

        Yet, I wonder, whether this requires specialized (fat?) clients or whether this is possible to be done from a vanilla browser. Using HTML5 features is fine for me.

        Show
        Felix Meschberger added a comment - I agree with Julian on #3: If a client can fill its pipe, fine. But do we want the client to fill the pipe of the server ? Probably not. #1 is certainly already possible, but in the case of interruption there is no recovery other than to restart from zero. #2 is probably really key, I agree. Yet, I wonder, whether this requires specialized (fat?) clients or whether this is possible to be done from a vanilla browser. Using HTML5 features is fine for me.
        Hide
        Shashank Gupta added a comment -

        >re: why would multiple connections actually help?
        like download accelerators which open multiple connections to server, download file chunks in parallel and finally merge them. e.g http://www.internetdownloadmanager.com/

        > If a client can fill its pipe, fine. But do we want the client to fill the pipe of the server ? Probably not.
        Client can upload multiple files in parallel and fill server's pipe. may be server has high bandwidth pipe. imo, sling should have capability to accept parallel connections for same file and it can dictate to its clients whether it will accept multiple connections for same file.

        >Yet, I wonder, whether this requires specialized (fat?) clients or whether this is possible to be done from a vanilla browser. Using HTML5 features is fine for me.
        HTML5 has File API which provides file slicing capability.
        http://www.html5rocks.com/en/tutorials/file/dndfiles/
        http://stackoverflow.com/questions/7594760/html5-file-api-slicing-or-not

        Show
        Shashank Gupta added a comment - >re: why would multiple connections actually help? like download accelerators which open multiple connections to server, download file chunks in parallel and finally merge them. e.g http://www.internetdownloadmanager.com/ > If a client can fill its pipe, fine. But do we want the client to fill the pipe of the server ? Probably not. Client can upload multiple files in parallel and fill server's pipe. may be server has high bandwidth pipe. imo, sling should have capability to accept parallel connections for same file and it can dictate to its clients whether it will accept multiple connections for same file. >Yet, I wonder, whether this requires specialized (fat?) clients or whether this is possible to be done from a vanilla browser. Using HTML5 features is fine for me. HTML5 has File API which provides file slicing capability. http://www.html5rocks.com/en/tutorials/file/dndfiles/ http://stackoverflow.com/questions/7594760/html5-file-api-slicing-or-not
        Hide
        Julian Reschke added a comment -

        The only reason why multiple connections help is because they work around the server's limits (either intentional, or as a consequence of having to server many other users at the same time) - why would we want to support this?

        Show
        Julian Reschke added a comment - The only reason why multiple connections help is because they work around the server's limits (either intentional, or as a consequence of having to server many other users at the same time) - why would we want to support this?
        Hide
        Shashank Gupta added a comment -

        > why would we want to support this?
        faster upload of large file. Here was the comparison that we did sometime back.

        File Size: 460.10 MB
        Normal upload: 225.26 secs
        chunked upload with 6 parallel connections(throttled by browser): 74 secs

        File size: 221.53MB
        Normal Upload: 108.9 sec
        chunked upload with 6 parallel connections(throttled by browser): 31 sec

        File size: 105.6 MB
        Normal Upload: 63 secs
        chunked upload with 6 parallel connections(throttled by browser): 22 secs

        File size: 51.4 MB
        Normal Upload: 39 secs
        chunked upload with 6 parallel connections(throttled by browser): 16 secs

        Show
        Shashank Gupta added a comment - > why would we want to support this? faster upload of large file. Here was the comparison that we did sometime back. File Size: 460.10 MB Normal upload: 225.26 secs chunked upload with 6 parallel connections(throttled by browser): 74 secs File size: 221.53MB Normal Upload: 108.9 sec chunked upload with 6 parallel connections(throttled by browser): 31 sec File size: 105.6 MB Normal Upload: 63 secs chunked upload with 6 parallel connections(throttled by browser): 22 secs File size: 51.4 MB Normal Upload: 39 secs chunked upload with 6 parallel connections(throttled by browser): 16 secs
        Hide
        Alexander Klimetschek added a comment -

        Agree with Julian & Felix: #2 (resume broken upload) makes sense, but I don't get why parallel uploads should help. What was used as server/endpoint for the above comparison?

        Show
        Alexander Klimetschek added a comment - Agree with Julian & Felix: #2 (resume broken upload) makes sense, but I don't get why parallel uploads should help. What was used as server/endpoint for the above comparison?
        Hide
        Shashank Gupta added a comment -

        > What was used as server/endpoint for the above comparison?
        CQ 5.6 server with S3 datastore.

        I have implemented a servlet endpoint on amazon ec2 instance which can accept parallel file chunks and merge them. Also attached upload client which can upload chunks in parallel.

        Environment: Amazon ec2 instance
        OS: RHEL-6.3
        Server: CQ 5.6 Load 20
        CRX datastore: file system
        Server endpoint: http://ec2-23-20-237-244.compute-1.amazonaws.com:4502
        Usage: java -DparallelThreads=<number of parallel connections> -DbaseUrl=http://ec2-23-20-237-244.compute-1.amazonaws.com:4502 -DfilePath=<file-path> -jar uploadclient.jar
        if number of parallel connections = 1 : all chunks uploaded in serial manner
        file-path = path of file which is being uploaded. file is uploaded into /content/dam/<file_name>

        Results:
        client : my laptop (my org network)
        File size : 54 MB
        Chunk size : 5 MB
        Number of chunks : 11
        Time taken in serial upload : 1735 sec ~= 29 minutes
        Time taken in parallel upload with 6 parallel connections: 691 sec ~= 11 minutes

        There is clear advantage of faster upload using parallel connections.

        PS: Since crx datastore detects duplicate binary, please change binary a bit so that their SHA-1 hash is different.

        Show
        Shashank Gupta added a comment - > What was used as server/endpoint for the above comparison? CQ 5.6 server with S3 datastore. I have implemented a servlet endpoint on amazon ec2 instance which can accept parallel file chunks and merge them. Also attached upload client which can upload chunks in parallel. Environment: Amazon ec2 instance OS: RHEL-6.3 Server: CQ 5.6 Load 20 CRX datastore: file system Server endpoint: http://ec2-23-20-237-244.compute-1.amazonaws.com:4502 Usage: java -DparallelThreads=<number of parallel connections> -DbaseUrl= http://ec2-23-20-237-244.compute-1.amazonaws.com:4502 -DfilePath=<file-path> -jar uploadclient.jar if number of parallel connections = 1 : all chunks uploaded in serial manner file-path = path of file which is being uploaded. file is uploaded into /content/dam/<file_name> Results: client : my laptop (my org network) File size : 54 MB Chunk size : 5 MB Number of chunks : 11 Time taken in serial upload : 1735 sec ~= 29 minutes Time taken in parallel upload with 6 parallel connections: 691 sec ~= 11 minutes There is clear advantage of faster upload using parallel connections. PS: Since crx datastore detects duplicate binary, please change binary a bit so that their SHA-1 hash is different.
        Hide
        Julian Reschke added a comment -

        Shashank: unless I'm missing something there is no reason why using parallel connections speeds things up, unless the server actually intends to limit the throughput per connection. In that case, using parallel connections essentially means abusing server resources. Why would we want that?

        So if you do tests, please do them with a server that is idle and which does not rely on yet another network connection for storage.

        Show
        Julian Reschke added a comment - Shashank: unless I'm missing something there is no reason why using parallel connections speeds things up, unless the server actually intends to limit the throughput per connection. In that case, using parallel connections essentially means abusing server resources. Why would we want that? So if you do tests, please do them with a server that is idle and which does not rely on yet another network connection for storage.
        Hide
        Shashank Gupta added a comment - - edited

        > unless I'm missing something there is no reason why using parallel connections speeds things up,
        one possible reason . http://en.wikipedia.org/wiki/TCP_tuning

        TCP speed limits
        Maximum achievable throughput for a single TCP connection is determined by different factors. One trivial limitation is the maximum bandwidth of the slowest link in the path. But there are also other, less obvious limits for TCP throughput. Bit errors can create a limitation for the connection as well as round-trip time.

        >So if you do tests, please do them with a server that is idle and which does not rely on yet another network connection for storage.
        the server hosted on ec2 instance idle server and has datastore storage on file system.

        Show
        Shashank Gupta added a comment - - edited > unless I'm missing something there is no reason why using parallel connections speeds things up, one possible reason . http://en.wikipedia.org/wiki/TCP_tuning TCP speed limits Maximum achievable throughput for a single TCP connection is determined by different factors. One trivial limitation is the maximum bandwidth of the slowest link in the path. But there are also other, less obvious limits for TCP throughput. Bit errors can create a limitation for the connection as well as round-trip time. >So if you do tests, please do them with a server that is idle and which does not rely on yet another network connection for storage. the server hosted on ec2 instance idle server and has datastore storage on file system.
        Hide
        Alexander Klimetschek added a comment -

        Do the parallel connections from the uploadclient to CQ map to parallel connections to S3 in the datastore implementation? AFAIK S3 is optimized for parallel uploads, as each connection might go to a different (internal) S3 endpoint.

        Show
        Alexander Klimetschek added a comment - Do the parallel connections from the uploadclient to CQ map to parallel connections to S3 in the datastore implementation? AFAIK S3 is optimized for parallel uploads, as each connection might go to a different (internal) S3 endpoint.
        Show
        Shashank Gupta added a comment - other links: http://stackoverflow.com/questions/259553/tcp-is-it-possible-to-achieve-higher-transfer-rate-with-multiple-connections http://www.eecis.udel.edu/~nataraja/papers/MultTCPTR.pdf
        Hide
        Julian Reschke added a comment -

        Shashank: for a reliable test you need a server on a single machine that doesn't do anything else, and which has a stable, consistent network connection to the client. I don't think this is the case here.

        Also, what you cite about TCP speed limits doesn't seem to be relevant for the question about why using multiple connections would help.

        Show
        Julian Reschke added a comment - Shashank: for a reliable test you need a server on a single machine that doesn't do anything else, and which has a stable, consistent network connection to the client. I don't think this is the case here. Also, what you cite about TCP speed limits doesn't seem to be relevant for the question about why using multiple connections would help.
        Hide
        Shashank Gupta added a comment -

        >Also, what you cite about TCP speed limits doesn't seem to be relevant for the question about why using multiple connections would help.
        The primary reason of single tcp cannot acquire full bandwidth of pipe is Head-of-line blocking(http://en.wikipedia.org/wiki/Head-of-line_blocking) i.e. sender stop sending data packets till it receives acknowledgement of first packet sent and using multiple connection is a workaround the problem.

        Quote from http://www.eecis.udel.edu/~nataraja/papers/MultTCPTR.pdf
        "The current workaround to improve an end user’s perceived WWW performance is to download an HTTP transfer over multiple TCP connections. An application
        employing multiple TCP senders exhibits an aggressive sending rate, and consumes a higher share of the bottleneck bandwidth than an application using fewer or single TCP connection(s) [12, 5]. Therefore, multiple TCP connections not only reduce HOL blocking, but are also expected to improve HTTP throughput. As a result, multiple TCP connections have remained an attractive option to improve web response times."

        Quote from http://stackoverflow.com/questions/259553/tcp-is-it-possible-to-achieve-higher-transfer-rate-with-multiple-connections
        "A brief summary: in high-latency, high-bandwidth environments, reliable communications such as TCP are often limited by the amount of data in flight at any given time. Multiple connections are one way around this, as the bandwidth delay product applies to each connection individually."

        "This will vary depending on the TCP stack of your OS, but a not-uncommon value for TCP receive window size is 64Kbytes. This is obviously far too small to allow you to make full use of the end to end bandwidth; once 64kbytes (512kbits) of data have been sent, your sending process will wait for a window update from the receiver indicating that some data has been consumed before putting any more data on the wire."

        "Having multiple TCP sessions open gets around this by virtue of the fact that each TCP session will have its own send/receive buffers"

        Show
        Shashank Gupta added a comment - >Also, what you cite about TCP speed limits doesn't seem to be relevant for the question about why using multiple connections would help. The primary reason of single tcp cannot acquire full bandwidth of pipe is Head-of-line blocking( http://en.wikipedia.org/wiki/Head-of-line_blocking ) i.e. sender stop sending data packets till it receives acknowledgement of first packet sent and using multiple connection is a workaround the problem. Quote from http://www.eecis.udel.edu/~nataraja/papers/MultTCPTR.pdf "The current workaround to improve an end user’s perceived WWW performance is to download an HTTP transfer over multiple TCP connections. An application employing multiple TCP senders exhibits an aggressive sending rate, and consumes a higher share of the bottleneck bandwidth than an application using fewer or single TCP connection(s) [12, 5] . Therefore, multiple TCP connections not only reduce HOL blocking, but are also expected to improve HTTP throughput. As a result, multiple TCP connections have remained an attractive option to improve web response times." Quote from http://stackoverflow.com/questions/259553/tcp-is-it-possible-to-achieve-higher-transfer-rate-with-multiple-connections "A brief summary: in high-latency, high-bandwidth environments, reliable communications such as TCP are often limited by the amount of data in flight at any given time. Multiple connections are one way around this, as the bandwidth delay product applies to each connection individually." "This will vary depending on the TCP stack of your OS, but a not-uncommon value for TCP receive window size is 64Kbytes. This is obviously far too small to allow you to make full use of the end to end bandwidth; once 64kbytes (512kbits) of data have been sent, your sending process will wait for a window update from the receiver indicating that some data has been consumed before putting any more data on the wire." "Having multiple TCP sessions open gets around this by virtue of the fact that each TCP session will have its own send/receive buffers"
        Hide
        Shashank Gupta added a comment -

        > Do the parallel connections from the uploadclient to CQ map to parallel connections to S3 in the datastore implementation? AFAIK S3 is optimized for parallel uploads, as each connection might go to a different (internal) S3 endpoint
        The above endpoint has nothing to do with S3. It is ootb CQ load 20 with an additional servlet endpoint to support chunked file upload into sling. It saves file chunks into local file system and finally merge them into one file and save it to crx datastore(filesytem)

        > for a reliable test you need a server on a single machine that doesn't do anything else, and which has a stable, consistent network connection to the client.
        I tried the upload test between two machines in my org intranet, the time taken using single and parallel connections is almost same. But I upload filer over unreliable internet, I see huge performance gain.
        I have attached a testcase and its usage instructions at https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13552658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13552658
        You can run the testcase and observe that using parallel connection increases the upload speed over internet.

        > In that case, using parallel connections essentially means abusing server resources
        imo, by default sling should allow single connection to upload. client should send special request to sling to use parallel connections and it should be approved by sling. for e.g. client wanting to upload large file can request sling to allow it using parallel connections.

        Show
        Shashank Gupta added a comment - > Do the parallel connections from the uploadclient to CQ map to parallel connections to S3 in the datastore implementation? AFAIK S3 is optimized for parallel uploads, as each connection might go to a different (internal) S3 endpoint The above endpoint has nothing to do with S3. It is ootb CQ load 20 with an additional servlet endpoint to support chunked file upload into sling. It saves file chunks into local file system and finally merge them into one file and save it to crx datastore(filesytem) > for a reliable test you need a server on a single machine that doesn't do anything else, and which has a stable, consistent network connection to the client. I tried the upload test between two machines in my org intranet, the time taken using single and parallel connections is almost same. But I upload filer over unreliable internet, I see huge performance gain. I have attached a testcase and its usage instructions at https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13552658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13552658 You can run the testcase and observe that using parallel connection increases the upload speed over internet. > In that case, using parallel connections essentially means abusing server resources imo, by default sling should allow single connection to upload. client should send special request to sling to use parallel connections and it should be approved by sling. for e.g. client wanting to upload large file can request sling to allow it using parallel connections.
        Hide
        Alexander Klimetschek added a comment -

        Did you also ran real-world scenarios with multiple people accessing the sling instance (mostly reads) and others uploading in parallel? And compare the overall system performance, not just for an individual user on an idle machine?

        Show
        Alexander Klimetschek added a comment - Did you also ran real-world scenarios with multiple people accessing the sling instance (mostly reads) and others uploading in parallel? And compare the overall system performance, not just for an individual user on an idle machine?
        Hide
        Shashank Gupta added a comment -

        Alexander Klimetschek Can you elaborate the point that you want to emphasize? In real world scenario most reads would be from cdn or dispatcher or publisher cq instance.
        My point is sling should have support to accept multiple parallel connections for chunked upload. Whether it allows client to use parallel connections or restrict it to single connection is its sole discretion.

        Show
        Shashank Gupta added a comment - Alexander Klimetschek Can you elaborate the point that you want to emphasize? In real world scenario most reads would be from cdn or dispatcher or publisher cq instance. My point is sling should have support to accept multiple parallel connections for chunked upload. Whether it allows client to use parallel connections or restrict it to single connection is its sole discretion.
        Hide
        Julian Reschke added a comment -

        My gut feeling is that the "parallel upload" feature adds a complexity where it's absolutely not clear it will be useful or desirable at all.

        Show
        Julian Reschke added a comment - My gut feeling is that the "parallel upload" feature adds a complexity where it's absolutely not clear it will be useful or desirable at all.
        Hide
        Shashank Gupta added a comment -

        yes I agree that "parallel chunk upload" will increase the complexity. ok first target fault tolerant resumable upload, but protocol should be flexible enough to support parallel chunks upload.

        Show
        Shashank Gupta added a comment - yes I agree that "parallel chunk upload" will increase the complexity. ok first target fault tolerant resumable upload, but protocol should be flexible enough to support parallel chunks upload.
        Hide
        Bertrand Delacretaz added a comment -

        I agree that chunked file uploads might be useful to support resumable uploads, assuming there's actual clients which support that.

        As for more sophisticated upload strategies, we might provide extension points to allow users to plug in whatever they want - I agree with Julian that supporting multi-connection uploads adds too much complexity and is probably only useful in very specific cases.

        Show
        Bertrand Delacretaz added a comment - I agree that chunked file uploads might be useful to support resumable uploads, assuming there's actual clients which support that. As for more sophisticated upload strategies, we might provide extension points to allow users to plug in whatever they want - I agree with Julian that supporting multi-connection uploads adds too much complexity and is probably only useful in very specific cases.
        Hide
        Shashank Gupta added a comment -

        Please find draft spec at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support.
        Please review and provide you valuable comments.

        Show
        Shashank Gupta added a comment - Please find draft spec at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support . Please review and provide you valuable comments.
        Hide
        Julian Reschke added a comment -

        1) The upload requests should return 201, not 200.

        2) The GET on an unknown upload should return a 4xx status code, not a 5xx. Probably simply 404.

        3) The "Location" response header field on "Get last successful upload chunk response" looks fishy. What is it for? Doesn't the information belong into the payload?

        Also: do we absolutely need the ".html" in the request URI?

        Show
        Julian Reschke added a comment - 1) The upload requests should return 201, not 200. 2) The GET on an unknown upload should return a 4xx status code, not a 5xx. Probably simply 404. 3) The "Location" response header field on "Get last successful upload chunk response" looks fishy. What is it for? Doesn't the information belong into the payload? Also: do we absolutely need the ".html" in the request URI?
        Hide
        Shashank Gupta added a comment -

        > 1) The upload requests should return 201, not 200.
        First/Intermediate upload does create any resource so 200 should be fine. Last chunk upload does the actual creation which returns 201.

        >2) The GET on an unknown upload should return a 4xx status code, not a 5xx. Probably simply 404.
        ok will correct.

        >3) The "Location" response header field on "Get last successful upload chunk response" looks fishy. What is it for? Doesn't the information belong into the payload?
        it is not required.will correct

        >Also: do we absolutely need the ".html" in the request URI?
        iirc, for a non existing resource, if i don't specify extension, selector is used as extension. so in that sense it is required. I will recheck.

        Show
        Shashank Gupta added a comment - > 1) The upload requests should return 201, not 200. First/Intermediate upload does create any resource so 200 should be fine. Last chunk upload does the actual creation which returns 201. >2) The GET on an unknown upload should return a 4xx status code, not a 5xx. Probably simply 404. ok will correct. >3) The "Location" response header field on "Get last successful upload chunk response" looks fishy. What is it for? Doesn't the information belong into the payload? it is not required.will correct >Also: do we absolutely need the ".html" in the request URI? iirc, for a non existing resource, if i don't specify extension, selector is used as extension. so in that sense it is required. I will recheck.
        Hide
        Julian Reschke added a comment -

        > First/Intermediate upload does create any resource so 200 should be fine. Last chunk upload does the actual creation which returns 201.

        Well, it does create a resource, just not the final resource.

        Show
        Julian Reschke added a comment - > First/Intermediate upload does create any resource so 200 should be fine. Last chunk upload does the actual creation which returns 201. Well, it does create a resource, just not the final resource.
        Hide
        Alexander Klimetschek added a comment -

        I wonder if you couldn't do range uploads instead of chunks?
        1) In case an upload fails (connection terminates, no 2xx response), the server makes sure to save what he got so far (not sure if Sling does that already),
        2) then the client does a HEAD on the file to see if it exists (partially) and what it's current length is.
        3) Now you do another PUT or PATCH (not sure what's "right" for range uploads) with a Range that starts at the existing length.

        Then there is no need to manage chunks on the server side, you only need to append to an existing binary. Also it's all in the spec already AFAICS, no need to invent a custom scheme.

        AFAICS, this should also be possible in a (modern) browser, using Blob.slice() from the html5 File API: http://dev.w3.org/2006/webapi/FileAPI/#dfn-slice It should also be possible to build it in a way that it falls back to re-uploading everything in case the server does not allow range uploads.

        Still some replies to the current design:
        > Also: do we absolutely need the ".html" in the request URI?

        Yes, the extension should really correspond to the response format. Note that it could be different responses for the very same feature, one html, one json, ... (depending on the client).

        I don't see any html here but json (on the "Query to retrieve last successful chunk upload").

        I would also put the chunk number in a selector instead of the somewhat long "/chunk/<number>". For example, for the 2nd chunk:

        POST /content/dam/dam-folder/asset.pdf.chunk.2.html

        And avoid action verbs like "upload" in the URL... but as noted above, no selector should be necessary at all.

        Show
        Alexander Klimetschek added a comment - I wonder if you couldn't do range uploads instead of chunks? 1) In case an upload fails (connection terminates, no 2xx response), the server makes sure to save what he got so far (not sure if Sling does that already), 2) then the client does a HEAD on the file to see if it exists (partially) and what it's current length is. 3) Now you do another PUT or PATCH (not sure what's "right" for range uploads) with a Range that starts at the existing length. Then there is no need to manage chunks on the server side, you only need to append to an existing binary. Also it's all in the spec already AFAICS, no need to invent a custom scheme. AFAICS, this should also be possible in a (modern) browser, using Blob.slice() from the html5 File API: http://dev.w3.org/2006/webapi/FileAPI/#dfn-slice It should also be possible to build it in a way that it falls back to re-uploading everything in case the server does not allow range uploads. Still some replies to the current design: > Also: do we absolutely need the ".html" in the request URI? Yes, the extension should really correspond to the response format. Note that it could be different responses for the very same feature, one html, one json, ... (depending on the client). I don't see any html here but json (on the "Query to retrieve last successful chunk upload"). I would also put the chunk number in a selector instead of the somewhat long "/chunk/<number>". For example, for the 2nd chunk: POST /content/dam/dam-folder/asset.pdf.chunk.2.html And avoid action verbs like "upload" in the URL... but as noted above, no selector should be necessary at all.
        Hide
        Shashank Gupta added a comment -

        > Well, it does create a resource, just not the final resource.
        Till the last chunk gets uploaded, the subsequent GET call to the url will return 404 so 201 doesn't looks correct to me. 206 looks more appropriate but it is restricted to GET. http://benramsey.com/blog/2008/05/206-partial-content-and-range-requests/

        >I wonder if you couldn't do range uploads instead of chunks?
        The reason we are not doing range uploads because sling endpoint won't be extensible to support parallel chunk upload. As go on range based upload, the chunk can vary in size and it would be impossible to handle out of sequence chunk arrival. Probably the reason Amazon aws uses chunk numbers http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html.

        > AFAICS, this should also be possible in a (modern) browser, using Blob.slice() from the html5 File API....
        Yes I agree. imo, OPTIONS http://<slingserver>:<port> should tell client whether sling support chunk upload. I am investigating ways to support that. The other approach can be all html5 based browser do chunked upload and rest do plain upload.
        Any other idea is welcomed here?

        > And avoid action verbs like "upload" in the URL... but as noted above, no selector should be necessary at all.
        Not sure i get it. how request will hit my endpoint if don't specify selector.

        >POST /content/dam/dam-folder/asset.pdf.chunk.2.html
        [chunk, 2 ] are selector here. correct ?

        > Then there is no need to manage chunks on the server side, you only need to append to an existing binary. Also it's all in the spec already AFAICS, no need to invent a custom scheme
        implementation detail. for extensible parallel connection design/protocol, it require to keep chunks temporary.

        Show
        Shashank Gupta added a comment - > Well, it does create a resource, just not the final resource. Till the last chunk gets uploaded, the subsequent GET call to the url will return 404 so 201 doesn't looks correct to me. 206 looks more appropriate but it is restricted to GET. http://benramsey.com/blog/2008/05/206-partial-content-and-range-requests/ >I wonder if you couldn't do range uploads instead of chunks? The reason we are not doing range uploads because sling endpoint won't be extensible to support parallel chunk upload. As go on range based upload, the chunk can vary in size and it would be impossible to handle out of sequence chunk arrival. Probably the reason Amazon aws uses chunk numbers http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html . > AFAICS, this should also be possible in a (modern) browser, using Blob.slice() from the html5 File API.... Yes I agree. imo, OPTIONS http://<slingserver>:<port> should tell client whether sling support chunk upload. I am investigating ways to support that. The other approach can be all html5 based browser do chunked upload and rest do plain upload. Any other idea is welcomed here? > And avoid action verbs like "upload" in the URL... but as noted above, no selector should be necessary at all. Not sure i get it. how request will hit my endpoint if don't specify selector. >POST /content/dam/dam-folder/asset.pdf.chunk.2.html [chunk, 2 ] are selector here. correct ? > Then there is no need to manage chunks on the server side, you only need to append to an existing binary. Also it's all in the spec already AFAICS, no need to invent a custom scheme implementation detail. for extensible parallel connection design/protocol, it require to keep chunks temporary.
        Hide
        Julian Reschke added a comment -

        > Till the last chunk gets uploaded, the subsequent GET call to the url will return 404 so 201 doesn't looks correct to me. 206 looks more appropriate but it is restricted to GET. http://benramsey.com/blog/2008/05/206-partial-content-and-range-requests/

        The point being that any chunk you upload is a temporary resource, and there's no reason why a GET on it's URI should return the content with status code 200. With respect to 206: the relevant specs are RFC 2616 and http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p5-range-latest.html for the latest&greatest...

        Show
        Julian Reschke added a comment - > Till the last chunk gets uploaded, the subsequent GET call to the url will return 404 so 201 doesn't looks correct to me. 206 looks more appropriate but it is restricted to GET. http://benramsey.com/blog/2008/05/206-partial-content-and-range-requests/ The point being that any chunk you upload is a temporary resource, and there's no reason why a GET on it's URI should return the content with status code 200. With respect to 206: the relevant specs are RFC 2616 and http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p5-range-latest.html for the latest&greatest...
        Hide
        Bertrand Delacretaz added a comment -

        Shouldn't this discussion happen on the dev list?

        I'm not sure if there's even agreement on what's supposed to be fixed/improved by this issue, the initial use cases at the top lists three somewhat distinct things - form what I understand the focus is on fault-tolerant/resumable uploads, so we probably need to discuss the basics of that and corresponding protocol first.

        Show
        Bertrand Delacretaz added a comment - Shouldn't this discussion happen on the dev list? I'm not sure if there's even agreement on what's supposed to be fixed/improved by this issue, the initial use cases at the top lists three somewhat distinct things - form what I understand the focus is on fault-tolerant/resumable uploads, so we probably need to discuss the basics of that and corresponding protocol first.
        Hide
        Alexander Klimetschek added a comment -

        Actually, since my proposal is append-only, you only need a request (POST/PATCH) somewhat indicating that this is appending to the existing binary. The Range/Content-Range headers aren't fully designed for uploads. Found this proposal which is quite close: http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

        > The reason we are not doing range uploads because sling endpoint won't be extensible to support parallel chunk upload.
        As noted, I don't see this as a goal at all, so a lot of complexity can be stripped.

        > OPTIONS should tell client whether sling support chunk upload
        Not only Sling...

        > how request will hit my endpoint if don't specify selector.
        I mean the range-upload instead of chunks.

        >>POST /content/dam/dam-folder/asset.pdf.chunk.2.html
        > [chunk, 2 ] are selector here. correct ?
        Yes, considering /content/dam/dam-folder/asset.pdf exists as resource.

        > implementation detail. for extensible parallel connection design/protocol, it require to keep chunks temporary.
        Well, exposing a certain URL format is not an implementation detail And btw, it's not restful - the client has to know how the chunk mechanism works (but that's probably necessary in case you want to do chunks).

        Show
        Alexander Klimetschek added a comment - Actually, since my proposal is append-only, you only need a request (POST/PATCH) somewhat indicating that this is appending to the existing binary. The Range/Content-Range headers aren't fully designed for uploads. Found this proposal which is quite close: http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal > The reason we are not doing range uploads because sling endpoint won't be extensible to support parallel chunk upload. As noted, I don't see this as a goal at all, so a lot of complexity can be stripped. > OPTIONS should tell client whether sling support chunk upload Not only Sling... > how request will hit my endpoint if don't specify selector. I mean the range-upload instead of chunks. >>POST /content/dam/dam-folder/asset.pdf.chunk.2.html > [chunk, 2 ] are selector here. correct ? Yes, considering /content/dam/dam-folder/asset.pdf exists as resource. > implementation detail. for extensible parallel connection design/protocol, it require to keep chunks temporary. Well, exposing a certain URL format is not an implementation detail And btw, it's not restful - the client has to know how the chunk mechanism works (but that's probably necessary in case you want to do chunks).
        Hide
        Julian Reschke added a comment -

        > Actually, since my proposal is append-only, you only need a request (POST/PATCH) somewhat indicating that this is appending to the existing binary. The

        Agreed. A PATCH with "append binary" content-type would work.

        > Range/Content-Range headers aren't fully designed for uploads. Found this proposal which is quite close: http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

        That proposal IMHO is dead; better ignore it.

        Show
        Julian Reschke added a comment - > Actually, since my proposal is append-only, you only need a request (POST/PATCH) somewhat indicating that this is appending to the existing binary. The Agreed. A PATCH with "append binary" content-type would work. > Range/Content-Range headers aren't fully designed for uploads. Found this proposal which is quite close: http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal That proposal IMHO is dead; better ignore it.
        Hide
        Felix Meschberger added a comment -

        I have attached my comments to the propsal in the wiki.

        Show
        Felix Meschberger added a comment - I have attached my comments to the propsal in the wiki.
        Hide
        Felix Meschberger added a comment -

        I understand the implementation consists of two parts: (1) the server side implementation in one or more servlets and the (2) Java Script to build the client side UI.

        Finally some integration tests to (a) test and (b) document the feature would be required, too.

        Show
        Felix Meschberger added a comment - I understand the implementation consists of two parts: (1) the server side implementation in one or more servlets and the (2) Java Script to build the client side UI. Finally some integration tests to (a) test and (b) document the feature would be required, too.
        Hide
        Shashank Gupta added a comment -

        Incorporated all suggestions and updated wiki at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support. Please review and provide your comments.
        once we finalized the protocol, I will spec out the design and implementation.

        thanks.

        Show
        Shashank Gupta added a comment - Incorporated all suggestions and updated wiki at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support . Please review and provide your comments. once we finalized the protocol, I will spec out the design and implementation. thanks.
        Hide
        Julian Reschke added a comment -

        You can't simply replace POST by PATCH. They have different semantics. You'll also need custom media types describing the PATCH format.

        Show
        Julian Reschke added a comment - You can't simply replace POST by PATCH. They have different semantics. You'll also need custom media types describing the PATCH format.
        Hide
        Shashank Gupta added a comment -
        Show
        Shashank Gupta added a comment - Updated spec at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support to include PATCH formats.
        Hide
        Julian Reschke added a comment -

        Why do we need two protocols? I really don't see the use case for PATCH here, unless we'd be manipulating a resource-in-place (which we do not).

        Regarding the POST-based proposal: returning 201 every time with the same Location header field looks wrong to me. On the 2nd request, we're just updating an existing resource, right?

        Show
        Julian Reschke added a comment - Why do we need two protocols? I really don't see the use case for PATCH here, unless we'd be manipulating a resource-in-place (which we do not). Regarding the POST-based proposal: returning 201 every time with the same Location header field looks wrong to me. On the 2nd request, we're just updating an existing resource, right?
        Hide
        Shashank Gupta added a comment -

        >Why do we need two protocols? I really don't see the use case for PATCH here, unless we'd be manipulating a resource-in-place (which we do not).
        Yes we are not updating resource-in-place. It was added as per suggestion on https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559699&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559699. I have no objection in removing it.

        > returning 201 every time with the same Location header field looks wrong to me.
        Added as per suggestion https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559551&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559551.
        IMO, since no resource get created until endpoint receives last chunk so 200 OK should be returned for first/intermediate chunk uploads. In last chunk upload since resource gets created, response 201 CREATED should be returned.
        https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559556&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559556

        Show
        Shashank Gupta added a comment - >Why do we need two protocols? I really don't see the use case for PATCH here, unless we'd be manipulating a resource-in-place (which we do not). Yes we are not updating resource-in-place. It was added as per suggestion on https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559699&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559699 . I have no objection in removing it. > returning 201 every time with the same Location header field looks wrong to me. Added as per suggestion https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559551&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559551 . IMO, since no resource get created until endpoint receives last chunk so 200 OK should be returned for first/intermediate chunk uploads. In last chunk upload since resource gets created, response 201 CREATED should be returned. https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13559556&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13559556
        Hide
        Alexander Klimetschek added a comment -

        > IMO, since no resource get created until endpoint receives last chunk so 200 OK

        The assumption here is that the final file is the only "resource" here - but IMHO that is not the case. The protocol allows to address the chunk resource(s) while the final file is not finished yet, so from the point of HTTP those intermediary steps are indeed resources. So that would speak for 201 all the way...

        @Julian: what do you think of a HTTP-level solution involving the existing Range headers for uploads as I proposed above? Would that be in line with the spec? I really don't like the arbitrary ".chunk.<idx>" URLs...

        Show
        Alexander Klimetschek added a comment - > IMO, since no resource get created until endpoint receives last chunk so 200 OK The assumption here is that the final file is the only "resource" here - but IMHO that is not the case. The protocol allows to address the chunk resource(s) while the final file is not finished yet, so from the point of HTTP those intermediary steps are indeed resources. So that would speak for 201 all the way... @Julian: what do you think of a HTTP-level solution involving the existing Range headers for uploads as I proposed above? Would that be in line with the spec? I really don't like the arbitrary ".chunk.<idx>" URLs...
        Hide
        Julian Reschke added a comment -

        Alexander Klimetschek That would be "append in place"", for which PUT does not work (see <http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p2-semantics-21.html#rfc.section.5.3.4.p.10>). We can do that with PATCH (if we did define an "append" patch format). In any case, I would avoid two different protocols unless it's clear that we need both.

        Show
        Julian Reschke added a comment - Alexander Klimetschek That would be "append in place"", for which PUT does not work (see < http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p2-semantics-21.html#rfc.section.5.3.4.p.10 >). We can do that with PATCH (if we did define an "append" patch format). In any case, I would avoid two different protocols unless it's clear that we need both.
        Hide
        Shashank Gupta added a comment -

        well there is overhead of appending binary in jcr node on each upload request. It will create junk files in datastore and generate jcr events. also there is no use of partial content in jcr.
        the idea here is to store chunks in file system' temp location and merge them on last request and create jcr node out of it. Given this implementation context probably 200 OK for first/intermediate request and 201 CREATED probably make more sense.

        Show
        Shashank Gupta added a comment - well there is overhead of appending binary in jcr node on each upload request. It will create junk files in datastore and generate jcr events. also there is no use of partial content in jcr. the idea here is to store chunks in file system' temp location and merge them on last request and create jcr node out of it. Given this implementation context probably 200 OK for first/intermediate request and 201 CREATED probably make more sense.
        Hide
        Alexander Klimetschek added a comment -

        @Julian: If we use Range headers, then why would we need a custom patch format? How would you send the actual content-type (e.g. application/pdf) then?

        Ideally we can make it work in a browser (so it works with what the HTML5 File API offers) while at the same time keeping it as simple as possible (so that one can easily write other http client apps). I am not sure if using the File blob in html5 will always use POST + multi-part form or if it also allows custom PATCH/POST with just the binary and a custom content-type. You could maybe also build a custom payload (mixing some custom data + the binary from the File object), but even if that works, this seems unnecessary complicated.

        I see these options:
        a) PATCH or POST + Range headers; plain (sliced) binary as body; custom content-type or encoding?
        b) PATCH or POST + Range headers; body multipart-form as given by browser
        c) PATCH or POST; body multipart form, one part (sliced) binary, another part including the range information

        I prefer a), it seems the simplest.

        Show
        Alexander Klimetschek added a comment - @Julian: If we use Range headers, then why would we need a custom patch format? How would you send the actual content-type (e.g. application/pdf) then? Ideally we can make it work in a browser (so it works with what the HTML5 File API offers) while at the same time keeping it as simple as possible (so that one can easily write other http client apps). I am not sure if using the File blob in html5 will always use POST + multi-part form or if it also allows custom PATCH/POST with just the binary and a custom content-type. You could maybe also build a custom payload (mixing some custom data + the binary from the File object), but even if that works, this seems unnecessary complicated. I see these options: a) PATCH or POST + Range headers; plain (sliced) binary as body; custom content-type or encoding? b) PATCH or POST + Range headers; body multipart-form as given by browser c) PATCH or POST; body multipart form, one part (sliced) binary, another part including the range information I prefer a), it seems the simplest.
        Hide
        Julian Reschke added a comment -

        Alexander Klimetschek the "Range" header field is undefined for POST; so you're basically inventing a new HTTP API, but use an existing header field name; not good.

        Regarding patch formats: a custom format for binary create/append would need a way to specify the type; right.

        Show
        Julian Reschke added a comment - Alexander Klimetschek the "Range" header field is undefined for POST; so you're basically inventing a new HTTP API, but use an existing header field name; not good. Regarding patch formats: a custom format for binary create/append would need a way to specify the type; right.
        Hide
        Alexander Klimetschek added a comment -

        Ok, good, so reusing RANGE with POST and PATCH is not a feasible, spec-compliant option.

        Then I would prefer something like c) which is baked right into the Sling POST servlet, as an extension to the existing file upload support [0], which works with multipart/form-data:

        file=<partial binary>
        file@RangeStart=<bytes>
        file@RangeEnd=<bytes>

        I am not sure if the complexity of storing and maintaining the chunks temporarily in the file system is worth it - considering improvements that Oak (Jackrabbit 3) will bring. I'd like to see the binary being stored in JCR right away.

        It could be updated every time a new chunk is added (the missing parts might be filled with zeros) and a special node type (sling:PartialFile < nt:file) might indicate that it's not finished and keep the already-uploaded ranges in e.g. a multi-value property. Alternatively, it could be a set of child nt:file nodes each holding a chunk (and maybe encoding the range in the node name, e.g. "1000-2500").

        [0] http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html#ManipulatingContent-TheSlingPostServlet%28servlets.post%29-FileUploads

        Show
        Alexander Klimetschek added a comment - Ok, good, so reusing RANGE with POST and PATCH is not a feasible, spec-compliant option. Then I would prefer something like c) which is baked right into the Sling POST servlet, as an extension to the existing file upload support [0] , which works with multipart/form-data: file=<partial binary> file@RangeStart=<bytes> file@RangeEnd=<bytes> I am not sure if the complexity of storing and maintaining the chunks temporarily in the file system is worth it - considering improvements that Oak (Jackrabbit 3) will bring. I'd like to see the binary being stored in JCR right away. It could be updated every time a new chunk is added (the missing parts might be filled with zeros) and a special node type (sling:PartialFile < nt:file) might indicate that it's not finished and keep the already-uploaded ranges in e.g. a multi-value property. Alternatively, it could be a set of child nt:file nodes each holding a chunk (and maybe encoding the range in the node name, e.g. "1000-2500"). [0] http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html#ManipulatingContent-TheSlingPostServlet%28servlets.post%29-FileUploads
        Hide
        Ian Boston added a comment - - edited

        For info only, in case it was missed earlier.

        The AWS Multipart incremental upload referenced earlier started life as a request for Content-Range on PUT requests [1]. Also referenced in that thread is the Microsoft Azure use of Range headers in PUT as a solution (probably a spec extension/violation ?). Interesting to note that the reasoning from AWS for using multipart chunks is the difficulty with reassembling the parts in a cloud environment where parts of a S3 bucket could arrive concurrently from distributed hosts.
        If this has already been looked at and considered, then sorry for the noise.

        1 https://forums.aws.amazon.com/thread.jspa?threadID=10752&start=25&tstart=0

        Show
        Ian Boston added a comment - - edited For info only, in case it was missed earlier. The AWS Multipart incremental upload referenced earlier started life as a request for Content-Range on PUT requests [1] . Also referenced in that thread is the Microsoft Azure use of Range headers in PUT as a solution (probably a spec extension/violation ?). Interesting to note that the reasoning from AWS for using multipart chunks is the difficulty with reassembling the parts in a cloud environment where parts of a S3 bucket could arrive concurrently from distributed hosts. If this has already been looked at and considered, then sorry for the noise. 1 https://forums.aws.amazon.com/thread.jspa?threadID=10752&start=25&tstart=0
        Hide
        Amit Gupta added a comment -

        Introducing a new node type sling:PartialFile seems unnecessary complexity. And then whether to store chunks in JCR or in file system that is anyways implementation details. Implementation can be changed/improved when OAK comes in.

        Current spec with part number gives opportunity for future extension for supporting uploading of parallel chunks. It takes best of amazon S3 multi part chunk upload, using part number approach. And, by use of part number amazon s3 upload also encourages concurrent uploads of parts. I think that we all agree that multi connection upload at this point is an added complexity, but if just by using part number (which is already simple enough), we can keep scope for future extension then that is not a bad idea at all.

        Show
        Amit Gupta added a comment - Introducing a new node type sling:PartialFile seems unnecessary complexity. And then whether to store chunks in JCR or in file system that is anyways implementation details. Implementation can be changed/improved when OAK comes in. Current spec with part number gives opportunity for future extension for supporting uploading of parallel chunks. It takes best of amazon S3 multi part chunk upload, using part number approach. And, by use of part number amazon s3 upload also encourages concurrent uploads of parts. I think that we all agree that multi connection upload at this point is an added complexity, but if just by using part number (which is already simple enough), we can keep scope for future extension then that is not a bad idea at all.
        Hide
        Alexander Klimetschek added a comment -

        > Introducing a new node type sling:PartialFile seems unnecessary complexity

        Ok, the implementation specifics can be discussed separately.

        > Current spec with part number gives opportunity for future extension for supporting uploading of parallel chunks

        My proposal using the sling post servlet and @RangeStart/End allows for the same; it might be necessary to (initially) include the full file size (e.g. @FileLength) as well, so the server side knows when it is finished and can assemble the final file.

        Regarding S3: as Ian mentioned, the use case for their chunks was different than the resumable upload we are discussing here. So it should not really dominate how our solution should look like. And IMHO it's quite intuitive to follow the Range concept of HTTP GETs - byte ranges are more flexible and self-describing than a fixed number of chunks. The only thing is that we can't use the Range headers directly as it's not spec compliant to reuse that header for PUT/POST/PATCH and we don't want to introduce a proprietary header (X-* considered harmful). Finally, having the chunk number in the URL is just complicating things - the sling post servlet feels like a natural place to add this feature to.

        Show
        Alexander Klimetschek added a comment - > Introducing a new node type sling:PartialFile seems unnecessary complexity Ok, the implementation specifics can be discussed separately. > Current spec with part number gives opportunity for future extension for supporting uploading of parallel chunks My proposal using the sling post servlet and @RangeStart/End allows for the same; it might be necessary to (initially) include the full file size (e.g. @FileLength) as well, so the server side knows when it is finished and can assemble the final file. Regarding S3: as Ian mentioned, the use case for their chunks was different than the resumable upload we are discussing here. So it should not really dominate how our solution should look like. And IMHO it's quite intuitive to follow the Range concept of HTTP GETs - byte ranges are more flexible and self-describing than a fixed number of chunks. The only thing is that we can't use the Range headers directly as it's not spec compliant to reuse that header for PUT/POST/PATCH and we don't want to introduce a proprietary header (X-* considered harmful). Finally, having the chunk number in the URL is just complicating things - the sling post servlet feels like a natural place to add this feature to.
        Hide
        Alexander Klimetschek added a comment -

        Came across this open source project / effort to "fix file uploading": http://felixge.de/2013/03/27/lets-fix-file-uploading.html

        They want to use the Content-Range header with POST and PUT ( http://www.tus.io/docs/http-protocol.html ) and have a html5/javascript client ( https://github.com/tus/tus-html5-client ).

        Show
        Alexander Klimetschek added a comment - Came across this open source project / effort to "fix file uploading": http://felixge.de/2013/03/27/lets-fix-file-uploading.html They want to use the Content-Range header with POST and PUT ( http://www.tus.io/docs/http-protocol.html ) and have a html5/javascript client ( https://github.com/tus/tus-html5-client ).
        Hide
        Felix Geisendörfer added a comment -

        Hi,

        I'm the author of the post linked by Alexander Klimetschek, and this discussion was pointed out to me by Julian Reschke on twitter.

        We're still in the very early stages, so none of the implementation details are final yet. The only reason we're working on clients/servers already is that it gives us a better picture of what can be done on the client plattforms (HTML5, iOS, Android) while using their standard http libraries.

        Anyway, we're serious about coming up with a good solution as this has been a sore spot on the web for way too long. I tend to think that byte ranges are preferable over chunk numbers, and we're currently discussing PUT vs PATCH [1] and which headers to use for the byte ranges [2].

        We'd really like to be fully compliant with existing specs without inventing our own headers, so hopefully this can be accomplished.

        Let me know if there is an interest from your end to join us in our quest. If not, I'd still love to exchange ideas.

        [1] https://github.com/tus/tus-resumable-upload-protocol/issues/14
        [2] https://github.com/tus/tus-resumable-upload-protocol/issues/2

        Show
        Felix Geisendörfer added a comment - Hi, I'm the author of the post linked by Alexander Klimetschek, and this discussion was pointed out to me by Julian Reschke on twitter. We're still in the very early stages, so none of the implementation details are final yet. The only reason we're working on clients/servers already is that it gives us a better picture of what can be done on the client plattforms (HTML5, iOS, Android) while using their standard http libraries. Anyway, we're serious about coming up with a good solution as this has been a sore spot on the web for way too long. I tend to think that byte ranges are preferable over chunk numbers, and we're currently discussing PUT vs PATCH [1] and which headers to use for the byte ranges [2] . We'd really like to be fully compliant with existing specs without inventing our own headers, so hopefully this can be accomplished. Let me know if there is an interest from your end to join us in our quest. If not, I'd still love to exchange ideas. [1] https://github.com/tus/tus-resumable-upload-protocol/issues/14 [2] https://github.com/tus/tus-resumable-upload-protocol/issues/2
        Hide
        Shashank Gupta added a comment - - edited

        Attached patch(SLING-2707.patch) with feature details at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support
        Salient features:
        1. Extension to SlingPostServlet
        2. Chunk upload is supported in "modify"operation of SlingPostServlet ( ie. In default operation) . No new operation introduced for it.
        3. Chunks are stored within actual path in mixin node type. Defined [sling:chunk] and [sling:chunks] for this purpose.
        4. Added six integration test cases validating different scenarios.
        SlingPostChunkUploadTest#testChunkUpload"
        SlingPostChunkUploadTest#testInterruptedChunkUpload
        SlingPostChunkUploadTest#testConcurrentChunkUpload
        SlingPostChunkUploadTest#testMidwayChunkUpload
        SlingPostChunkUploadTest#testChunkUploadOnExistingNode
        SlingPostChunkUploadTest#testChunkUploadOnStreaming

        Show
        Shashank Gupta added a comment - - edited Attached patch( SLING-2707 .patch) with feature details at https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support Salient features: 1. Extension to SlingPostServlet 2. Chunk upload is supported in "modify"operation of SlingPostServlet ( ie. In default operation) . No new operation introduced for it. 3. Chunks are stored within actual path in mixin node type. Defined [sling:chunk] and [sling:chunks] for this purpose. 4. Added six integration test cases validating different scenarios. SlingPostChunkUploadTest#testChunkUpload" SlingPostChunkUploadTest#testInterruptedChunkUpload SlingPostChunkUploadTest#testConcurrentChunkUpload SlingPostChunkUploadTest#testMidwayChunkUpload SlingPostChunkUploadTest#testChunkUploadOnExistingNode SlingPostChunkUploadTest#testChunkUploadOnStreaming
        Hide
        Shashank Gupta added a comment -
        Show
        Shashank Gupta added a comment - Pull request created at https://github.com/apache/sling/pull/8
        Hide
        Shashank Gupta added a comment -
        1. Added svn compatible patch.
        2. Moved constants from JcrResourceConstants.java to SlingPostConstants.java
        Show
        Shashank Gupta added a comment - Added svn compatible patch. Moved constants from JcrResourceConstants.java to SlingPostConstants.java
        Hide
        Carsten Ziegeler added a comment -

        Thanks for your patch, Shashank. I've applied it to trunk. I'll leave this issue open for a little time to give others a chance to comment

        Show
        Carsten Ziegeler added a comment - Thanks for your patch, Shashank. I've applied it to trunk. I'll leave this issue open for a little time to give others a chance to comment
        Hide
        Julian Reschke added a comment -

        How does overloading the name with offset and length work for cases where the filename happens to contain "@"???

        Show
        Julian Reschke added a comment - How does overloading the name with offset and length work for cases where the filename happens to contain "@"???
        Hide
        Shashank Gupta added a comment -

        Thanks Carsten.
        "@" is handled in RequestProperty. I quickly tested the scenario(even with multiple "@" chars in fileName) in integration testcases, it worked fine.

        Show
        Shashank Gupta added a comment - Thanks Carsten. "@" is handled in RequestProperty. I quickly tested the scenario(even with multiple "@" chars in fileName) in integration testcases, it worked fine.
        Hide
        Alexander Klimetschek added a comment -

        Feedback:

        • are the node types required in the jcr/resource bundle, in a cnd which otherwise only defines the sling:Resource node type?
        • IMO the :applyToChunks flag is not necessary, if you want to abort, you simple DELETE (or :operation=delete) the target file
        • also, we cannot rely that this is always sent correctly by the client, the usual "abort" case will be that the upload has been aborted and nothing else is happening anymore. We need to handle partially uploaded nt:file nodes in some way then... garbage collection?
        • the validation in SlingFileUploadHandler lines 271 ff. seems very strict to me:
          • only sequential chunk upload allowed (chunkOffset != currentLength)
          • the totalLength != expectedLength check in line 286 seems wrong to me; IIUC, this would fail if the current chunk's length (totalLength) is not matching the total file length (expectedLength, stored as sling:fileLength), which defeats the purpose of multiple chunks (although I wonder how the tests run, there must be an oversight)
        • What else is the need for the sling:length property then? IMO it's not needed, too easy to get out of sync with what's in the sling:chunk child nodes anyway
        • Is this thread-safe? Assuming multiple chunk requests coming in in parallel.
        • Merging is done through a temporary file stored on the file system, using a custom defined memory buffer of 16KB. I think this can be done simpler using a SequenceInputStream just piping them all together. (And don't do things like new byte[16*1024], use commons-io when needed).
        • the RequestProperty extensions (getOffset, getLength, etc.) might be useful to separate out into a Chunk object that you get via RequestProperty.getChunk() (null if not a chunk), to avoid polluting that object with methods that only make sense in a certain scenario together
        Show
        Alexander Klimetschek added a comment - Feedback: are the node types required in the jcr/resource bundle, in a cnd which otherwise only defines the sling:Resource node type? IMO the :applyToChunks flag is not necessary, if you want to abort, you simple DELETE (or :operation=delete) the target file also, we cannot rely that this is always sent correctly by the client, the usual "abort" case will be that the upload has been aborted and nothing else is happening anymore. We need to handle partially uploaded nt:file nodes in some way then... garbage collection? the validation in SlingFileUploadHandler lines 271 ff. seems very strict to me: only sequential chunk upload allowed (chunkOffset != currentLength) the totalLength != expectedLength check in line 286 seems wrong to me; IIUC, this would fail if the current chunk's length (totalLength) is not matching the total file length (expectedLength, stored as sling:fileLength), which defeats the purpose of multiple chunks (although I wonder how the tests run, there must be an oversight) What else is the need for the sling:length property then? IMO it's not needed, too easy to get out of sync with what's in the sling:chunk child nodes anyway Is this thread-safe? Assuming multiple chunk requests coming in in parallel. Merging is done through a temporary file stored on the file system, using a custom defined memory buffer of 16KB. I think this can be done simpler using a SequenceInputStream just piping them all together. (And don't do things like new byte [16*1024] , use commons-io when needed). the RequestProperty extensions (getOffset, getLength, etc.) might be useful to separate out into a Chunk object that you get via RequestProperty.getChunk() (null if not a chunk), to avoid polluting that object with methods that only make sense in a certain scenario together
        Hide
        Shashank Gupta added a comment -

        Thanks Alex for your detailed feedback. much appreciated. I agree on most of them.

        >are the node types required in the jcr/resource bundle, in a cnd which otherwise only defines the sling:Resource node type?
        No. Yes I think it can be moved to slingpostbundle.

        >IMO the :applyToChunks flag is not necessary, if you want to abort, you simple DELETE (or :operation=delete) the target file
        It is required in update case where you don't want to delete original file. As on now, delete chunks on new file results in empty nt:file. It should delete the file itself. wdyt?

        >Also, we cannot rely that this is always sent correctly by the client, the usual "abort" case will be that the upload has been aborted and nothing else is happening anymore. We need to handle partially uploaded nt:file nodes in some way then... garbage collection?
        Yes I agree. Scheduled job is required which deletes all partial uploads which are 12hrs older. frequency of job say 6hrs. It would be configurable anyway.

        >only sequential chunk upload allowed (chunkOffset != currentLength)
        Yes. parallel not supported as of now. there would be other considerations too for parallel chunks in current implementation.
        As per few first comments on this issue, community doesn't support idea of parallel chunk upload, although my experiment with cq deployed on AWS cloud clearly demonstrated the advantage [1]. As per your comment too "Agree with Julian & Felix: #2 (resume broken upload) makes sense, but I don't get why parallel uploads should help. What was used as server/endpoint for the above comparison?"

        I would be happy to support parallel chunk upload, if community agrees to it.

        [1]https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13552658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13552658

        >the totalLength != expectedLength check in line 286 seems wrong to me
        File length sent in first chunk upload request is persisted in "sling:fileLength". now on subsequent requests the @Length parameter is validated with the persisted value. if there is mismatch, proper exception is thrown. Total file length cannot change between first and subsequent chunk upload request.

        >What else is the need for the sling:length property then? IMO it's not needed, too easy to get out of sync with what's in the sling:chunk child nodes anyway
        yes it can be derived. you need to read all chunks and add their respecitive sizes. I kept it simple to keep it in a property. Also user can see current size from UI anytime.

        >Is this thread-safe? Assuming multiple chunk requests coming in in parallel.
        Yes unless they do on same node. Test cases SlingPostChunkUploadTest#testConcurrentChunkUpload and SlingPostChunkUploadTest#testChunkUploadOnExistingNode covered the scenario.

        >Merging is done through a temporary file stored on the file system.
        yes this can be improved.

        > RequestProperty extensions (getOffset, getLength, etc.) might be useful to separate out into a Chunk object..
        Yes it can be re-factored.

        Show
        Shashank Gupta added a comment - Thanks Alex for your detailed feedback. much appreciated. I agree on most of them. >are the node types required in the jcr/resource bundle, in a cnd which otherwise only defines the sling:Resource node type? No. Yes I think it can be moved to slingpostbundle. >IMO the :applyToChunks flag is not necessary, if you want to abort, you simple DELETE (or :operation=delete) the target file It is required in update case where you don't want to delete original file. As on now, delete chunks on new file results in empty nt:file. It should delete the file itself. wdyt? >Also, we cannot rely that this is always sent correctly by the client, the usual "abort" case will be that the upload has been aborted and nothing else is happening anymore. We need to handle partially uploaded nt:file nodes in some way then... garbage collection? Yes I agree. Scheduled job is required which deletes all partial uploads which are 12hrs older. frequency of job say 6hrs. It would be configurable anyway. >only sequential chunk upload allowed (chunkOffset != currentLength) Yes. parallel not supported as of now. there would be other considerations too for parallel chunks in current implementation. As per few first comments on this issue, community doesn't support idea of parallel chunk upload, although my experiment with cq deployed on AWS cloud clearly demonstrated the advantage [1] . As per your comment too "Agree with Julian & Felix: #2 (resume broken upload) makes sense, but I don't get why parallel uploads should help. What was used as server/endpoint for the above comparison?" I would be happy to support parallel chunk upload, if community agrees to it. [1] https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13552658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13552658 >the totalLength != expectedLength check in line 286 seems wrong to me File length sent in first chunk upload request is persisted in "sling:fileLength". now on subsequent requests the @Length parameter is validated with the persisted value. if there is mismatch, proper exception is thrown. Total file length cannot change between first and subsequent chunk upload request. >What else is the need for the sling:length property then? IMO it's not needed, too easy to get out of sync with what's in the sling:chunk child nodes anyway yes it can be derived. you need to read all chunks and add their respecitive sizes. I kept it simple to keep it in a property. Also user can see current size from UI anytime. >Is this thread-safe? Assuming multiple chunk requests coming in in parallel. Yes unless they do on same node. Test cases SlingPostChunkUploadTest#testConcurrentChunkUpload and SlingPostChunkUploadTest#testChunkUploadOnExistingNode covered the scenario. >Merging is done through a temporary file stored on the file system. yes this can be improved. > RequestProperty extensions (getOffset, getLength, etc.) might be useful to separate out into a Chunk object.. Yes it can be re-factored.
        Hide
        Shashank Gupta added a comment -

        Logged SLING-3036 to track feedback.

        Show
        Shashank Gupta added a comment - Logged SLING-3036 to track feedback.
        Hide
        ASF GitHub Bot added a comment -

        Github user shashank-itbhu closed the pull request at:

        https://github.com/apache/sling/pull/8

        Show
        ASF GitHub Bot added a comment - Github user shashank-itbhu closed the pull request at: https://github.com/apache/sling/pull/8

          People

          • Assignee:
            Carsten Ziegeler
            Reporter:
            Shashank Gupta
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development