Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Memory can be used as a storage medium for smaller/transient files for fast write throughput.

      More information/design will be added later.

        Issue Links

          Activity

          Hide
          Henry Saputra added a comment -

          Thanks Arpit Agarwal, will check out the new JIRA

          Show
          Henry Saputra added a comment - Thanks Arpit Agarwal , will check out the new JIRA
          Hide
          Arpit Agarwal added a comment -

          I have just created a branch for HDFS-6581. It is fairly limited in scope compared to the original feature proposal.

          Henry Saputra, Konstantin Shvachko the proposal attached to HDFS-6581 should address your queries, please let me know if it doesn't. DDM is not in scope for this phase.

          Show
          Arpit Agarwal added a comment - I have just created a branch for HDFS-6581 . It is fairly limited in scope compared to the original feature proposal. Henry Saputra , Konstantin Shvachko the proposal attached to HDFS-6581 should address your queries, please let me know if it doesn't. DDM is not in scope for this phase.
          Hide
          Colin Patrick McCabe added a comment -

          Let's create a branch for this work, just like we did for the original HDFS-4949 work. This is clearly a big feature, and I think we'll need to iterate a bit to get the best design.

          Show
          Colin Patrick McCabe added a comment - Let's create a branch for this work, just like we did for the original HDFS-4949 work. This is clearly a big feature, and I think we'll need to iterate a bit to get the best design.
          Hide
          Konstantin Shvachko added a comment -

          Few questions as I don't see it covered in the design document:

          1. Heterogeneous storage is implemented but not enabled, as for now it can only allocate the StorageType.DEFAULT blocks. This seems to be the first extension to other StorageTypes. Why memory type is getting prioritized ahead of say SSDs?
          2. I understand the design assumes that only one DN will hold a memory replica of a DDM block. This will increase the latency accessing that block for a single client, but it also makes this DN a bottleneck for many clients trying to access the same data.
          3. If I understood correctly the proposal is to not introduce new APIs for discarding unnecessary or lost data, but handle it using a discardability policy. What is the policy?
            • In this regard Eric's idea of ZK ephemeral nodes is interesting, but probably not directly applicable, as a file should not be discarded only because its creator quit.
          4. Eviction policy is another thing which needs clarification.
          5. What do you mean by "static allocation of memory"? A configuration parameter for DNs?
          Show
          Konstantin Shvachko added a comment - Few questions as I don't see it covered in the design document: Heterogeneous storage is implemented but not enabled, as for now it can only allocate the StorageType.DEFAULT blocks. This seems to be the first extension to other StorageTypes. Why memory type is getting prioritized ahead of say SSDs? I understand the design assumes that only one DN will hold a memory replica of a DDM block. This will increase the latency accessing that block for a single client, but it also makes this DN a bottleneck for many clients trying to access the same data. If I understood correctly the proposal is to not introduce new APIs for discarding unnecessary or lost data, but handle it using a discardability policy. What is the policy? In this regard Eric's idea of ZK ephemeral nodes is interesting, but probably not directly applicable, as a file should not be discarded only because its creator quit. Eviction policy is another thing which needs clarification. What do you mean by "static allocation of memory"? A configuration parameter for DNs?
          Hide
          Andrew Purtell added a comment -

          In term of discardability, what is the "eviction" policy for such data and how control or fine tune it if needed.

          Related, I was talking with Colin Patrick McCabe in the context of HDFS-4949 about possible LRU or LFU policy based eviction, and how that might work. Interesting open question of how to revoke access to mapped pages shared by the datanode with another process without causing the client process to segfault. I don't see this issue addressed in the design doc on this issue. One possibility is a callback protocol advising the client process of pending invalidations?

          Show
          Andrew Purtell added a comment - In term of discardability, what is the "eviction" policy for such data and how control or fine tune it if needed. Related, I was talking with Colin Patrick McCabe in the context of HDFS-4949 about possible LRU or LFU policy based eviction, and how that might work. Interesting open question of how to revoke access to mapped pages shared by the datanode with another process without causing the client process to segfault. I don't see this issue addressed in the design doc on this issue. One possibility is a callback protocol advising the client process of pending invalidations?
          Hide
          Henry Saputra added a comment -

          HI Sanjay Radia, I was looking at the JIRA and proposal and I have some questions related to it:
          1. I did not see where the memory will be allocated for the DDM proposal. Is it similar to HDFS-4949 to use the memory from Datanode?
          2. As for the APIs, would it be new Hadoop FS (Java) APIs or higher level construct to store data in memory because it seemed that the proposal only relying on file path to indicate trying to use in-memory cache
          3. In the problem statement of the proposal seemed like there would be policy to manage how data should be store in memory per application but I could not find details about how to achieve it. Some applications may need to have quick access to some small portion of data more significant (eg: newer time series data) whereas some others may be need to store more (eg: large Hive query)
          4. In term of discardability, what is the "eviction" policy for such data and how control or fine tune it if needed.

          Maybe it was discussed in the in-person happened before but I could not find it in the meet summary.
          Thanks for driving this new feature.

          Show
          Henry Saputra added a comment - HI Sanjay Radia , I was looking at the JIRA and proposal and I have some questions related to it: 1. I did not see where the memory will be allocated for the DDM proposal. Is it similar to HDFS-4949 to use the memory from Datanode? 2. As for the APIs, would it be new Hadoop FS (Java) APIs or higher level construct to store data in memory because it seemed that the proposal only relying on file path to indicate trying to use in-memory cache 3. In the problem statement of the proposal seemed like there would be policy to manage how data should be store in memory per application but I could not find details about how to achieve it. Some applications may need to have quick access to some small portion of data more significant (eg: newer time series data) whereas some others may be need to store more (eg: large Hive query) 4. In term of discardability, what is the "eviction" policy for such data and how control or fine tune it if needed. Maybe it was discussed in the in-person happened before but I could not find it in the meet summary. Thanks for driving this new feature.
          Hide
          Sanjay Radia added a comment -

          Slightly updated doc on DDM:

          • Clarified the separation of mechanism from DDM policy as per Colin's and my comment on being able to create memory cached file anywhere in the HDFS namespace.
          • Explained how DDMs fit with materialized queries (julains DIMMQ).
          • Updated references marked TBD
          • Minor improvements to the RDD/Tachyon comparison.
            Note this text was written prior to Ali and Li's comment and hence does not address their concern. I am rereading the Tachyon paper and will be meeting Li over the next couple of days and will update further as needed.
          Show
          Sanjay Radia added a comment - Slightly updated doc on DDM: Clarified the separation of mechanism from DDM policy as per Colin's and my comment on being able to create memory cached file anywhere in the HDFS namespace. Explained how DDMs fit with materialized queries (julains DIMMQ). Updated references marked TBD Minor improvements to the RDD/Tachyon comparison. Note this text was written prior to Ali and Li's comment and hence does not address their concern. I am rereading the Tachyon paper and will be meeting Li over the next couple of days and will update further as needed.
          Hide
          Sanjay Radia added a comment -

          Colin wrote:

          why a separate namespace under hdfs://<namespace>/.reserved/ddm ? We have xattrs now, so files ...

          I did not explain it well. It is a separation of policy and mechanism. HDFS has to support such files for ANY name. Hence we can use xattr to create files write cache.

          The policy of managing the memory space and the underlying swap space (e.g. hdfs://namepace/.reserved/ddm) is separate from the write-cache mechanism that HDFS needs to support in ANY part of its namespace; so I believe we are in agreement here. I will explain the policy I am proposing in a separate comment.

          Show
          Sanjay Radia added a comment - Colin wrote: why a separate namespace under hdfs://<namespace>/.reserved/ddm ? We have xattrs now, so files ... I did not explain it well. It is a separation of policy and mechanism. HDFS has to support such files for ANY name. Hence we can use xattr to create files write cache. The policy of managing the memory space and the underlying swap space (e.g. hdfs://namepace/.reserved/ddm) is separate from the write-cache mechanism that HDFS needs to support in ANY part of its namespace; so I believe we are in agreement here. I will explain the policy I am proposing in a separate comment.
          Hide
          Todd Lipcon added a comment -

          Yep, the native checksumming that James is working on is one big part of it. The other half is the work that Trevor Robinson was doing on using DirectByteBuffers on the DN side to avoid some copies to/from byte arrays.

          Show
          Todd Lipcon added a comment - Yep, the native checksumming that James is working on is one big part of it. The other half is the work that Trevor Robinson was doing on using DirectByteBuffers on the DN side to avoid some copies to/from byte arrays.
          Hide
          Arpit Agarwal added a comment -

          Thanks for the pointer, James!

          Show
          Arpit Agarwal added a comment - Thanks for the pointer, James!
          Hide
          James Thomas added a comment -

          Hi Arpit, I posted some preliminary work on native checksumming in the write path at HDFS-3528.

          Show
          James Thomas added a comment - Hi Arpit, I posted some preliminary work on native checksumming in the write path at HDFS-3528 .
          Hide
          Arpit Agarwal added a comment -

          Todd Lipcon - I think you mentioned you had prototyped some write pipeline improvements which showed significant improvements. Are you able to share those out?

          Show
          Arpit Agarwal added a comment - Todd Lipcon - I think you mentioned you had prototyped some write pipeline improvements which showed significant improvements. Are you able to share those out?
          Hide
          Andrew Purtell added a comment -

          Can we make these files transient like ZK ephemeral nodes?

          This is an interesting idea but ZK ephemeral nodes rely on session semantics. Would the HDFS equivalent be close-on-delete, lease == session? Perhaps it can be done generically for all types of files on all media by implementing delete-on-close with refcounting. You could imagine a file marked delete-on-close open for append, with multiple readers tailing data behind the writer. Once the writer and all readers close the file it could be garbage collected.

          Show
          Andrew Purtell added a comment - Can we make these files transient like ZK ephemeral nodes? This is an interesting idea but ZK ephemeral nodes rely on session semantics. Would the HDFS equivalent be close-on-delete, lease == session? Perhaps it can be done generically for all types of files on all media by implementing delete-on-close with refcounting. You could imagine a file marked delete-on-close open for append, with multiple readers tailing data behind the writer. Once the writer and all readers close the file it could be garbage collected.
          Hide
          Haoyuan Li added a comment -

          Sanjay RadiaAli Ghodsi Yes. Tachyon has had discardability from day one. In addition to working with Spark, Tachyon can also work with other frameworks, e.g. MapReduce, Tez, and Impala.

          Show
          Haoyuan Li added a comment - Sanjay Radia Ali Ghodsi Yes. Tachyon has had discardability from day one. In addition to working with Spark, Tachyon can also work with other frameworks, e.g. MapReduce, Tez, and Impala.
          Hide
          Ali Ghodsi added a comment -

          Sanjay Radia Sanjay, Tachyon indeed has had discardability since about a year back. In fact, the lineage support was only put in in the latest version. In fact, Spark's current releases use Tachyon without lineage stored in Tachyon. If data falls out of Tachyon, Spark will recompute it. Please update the design doc accordingly. Thanks.

          Show
          Ali Ghodsi added a comment - Sanjay Radia Sanjay, Tachyon indeed has had discardability since about a year back. In fact, the lineage support was only put in in the latest version. In fact, Spark's current releases use Tachyon without lineage stored in Tachyon. If data falls out of Tachyon, Spark will recompute it. Please update the design doc accordingly. Thanks.
          Hide
          Colin Patrick McCabe added a comment -

          Todd is correct here. It's the DataNode which pins the block file and metadata file, and it can un-pin them if the client takes too long.

          A few concerns:

          • why a separate namespace under hdfs://<namespace>/.reserved/ddm ? We have xattrs now, so files which have been created in the write cache could be identified with a given (system) xattr.
          • discarding files when memory gets tight. spilling them to disk is another. we should have both policies available so that this can be useful in things like Spark.
          Show
          Colin Patrick McCabe added a comment - Todd is correct here. It's the DataNode which pins the block file and metadata file, and it can un-pin them if the client takes too long. A few concerns: why a separate namespace under hdfs://<namespace>/.reserved/ddm ? We have xattrs now, so files which have been created in the write cache could be identified with a given (system) xattr. discarding files when memory gets tight. spilling them to disk is another. we should have both policies available so that this can be useful in things like Spark.
          Hide
          Todd Lipcon added a comment -

          >> The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD?
          > Yes this is correct. This should be the existing behavior with short-circuit reads.

          Not quite - it doesn't pin the memory, since the memory pinning is implemented by the datanode calling "mlock". The datanode can still munlock the memory even if the client holds the fd open, and the client will fall back to reading from disk (through normal linux read path)

          Show
          Todd Lipcon added a comment - >> The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD? > Yes this is correct. This should be the existing behavior with short-circuit reads. Not quite - it doesn't pin the memory, since the memory pinning is implemented by the datanode calling "mlock". The datanode can still munlock the memory even if the client holds the fd open, and the client will fall back to reading from disk (through normal linux read path)
          Hide
          Arpit Agarwal added a comment -

          Hi eric baldeschwieler, I apologize for the long delayed response.

          The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD?

          Yes this is correct. This should be the existing behavior with short-circuit reads.

          Single replicas? Why would one want to triple replicate discardable memory? One should at least have the option to only keep a single local copy in HDFS.

          We will have a single replica to begin with. There are use cases for triple replicas to share memory across sessions/applications. Multiple replicas would have to be optional as you suggest.

          If we can not prevent random access writes to DDM (we could presumably limit this in client API), then I don't think we can checksum or replicate until a file is closed. My gut is delaying such until close is the right call...

          Direct writes to DN memory are appealing for performance reasons but there are some open questions including the ones you raised. Hence phase 1 will avoid short-circuit writes.

          I will let Sanjay respond to the remaining questions since he is more familiar with those aspects.

          Show
          Arpit Agarwal added a comment - Hi eric baldeschwieler , I apologize for the long delayed response. The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD? Yes this is correct. This should be the existing behavior with short-circuit reads. Single replicas? Why would one want to triple replicate discardable memory? One should at least have the option to only keep a single local copy in HDFS. We will have a single replica to begin with. There are use cases for triple replicas to share memory across sessions/applications. Multiple replicas would have to be optional as you suggest. If we can not prevent random access writes to DDM (we could presumably limit this in client API), then I don't think we can checksum or replicate until a file is closed. My gut is delaying such until close is the right call... Direct writes to DN memory are appealing for performance reasons but there are some open questions including the ones you raised. Hence phase 1 will avoid short-circuit writes. I will let Sanjay respond to the remaining questions since he is more familiar with those aspects.
          Hide
          Arpit Agarwal added a comment -

          Minutes from Google Hangout:

          Wrt to the mechanism to support memory caching there was high level agreement on the implementation phases roughly as:

          • 1st phase - streaming socket write, but mlock on DN side so that it keeps it for readers.
            • Make this work for a single replica
            • Separately (in another Jira) investigate write-pipeline improvements because the write-pipeline has not been optimized. This should give us some initial performance numbers and one can start using this mechanism. Todd Lipcon has a prototype.
          • 2nd phase - Explore short-circuit write, but datanode still mlocks. We had a quick discussion on short-circuit write being tricky
            • Recovery issues (RBW)
            • Client can do things that can get the DN confused (e.g. truncate/append the file after close)
          • Future phases
            • Add lazy replication to other replicas (note earlier phases allowed only 1 replica)
            • Direct writes to memory by memory-mapping the file

          Discussion on discardability:

          • Shouldn't this be property of file (such a replica count of 1) rather than a a property of /.reserved/ddm?
            • This needs further discussion on the jira.
          • Why the two layer approach?
            • We don't want to necessarily put load on NN for intermediate files and hence the 2nd layer.
          Show
          Arpit Agarwal added a comment - Minutes from Google Hangout: Wrt to the mechanism to support memory caching there was high level agreement on the implementation phases roughly as: 1st phase - streaming socket write, but mlock on DN side so that it keeps it for readers. Make this work for a single replica Separately (in another Jira) investigate write-pipeline improvements because the write-pipeline has not been optimized. This should give us some initial performance numbers and one can start using this mechanism. Todd Lipcon has a prototype. 2nd phase - Explore short-circuit write, but datanode still mlocks. We had a quick discussion on short-circuit write being tricky Recovery issues (RBW) Client can do things that can get the DN confused (e.g. truncate/append the file after close) Future phases Add lazy replication to other replicas (note earlier phases allowed only 1 replica) Direct writes to memory by memory-mapping the file Discussion on discardability: Shouldn't this be property of file (such a replica count of 1) rather than a a property of /.reserved/ddm? This needs further discussion on the jira. Why the two layer approach? We don't want to necessarily put load on NN for intermediate files and hence the 2nd layer.
          Show
          Colin Patrick McCabe added a comment - This link is a direct link to the hangout: https://plus.google.com/hangouts/_/stream/72cpimt8aamrmkuj0036v19u08?pqs=1&authuser=0&hl=en
          Hide
          Andrew Wang added a comment -

          Hey Arpit/Sanjay, as a heads up, a bunch of us from Cloudera are planning on attending in person (myself, Colin, ATM, Todd, Charlie, maybe Eli). Looking forward to the meeting tomorrow.

          Show
          Andrew Wang added a comment - Hey Arpit/Sanjay, as a heads up, a bunch of us from Cloudera are planning on attending in person (myself, Colin, ATM, Todd, Charlie, maybe Eli). Looking forward to the meeting tomorrow.
          Hide
          Sanjay Radia added a comment -

          BTW we will host the meeting at Hortonworks for those that are local and want to attend in person:
          Hortonworks
          3460 W. Bayshore Rd
          Palo Alto CA 94303

          Show
          Sanjay Radia added a comment - BTW we will host the meeting at Hortonworks for those that are local and want to attend in person: Hortonworks 3460 W. Bayshore Rd Palo Alto CA 94303
          Hide
          Arpit Agarwal added a comment - - edited

          I scheduled a Google+ hangout to discuss this topic for 4/30 3-5pm PDT - link here.

          Let me know if you are unable to access it.

          I will be attending remotely as I am not located in the bay area. I have reserved a conference room at the Hortonworks Palo Alto office for anyone who wants to attend in person. Please check with Sanjay Radia for access to the Hortonworks office ahead of time if you plan to attend. The address is in Sanjay's comment below.

          (edit: Increased the scheduled time from 1 hour - 2 hour in case we go over)

          Show
          Arpit Agarwal added a comment - - edited I scheduled a Google+ hangout to discuss this topic for 4/30 3-5pm PDT - link here . Let me know if you are unable to access it. I will be attending remotely as I am not located in the bay area. I have reserved a conference room at the Hortonworks Palo Alto office for anyone who wants to attend in person. Please check with Sanjay Radia for access to the Hortonworks office ahead of time if you plan to attend. The address is in Sanjay's comment below. (edit: Increased the scheduled time from 1 hour - 2 hour in case we go over)
          Hide
          Sanjay Radia added a comment - - edited

          Added comparison to Tachyon in the doc. The is also an implementation difference that I don't cover (Tachyon I believe uses RamFs rather than a memory that is mapped to a HDFS file – but need to verify that).

          I have reproduced the text from the updated doc here for convenience:
          Recently, Spark has added an RDD implementation called Tachyon [4]. Tachyon is outside the address space of an application and allows sharing RDDs across applications. Both Tachyon and DDMs use memory mapped files and lazy writing to reduce the need to recompute. Tachyon, since it is an RDD implementation, records the computation in order to regenerate the data in case of loss whereas DDMs relies on the application to regenerate. Tachyon and RDDs do not have a notion of discardability, which is fundamental to DDMs where data can be discarded when it is under memory and/or backing store pressure. DDMs are closest to virtual memory/anti-caching in that they virtualize memory, with the twist that data can be discarded.

          Show
          Sanjay Radia added a comment - - edited Added comparison to Tachyon in the doc. The is also an implementation difference that I don't cover (Tachyon I believe uses RamFs rather than a memory that is mapped to a HDFS file – but need to verify that). I have reproduced the text from the updated doc here for convenience: Recently, Spark has added an RDD implementation called Tachyon [4] . Tachyon is outside the address space of an application and allows sharing RDDs across applications. Both Tachyon and DDMs use memory mapped files and lazy writing to reduce the need to recompute. Tachyon, since it is an RDD implementation, records the computation in order to regenerate the data in case of loss whereas DDMs relies on the application to regenerate. Tachyon and RDDs do not have a notion of discardability, which is fundamental to DDMs where data can be discarded when it is under memory and/or backing store pressure. DDMs are closest to virtual memory/anti-caching in that they virtualize memory, with the twist that data can be discarded.
          Hide
          eric baldeschwieler added a comment -

          The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD?

          Single replicas? Why would one want to triple replicate discardable memory? One should at least have the option to only keep a single local copy in HDFS.

          If we can not prevent random access writes to DDM (we could presumably limit this in client API), then I don't think we can checksum or replicate until a file is closed. My gut is delaying such until close is the right call...

          How are discarded or lost (node fails) blocks / files handled? Do the names remain in the NN and get reported in FSCK and other operations? We want to be sure this doesn't add work to operators.

          Can we make these files transient like ZK ephemeral nodes?

          Once one assumes you don't need to replicate discardable files, then one can think about allocating only an arena name (think directory) in the NN and then creating individual files only at the DN, limiting NN interaction. This would be a lot faster. (You could still have remote access via .../<ARENA>/<DN-NAME>/<name> style URLs.) With this you could vastly reduce NN interactions, which is probably good for latency reduction and scalability. You could then imagine using this mechanism for MR / Tez / Spark shuffle files ... which has been a long term project goal... Maybe we should break this idea out into another JIRA... ? happy to chat if folks want to flesh this out.

          Involving Yarn in HDFS resource management is interestingly circular. Is this needed? One would want the right abstraction to allow other solutions to be applied to Yarnless deployments.

          Show
          eric baldeschwieler added a comment - The case of a local short circuit read having access to the open file is interesting... does this pin the memory until the possibly misbehaved client process closes the socket / FD? Single replicas? Why would one want to triple replicate discardable memory? One should at least have the option to only keep a single local copy in HDFS. If we can not prevent random access writes to DDM (we could presumably limit this in client API), then I don't think we can checksum or replicate until a file is closed. My gut is delaying such until close is the right call... How are discarded or lost (node fails) blocks / files handled? Do the names remain in the NN and get reported in FSCK and other operations? We want to be sure this doesn't add work to operators. Can we make these files transient like ZK ephemeral nodes? Once one assumes you don't need to replicate discardable files, then one can think about allocating only an arena name (think directory) in the NN and then creating individual files only at the DN, limiting NN interaction. This would be a lot faster. (You could still have remote access via .../<ARENA>/<DN-NAME>/<name> style URLs.) With this you could vastly reduce NN interactions, which is probably good for latency reduction and scalability. You could then imagine using this mechanism for MR / Tez / Spark shuffle files ... which has been a long term project goal... Maybe we should break this idea out into another JIRA... ? happy to chat if folks want to flesh this out. Involving Yarn in HDFS resource management is interestingly circular. Is this needed? One would want the right abstraction to allow other solutions to be applied to Yarnless deployments.
          Hide
          Colin Patrick McCabe added a comment -

          Next Wednesday at 3pm works for me. I can come by in person if you are hosting at the Hortonworks office. Alternately, we can host at Cloudera, if you like. Thanks

          Show
          Colin Patrick McCabe added a comment - Next Wednesday at 3pm works for me. I can come by in person if you are hosting at the Hortonworks office. Alternately, we can host at Cloudera, if you like. Thanks
          Hide
          Arpit Agarwal added a comment -

          How about 4/30 (next Wednesday) at 3pm PDT? We will setup a Google+ hangout or webex.

          I will also be attend remotely since I am not in the Bay Area. If there is interest, we can host a conference room at the Hortonworks office in Palo Alto for folks to attend in person.

          Show
          Arpit Agarwal added a comment - How about 4/30 (next Wednesday) at 3pm PDT? We will setup a Google+ hangout or webex. I will also be attend remotely since I am not in the Bay Area. If there is interest, we can host a conference room at the Hortonworks office in Palo Alto for folks to attend in person.
          Hide
          Colin Patrick McCabe added a comment -

          I took a quick look at the design doc. I think the focus on "discardable" memory makes sense in light of next-gen frameworks like Spark, Tez, etc. One note: Tachyon, Spark's caching layer, does not currently incorporate the concept of RDDs, although that support is planned, as I understand. It's just caching (serialized) files at this point, and I think the semantics match up pretty well with what we're talking about here. The execution framework can re-generate the data if needed... this re-generating support does not need to be included in HDFS.

          I think that some HDFS applications will want the ability to treat multiple files as a single eviction unit... i.e., if you evict one file, you evict them all. (Things like Hive tables are multiple files, but probably ought to be treated as a single unit for caching purposes.) There are also some questions about when eviction can occur... it seems like it would be very inconvenient to do it while the file was being read. On the other hand, we probably need a timeout to prevent a selfish process (or a process on a disconnected node) from pinning something in the cache forever by keeping a file open.

          Clearly we want the ability to do things like skip checksums when reading the cached files. This will reuse a lot of the HDFS-4949 code. It's less clear what other aspects of the HDFS-4949 code we'll want to reuse. I think cache pools might be one such thing. There is a potential to reuse some of the implementation as well, such as mlocking and so forth. An mlocked file in /dev/shm could be a good way to go here.

          I am free all of next week, except for Friday. Let's schedule a webex so we can figure this stuff out.

          Show
          Colin Patrick McCabe added a comment - I took a quick look at the design doc. I think the focus on "discardable" memory makes sense in light of next-gen frameworks like Spark, Tez, etc. One note: Tachyon, Spark's caching layer, does not currently incorporate the concept of RDDs, although that support is planned, as I understand. It's just caching (serialized) files at this point, and I think the semantics match up pretty well with what we're talking about here. The execution framework can re-generate the data if needed... this re-generating support does not need to be included in HDFS. I think that some HDFS applications will want the ability to treat multiple files as a single eviction unit... i.e., if you evict one file, you evict them all. (Things like Hive tables are multiple files, but probably ought to be treated as a single unit for caching purposes.) There are also some questions about when eviction can occur... it seems like it would be very inconvenient to do it while the file was being read. On the other hand, we probably need a timeout to prevent a selfish process (or a process on a disconnected node) from pinning something in the cache forever by keeping a file open. Clearly we want the ability to do things like skip checksums when reading the cached files. This will reuse a lot of the HDFS-4949 code. It's less clear what other aspects of the HDFS-4949 code we'll want to reuse. I think cache pools might be one such thing. There is a potential to reuse some of the implementation as well, such as mlocking and so forth. An mlocked file in /dev/shm could be a good way to go here. I am free all of next week, except for Friday. Let's schedule a webex so we can figure this stuff out.
          Hide
          Sanjay Radia added a comment -

          Please see attached document that identifies some use cases and a proposal for using memory for intermediate data. We introduce the notion of Discardable Distributed Memory (DDM) that exploit the property the data can be reconstructed. Further, by using HDFS files as a backing store to which DDM data is lazily written, we give the impression of much larger memory size and also give the system an additional degree of freedom to manage the scarce memory resource. The main implementation mechanism is memory­-mapped files that are lazily replicated; this mechanism provides weak­-persistence which may have other direct use cases beyond DDMs.

          Show
          Sanjay Radia added a comment - Please see attached document that identifies some use cases and a proposal for using memory for intermediate data. We introduce the notion of Discardable Distributed Memory (DDM) that exploit the property the data can be reconstructed. Further, by using HDFS files as a backing store to which DDM data is lazily written, we give the impression of much larger memory size and also give the system an additional degree of freedom to manage the scarce memory resource. The main implementation mechanism is memory­-mapped files that are lazily replicated; this mechanism provides weak­-persistence which may have other direct use cases beyond DDMs.
          Hide
          Colin Patrick McCabe added a comment -

          Thanks, Aprit. Would be nice to get the time worked out soon so everyone can fit it into their schedule.

          Show
          Colin Patrick McCabe added a comment - Thanks, Aprit. Would be nice to get the time worked out soon so everyone can fit it into their schedule.
          Hide
          Arpit Agarwal added a comment -

          Hi Colin, a call sounds like a good idea and we are open to collaborating on the feature implementation too. Let's have a call next week. We would like to get an initial proposal out by then. There will be ample time to discuss the approach within the community. I will propose a date and time within a couple of days.

          Show
          Arpit Agarwal added a comment - Hi Colin, a call sounds like a good idea and we are open to collaborating on the feature implementation too. Let's have a call next week. We would like to get an initial proposal out by then. There will be ample time to discuss the approach within the community. I will propose a date and time within a couple of days.
          Hide
          Colin Patrick McCabe added a comment -

          Hey guys, how does Thursday (April 24) at 3pm-4pm sound for a webex? I can organize.

          I'd like to figure out:

          • what are the use cases for HDFS-5851
          • how is it different than HDFS-4949 (what motivates a separate implementation)
          • how it fits into our long-term plans for heterogeneous storage

          If we have time we can brainstorm about implementation (Andrew and I have actually thought about some ways of extending HDFS-4949 recently, so we'd like to share some of those ideas with the community).

          Show
          Colin Patrick McCabe added a comment - Hey guys, how does Thursday (April 24) at 3pm-4pm sound for a webex? I can organize. I'd like to figure out: what are the use cases for HDFS-5851 how is it different than HDFS-4949 (what motivates a separate implementation) how it fits into our long-term plans for heterogeneous storage If we have time we can brainstorm about implementation (Andrew and I have actually thought about some ways of extending HDFS-4949 recently, so we'd like to share some of those ideas with the community).
          Hide
          Colin Patrick McCabe added a comment -

          Let's organize a webex about this. It shouldn't take more than an hour of everyone's time.

          If the community gets involved later rather than sooner, I think this may get unpleasant (it's always unpleasant to "reject" a design doc and I want to avoid that by sharing use cases and ideas up front).

          Show
          Colin Patrick McCabe added a comment - Let's organize a webex about this. It shouldn't take more than an hour of everyone's time. If the community gets involved later rather than sooner, I think this may get unpleasant (it's always unpleasant to "reject" a design doc and I want to avoid that by sharing use cases and ideas up front).
          Hide
          Andrew Wang added a comment -

          Does CCM support block-level caching or is it just cache all blocks in a file?

          Right now, the user APIs are all in terms of files, but the backend could do single blocks pretty easily. We didn't add block-level cache directives because we feel automatic cache eviction is a better solution (and could operate at the block or subblock level).

          Looking forward to the design doc! Feel free to ping us even with preliminary usecases/ideas.

          Show
          Andrew Wang added a comment - Does CCM support block-level caching or is it just cache all blocks in a file? Right now, the user APIs are all in terms of files, but the backend could do single blocks pretty easily. We didn't add block-level cache directives because we feel automatic cache eviction is a better solution (and could operate at the block or subblock level). Looking forward to the design doc! Feel free to ping us even with preliminary usecases/ideas.
          Hide
          Arpit Agarwal added a comment -

          Hi Andrew,

          There are also plans to move towards sub-block caching. Whole-block caching is wasteful for columnar formats like ORC and Parquet. With sub-block caching, automatic cache replacement looks a lot more attractive (another planned feature). These are both things we can support with HDFS-4949's infrastructure. I'm not sure about with HSM.

          Does CCM support block-level caching or is it just cache all blocks in a file?

          Anyway, either pulling this towards HDFS-4949 or vice versa, we should figure out these details before moving ahead. I'll echo Colin's desire for a meeting to discuss this. We're willing to host at our Palo Alto office.

          Thanks for the offer! We are discussing use cases and proposed API at high level. I am not sure there is much overlap between CCM and our work and I expect them to solve different use cases. However I am also open to discussing how and whether we can align them more closely once we have shared our initial proposal.

          Show
          Arpit Agarwal added a comment - Hi Andrew, There are also plans to move towards sub-block caching. Whole-block caching is wasteful for columnar formats like ORC and Parquet. With sub-block caching, automatic cache replacement looks a lot more attractive (another planned feature). These are both things we can support with HDFS-4949 's infrastructure. I'm not sure about with HSM. Does CCM support block-level caching or is it just cache all blocks in a file ? Anyway, either pulling this towards HDFS-4949 or vice versa, we should figure out these details before moving ahead. I'll echo Colin's desire for a meeting to discuss this. We're willing to host at our Palo Alto office. Thanks for the offer! We are discussing use cases and proposed API at high level. I am not sure there is much overlap between CCM and our work and I expect them to solve different use cases. However I am also open to discussing how and whether we can align them more closely once we have shared our initial proposal.
          Hide
          Andrew Wang added a comment -

          I'd really like to integrate this with HDFS-4949 where possible. One concern is that we should avoid having another pool of memory carved off from the cluster. HDFS-4949's cache pools were designed to eventually integrate with YARN, but this might introduce another separate pool for a memory quota, putting us back in the same place.

          There are also plans to move towards sub-block caching. Whole-block caching is wasteful for columnar formats like ORC and Parquet. With sub-block caching, automatic cache replacement looks a lot more attractive (another planned feature). These are both things we can support with HDFS-4949's infrastructure. I'm not sure about with HSM.

          It'd also be nice if apps could ZCR these memory-only replicas, ideally reusing the existing auto-ZCR infrastructure.

          Anyway, either pulling this towards HDFS-4949 or vice versa, we should figure out these details before moving ahead. I'll echo Colin's desire for a meeting to discuss this. We're willing to host at our Palo Alto office.

          Show
          Andrew Wang added a comment - I'd really like to integrate this with HDFS-4949 where possible. One concern is that we should avoid having another pool of memory carved off from the cluster. HDFS-4949 's cache pools were designed to eventually integrate with YARN, but this might introduce another separate pool for a memory quota, putting us back in the same place. There are also plans to move towards sub-block caching. Whole-block caching is wasteful for columnar formats like ORC and Parquet. With sub-block caching, automatic cache replacement looks a lot more attractive (another planned feature). These are both things we can support with HDFS-4949 's infrastructure. I'm not sure about with HSM. It'd also be nice if apps could ZCR these memory-only replicas, ideally reusing the existing auto-ZCR infrastructure. Anyway, either pulling this towards HDFS-4949 or vice versa, we should figure out these details before moving ahead. I'll echo Colin's desire for a meeting to discuss this. We're willing to host at our Palo Alto office.
          Hide
          Arpit Agarwal added a comment -

          Hi Eric,

          Unlike Tachyon we won't deal with data regeneration or checkpointing, leaving it to the application. We are still discussing use cases and this task got pushed out due to the 2.4 release.

          • No durability or replication guarantees.
          • Lost files/blocks will be discarded.
          • The application is responsible for timely checkpointing i.e. moving blocks to persistent storage.
          Show
          Arpit Agarwal added a comment - Hi Eric, Unlike Tachyon we won't deal with data regeneration or checkpointing, leaving it to the application. We are still discussing use cases and this task got pushed out due to the 2.4 release. No durability or replication guarantees. Lost files/blocks will be discarded. The application is responsible for timely checkpointing i.e. moving blocks to persistent storage.
          Hide
          eric baldeschwieler added a comment -

          A very interesting design space!

          How does this relate to Tachyon's design goals?
          What is the target use case?
          What durability / replication guarantees are you providing?
          What happens when RAM is exhausted?

          Assuming low durability, what happens when a file or block is lost?

          Show
          eric baldeschwieler added a comment - A very interesting design space! How does this relate to Tachyon's design goals? What is the target use case? What durability / replication guarantees are you providing? What happens when RAM is exhausted? Assuming low durability, what happens when a file or block is lost?
          Hide
          Colin Patrick McCabe added a comment -

          The HDFS-4949 work was centered around caching small, often-used files that were already stored durably to disk. If we supported a "temporary / non-durable" storage tier, there would be some overlap with internal implementation, but probably not much with interface. We should probably have a conference call about this at some point.

          Show
          Colin Patrick McCabe added a comment - The HDFS-4949 work was centered around caching small, often-used files that were already stored durably to disk. If we supported a "temporary / non-durable" storage tier, there would be some overlap with internal implementation, but probably not much with interface. We should probably have a conference call about this at some point.
          Hide
          Arpit Agarwal added a comment -

          I have not given much thought to the specifics except that it would fit within the Heterogeneous Storage framework.

          Spilling writes is an interesting idea. It could be done with extensions to Storage Preferences. Or we don't spill writes silently and limit memory consumption with the quota extensions we described in HDFS-2832. DFSClient or the app could handle the failure.

          Do you see overlap with your CCM work?

          Show
          Arpit Agarwal added a comment - I have not given much thought to the specifics except that it would fit within the Heterogeneous Storage framework. Spilling writes is an interesting idea. It could be done with extensions to Storage Preferences . Or we don't spill writes silently and limit memory consumption with the quota extensions we described in HDFS-2832 . DFSClient or the app could handle the failure. Do you see overlap with your CCM work?
          Hide
          Colin Patrick McCabe added a comment -

          Hi Arpit,

          I don't know if you were present for some of the discussions around in-memory caching and HDFS-4949. See https://issues.apache.org/jira/browse/HDFS-4949?focusedCommentId=13707389&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13707389 for some discussion around this.

          In the past, we've talked about having a "transient tier" for files that we write, but don't necessarily want to put on-disk. I think many applications would choose to write to a tier that would put stuff into memory if space was available, but if not, would spill it to disk. It's crucial to implement spilling, though. Otherwise, we make the applications worry about how much memory is left on the DataNode, which I think would lead to limited adoption. In this sense, memory gets used as a temporary area during a job, not so much a "storage area" (at least that's how I look at it.) Does this line up with your thinking in this area?

          Show
          Colin Patrick McCabe added a comment - Hi Arpit, I don't know if you were present for some of the discussions around in-memory caching and HDFS-4949 . See https://issues.apache.org/jira/browse/HDFS-4949?focusedCommentId=13707389&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13707389 for some discussion around this. In the past, we've talked about having a "transient tier" for files that we write, but don't necessarily want to put on-disk. I think many applications would choose to write to a tier that would put stuff into memory if space was available, but if not, would spill it to disk. It's crucial to implement spilling, though. Otherwise, we make the applications worry about how much memory is left on the DataNode, which I think would lead to limited adoption. In this sense, memory gets used as a temporary area during a job, not so much a "storage area" (at least that's how I look at it.) Does this line up with your thinking in this area?

            People

            • Assignee:
              Arpit Agarwal
              Reporter:
              Arpit Agarwal
            • Votes:
              5 Vote for this issue
              Watchers:
              80 Start watching this issue

              Dates

              • Created:
                Updated:

                Development