Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: hdfs-client, namenode
    • Labels:
      None
    • Tags:
      ttl

      Description

      In production environment, we always have scenario like this, we want to backup files on hdfs for some time and then hope to delete these files automatically. For example, we keep only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs can support TTL.

      Following are some details of this proposal:
      1. HDFS can support TTL on a specified file or directory
      2. If a TTL is set on a file, the file will be deleted automatically after the TTL is expired
      3. If a TTL is set on a directory, the child files and directories will be deleted automatically after the TTL is expired
      4. The child file/directory's TTL configuration should override its parent directory's
      5. A global configuration is needed to configure that whether the deleted files/directories should go to the trash or not
      6. A global configuration is needed to configure that whether a directory with TTL should be deleted when it is emptied by TTL mechanism or not.

      1. HDFS-TTL-Design.pdf
        106 kB
        Zesheng Wu
      2. HDFS-TTL-Design -2.pdf
        115 kB
        Zesheng Wu
      3. HDFS-TTL-Design-3.pdf
        122 kB
        Zesheng Wu

        Issue Links

          Activity

          Hide
          Zesheng Wu added a comment -

          Thanks Doug Cutting
          I will move the trash emptier into TTL daemon after HDFS-6525 and HDFS-6526 are resolved.

          Show
          Zesheng Wu added a comment - Thanks Doug Cutting I will move the trash emptier into TTL daemon after HDFS-6525 and HDFS-6526 are resolved.
          Hide
          Doug Cutting added a comment -

          But the trash emptier runs inside NN as a daemon thread, instead of a separate daemon process.

          The trash emptier was embedded in the NN mostly just to avoid making folks have to manage another daemon process. However, embedding the emptier has many of the hazards that Chris and Colin described above for embedding TTL. So, if we add a separate daemon process for TTL, then we might also have that process empty the trash and remove the embedded emptier.

          Show
          Doug Cutting added a comment - But the trash emptier runs inside NN as a daemon thread, instead of a separate daemon process. The trash emptier was embedded in the NN mostly just to avoid making folks have to manage another daemon process. However, embedding the emptier has many of the hazards that Chris and Colin described above for embedding TTL. So, if we add a separate daemon process for TTL, then we might also have that process empty the trash and remove the embedded emptier.
          Hide
          Zesheng Wu added a comment -

          Hi guys, I've uploaded an initial implementation on HDFS-6525 and HDFS-6526 separately, hope you can take a look at, any comments will be appreciated. Thanks in advance.

          Show
          Zesheng Wu added a comment - Hi guys, I've uploaded an initial implementation on HDFS-6525 and HDFS-6526 separately, hope you can take a look at, any comments will be appreciated. Thanks in advance.
          Hide
          Zesheng Wu added a comment -

          Update the document according to the implementation.

          Show
          Zesheng Wu added a comment - Update the document according to the implementation.
          Hide
          Steve Loughran added a comment -

          It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world.

          Allen, that's like saying "it's ridiculous to require bash scripts to perform what is fundamentally a unix filesystem problem". One is data, the other is the mechanism to run code near the data. I don't try and hide any local /tmp cleanup init.d scripts inside an ext3 plugin, after all.
          YARN

          1. handles security by having you include kerberos tickets in the launch.
          2. stops you having to choose a specific server to run this thing (hence point of failure).
          3. lets you scale up when needed.
          Show
          Steve Loughran added a comment - It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world. Allen, that's like saying "it's ridiculous to require bash scripts to perform what is fundamentally a unix filesystem problem". One is data, the other is the mechanism to run code near the data. I don't try and hide any local /tmp cleanup init.d scripts inside an ext3 plugin, after all. YARN handles security by having you include kerberos tickets in the launch. stops you having to choose a specific server to run this thing (hence point of failure). lets you scale up when needed.
          Hide
          Colin Patrick McCabe added a comment -

          Plus, in the places that need this the most, one has to deal with getting what essentially becomes a critical part of uptime getting scheduled, competing with all of the other things running.... and, to remind you, to just delete files. It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world.

          In the examples you give, you're already using YARN for Hive and Pig, so it's already a critical part of the infrastructure. Anyway, you should be able to put the cleanup job in a different queue. It's not like YARN is strictly FIFO.

          One eventually gets to the point that the auto cleaner job is now running hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside of HDFS, they are slow and tedious and generally fall in the lap of teams that don't do Java so end up doing all sorts of squirrely things to make these jobs work. This also sucks.

          Well, presumably the implementation in this JIRA won't be done by a team that "doesn't do Java" so we should skip that problem, right?

          The comments about /tmp are, I think, another example of how this needs to be highly configurable. Rather than modifying Hive or Pig to set TTLs on things, we probably want to be able to configure the scanner to look at everything under /tmp. Perhaps the scanner should attach a TTL to things in /tmp that don't already have one.

          Running this under YARN has an intuitive appeal to the upstream developers, since YARN is a scheduler. If we write our own scheduler for this inside HDFS, we're kind of duplicating some of that work, including the monitoring, logging, etc. features. I think Steve's comments (and a lot of the earlier comments) reflect that. Of course, to users not already using YARN, a standalone daemon might seem more appealing.

          The proposal to put this in the balancer seems like a reasonable compromise. We can reuse some of the balancer code, and that way, we're not adding another daemon to manage. I wonder if we could have YARN run the balancer periodically? That might be interesting.

          Show
          Colin Patrick McCabe added a comment - Plus, in the places that need this the most, one has to deal with getting what essentially becomes a critical part of uptime getting scheduled, competing with all of the other things running.... and, to remind you, to just delete files. It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world. In the examples you give, you're already using YARN for Hive and Pig, so it's already a critical part of the infrastructure. Anyway, you should be able to put the cleanup job in a different queue. It's not like YARN is strictly FIFO. One eventually gets to the point that the auto cleaner job is now running hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside of HDFS, they are slow and tedious and generally fall in the lap of teams that don't do Java so end up doing all sorts of squirrely things to make these jobs work. This also sucks. Well, presumably the implementation in this JIRA won't be done by a team that "doesn't do Java" so we should skip that problem, right? The comments about /tmp are, I think, another example of how this needs to be highly configurable. Rather than modifying Hive or Pig to set TTLs on things, we probably want to be able to configure the scanner to look at everything under /tmp. Perhaps the scanner should attach a TTL to things in /tmp that don't already have one. Running this under YARN has an intuitive appeal to the upstream developers, since YARN is a scheduler. If we write our own scheduler for this inside HDFS, we're kind of duplicating some of that work, including the monitoring, logging, etc. features. I think Steve's comments (and a lot of the earlier comments) reflect that. Of course, to users not already using YARN, a standalone daemon might seem more appealing. The proposal to put this in the balancer seems like a reasonable compromise. We can reuse some of the balancer code, and that way, we're not adding another daemon to manage. I wonder if we could have YARN run the balancer periodically? That might be interesting.
          Hide
          Zesheng Wu added a comment -

          Steve Loughran Thanks for your feedback.
          We have discussed that whether to use a MR job or a standalone daemon, and most people upstream has come to an agreement that a standalone daemon is reasonable and acceptable. You can go through the earlier discussion.

          Allen Wittenauer Thanks for your feedback.
          Your suggestion is really valuable and firms our confidence to implement it as a standalone daemon.

          Show
          Zesheng Wu added a comment - Steve Loughran Thanks for your feedback. We have discussed that whether to use a MR job or a standalone daemon, and most people upstream has come to an agreement that a standalone daemon is reasonable and acceptable. You can go through the earlier discussion. Allen Wittenauer Thanks for your feedback. Your suggestion is really valuable and firms our confidence to implement it as a standalone daemon.
          Hide
          Allen Wittenauer added a comment -

          Let's take a step back and give a more concrete example: /tmp

          /tmp is the plague of Hadoop operations teams everywhere. Between Hive and Pig leaving bits hanging around because the client can't properly handle signals to crunch that by default just leaves (seemingly) as much crap laying around that it possibly can because someone somewhere might want to debug it someday, /tmp is the cesspool of HDFS. Every competent admin ends up writing some sort of auto-/tmp-cleaner because of these issues. At scale, /tmp can have hundreds of TB and millions of objects in it in less than 24 hours. It sucks.

          One eventually gets to the point that the auto cleaner job is now running hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside of HDFS, they are slow and tedious and generally fall in the lap of teams that don't do Java so end up doing all sorts of squirrely things to make these jobs work. This also sucks.

          Now, I can see why using an MR job is appealing (easy!), but it isn't very effective. For one, we've already been here once and the result was distch. Hell, there was a big fight just to get distch written and that-years and years later!-still isn't documented because of how slow it works. Throw in directories like /tmp that simply have WAY too much churn and one can see that depending upon MR to work here just isn't viable. Plus, in the places that need this the most, one has to deal with getting what essentially becomes a critical part of uptime getting scheduled, competing with all of the other things running.... and, to remind you, to just delete files. It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world.

          While at Hadoop Summit, a bunch of us sat around a table and were talking about this issue with regards specifically to /tmp. (We didn't know about this JIRA, BTW.) The solution we came up with was basically a service that would bootstrap by reading fsimage and then reading the edits stream by sending the audit information to Kafka. One of the big advantages of this is that we can get near real-time updates of the parts of the file system we care need to operate on. Since we only care about a subsection of the file system, the memory requirements are significantly lower and it might be possible to coalesce deletes in a smart way to cut back on RPCs. I suspect it wouldn't be hard to generalize this type of solution to handle multiple user cases. But for me, this is critical admin functionality that HDFS needs desperately and throwing the the problem to MR just isn't workable.

          Show
          Allen Wittenauer added a comment - Let's take a step back and give a more concrete example: /tmp /tmp is the plague of Hadoop operations teams everywhere. Between Hive and Pig leaving bits hanging around because the client can't properly handle signals to crunch that by default just leaves (seemingly) as much crap laying around that it possibly can because someone somewhere might want to debug it someday, /tmp is the cesspool of HDFS. Every competent admin ends up writing some sort of auto-/tmp-cleaner because of these issues. At scale, /tmp can have hundreds of TB and millions of objects in it in less than 24 hours. It sucks. One eventually gets to the point that the auto cleaner job is now running hourly just so /tmp doesn't overrun the rest of HDFS. Because these run outside of HDFS, they are slow and tedious and generally fall in the lap of teams that don't do Java so end up doing all sorts of squirrely things to make these jobs work. This also sucks. Now, I can see why using an MR job is appealing (easy!), but it isn't very effective. For one, we've already been here once and the result was distch. Hell, there was a big fight just to get distch written and that- years and years later! -still isn't documented because of how slow it works. Throw in directories like /tmp that simply have WAY too much churn and one can see that depending upon MR to work here just isn't viable. Plus, in the places that need this the most, one has to deal with getting what essentially becomes a critical part of uptime getting scheduled, competing with all of the other things running.... and, to remind you, to just delete files. It's sort of ridiculous to require YARN running for what is fundamentally a file system problem. It simply doesn't work in the real world. While at Hadoop Summit, a bunch of us sat around a table and were talking about this issue with regards specifically to /tmp. (We didn't know about this JIRA, BTW.) The solution we came up with was basically a service that would bootstrap by reading fsimage and then reading the edits stream by sending the audit information to Kafka. One of the big advantages of this is that we can get near real-time updates of the parts of the file system we care need to operate on. Since we only care about a subsection of the file system, the memory requirements are significantly lower and it might be possible to coalesce deletes in a smart way to cut back on RPCs. I suspect it wouldn't be hard to generalize this type of solution to handle multiple user cases. But for me, this is critical admin functionality that HDFS needs desperately and throwing the the problem to MR just isn't workable.
          Hide
          Steve Loughran added a comment -

          My comments

          1. this can be done as an MR job.
          2. If you are worried about excessive load, start exactly one mapper, and consider throttling requests. As some object stores throttle heavy load & reject on a very high DELETE rate, throttling is going to be needed for anything that works against them.
          3. you can then use OOzie as the scheduler.
          4. MR restart handles failures: you just re-enum the directories and deleted files don't show up.
          5. If you really, really can't do it as MR, write it as a one-node YARN app, for which I'd recommend apache twill as the starting point. In fact, this project would make for a nice example.

          Don't rush to write a new service here for an intermittent job. that just adds a new cost "A service to install and monitor". Especially when you consider that this new service will need

          1. a launcher entry point
          2. tests
          3. commitment from the HDFS team to maintain it

          We can implement TTL within a MapReduce job that is similar with DistCp. We could run this MapReduce job over and over again or nightly or weekly to delete the expired files and directories.

          Yes, and schedule with oozie

          (1) Advantages:
          The major advantage of the MapReduce framework is concurrency control, if we want to run multiple tasks concurrently, choose a MapReduce approach will ease of concurrency control.

          There are other advantages

          1. The MR job will be simple to write and can be submitted remotely.
          2. it's trivial to test and therefore maintain.
          3. no need to wait for a new version of Hadoop. You can evolve it locally.
          4. different users, submitting jobs with different kerberos tickets can work on their own files securely.
          5. there's no need to install and maintain a new service.

          (2) Disadvantages:
          For implementing the TTL functionality, one task is enough, multiple tasks will give too much race and load to the NameNode.

          1. Demonstrate this by writing an MR job and assessing its load when you have a throttled executor.

            On another hand, use a MapReduce job will introduce additional dependencies and have additional overheads.

          1. additional dependencies? In a cluster with MapReduce installed? The only additional dependency is the JAR with the mapper and the reducer.
          2. What "additional overheads"? Are they really any less than running another service in your cluster, with its own classpath, failure modes, security needs?

          My recommendation, before writing a single line of a new service, is to write it as an MR job. You will find it easy to write and maintain; server load is handled by making sleep time a configurable parameter.

          If you can then actually demonstrate that this is inadequate on a large cluster, then consider a service. But start with MapReduce first. If you haven't written an MR job before, don't worry -it doesn't take that long to learn, and having done it you'll understand your user's workflow better.

          Show
          Steve Loughran added a comment - My comments this can be done as an MR job. If you are worried about excessive load, start exactly one mapper, and consider throttling requests. As some object stores throttle heavy load & reject on a very high DELETE rate, throttling is going to be needed for anything that works against them. you can then use OOzie as the scheduler. MR restart handles failures: you just re-enum the directories and deleted files don't show up. If you really, really can't do it as MR, write it as a one-node YARN app, for which I'd recommend apache twill as the starting point. In fact, this project would make for a nice example. Don't rush to write a new service here for an intermittent job. that just adds a new cost "A service to install and monitor". Especially when you consider that this new service will need a launcher entry point tests commitment from the HDFS team to maintain it We can implement TTL within a MapReduce job that is similar with DistCp. We could run this MapReduce job over and over again or nightly or weekly to delete the expired files and directories. Yes, and schedule with oozie (1) Advantages: The major advantage of the MapReduce framework is concurrency control, if we want to run multiple tasks concurrently, choose a MapReduce approach will ease of concurrency control. There are other advantages The MR job will be simple to write and can be submitted remotely. it's trivial to test and therefore maintain. no need to wait for a new version of Hadoop. You can evolve it locally. different users, submitting jobs with different kerberos tickets can work on their own files securely. there's no need to install and maintain a new service. (2) Disadvantages: For implementing the TTL functionality, one task is enough, multiple tasks will give too much race and load to the NameNode. Demonstrate this by writing an MR job and assessing its load when you have a throttled executor. On another hand, use a MapReduce job will introduce additional dependencies and have additional overheads. additional dependencies? In a cluster with MapReduce installed? The only additional dependency is the JAR with the mapper and the reducer. What "additional overheads"? Are they really any less than running another service in your cluster, with its own classpath, failure modes, security needs? My recommendation, before writing a single line of a new service, is to write it as an MR job. You will find it easy to write and maintain; server load is handled by making sleep time a configurable parameter. If you can then actually demonstrate that this is inadequate on a large cluster, then consider a service. But start with MapReduce first. If you haven't written an MR job before, don't worry -it doesn't take that long to learn, and having done it you'll understand your user's workflow better.
          Hide
          Zesheng Wu added a comment -

          I filed two sub-tasks to track the development of this feature.

          Show
          Zesheng Wu added a comment - I filed two sub-tasks to track the development of this feature.
          Hide
          Zesheng Wu added a comment -

          Tsz Wo Nicholas Sze, Thanks for your valuable suggestions.

          Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds? Do you think that the daemon could guarantee such accuracy? We don't want to waste namenode memory space to store trailing zeros/digits for each ttl. How about supporting symbolic ttl notation, e.g. 10h, 5d?

          Yes, I agree with you that the daemon can't guarantee milliseconds accuracy, and in fact there's no need to guarantee such accuracy. As you suggested, we can use encoded bytes to save NN's memory.

          The name "Supervisor" sounds too general. How about calling it "TtlManager" for the moment? If there are more new features added to the tool, we may change the name later.

          OK, "TtlManager" is more suitable for the moment.

          For setting ttl on a directory foo, write permission permission on the parent directory of foo is not enough. Namenode also checks rwx for all subdirectories of foo for recursive delete.

          Nice catch, If we want to conform to the delete semantics mentioned by Colin, we should check the subdirectories recursively.

          BTW, permission could be changed from time to time. A user may be able to delete a file/dir at the time of setting TTL but the same user may not have permission to delete the same file/dir when the ttl expires.

          The deleting work will be done by a super user(which the "TtlManager" runs as), seems this is not a problem?

          I suggest not to check additional permission requirement on setting ttl but run as the particular user when deleting the file. Then we need to add username to the ttl xattr.

          Good point, but adding the username to the ttl xattr requires more space of NN's memory, we should do the trade-off whether it's worth doing.

          Show
          Zesheng Wu added a comment - Tsz Wo Nicholas Sze , Thanks for your valuable suggestions. Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds? Do you think that the daemon could guarantee such accuracy? We don't want to waste namenode memory space to store trailing zeros/digits for each ttl. How about supporting symbolic ttl notation, e.g. 10h, 5d? Yes, I agree with you that the daemon can't guarantee milliseconds accuracy, and in fact there's no need to guarantee such accuracy. As you suggested, we can use encoded bytes to save NN's memory. The name "Supervisor" sounds too general. How about calling it "TtlManager" for the moment? If there are more new features added to the tool, we may change the name later. OK, "TtlManager" is more suitable for the moment. For setting ttl on a directory foo, write permission permission on the parent directory of foo is not enough. Namenode also checks rwx for all subdirectories of foo for recursive delete. Nice catch, If we want to conform to the delete semantics mentioned by Colin, we should check the subdirectories recursively. BTW, permission could be changed from time to time. A user may be able to delete a file/dir at the time of setting TTL but the same user may not have permission to delete the same file/dir when the ttl expires. The deleting work will be done by a super user(which the "TtlManager" runs as), seems this is not a problem? I suggest not to check additional permission requirement on setting ttl but run as the particular user when deleting the file. Then we need to add username to the ttl xattr. Good point, but adding the username to the ttl xattr requires more space of NN's memory, we should do the trade-off whether it's worth doing.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > How about supporting symbolic ttl notation, e.g. 10h, 5d?

          The actual value in xattr could be encoded as two bytes – 3 bits for the unit (year, month, week, day, hour, minute, second, millisecond) and 13 bits for value. If we store it using milliseconds, it probably needs eight bytes.

          Show
          Tsz Wo Nicholas Sze added a comment - > How about supporting symbolic ttl notation, e.g. 10h, 5d? The actual value in xattr could be encoded as two bytes – 3 bits for the unit (year, month, week, day, hour, minute, second, millisecond) and 13 bits for value. If we store it using milliseconds, it probably needs eight bytes.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Checked the design doc. It looks good. Some comments:

          • "Standalone Daemon Approach ... To Implement a completely new standalone daemon can rarely reuse existing code, will need lots of work to do."
            I don't agree. We may refactor Balancer or other tools if necessary.
          • Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds? Do you think that the daemon could guarantee such accuracy? We don't want to waste namenode memory space to store trailing zeros/digits for each ttl. How about supporting symbolic ttl notation, e.g. 10h, 5d?
          • The name "Supervisor" sounds too general. How about calling it "TtlManager" for the moment? If there are more new features added to the tool, we may change the name later.
          • For setting ttl on a directory foo, write permission permission on the parent directory of foo is not enough. Namenode also checks rwx for all subdirectories of foo for recursive delete. BTW, permission could be changed from time to time. A user may be able to delete a file/dir at the time of setting TTL but the same user may not have permission to delete the same file/dir when the ttl expires.
            I suggest not to check additional permission requirement on setting ttl but run as the particular user when deleting the file. Then we need to add username to the ttl xattr.
          Show
          Tsz Wo Nicholas Sze added a comment - Checked the design doc. It looks good. Some comments: "Standalone Daemon Approach ... To Implement a completely new standalone daemon can rarely reuse existing code, will need lots of work to do." I don't agree. We may refactor Balancer or other tools if necessary. Using xattrs for TTL is a good idea. Do we really need ttl in milliseconds? Do you think that the daemon could guarantee such accuracy? We don't want to waste namenode memory space to store trailing zeros/digits for each ttl. How about supporting symbolic ttl notation, e.g. 10h, 5d? The name "Supervisor" sounds too general. How about calling it "TtlManager" for the moment? If there are more new features added to the tool, we may change the name later. For setting ttl on a directory foo, write permission permission on the parent directory of foo is not enough. Namenode also checks rwx for all subdirectories of foo for recursive delete. BTW, permission could be changed from time to time. A user may be able to delete a file/dir at the time of setting TTL but the same user may not have permission to delete the same file/dir when the ttl expires. I suggest not to check additional permission requirement on setting ttl but run as the particular user when deleting the file. Then we need to add username to the ttl xattr.
          Hide
          Zesheng Wu added a comment -

          Updated the documents to address Colin's suggestions.
          Thanks Colin for your valuable suggestions

          Show
          Zesheng Wu added a comment - Updated the documents to address Colin's suggestions. Thanks Colin for your valuable suggestions
          Hide
          Zesheng Wu added a comment -

          Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given?

          OK, I will update the document and post a new version soon.

          You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself).

          That's reasonable, I'll figure it out clearly in the document.

          Show
          Zesheng Wu added a comment - Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given? OK, I will update the document and post a new version soon. You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself). That's reasonable, I'll figure it out clearly in the document.
          Hide
          Colin Patrick McCabe added a comment -

          You mean that we scan the whole namespace at first and then split it into 5 pieces according to hash of the path, why do we just complete the work during the first scanning process? If I misunderstand your meaning, please point out.

          You need to make one RPC for each file or directory you delete. In contrast, when listing a directory you make only one RPC for every dfs.ls.limit elements (by default 1000). So if you have 5 workers all listing all directories, but only calling delete on some of the files, you still might come out ahead in terms of number of RPCs, provided you had a high ratio of files to directories.

          There are other ways to partition the namespace which are smarter, but rely on some knowledge of what is in it, which you'd have to keep track of.

          A single node design will work for now, though. Considering that you probably want rate-limiting anyway.

          For the simplicity purpose, in the initial version, we will use logs to record which file/directory is deleted by TTL, and errors during the deleting process.

          Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given?

          It doesn't need to be an administrator command, user only can setTtl on file/directory that they have write permission, and can getTtl on file/directory that they have read permission.

          You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself).

          Show
          Colin Patrick McCabe added a comment - You mean that we scan the whole namespace at first and then split it into 5 pieces according to hash of the path, why do we just complete the work during the first scanning process? If I misunderstand your meaning, please point out. You need to make one RPC for each file or directory you delete. In contrast, when listing a directory you make only one RPC for every dfs.ls.limit elements (by default 1000). So if you have 5 workers all listing all directories, but only calling delete on some of the files, you still might come out ahead in terms of number of RPCs, provided you had a high ratio of files to directories. There are other ways to partition the namespace which are smarter, but rely on some knowledge of what is in it, which you'd have to keep track of. A single node design will work for now, though. Considering that you probably want rate-limiting anyway. For the simplicity purpose, in the initial version, we will use logs to record which file/directory is deleted by TTL, and errors during the deleting process. Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given? It doesn't need to be an administrator command, user only can setTtl on file/directory that they have write permission, and can getTtl on file/directory that they have read permission. You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself).
          Hide
          Zesheng Wu added a comment -

          Thanks Colin Patrick McCabe for your feedback.

          For the MR strategy, it seems like this could be parallelized fairly easily. For example, if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra dependencies since HDFS and MR are packaged together.

          You mean that we scan the whole namespace at first and then split it into 5 pieces according to hash of the path, why do we just complete the work during the first scanning process? If I misunderstand your meaning, please point out.

          I don't understand what you mean by "the mapreduce strategy will have additional overheads." What overheads are you foreseeing?

          Possible overheads: Starting a mapreduce job needs to split the input, start an AppMaster, collect result from random machines (Perhaps 'overheads' is not a proper word here)

          I don't understand what you mean by this. What will be done automatically?

          Here "automatically" means we do not have to rely on external tools, the daemon itself can manage the work well.

          How are you going to implement HA for the standalone daemon?

          Good point. As you suggested, one approach is save the state in HDFS and simply restart it when it fails. But managing the state is a complex work, I am considering how to simplify this. One possible simpler approach is that we can consider that the daemon is stateless and simply restart it when if fails. We needn't do checkpoint and just scan from the beginning when it restarts. Because we can require that the work the daemon does is idempotent, starting from the beginning will be harmless. Possible drawbacks of the later approach are that it may waste some time and may delay the work, but they are acceptable.

          I don't see a lot of discussion of logging and monitoring in general. How is the user going to become aware that a file was deleted because of a TTL? Or if there is an error during the delete, how will the user know?

          For the simplicity purpose, in the initial version, we will use logs to record which file/directory is deleted by TTL, and errors during the deleting process.

          Does this need to be an administrator command?

          It doesn't need to be an administrator command, user only can setTtl on file/directory that they have write permission, and can getTtl on file/directory that they have read permission.

          Show
          Zesheng Wu added a comment - Thanks Colin Patrick McCabe for your feedback. For the MR strategy, it seems like this could be parallelized fairly easily. For example, if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra dependencies since HDFS and MR are packaged together. You mean that we scan the whole namespace at first and then split it into 5 pieces according to hash of the path, why do we just complete the work during the first scanning process? If I misunderstand your meaning, please point out. I don't understand what you mean by "the mapreduce strategy will have additional overheads." What overheads are you foreseeing? Possible overheads: Starting a mapreduce job needs to split the input, start an AppMaster, collect result from random machines (Perhaps 'overheads' is not a proper word here) I don't understand what you mean by this. What will be done automatically? Here "automatically" means we do not have to rely on external tools, the daemon itself can manage the work well. How are you going to implement HA for the standalone daemon? Good point. As you suggested, one approach is save the state in HDFS and simply restart it when it fails. But managing the state is a complex work, I am considering how to simplify this. One possible simpler approach is that we can consider that the daemon is stateless and simply restart it when if fails. We needn't do checkpoint and just scan from the beginning when it restarts. Because we can require that the work the daemon does is idempotent, starting from the beginning will be harmless. Possible drawbacks of the later approach are that it may waste some time and may delay the work, but they are acceptable. I don't see a lot of discussion of logging and monitoring in general. How is the user going to become aware that a file was deleted because of a TTL? Or if there is an error during the delete, how will the user know? For the simplicity purpose, in the initial version, we will use logs to record which file/directory is deleted by TTL, and errors during the deleting process. Does this need to be an administrator command? It doesn't need to be an administrator command, user only can setTtl on file/directory that they have write permission, and can getTtl on file/directory that they have read permission.
          Hide
          Colin Patrick McCabe added a comment -

          For the MR strategy, it seems like this could be parallelized fairly easily. For example, if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra dependencies since HDFS and MR are packaged together.

          I don't understand what you mean by "the mapreduce strategy will have additional overheads." What overheads are you forseeing?

          It is true that you need to avoid overloading the NameNode. But this is a concern with any approach, not just the MR one. It would be good to see a section on this. I think the simplest way to do it is to rate-limit RPCs to the NameNode to a configurable rate.

          [for the standalone daemon] The major advantage of this approach is that we don’t need any extra work to finish the TTL work, all will be done in the daemon automatically.

          I don't understand what you mean by this. What will be done automatically?

          How are you going to implement HA for the standalone daemon? I suppose if all the state is kept in HDFS, you can simply restart it when it fails. However, it seems like you need to checkpoint how far along in the FS you are, so that if you die and later get restarted, you don't have to redo the whole FS scan. This implies reading directories in alphabetical order, or similar. You also need to somehow record when the last scan was, perhaps in a file in HDFS.

          I don't see a lot of discussion of logging and monitoring in general. How is the user going to become aware that a file was deleted because of a TTL? Or if there is an error during the delete, how will the user know? Logging is one choice here. Creating a file in HDFS is another.

          The setTtl command seems reasonable. Does this need to be an administrator command?

          Show
          Colin Patrick McCabe added a comment - For the MR strategy, it seems like this could be parallelized fairly easily. For example, if you have 5 MR tasks, you can calculate the hash of each path, and then task 1 can do all the paths that are 0 mod 5, task 2 can do all the paths that are 1 mod 5, and so forth. MR also doesn't introduce extra dependencies since HDFS and MR are packaged together. I don't understand what you mean by "the mapreduce strategy will have additional overheads." What overheads are you forseeing? It is true that you need to avoid overloading the NameNode. But this is a concern with any approach, not just the MR one. It would be good to see a section on this. I think the simplest way to do it is to rate-limit RPCs to the NameNode to a configurable rate. [for the standalone daemon] The major advantage of this approach is that we don’t need any extra work to finish the TTL work, all will be done in the daemon automatically. I don't understand what you mean by this. What will be done automatically? How are you going to implement HA for the standalone daemon? I suppose if all the state is kept in HDFS, you can simply restart it when it fails. However, it seems like you need to checkpoint how far along in the FS you are, so that if you die and later get restarted, you don't have to redo the whole FS scan. This implies reading directories in alphabetical order, or similar. You also need to somehow record when the last scan was, perhaps in a file in HDFS. I don't see a lot of discussion of logging and monitoring in general. How is the user going to become aware that a file was deleted because of a TTL? Or if there is an error during the delete, how will the user know? Logging is one choice here. Creating a file in HDFS is another. The setTtl command seems reasonable. Does this need to be an administrator command?
          Hide
          Zesheng Wu added a comment -

          An initial version of design doc.

          Show
          Zesheng Wu added a comment - An initial version of design doc.
          Hide
          Hangjun Ye added a comment -

          Thanks Colin. We would start to draft a design doc and ask you guys' help to review.

          Yes, the xattrs has saved the big burden for saving the policy, the major question left is where to run the logic.

          Besides these 3 options, another related stuff might be the "trash". Currently trash is implemented as a client-side capability, the trash cleanup logic (trash emptier) depends on FileSystem to operate namespace and basically is a client-side function. But the trash emptier runs inside NN as a daemon thread, instead of a separate daemon process. I guess it interacts with NN via RPC even it runs inside NN.

          We could observe some similarities of trash, balancer, and the proposed TTL: mainly need data from NN; could be implemented as client-side capability (via RPC); need to be run periodically. So if possible we unify all these stuff in one framework/daemon? It also echos Haohui's points earlier. And if it's implemented clearly enough, the user could optionally run it inside NN as a daemon thread to have less jobs to maintain, as long as the user would like to take the risk of running additional logic inside NN (w/o changing NN's logic for this, as it still interacts with NN like a client).

          That's just a premature idea, we might still want to have the TTL as a separate daemon firstly as it's most straight forward. Let's discuss more after we have the design doc.

          Show
          Hangjun Ye added a comment - Thanks Colin. We would start to draft a design doc and ask you guys' help to review. Yes, the xattrs has saved the big burden for saving the policy, the major question left is where to run the logic. Besides these 3 options, another related stuff might be the "trash". Currently trash is implemented as a client-side capability, the trash cleanup logic (trash emptier) depends on FileSystem to operate namespace and basically is a client-side function. But the trash emptier runs inside NN as a daemon thread, instead of a separate daemon process. I guess it interacts with NN via RPC even it runs inside NN. We could observe some similarities of trash, balancer, and the proposed TTL: mainly need data from NN; could be implemented as client-side capability (via RPC); need to be run periodically. So if possible we unify all these stuff in one framework/daemon? It also echos Haohui's points earlier. And if it's implemented clearly enough, the user could optionally run it inside NN as a daemon thread to have less jobs to maintain, as long as the user would like to take the risk of running additional logic inside NN (w/o changing NN's logic for this, as it still interacts with NN like a client). That's just a premature idea, we might still want to have the TTL as a separate daemon firstly as it's most straight forward. Let's discuss more after we have the design doc.
          Hide
          Colin Patrick McCabe added a comment -

          The xattrs branch was merged to trunk two weeks ago. Since trunk is where development happens anyway, you should be able to start now if you like.

          Maybe post a design doc first if you want feedback. It seems like the big question to be answered is: where is this going to live? We have had proposals for doing this as an MR job, a separate daemon, or part of the balancer. They all have pros and cons... it would be good to write down the benefits and disadvantages of each option before making a choice.

          I think any of these 3 options is possible and I wouldn't vote against any of them. It's up to you. If it's a separate daemon, at minimum, we can put it in contrib/. But you may find that some options have a higher maintenance burden on you. I also think that users don't like running more daemons if they can help it. But perhaps there is something I haven't thought of that makes a separate daemon a good choice.

          Show
          Colin Patrick McCabe added a comment - The xattrs branch was merged to trunk two weeks ago. Since trunk is where development happens anyway, you should be able to start now if you like. Maybe post a design doc first if you want feedback. It seems like the big question to be answered is: where is this going to live? We have had proposals for doing this as an MR job, a separate daemon, or part of the balancer. They all have pros and cons... it would be good to write down the benefits and disadvantages of each option before making a choice. I think any of these 3 options is possible and I wouldn't vote against any of them. It's up to you. If it's a separate daemon, at minimum, we can put it in contrib/. But you may find that some options have a higher maintenance burden on you. I also think that users don't like running more daemons if they can help it. But perhaps there is something I haven't thought of that makes a separate daemon a good choice.
          Hide
          Hangjun Ye added a comment -

          Thanks Colin! That's exactly what we want. Seems it's on the way to be merged to branch-2, we will wait for it.

          Show
          Hangjun Ye added a comment - Thanks Colin! That's exactly what we want. Seems it's on the way to be merged to branch-2, we will wait for it.
          Hide
          Colin Patrick McCabe added a comment -

          So just wondering if possible we add an "opaque feature" in INode to store arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase supports "tags" to store arbitrary metadata at a cell: https://issues.apache.org/jira/browse/HBASE-8496

          It sounds like extended attributes (xattrs) might work here. They were recently implemented in HDFS-2006 and subtasks. They basically let you associate some arbitrary key/value pairs with each inode. Check out https://issues.apache.org/jira/secure/attachment/12644341/HDFS-XAttrs-Design-3.pdf

          Show
          Colin Patrick McCabe added a comment - So just wondering if possible we add an "opaque feature" in INode to store arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase supports "tags" to store arbitrary metadata at a cell: https://issues.apache.org/jira/browse/HBASE-8496 It sounds like extended attributes (xattrs) might work here. They were recently implemented in HDFS-2006 and subtasks. They basically let you associate some arbitrary key/value pairs with each inode. Check out https://issues.apache.org/jira/secure/attachment/12644341/HDFS-XAttrs-Design-3.pdf
          Hide
          Hangjun Ye added a comment -

          Thanks Haohui and Colin.

          The balancer or a balancer-like standalone daemon sounds a feasible approach to us. A special requirement of the TTL cleanup is that we need a persistent storage to contain all TTL policies set by users, while balancer and DistCp don't require. It might be nice if the namenode could store such information then we don't have to find somewhere else.

          So just wondering if possible we add an "opaque feature" in INode to store arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase supports "tags" to store arbitrary metadata at a cell: https://issues.apache.org/jira/browse/HBASE-8496

          Then we could have external tools/daemon to let end-users set their TTL policies, and do the cleanup logic. The only change to NN is to add a new feature and also expose APIs to set/get the feature, complicated and volatile logic (metadata encoding, interpretation, cleanup) are done outside NN. And the change might have a much broader usage other than TTL.

          Any thoughts?

          Show
          Hangjun Ye added a comment - Thanks Haohui and Colin. The balancer or a balancer-like standalone daemon sounds a feasible approach to us. A special requirement of the TTL cleanup is that we need a persistent storage to contain all TTL policies set by users, while balancer and DistCp don't require. It might be nice if the namenode could store such information then we don't have to find somewhere else. So just wondering if possible we add an "opaque feature" in INode to store arbitrary bytes? NN just stores it, doesn't interpret it. As an analogy, HBase supports "tags" to store arbitrary metadata at a cell: https://issues.apache.org/jira/browse/HBASE-8496 Then we could have external tools/daemon to let end-users set their TTL policies, and do the cleanup logic. The only change to NN is to add a new feature and also expose APIs to set/get the feature, complicated and volatile logic (metadata encoding, interpretation, cleanup) are done outside NN. And the change might have a much broader usage other than TTL. Any thoughts?
          Hide
          Colin Patrick McCabe added a comment -

          I'm -1 on the idea of putting this in the NameNode. Let's see if we can work together to figure out where the best place for it is, though. Can you comment on why MR is not an option for you? I am concerned that there will be a lot of wheel reinvention if we don't use MR (authentication, resource management, scheduling, etc. etc.) Why not do as DistCp does? As Haohui said, another option is the balancer or a completely standalone daemon.

          Show
          Colin Patrick McCabe added a comment - I'm -1 on the idea of putting this in the NameNode. Let's see if we can work together to figure out where the best place for it is, though. Can you comment on why MR is not an option for you? I am concerned that there will be a lot of wheel reinvention if we don't use MR (authentication, resource management, scheduling, etc. etc.) Why not do as DistCp does? As Haohui said, another option is the balancer or a completely standalone daemon.
          Hide
          Haohui Mai added a comment -

          I think the comments against implementing it in NN are legit. Popping up one level, I'm wondering what is the best approach to meet the following requirements:

          1. Fine tune the behavior of HDFS, which requires the information from the internal data structure in HDFS.
          2. Performing the above task without MapReduce to simplify the operations of the cluster.

          To meet the above requirements, today it looks like to me that there is no way other than making massive changes in HDFS.

          What I'm wondering is that whether it is possible to architect the system to make things easier. For example, is it possible to generalize the architecture of the balancer we have today to accomplish these types of tasks? From a very high level it looks to me that most of the code can sit outside of the NN while meeting the above requirements. Since this is aiming for advanced usages, there are more freedoms on the design of the architecture. For instance, the architecture might choose to expose the details of the implementation and do not guarantee compatibility (like an Exokernel type of system).

          Thoughts?

          Show
          Haohui Mai added a comment - I think the comments against implementing it in NN are legit. Popping up one level, I'm wondering what is the best approach to meet the following requirements: Fine tune the behavior of HDFS, which requires the information from the internal data structure in HDFS. Performing the above task without MapReduce to simplify the operations of the cluster. To meet the above requirements, today it looks like to me that there is no way other than making massive changes in HDFS. What I'm wondering is that whether it is possible to architect the system to make things easier. For example, is it possible to generalize the architecture of the balancer we have today to accomplish these types of tasks? From a very high level it looks to me that most of the code can sit outside of the NN while meeting the above requirements. Since this is aiming for advanced usages, there are more freedoms on the design of the architecture. For instance, the architecture might choose to expose the details of the implementation and do not guarantee compatibility (like an Exokernel type of system). Thoughts?
          Hide
          Hangjun Ye added a comment -

          Thanks Colin for your summarization, I'd like to try to address some of your concerns and questions:

          security / correctness concerns: it's easy to make a mistake that could bring down the NameNode or entire FS

          I agree that's the cost, the developer must be careful and guarantee the code quality.

          non-generality to systems using s3 or another FS in addition to HDFS

          Yes, It's only applicable to HDFS. I guess snapshot is only applicable to HDFS too (I could be wrong here as I haven't read snapshot code), so it shouldn't bring much confusion.

          issues with federation (which NN does the cleanup? How do you decide?)

          Each NN only takes care of the cleanup of files/directories in its own namespace. Let's consider TTL as an attribute attached to files/directories, no much difference under federation or non-federation configuration.

          complexities surrounding our client-side Trash implementation and our server-side snapshots

          No much difference whether implemented inside NN or outside NN.

          configuration burden on sysadmins

          We need to think about the total cost of ownership. Implementing inside NN increases HDFS's own configuration burden for sure, but implementing in a separate system just moves the burden from HDFS to a new system, it would have higher total cost in general.

          inability to change the cleanup code without restarting the NameNode

          Yes, that's the cost, but should be minor. Users might change TTL policies frequently according to their requirements, but the cleanup code shouldn't change frequently (unless the implementation code is crappy).

          HA concerns (need to avoid split-brain or lost updates)

          That's a good question. We haven't thought over this, seems the cleanup code should only run at the active NN as standby doesn't have the latest updates and can't initiate edits.
          It shouldn't introduce split-brain as it doesn't change NN's core flow, but should be implemented carefully anyway.

          error handling (where do users find out about errors?)

          I haven't thought of any runtime errors (at cleanup stage) that need to be notified of the end users. It should be the sys admin who cares about errors at this stage and he/she could figure them out in logs. For the errors when users set TTL through command line or APIs, the users should be notified directly.

          semantics: disappearing or time-limited files is an unfamiliar API, not like the traditional FS APIs we usually implement

          Firstly, no much difference whether implemented inside NN or outside NN. Moreover, if only users have the requirements of TTL-based cleanup, it shouldn't be difficult for them to accept an API.

          Making this pluggable doesn't fix any of those problems, and it adds some more:

          The motivation isn't fixing possible problems of implementing TTL policy in the server-side, it's trying to separate the mechanism from specific jobs. It provides an elegant approach to implement such an extension to NN and makes the common part of such extensions reusable.

          The only points I've seen raised in favor of doing this in the NameNode are:...

          IMHO, the major points for doing this in NN are:

          1. it's a more natural way for end-users, they don't have to interact with HDFS directly in most cases but resort to another system for TTL requirement.
          2. lower cost for maintenance (possibly lower cost for implementation too, but it depends on current status of NN).

          To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses them for things like secondary indices, a much better-defined problem.

          HBase coprocessor is just an analogy... Possibly not a good one but I can't think of a better one right now. HBase could use coprocessor for cleanup jobs. HBase's default cleanup policies are "Number of Versions" and "TTL", which are configured per Column Family. If you have a special requirement to clean up cells per its content, for example using the value of a specific column as the "Number of Versions" to keep, you could do it using coprocessor. You could do the same thing in a MR job for sure. I'm not saying using coprocessor is a good practice in general but for some use cases, it might be.

          A little bit background about what we are doing: both Zesheng and I are from Xiaomi, a fast growing mobile internet company in China. We are in a team to support data infrastructure of the company using the open-sourced Hadoop ecosystem and our role might be similar to some teams in Facebook. We do improvements to the open-sourced software per the requirements from our products and would like to contribute our improvements back to community. We have contributed pretty a few patches to HBase community and two members of our team, Liang Xie and Honghua Feng became HBase committers recently. We improve HDFS at the same time and are also happy to collaborate with the community.

          For this specific feature proposal, a NN-side TTL implementation and a general NN extension mechanism, its feasibility isn't very clear to us as it's just an idea so far. We'd like to spend time on investigating its feasibility furthermore. It's still preferable if feasible. If we encounter insurmountable technical challenges, we would give up for sure. So how about keep this jira issue opened right now (we might open another jira issue to track the general NN extension mechanism), and we will get back after we do the investigation? Whatever approach we choose eventually, we always appreciate you guys' help to work out the solution.

          Show
          Hangjun Ye added a comment - Thanks Colin for your summarization, I'd like to try to address some of your concerns and questions: security / correctness concerns: it's easy to make a mistake that could bring down the NameNode or entire FS I agree that's the cost, the developer must be careful and guarantee the code quality. non-generality to systems using s3 or another FS in addition to HDFS Yes, It's only applicable to HDFS. I guess snapshot is only applicable to HDFS too (I could be wrong here as I haven't read snapshot code), so it shouldn't bring much confusion. issues with federation (which NN does the cleanup? How do you decide?) Each NN only takes care of the cleanup of files/directories in its own namespace. Let's consider TTL as an attribute attached to files/directories, no much difference under federation or non-federation configuration. complexities surrounding our client-side Trash implementation and our server-side snapshots No much difference whether implemented inside NN or outside NN. configuration burden on sysadmins We need to think about the total cost of ownership. Implementing inside NN increases HDFS's own configuration burden for sure, but implementing in a separate system just moves the burden from HDFS to a new system, it would have higher total cost in general. inability to change the cleanup code without restarting the NameNode Yes, that's the cost, but should be minor. Users might change TTL policies frequently according to their requirements, but the cleanup code shouldn't change frequently (unless the implementation code is crappy). HA concerns (need to avoid split-brain or lost updates) That's a good question. We haven't thought over this, seems the cleanup code should only run at the active NN as standby doesn't have the latest updates and can't initiate edits. It shouldn't introduce split-brain as it doesn't change NN's core flow, but should be implemented carefully anyway. error handling (where do users find out about errors?) I haven't thought of any runtime errors (at cleanup stage) that need to be notified of the end users. It should be the sys admin who cares about errors at this stage and he/she could figure them out in logs. For the errors when users set TTL through command line or APIs, the users should be notified directly. semantics: disappearing or time-limited files is an unfamiliar API, not like the traditional FS APIs we usually implement Firstly, no much difference whether implemented inside NN or outside NN. Moreover, if only users have the requirements of TTL-based cleanup, it shouldn't be difficult for them to accept an API. Making this pluggable doesn't fix any of those problems, and it adds some more: The motivation isn't fixing possible problems of implementing TTL policy in the server-side, it's trying to separate the mechanism from specific jobs. It provides an elegant approach to implement such an extension to NN and makes the common part of such extensions reusable. The only points I've seen raised in favor of doing this in the NameNode are:... IMHO, the major points for doing this in NN are: it's a more natural way for end-users, they don't have to interact with HDFS directly in most cases but resort to another system for TTL requirement. lower cost for maintenance (possibly lower cost for implementation too, but it depends on current status of NN). To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses them for things like secondary indices, a much better-defined problem. HBase coprocessor is just an analogy... Possibly not a good one but I can't think of a better one right now. HBase could use coprocessor for cleanup jobs. HBase's default cleanup policies are "Number of Versions" and "TTL", which are configured per Column Family. If you have a special requirement to clean up cells per its content, for example using the value of a specific column as the "Number of Versions" to keep, you could do it using coprocessor. You could do the same thing in a MR job for sure. I'm not saying using coprocessor is a good practice in general but for some use cases, it might be. A little bit background about what we are doing: both Zesheng and I are from Xiaomi, a fast growing mobile internet company in China. We are in a team to support data infrastructure of the company using the open-sourced Hadoop ecosystem and our role might be similar to some teams in Facebook. We do improvements to the open-sourced software per the requirements from our products and would like to contribute our improvements back to community. We have contributed pretty a few patches to HBase community and two members of our team, Liang Xie and Honghua Feng became HBase committers recently. We improve HDFS at the same time and are also happy to collaborate with the community. For this specific feature proposal, a NN-side TTL implementation and a general NN extension mechanism, its feasibility isn't very clear to us as it's just an idea so far. We'd like to spend time on investigating its feasibility furthermore. It's still preferable if feasible. If we encounter insurmountable technical challenges, we would give up for sure. So how about keep this jira issue opened right now (we might open another jira issue to track the general NN extension mechanism), and we will get back after we do the investigation? Whatever approach we choose eventually, we always appreciate you guys' help to work out the solution.
          Hide
          Colin Patrick McCabe added a comment -

          Chris, Andrew, and I have brought up a lot of reasons why this probably doesn't make sense in the NameNode.

          Just to summarize:

          • security / correctness concerns: it's easy to make a mistake that could bring down the NameNode or entire FS
          • non-generality to systems using s3 or another FS in addition to HDFS
          • issues with federation (which NN does the cleanup? How do you decide?)
          • complexities surrounding our client-side Trash implementation and our server-side snapshots
          • configuration burden on sysadmins
          • inability to change the cleanup code without restarting the NameNode
          • HA concerns (need to avoid split-brain or lost updates)
          • error handling (where do users find out about errors?)
          • semantics: disappearing or time-limited files is an unfamiliar API, not like the traditional FS APIs we usually implement

          Making this pluggable doesn't fix any of those problems, and it adds some more:

          • API stability issues (the INode and Feature classes have changed a lot, and we make no guarantees there)
          • CLASSPATH issues (if I want to send an email about a cleanup job with the FooEmailer library, how do I get that into the NameNode's CLASSPATH?) How do I avoid jar conflicts?

          The only points I've seen raised in favor of doing this in the NameNode are:

          • the NameNode already has an authorization system which this could use.
          • HBase has coprocessors which also allow loading arbitrary code.

          To the first point, there are lots of other ways to deal with authorization, like by using YARN (which also has authorization), or configuring the cleanup using files in HDFS.

          To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses them for things like secondary indices, a much better-defined problem. The functionality you want is not something that should be implemented as a coprocessor, even if we had those.

          Show
          Colin Patrick McCabe added a comment - Chris, Andrew, and I have brought up a lot of reasons why this probably doesn't make sense in the NameNode. Just to summarize: security / correctness concerns: it's easy to make a mistake that could bring down the NameNode or entire FS non-generality to systems using s3 or another FS in addition to HDFS issues with federation (which NN does the cleanup? How do you decide?) complexities surrounding our client-side Trash implementation and our server-side snapshots configuration burden on sysadmins inability to change the cleanup code without restarting the NameNode HA concerns (need to avoid split-brain or lost updates) error handling (where do users find out about errors?) semantics: disappearing or time-limited files is an unfamiliar API, not like the traditional FS APIs we usually implement Making this pluggable doesn't fix any of those problems, and it adds some more: API stability issues (the INode and Feature classes have changed a lot, and we make no guarantees there) CLASSPATH issues (if I want to send an email about a cleanup job with the FooEmailer library, how do I get that into the NameNode's CLASSPATH?) How do I avoid jar conflicts? The only points I've seen raised in favor of doing this in the NameNode are: the NameNode already has an authorization system which this could use. HBase has coprocessors which also allow loading arbitrary code. To the first point, there are lots of other ways to deal with authorization, like by using YARN (which also has authorization), or configuring the cleanup using files in HDFS. To the second point, HBase doesn't use coprocessors for cleanup jobs... it uses them for things like secondary indices, a much better-defined problem. The functionality you want is not something that should be implemented as a coprocessor, even if we had those.
          Hide
          Hangjun Ye added a comment -

          BTW: TTL is one of applications that could benefit from a general mechanism. Haohui gave several nice use cases that would benefit as well.

          Show
          Hangjun Ye added a comment - BTW: TTL is one of applications that could benefit from a general mechanism. Haohui gave several nice use cases that would benefit as well.
          Hide
          Hangjun Ye added a comment -

          I think we have two discussions here now: a TTL cleanup policy (implemented inside or outiside NN), and a general mechanism to help implement such a policy easily inside NN.

          I've been convinced that a specific TTL cleanup policy implementation does NOT sound feasible to fly in core code of NN directly, I'm more interested to pursuing a mechanism to enable such policy implementation.

          Considering HBase having co-processor (https://blogs.apache.org/hbase/entry/coprocessor_introduction), people could extend the functionality easily (w/o extending the base classes), such as counting rows, secondary index. We could argue that most of such usages are NOT necessarily implemented as server side, but having such a mechanism gives users an opportunity to choose what is most suitable for their requirements.

          If the NN has such an extensible mechanism (as Haohui suggested earlier), we could implement a TTL cleanup policy in NN in an elegant way (w/o touching the base classes). And NN has abstracted out the "INode.Feature", we could implement a TTLFeature to hold the meta. The policy implementation doesn't have to go into community's codebase if it's too specific, we could keep it in our private branch. But basing on a general mechanism (w/o touching the base classes) makes it easy to be maintained (considering we would upgrade with new Hadoop releases regularly).

          If you guys think such a general mechanism deserves to be considered, we are happy to contribute some efforts.

          Show
          Hangjun Ye added a comment - I think we have two discussions here now: a TTL cleanup policy (implemented inside or outiside NN), and a general mechanism to help implement such a policy easily inside NN. I've been convinced that a specific TTL cleanup policy implementation does NOT sound feasible to fly in core code of NN directly, I'm more interested to pursuing a mechanism to enable such policy implementation. Considering HBase having co-processor ( https://blogs.apache.org/hbase/entry/coprocessor_introduction ), people could extend the functionality easily (w/o extending the base classes), such as counting rows, secondary index. We could argue that most of such usages are NOT necessarily implemented as server side, but having such a mechanism gives users an opportunity to choose what is most suitable for their requirements. If the NN has such an extensible mechanism (as Haohui suggested earlier), we could implement a TTL cleanup policy in NN in an elegant way (w/o touching the base classes). And NN has abstracted out the "INode.Feature", we could implement a TTLFeature to hold the meta. The policy implementation doesn't have to go into community's codebase if it's too specific, we could keep it in our private branch. But basing on a general mechanism (w/o touching the base classes) makes it easy to be maintained (considering we would upgrade with new Hadoop releases regularly). If you guys think such a general mechanism deserves to be considered, we are happy to contribute some efforts.
          Hide
          Andrew Wang added a comment -

          Even if implementing security was a major hurdle (and I really don't think it is as hard as you think, considering we have quite a few examples of Hadoop auth besides the NN), the rest of Chris's points still stand. I also think that semantically, a TTL is not an expected type of file attribute for those of us with a Unix background, which leads to TOCTOU issues like Chris also pointed out even if just because of user expectations.

          So, at this point, I think there are strong technical reasons not to implement this in the NN, and strong reasons to do this type of data lifecycle management externally.

          Show
          Andrew Wang added a comment - Even if implementing security was a major hurdle (and I really don't think it is as hard as you think, considering we have quite a few examples of Hadoop auth besides the NN), the rest of Chris's points still stand. I also think that semantically, a TTL is not an expected type of file attribute for those of us with a Unix background, which leads to TOCTOU issues like Chris also pointed out even if just because of user expectations. So, at this point, I think there are strong technical reasons not to implement this in the NN, and strong reasons to do this type of data lifecycle management externally.
          Hide
          Colin Patrick McCabe added a comment -

          One approach, as you suggested, is we that implement a separate cleanup platform and users submit their policy to this platform, and we do the real cleanup action to the HDFS on behalf of users (as a superuser or other powerful user). But the separate platform has to implement an authentication/authorization mechanism to make sure the user is who they claim to be and have the permission (authentication is a must, authorization might be optional but it'd better have). It's a repeated job as the NameNode has done with Kerberos/acl.... If it's implemented inside the NameNode, we could leverage NameNode's authentication/authorization mechanism.

          YARN / MR / etc already have authentication frameworks that you can use. For example, you can set up a YARN queue with certain permissions so that only certain users or groups can submit to it.

          Another idea is to have an HDFS directory where each group (or user) puts their files containing the cleanup policies they want, and let HDFS take care of permissions.

          Show
          Colin Patrick McCabe added a comment - One approach, as you suggested, is we that implement a separate cleanup platform and users submit their policy to this platform, and we do the real cleanup action to the HDFS on behalf of users (as a superuser or other powerful user). But the separate platform has to implement an authentication/authorization mechanism to make sure the user is who they claim to be and have the permission (authentication is a must, authorization might be optional but it'd better have). It's a repeated job as the NameNode has done with Kerberos/acl.... If it's implemented inside the NameNode, we could leverage NameNode's authentication/authorization mechanism. YARN / MR / etc already have authentication frameworks that you can use. For example, you can set up a YARN queue with certain permissions so that only certain users or groups can submit to it. Another idea is to have an HDFS directory where each group (or user) puts their files containing the cleanup policies they want, and let HDFS take care of permissions.
          Hide
          Hangjun Ye added a comment -

          This logic is subject to time of check/time of use race conditions, possibly resulting in incorrect deletion of data. For example, imagine the following sequence: ...

          It doesn't sound a "race condition" to me. We could consider TTL as an "independent" attribute of file, just like file owner and replication number. In the above scenario, it seems to work as expected, since the admin only changes the owner of /file1 but leaves the TTL attribute as is and so the TTL should be still effective. If the admin doesn't want it to happen, he/she should "unset" the TTL attribute (i.e. set it to infinity) firstly before he/she changes the owner of /file1, and the new owner of /file1 could set a new TTL attribute later if needed.

          Show
          Hangjun Ye added a comment - This logic is subject to time of check/time of use race conditions, possibly resulting in incorrect deletion of data. For example, imagine the following sequence: ... It doesn't sound a "race condition" to me. We could consider TTL as an "independent" attribute of file, just like file owner and replication number. In the above scenario, it seems to work as expected, since the admin only changes the owner of /file1 but leaves the TTL attribute as is and so the TTL should be still effective. If the admin doesn't want it to happen, he/she should "unset" the TTL attribute (i.e. set it to infinity) firstly before he/she changes the owner of /file1, and the new owner of /file1 could set a new TTL attribute later if needed.
          Hide
          Chris Nauroth added a comment -

          The implemented mechanism inside the NameNode would (maybe periodically) execute all policies specified by users, and it would do it as a superuser safely, as authentication/authorization have been done when user set their policies to the NameNode.

          This logic is subject to time of check/time of use race conditions, possibly resulting in incorrect deletion of data. For example, imagine the following sequence:

          1. A user calls the setttl command on /file1. Authentication is successful, and the authenticated user is the file owner, so NN decides the user is authorized to set a TTL.
          2. An admin changes the owner of /file1 in order to revoke the user's access.
          3. Now the NN's background expiration thread/job starts running. It finds a TTL on /file1 and deletes it. Since this is running as the HDFS super-user, nothing blocks the delete, even though the user who set the TTL really no longer has permission to delete.

          With an external process, authentication and authorization are enforced at the time of delete for the specific user, so there is no time of check/time of use race condition, and there is no chance of an incorrect delete.

          Running some code as a privileged user might look expedient in some ways, but it also compromises the file system permissions model somewhat.

          Show
          Chris Nauroth added a comment - The implemented mechanism inside the NameNode would (maybe periodically) execute all policies specified by users, and it would do it as a superuser safely, as authentication/authorization have been done when user set their policies to the NameNode. This logic is subject to time of check/time of use race conditions, possibly resulting in incorrect deletion of data. For example, imagine the following sequence: A user calls the setttl command on /file1. Authentication is successful, and the authenticated user is the file owner, so NN decides the user is authorized to set a TTL. An admin changes the owner of /file1 in order to revoke the user's access. Now the NN's background expiration thread/job starts running. It finds a TTL on /file1 and deletes it. Since this is running as the HDFS super-user, nothing blocks the delete, even though the user who set the TTL really no longer has permission to delete. With an external process, authentication and authorization are enforced at the time of delete for the specific user, so there is no time of check/time of use race condition, and there is no chance of an incorrect delete. Running some code as a privileged user might look expedient in some ways, but it also compromises the file system permissions model somewhat.
          Hide
          Hangjun Ye added a comment -

          Thanks Chris and Colin for your valuable comments, I'd like to address your concern about the "security" problem.

          Firstly our scenario is as following:
          We have a Hadoop cluster shared by multiple teams for their storage and computation requirement and "we" are the dev/supporting team to ensure the functionality and availability of the cluster. The cluster is security enabled to ensure every team could only access the files that they should. So every team is a common user of the cluster and "we" own the superuser.

          Currently several teams have the requirement to clean up files based on TTL policy. Obviously they could have cron job to do that by themselves but it would have many repeated jobs, so we'd better have a mechanism to let them to specify/implement their policy easily.

          One approach, as you suggested, is we that implement a separate cleanup platform and users submit their policy to this platform, and we do the real cleanup action to the HDFS on behalf of users (as a superuser or other powerful user). But the separate platform has to implement an authentication/authorization mechanism to make sure the user is who they claim to be and have the permission (authentication is a must, authorization might be optional but it'd better have). It's a repeated job as the NameNode has done with Kerberos/acl.

          If it's implemented inside the NameNode, we could leverage NameNode's authentication/authorization mechanism. For example we provide a "./bin/hdfs dfs -setttl <path/file>" command (just like -setrep). Users could specify their policy by it and the NameNode should persist it somewhere, maybe as an attribute of file like replication number. The implemented mechanism inside the NameNode would (maybe periodically) execute all policies specified by users, and it would do it as a superuser safely, as authentication/authorization have been done when user set their policies to the NameNode.

          To address several detailed concerns you raised:

          • "buggy or malicious code": The proposed concept (actually Haohui proposed) should be pretty similar to HBase's coprocessor (http://hbase.apache.org/book.html#cp), it's a plug-in or extension of NameNode and most likely enabled at deployment time. A common user can't submit it, the cluster owner could do. So the code is not arbitrary and the quality/safety could be guaranteed.
          • "Who exactly is the effective user running the delete, and how do we manage their login and file permission enforcement": the extension is run as superuser/system, a specific extension implementation could do any permission enforcement if needed. For the "TTL-based cleanup policy executor", no permission enforcement is needed at this stage as authentication/authorization have been done when user set policy.

          I think the idea proposed by Haohui is to have an extensible mechanism in NameNode to run jobs which intensively depend on namespace data, and make the specific job's code as de-coupled from NameNode's core code as possible. For certain it's not easy, as Chris pointed out several problems, like HA and concurrency, but it might deserve to be thought about.

          Show
          Hangjun Ye added a comment - Thanks Chris and Colin for your valuable comments, I'd like to address your concern about the "security" problem. Firstly our scenario is as following: We have a Hadoop cluster shared by multiple teams for their storage and computation requirement and "we" are the dev/supporting team to ensure the functionality and availability of the cluster. The cluster is security enabled to ensure every team could only access the files that they should. So every team is a common user of the cluster and "we" own the superuser. Currently several teams have the requirement to clean up files based on TTL policy. Obviously they could have cron job to do that by themselves but it would have many repeated jobs, so we'd better have a mechanism to let them to specify/implement their policy easily. One approach, as you suggested, is we that implement a separate cleanup platform and users submit their policy to this platform, and we do the real cleanup action to the HDFS on behalf of users (as a superuser or other powerful user). But the separate platform has to implement an authentication/authorization mechanism to make sure the user is who they claim to be and have the permission (authentication is a must, authorization might be optional but it'd better have). It's a repeated job as the NameNode has done with Kerberos/acl. If it's implemented inside the NameNode, we could leverage NameNode's authentication/authorization mechanism. For example we provide a "./bin/hdfs dfs -setttl <path/file>" command (just like -setrep). Users could specify their policy by it and the NameNode should persist it somewhere, maybe as an attribute of file like replication number. The implemented mechanism inside the NameNode would (maybe periodically) execute all policies specified by users, and it would do it as a superuser safely, as authentication/authorization have been done when user set their policies to the NameNode. To address several detailed concerns you raised: "buggy or malicious code": The proposed concept (actually Haohui proposed) should be pretty similar to HBase's coprocessor ( http://hbase.apache.org/book.html#cp ), it's a plug-in or extension of NameNode and most likely enabled at deployment time. A common user can't submit it, the cluster owner could do. So the code is not arbitrary and the quality/safety could be guaranteed. "Who exactly is the effective user running the delete, and how do we manage their login and file permission enforcement": the extension is run as superuser/system, a specific extension implementation could do any permission enforcement if needed. For the "TTL-based cleanup policy executor", no permission enforcement is needed at this stage as authentication/authorization have been done when user set policy. I think the idea proposed by Haohui is to have an extensible mechanism in NameNode to run jobs which intensively depend on namespace data, and make the specific job's code as de-coupled from NameNode's core code as possible. For certain it's not easy, as Chris pointed out several problems, like HA and concurrency, but it might deserve to be thought about.
          Hide
          Colin Patrick McCabe added a comment -

          I agree with Chris' comments here. There are just so many advantages to running outside the NameNode, that I think that's the design we should start with. If we later find something that would work better with NN support, we can think about it then.

          Hangjun Ye wrote:

          Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism.

          The client handles authentication/authorization very well, actually. You can choose to run your cleanup job as superuser (can do anything) or some other less powerful user who is limited (safer). But when you run inside the NameNode, there are no safeguards... everything is effectively superuser. And you can destroy or corrupt the entire filesystem very easily that way, especially if your cleanup code is buggy.

          Show
          Colin Patrick McCabe added a comment - I agree with Chris' comments here. There are just so many advantages to running outside the NameNode, that I think that's the design we should start with. If we later find something that would work better with NN support, we can think about it then. Hangjun Ye wrote: Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism. The client handles authentication/authorization very well, actually. You can choose to run your cleanup job as superuser (can do anything) or some other less powerful user who is limited (safer). But when you run inside the NameNode, there are no safeguards... everything is effectively superuser. And you can destroy or corrupt the entire filesystem very easily that way, especially if your cleanup code is buggy.
          Hide
          Chris Nauroth added a comment -

          ...run a job (maybe periodically) over the namespace inside the NN...

          Please correct me if I misunderstood, but this sounds like execution of arbitrary code inside the NN process. If so, this opens the risk of resource exhaustion at the NN by buggy or malicious code. Even if there is a fork for process isolation, it's still sharing machine resources with the NN process. If the code is running as the HDFS super-user, then it has access to sensitive resources like the fsimage file. If multiple such "in-process jobs" are submitted concurrently, then it would cause resource contention with the main work of the NN. Multiple concurrent jobs also gets into the realm of scheduling. There are lots of tough problems here that would increase the complexity of the NN.

          Even putting that aside, I see multiple advantages in implementing this externally instead of embedded inside the NN. Here is a list of several problems that an embedded design would need to solve, and which I believe are already easily addressed by an external design. This includes/expands on issues brought up by others in earlier comments too.

          • Trash: The description mentions trash capability as a requirement. Trash functionality is currently implemented as a client-side capability.
            • Embedded: We'd need to reimplement trash inside the NN, or heavily refactor for code sharing.
            • External: The client already has the trash capability, so this problem is already solved.
          • Integration: Many Hadoop deployments use an alternative file system like S3 or Azure storage. In these deployments, there is no NameNode.
            • Embedded: The feature is only usable for HDFS-based deployments. Users of alternative file systems can't use the feature.
            • External: The client already has the capability to target any Hadoop file system implementation, so this problem is already solved.
          • HA: In the event of a failover, we must guarantee that the former active NN does not drive any expiration activity.
            • Embedded: Any background thread or "in-process jobs" running inside the NN must coordinate shutdown during a failover.
            • External: Thanks to our client-side retry policies, an external process automatically transitions to the new active NN after a failover, and there is no risk of split-brain scenario, so this problem is already solved.
          • Authentication/Authorization: Who exactly is the effective user running the delete, and how do we manage their login and file permission enforcement?
            • Embedded: You mention there is an advantage to running embedded, but I didn't quite understand. Are you suggesting running the deletes inside a UserGroupInformation#doAs for the specific user?
            • External: The client already knows how to authenticate RPC, and the NN already knows how to enforce authorization on files for that authenticated user, so this problem is already solved.
          • Error Handling: How do users find out when the deletes don't work?
            • Embedded: There is no mechanism for asynchronous user notification inside the NN. As others have mentioned, there is a lot of complexity in this area. If it's email, then you need to solve the problem of reliable email delivery (i.e. retries if SMTP gateways are down). If it's monitoring/alerting, then you need to expose new monitoring endpoints to publish sufficient information.
            • External: The client's exception messages are sufficient to identify file paths that failed during synchronous calls, and the NN audit log is another source of troubleshooting information, so this problem is already solved.
          • Federation: With federation, the HDFS namespace is split across multiple NameNodes.
            • Embedded: The design needs to coordinate putting the right expiration work on the right NN hosting that part of the namespace.
            • External: The client has the capability to configure a client-side mount table that joins together multiple federated namespaces, and ViewFileSystem then routes RPC to the correct NN depending on the target file path, so this problem is already solved.
          Show
          Chris Nauroth added a comment - ...run a job (maybe periodically) over the namespace inside the NN... Please correct me if I misunderstood, but this sounds like execution of arbitrary code inside the NN process. If so, this opens the risk of resource exhaustion at the NN by buggy or malicious code. Even if there is a fork for process isolation, it's still sharing machine resources with the NN process. If the code is running as the HDFS super-user, then it has access to sensitive resources like the fsimage file. If multiple such "in-process jobs" are submitted concurrently, then it would cause resource contention with the main work of the NN. Multiple concurrent jobs also gets into the realm of scheduling. There are lots of tough problems here that would increase the complexity of the NN. Even putting that aside, I see multiple advantages in implementing this externally instead of embedded inside the NN. Here is a list of several problems that an embedded design would need to solve, and which I believe are already easily addressed by an external design. This includes/expands on issues brought up by others in earlier comments too. Trash: The description mentions trash capability as a requirement. Trash functionality is currently implemented as a client-side capability. Embedded: We'd need to reimplement trash inside the NN, or heavily refactor for code sharing. External: The client already has the trash capability, so this problem is already solved. Integration: Many Hadoop deployments use an alternative file system like S3 or Azure storage. In these deployments, there is no NameNode. Embedded: The feature is only usable for HDFS-based deployments. Users of alternative file systems can't use the feature. External: The client already has the capability to target any Hadoop file system implementation, so this problem is already solved. HA: In the event of a failover, we must guarantee that the former active NN does not drive any expiration activity. Embedded: Any background thread or "in-process jobs" running inside the NN must coordinate shutdown during a failover. External: Thanks to our client-side retry policies, an external process automatically transitions to the new active NN after a failover, and there is no risk of split-brain scenario, so this problem is already solved. Authentication/Authorization: Who exactly is the effective user running the delete, and how do we manage their login and file permission enforcement? Embedded: You mention there is an advantage to running embedded, but I didn't quite understand. Are you suggesting running the deletes inside a UserGroupInformation#doAs for the specific user? External: The client already knows how to authenticate RPC, and the NN already knows how to enforce authorization on files for that authenticated user, so this problem is already solved. Error Handling: How do users find out when the deletes don't work? Embedded: There is no mechanism for asynchronous user notification inside the NN. As others have mentioned, there is a lot of complexity in this area. If it's email, then you need to solve the problem of reliable email delivery (i.e. retries if SMTP gateways are down). If it's monitoring/alerting, then you need to expose new monitoring endpoints to publish sufficient information. External: The client's exception messages are sufficient to identify file paths that failed during synchronous calls, and the NN audit log is another source of troubleshooting information, so this problem is already solved. Federation: With federation, the HDFS namespace is split across multiple NameNodes. Embedded: The design needs to coordinate putting the right expiration work on the right NN hosting that part of the namespace. External: The client has the capability to configure a client-side mount table that joins together multiple federated namespaces, and ViewFileSystem then routes RPC to the correct NN depending on the target file path, so this problem is already solved.
          Hide
          Hangjun Ye added a comment -

          Thanks Haohui, that's clear to us now.
          That's interesting and we'd like to pursue the more general approach.
          We will take time to work out a rough design and ask you guys to review.

          Show
          Hangjun Ye added a comment - Thanks Haohui, that's clear to us now. That's interesting and we'd like to pursue the more general approach. We will take time to work out a rough design and ask you guys to review.
          Hide
          Haohui Mai added a comment -

          Your suggestion is that we'd better have a general mechanism/framework to run a job (maybe periodically) over the namespace inside the NN, and the TTL policy is just a specific job that might be implemented by user?

          This is correct. There are a couple additional use cases that might be useful to keep in mind:

          1. Archiving data. TTL is one of the use case here.
          2. Backing up or syncing data between clusters. It's nice to back up / to sync data between clusters for disaster recovery, without running a MR job.
          3. Balancing data between data nodes.

          A mechanism that can support the above use cases can be quite powerful and improve the state of the art. I'm happy to collaborate if this is the direction you guys want to pursue.

          We are heavy users of Hadoop and also do some in-house improvements per our business requirement. We definitely want to contribute the improvements back to community.

          This is great to hear. Patches are welcome.

          Show
          Haohui Mai added a comment - Your suggestion is that we'd better have a general mechanism/framework to run a job (maybe periodically) over the namespace inside the NN, and the TTL policy is just a specific job that might be implemented by user? This is correct. There are a couple additional use cases that might be useful to keep in mind: Archiving data. TTL is one of the use case here. Backing up or syncing data between clusters. It's nice to back up / to sync data between clusters for disaster recovery, without running a MR job. Balancing data between data nodes. A mechanism that can support the above use cases can be quite powerful and improve the state of the art. I'm happy to collaborate if this is the direction you guys want to pursue. We are heavy users of Hadoop and also do some in-house improvements per our business requirement. We definitely want to contribute the improvements back to community. This is great to hear. Patches are welcome.
          Hide
          Hangjun Ye added a comment -

          Thanks Haohui for your reply.

          Let me confirm I got your point. Your suggestion is that we'd better have a general mechanism/framework to run a job (maybe periodically) over the namespace inside the NN, and the TTL policy is just a specific job that might be implemented by user?
          That's an interesting direction, we will think about it.

          We are heavy users of Hadoop and also do some in-house improvements per our business requirement. We definitely want to contribute the improvements back to community, as long as it's helpful for the community.

          Show
          Hangjun Ye added a comment - Thanks Haohui for your reply. Let me confirm I got your point. Your suggestion is that we'd better have a general mechanism/framework to run a job (maybe periodically) over the namespace inside the NN, and the TTL policy is just a specific job that might be implemented by user? That's an interesting direction, we will think about it. We are heavy users of Hadoop and also do some in-house improvements per our business requirement. We definitely want to contribute the improvements back to community, as long as it's helpful for the community.
          Hide
          Haohui Mai added a comment -

          TTL is a very simple (but general) policy and we might even consider it as an attribute of file, like the number of replicas. Seems it wouldn't introduce much complexity to handle it in the NN.

          Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism.

          I agree that running jobs of the namespace without MR should be the direction to go. However, I think the main hold back here is that the design mixes the mechanism (running jobs of the namespace without MR) and the policy (TTL) together.

          As Colin Patrick McCabe pointed out earlier, every user has his / her own policy. Provided that HDFS has a wide range of users, this type of design / implementation is unlikely to fly in the ecosystem.

          Currently HDFS does not have the above mechanism, you're more than welcomed to contribute a patch.

          Show
          Haohui Mai added a comment - TTL is a very simple (but general) policy and we might even consider it as an attribute of file, like the number of replicas. Seems it wouldn't introduce much complexity to handle it in the NN. Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism. I agree that running jobs of the namespace without MR should be the direction to go. However, I think the main hold back here is that the design mixes the mechanism (running jobs of the namespace without MR) and the policy (TTL) together. As Colin Patrick McCabe pointed out earlier, every user has his / her own policy. Provided that HDFS has a wide range of users, this type of design / implementation is unlikely to fly in the ecosystem. Currently HDFS does not have the above mechanism, you're more than welcomed to contribute a patch.
          Hide
          Hangjun Ye added a comment -

          Implementing it outside NN is definitely another option, and I agree with Colin that it's not feasible to implement a complex clean up policy (like based on storage space) inside NN.

          TTL is a very simple (but general) policy and we might even consider it as an attribute of file, like the number of replicas. Seems it wouldn't introduce much complexity to handle it in the NN.

          Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism.

          So far a TTL-based clean up policy is good enough for our scenario (Zesheng and I are from the same company and we are supporting our company's internal usage for Hadoop) and it's would be nice to have a simple and workable solution in HDFS.

          Show
          Hangjun Ye added a comment - Implementing it outside NN is definitely another option, and I agree with Colin that it's not feasible to implement a complex clean up policy (like based on storage space) inside NN. TTL is a very simple (but general) policy and we might even consider it as an attribute of file, like the number of replicas. Seems it wouldn't introduce much complexity to handle it in the NN. Another benefit to having it inside NN is we don't have to handle the authentication/authorization problem in a separate system. For example we have a shared HDFS cluster for many internal users, we don't want someone to set TTL policy to other one's files. NN could handle it easily by its own authentication/authorization mechanism. So far a TTL-based clean up policy is good enough for our scenario (Zesheng and I are from the same company and we are supporting our company's internal usage for Hadoop) and it's would be nice to have a simple and workable solution in HDFS.
          Hide
          Jian Wang added a comment -

          I think it is better for you to provide a (backup & clean up ) platform for your user ,you can implement a lot of clean up strategy for your users in your company.
          This can reduce a lot of repeated jobs.

          Show
          Jian Wang added a comment - I think it is better for you to provide a (backup & clean up ) platform for your user ,you can implement a lot of clean up strategy for your users in your company. This can reduce a lot of repeated jobs.
          Hide
          Colin Patrick McCabe added a comment -

          Why do you think that putting the cleanup mechanism into the NameNode seems questionable, can you point out some details?

          Andrew and Chris commented about this earlier. See:
          https://issues.apache.org/jira/browse/HDFS-6382?focusedCommentId=13998933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13998933

          I would add to that:

          • Every user of this is going to want a slightly different deletion policy. It's just way too much configuration for the NameNode to reasonably handle. Much easier to do it in a user process. For example, maybe you want to keep at least 100 GB of logs, 100 GB of "foo" data, and 1000 GB of "bar" data. It's easy to handle this complexity in a user process, incredibly complex and frustrating to handle it in the NameNode.
          • Your nightly MR job (or whatever) also needs to be able to do things like email sysadmins when the disks are filling up, which the NameNode can't reasonably be expected to do.
          • I don't see a big advantage to doing this in the NameNode, and I see a lot of disadvantages (more complexity to maintain, difficult configuration, need to restart to update config)

          Maybe I could be convinced otherwise, but so far the only argument that I've seen for doing it in the NN is that it would be re-usable. And this could just as easily apply to an implementation outside the NN. For example, as I pointed out earlier, DistCp is reusable, without being in the NameNode.

          Show
          Colin Patrick McCabe added a comment - Why do you think that putting the cleanup mechanism into the NameNode seems questionable, can you point out some details? Andrew and Chris commented about this earlier. See: https://issues.apache.org/jira/browse/HDFS-6382?focusedCommentId=13998933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13998933 I would add to that: Every user of this is going to want a slightly different deletion policy. It's just way too much configuration for the NameNode to reasonably handle. Much easier to do it in a user process. For example, maybe you want to keep at least 100 GB of logs, 100 GB of "foo" data, and 1000 GB of "bar" data. It's easy to handle this complexity in a user process, incredibly complex and frustrating to handle it in the NameNode. Your nightly MR job (or whatever) also needs to be able to do things like email sysadmins when the disks are filling up, which the NameNode can't reasonably be expected to do. I don't see a big advantage to doing this in the NameNode, and I see a lot of disadvantages (more complexity to maintain, difficult configuration, need to restart to update config) Maybe I could be convinced otherwise, but so far the only argument that I've seen for doing it in the NN is that it would be re-usable. And this could just as easily apply to an implementation outside the NN. For example, as I pointed out earlier, DistCp is reusable, without being in the NameNode.
          Hide
          Zesheng Wu added a comment -

          Like I said, we should write such a tool and add it to the base Hadoop distribution. This is similar to what we did with DistCp. Then users would not need to write their own versions of this stuff.

          Sure, this is another good option.

          It's important to distinguish between creating a tool to handle deleting old files (which we all agree we should do), and putting this into the NameNode (which seems questionable).

          Why do you think that putting the cleanup mechanism into the NameNode seems questionable, can you point out some details?

          Show
          Zesheng Wu added a comment - Like I said, we should write such a tool and add it to the base Hadoop distribution. This is similar to what we did with DistCp. Then users would not need to write their own versions of this stuff. Sure, this is another good option. It's important to distinguish between creating a tool to handle deleting old files (which we all agree we should do), and putting this into the NameNode (which seems questionable). Why do you think that putting the cleanup mechanism into the NameNode seems questionable, can you point out some details?
          Hide
          Colin Patrick McCabe added a comment -

          But if there's no internal cleanup mechanism of HDFS, all users(across companies) need to write their own cleanup tools respectively, lots of repeated work.

          Like I said, we should write such a tool and add it to the base Hadoop distribution. This is similar to what we did with DistCp. Then users would not need to write their own versions of this stuff.

          It's important to distinguish between creating a tool to handle deleting old files (which we all agree we should do), and putting this into the NameNode (which seems questionable).

          Show
          Colin Patrick McCabe added a comment - But if there's no internal cleanup mechanism of HDFS, all users(across companies) need to write their own cleanup tools respectively, lots of repeated work. Like I said, we should write such a tool and add it to the base Hadoop distribution. This is similar to what we did with DistCp . Then users would not need to write their own versions of this stuff. It's important to distinguish between creating a tool to handle deleting old files (which we all agree we should do), and putting this into the NameNode (which seems questionable).
          Hide
          Zesheng Wu added a comment -

          Thanks Colin Patrick McCabe, it's sure that an outside cleanup tool is not too difficult or complex to implement, and there are many ways which can satisfy the requirements. But if there's no internal cleanup mechanism of HDFS, all users(across companies) need to write their own cleanup tools respectively, lots of repeated work. Suppose that if HDFS can support an internal cleanup mechanism, it's sure that this will be more convenient, do you agree this?

          About snapshots, I think the behavior of a snapshotted file which is deleted by the TTL mechanism is just the same as the behavior of a snapshotted file which is deleted by a user manually.

          Show
          Zesheng Wu added a comment - Thanks Colin Patrick McCabe , it's sure that an outside cleanup tool is not too difficult or complex to implement, and there are many ways which can satisfy the requirements. But if there's no internal cleanup mechanism of HDFS, all users(across companies) need to write their own cleanup tools respectively, lots of repeated work. Suppose that if HDFS can support an internal cleanup mechanism, it's sure that this will be more convenient, do you agree this? About snapshots, I think the behavior of a snapshotted file which is deleted by the TTL mechanism is just the same as the behavior of a snapshotted file which is deleted by a user manually.
          Hide
          Colin Patrick McCabe added a comment -

          I don't think a nightly (or weekly) cleanup job that lives outside HDFS is that difficult or complex to write. If it were done as a MapReduce job, it could easily work on the whole cluster. This is something we could consider putting upstream.

          Another issue to consider here is snapshots. Deleting files is not going to free space if they exist in a snapshot.

          Show
          Colin Patrick McCabe added a comment - I don't think a nightly (or weekly) cleanup job that lives outside HDFS is that difficult or complex to write. If it were done as a MapReduce job, it could easily work on the whole cluster. This is something we could consider putting upstream. Another issue to consider here is snapshots. Deleting files is not going to free space if they exist in a snapshot.
          Hide
          Zesheng Wu added a comment -

          Thanks Chris Nauroth, I agree with your example MapReduce scenario and the risk, but this risk can't be avoided even if we use outside tools. For example, we use a nightly cron job just like Andrew mentioned, imagine a MapReduce job gets submitted, we derive input splits from a file, and then the file is deleted by the cron job after input split calculation but before the map tasks start running and reading the blocks, the risk is the same. What I want to declare is that the TTL is just a convenient way to finish tasks like I described in the proposal, the users should learn how to use it and use it correctly, rather than use a complicated way and there's no obvious advantage.

          Show
          Zesheng Wu added a comment - Thanks Chris Nauroth , I agree with your example MapReduce scenario and the risk, but this risk can't be avoided even if we use outside tools. For example, we use a nightly cron job just like Andrew mentioned, imagine a MapReduce job gets submitted, we derive input splits from a file, and then the file is deleted by the cron job after input split calculation but before the map tasks start running and reading the blocks, the risk is the same. What I want to declare is that the TTL is just a convenient way to finish tasks like I described in the proposal, the users should learn how to use it and use it correctly, rather than use a complicated way and there's no obvious advantage.
          Hide
          Chris Nauroth added a comment -

          I agree with Andrew's opinion that this is better implemented outside the file system. An automatic delete based on a TTL introduces a high risk of concurrency bugs for applications. For example, imagine a MapReduce job gets submitted, we derive input splits from a file, and then the file expires after input split calculation but before the map tasks start running and reading the blocks. Overall, I think it's preferable to put delete into the hands of the calling application for explicit control.

          Show
          Chris Nauroth added a comment - I agree with Andrew's opinion that this is better implemented outside the file system. An automatic delete based on a TTL introduces a high risk of concurrency bugs for applications. For example, imagine a MapReduce job gets submitted, we derive input splits from a file, and then the file expires after input split calculation but before the map tasks start running and reading the blocks. Overall, I think it's preferable to put delete into the hands of the calling application for explicit control.
          Hide
          Zesheng Wu added a comment -

          Thanks Andrew Wang. Of course a nightly cron job can do this for us, but there're various kinds of data backup requirements from our users, log backup is just one of them, we just want to supply a more convenient way for our users to satisfy their requirements. Imagine that we have lots of backup requirements, each have different TTL configuration, one way to achieve this is each user maintains his own cron job, and the other is the cluster administrator maintains all the cron jobs for all users, both these two ways are not very convenient and need lots of manual operation work. If HDFS can support TTL as I proposed above, these requirements will be satisfied very easily.

          Show
          Zesheng Wu added a comment - Thanks Andrew Wang . Of course a nightly cron job can do this for us, but there're various kinds of data backup requirements from our users, log backup is just one of them, we just want to supply a more convenient way for our users to satisfy their requirements. Imagine that we have lots of backup requirements, each have different TTL configuration, one way to achieve this is each user maintains his own cron job, and the other is the cluster administrator maintains all the cron jobs for all users, both these two ways are not very convenient and need lots of manual operation work. If HDFS can support TTL as I proposed above, these requirements will be satisfied very easily.
          Hide
          Andrew Wang added a comment -

          This is just my opinion, but isn't this something better done in userspace? A nightly cron job could do this for you, and log files are typically even already timestamped for easy parsing and removal.

          Show
          Andrew Wang added a comment - This is just my opinion, but isn't this something better done in userspace? A nightly cron job could do this for you, and log files are typically even already timestamped for easy parsing and removal.

            People

            • Assignee:
              Zesheng Wu
              Reporter:
              Zesheng Wu
            • Votes:
              2 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:

                Development