Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Introduced archive feature to Hadoop. A Map/Reduce job can be run to create an archive with indexes. A FileSystem abstraction is provided over the archive.

      Description

      This is a new feature for archiving and unarchiving files in HDFS.

      1. hadoop-3307_4.patch
        61 kB
        Mahadev konar
      2. hadoop-3307_2.patch
        61 kB
        Mahadev konar
      3. hadoop-3307_1.patch
        54 kB
        Mahadev konar

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment - - edited

          Here is the design for the archives.

          Archiving files in HDFS

          • Motivation

          The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving utility that is able to archive these files which are semi transparent and usable by map reduce.

          • Why not just concatenate the files?
            As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct files and would sometime like to unarchive and not lose the file layouts.
          • Requirements
          • transparent or semi transparent usage of archives.
          • Must be able to archive and unarchive in parallel
          • Changeable archives is not a requirement but the design should not prevent it to be implemented later.
          • Compression is not a goal.
          • Archive Format
          • Conventional archive formats like tar are not convenient for parallel archive creation
          • Here is a proposal that will allow archive creation in parallel

          The format of an archive as a filesystem path is:

          /user/mahadev/foo.har/_index*
          /user/mahadev/foo.har/part-*

          The indexes store the filenames and the offset with the part files.

          • URI Syntax
            Har FileSystem is a client side filesystem which is semitransparent.
          • har:<archivePath>!<fileInArchive> (similar to jar uri)
            example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive
          • How will map reduce work with this new Filesystem.
            There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.
          • How will the dfs commands work –

          The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives.

          • How will permissions work with archives
            In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.
          • Future Work
          • Transparent use of archives.
            This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.

          Comments?

          Show
          Mahadev konar added a comment - - edited Here is the design for the archives. Archiving files in HDFS Motivation The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving utility that is able to archive these files which are semi transparent and usable by map reduce. Why not just concatenate the files? As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct files and would sometime like to unarchive and not lose the file layouts. Requirements transparent or semi transparent usage of archives. Must be able to archive and unarchive in parallel Changeable archives is not a requirement but the design should not prevent it to be implemented later. Compression is not a goal. Archive Format Conventional archive formats like tar are not convenient for parallel archive creation Here is a proposal that will allow archive creation in parallel The format of an archive as a filesystem path is: /user/mahadev/foo.har/_index* /user/mahadev/foo.har/part-* The indexes store the filenames and the offset with the part files. URI Syntax Har FileSystem is a client side filesystem which is semitransparent. har:<archivePath>!<fileInArchive> (similar to jar uri) example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive How will map reduce work with this new Filesystem. There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs. How will the dfs commands work – The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. How will permissions work with archives In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions. Future Work Transparent use of archives. This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives. Comments?
          Hide
          Doug Cutting added a comment -

          Note that har:<path>! is a non-hierarchical, opaque URI. Much of the Path code assumes that URIs are hierarchical and would need to be altered to support opaque uris.

          One alternative would be to always "mount" hars before access. Mounting would just require setting a "fs.har.<name>" property to a har file.

          For example, a job could add a mount with:

          job.set("fs.har.myfiles", "hdfs://host:port/dir/my.har");

          Then specify its input as:

          job.addInputPath("har://myfiles/path/in/har");

          Another alternative could be to somehow escape paths in the authority of har: uris, e.g.:

          har://hdfs-c-s-shost-c999-sdir/path_in_har

          Where -c and -s are escapes for colon and slash. Then the uris could still be hierarchical. The downside is that paths would look really ugly. Sigh.

          If we wanted to make it transparent, then we might do it by adding symbolic links to the FileSystem API, rather than hacking DFSClient. Then one could "mount" a har file by simply linking to a har: URI.

          Show
          Doug Cutting added a comment - Note that har:<path>! is a non-hierarchical, opaque URI. Much of the Path code assumes that URIs are hierarchical and would need to be altered to support opaque uris. One alternative would be to always "mount" hars before access. Mounting would just require setting a "fs.har.<name>" property to a har file. For example, a job could add a mount with: job.set("fs.har.myfiles", "hdfs://host:port/dir/my.har"); Then specify its input as: job.addInputPath("har://myfiles/path/in/har"); Another alternative could be to somehow escape paths in the authority of har: uris, e.g.: har://hdfs-c-s-shost-c999-sdir/path_in_har Where -c and -s are escapes for colon and slash. Then the uris could still be hierarchical. The downside is that paths would look really ugly. Sigh. If we wanted to make it transparent, then we might do it by adding symbolic links to the FileSystem API, rather than hacking DFSClient. Then one could "mount" a har file by simply linking to a har: URI.
          Hide
          Mahadev konar added a comment -

          you are right Doug. The opaque uri would not work with how Path works right now and I think it would be difficult to get it working as well. The suggestions with uri escaping looks really ugly and difficult to users to understand. Not sure how to get around this problem.

          Show
          Mahadev konar added a comment - you are right Doug. The opaque uri would not work with how Path works right now and I think it would be difficult to get it working as well. The suggestions with uri escaping looks really ugly and difficult to users to understand. Not sure how to get around this problem.
          Hide
          Lohit Vijayarenu added a comment -

          How about assuming file names ending with .har to be considered as special file format.

          Show
          Lohit Vijayarenu added a comment - How about assuming file names ending with .har to be considered as special file format.
          Hide
          Doug Cutting added a comment -

          > How about assuming file names ending with .har to be considered as special file format.

          I'd prefer not doing a naming hack like that directly in HDFS or in FileSystem. But I don't mind doing it in a layered filesystem, perhaps something like:

          har://hdfs-host:port/dir/my.har/file/in/har

          So the "har" FileSystem could pull the nested scheme off the front of the host, and scan the path for a ".har", parse the index there (caching it, presumably), and finally access the file. In the above case, the har path would be hdfs://host:port/dir/my.har. No changes to FileSystem or HDFS are required. That's not too ugly, is it?

          Show
          Doug Cutting added a comment - > How about assuming file names ending with .har to be considered as special file format. I'd prefer not doing a naming hack like that directly in HDFS or in FileSystem. But I don't mind doing it in a layered filesystem, perhaps something like: har://hdfs-host:port/dir/my.har/file/in/har So the "har" FileSystem could pull the nested scheme off the front of the host, and scan the path for a ".har", parse the index there (caching it, presumably), and finally access the file. In the above case, the har path would be hdfs://host:port/dir/my.har. No changes to FileSystem or HDFS are required. That's not too ugly, is it?
          Hide
          Mahadev konar added a comment -

          no its isnt that ugly. the real problem is not ! in the path uri that I suggest but having an opaque uri.

          har://hdfs-host:port/dir/my.har/file/in/har is still opaque.

          So paths do not work with this as well. Are you suggesting that we change Path to work with opaque uri's?

          Show
          Mahadev konar added a comment - no its isnt that ugly. the real problem is not ! in the path uri that I suggest but having an opaque uri. har://hdfs-host:port/dir/my.har/file/in/har is still opaque. So paths do not work with this as well. Are you suggesting that we change Path to work with opaque uri's?
          Hide
          Mahadev konar added a comment -

          sorry i think i missed this

          would the uri be

          har://hdfs-host:port/dir/my.har/file/in/har

          or har:hdfs://host:port/dir/my.har/file/in/har

          the first one is not opaque but will only work with HDFS. We are implicitly assuming that this is a hdfs archives.

          that might be ok though. I am not against it.

          Show
          Mahadev konar added a comment - sorry i think i missed this would the uri be har://hdfs-host:port/dir/my.har/file/in/har or har:hdfs://host:port/dir/my.har/file/in/har the first one is not opaque but will only work with HDFS. We are implicitly assuming that this is a hdfs archives. that might be ok though. I am not against it.
          Hide
          Andrzej Bialecki added a comment -

          Why not do an equivalent of NFS mount on *nix? I.e. tell NameNode that any file ops under the mount point are handled by a handler, and the handler application (module in a task?) is initialized with the specs from the JobConf.

          Show
          Andrzej Bialecki added a comment - Why not do an equivalent of NFS mount on *nix? I.e. tell NameNode that any file ops under the mount point are handled by a handler, and the handler application (module in a task?) is initialized with the specs from the JobConf.
          Hide
          Mahadev konar added a comment - - edited

          isnt it the same as mounts as I suggested? Wouldnt this also require changes to DFS Namenode? I am thinking of implementing Archives without changes on namenode side (for the first version at least).

          Show
          Mahadev konar added a comment - - edited isnt it the same as mounts as I suggested? Wouldnt this also require changes to DFS Namenode? I am thinking of implementing Archives without changes on namenode side (for the first version at least).
          Hide
          Doug Cutting added a comment -

          What I suggested is not opaque, but hierarchical. The nested scheme is pasted onto the front of the authority with a dash. This would fail for schemes that have a dash. If that's a problem, we could use something like hdfs-ar://host:port/dir/my.har/file. This would require adding an entry to the configuration per embedded scheme, rather than a single entry for all schemes--not a big deal.

          Show
          Doug Cutting added a comment - What I suggested is not opaque, but hierarchical. The nested scheme is pasted onto the front of the authority with a dash. This would fail for schemes that have a dash. If that's a problem, we could use something like hdfs-ar://host:port/dir/my.har/file. This would require adding an entry to the configuration per embedded scheme, rather than a single entry for all schemes--not a big deal.
          Hide
          Andrzej Bialecki added a comment -

          Mahadev,

          Wouldnt this also require changes to DFS Namenode?

          Perhaps not - the way I was thinking about it, this would be a function of a DFS client (i.e. the FileSystem on the API level). The FileSystem client would be aware of the current mounts (from Configuration), and for matching path prefixes it would translate file ops to operations on the archive file retrieved from the underlying configured FileSystem.

          This way we don't have to modify the Namenode, we don't have to hack the url schemas, the URIs are transparent, and all processing load is at the cost of a user task, i.e. we don't create an additional load on the namenode. Additionally, the scope of the "mount" is the current configuration, e.g. a job, so it's not a permanent mount, it can be different for each job, and no cleanup or unmount is needed.

          Show
          Andrzej Bialecki added a comment - Mahadev, Wouldnt this also require changes to DFS Namenode? Perhaps not - the way I was thinking about it, this would be a function of a DFS client (i.e. the FileSystem on the API level). The FileSystem client would be aware of the current mounts (from Configuration), and for matching path prefixes it would translate file ops to operations on the archive file retrieved from the underlying configured FileSystem. This way we don't have to modify the Namenode, we don't have to hack the url schemas, the URIs are transparent, and all processing load is at the cost of a user task, i.e. we don't create an additional load on the namenode. Additionally, the scope of the "mount" is the current configuration, e.g. a job, so it's not a permanent mount, it can be different for each job, and no cleanup or unmount is needed.
          Hide
          Doug Cutting added a comment -

          > Why not do an equivalent of NFS mount

          On unix, mounts are not managed by a filesystem implementation, but by the kernel. If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS.

          In fact, we already have a mount mechanism, but it only permits mounts by scheme, not at any point in the path. We could add a mechanism to mount a filesystem under an arbitrary path, or even a regex like "*.har". This could be confusing, however, since a path that looks like an hdfs: path would really be using some other protocol. And I don't yet see that we need to add a new mount feature, since I think the existing one is sufficient to implement this feature. Also, if we use "har:" or "hdfs-ar:" then it is clear that these are not normal HDFS files.

          A feature that might be good to add to the generic FileSystem is symbolic links. Then one could add a link in HDFS to an archive URI, thus grafting it into the namespace. If one linked hdfs://host:port/dir/foo/ to hdfs-ar://host:port/dir/foo.har then one could list the files in the former to get the uris in the latter. But that's beyond the scope of this issue. This would be good future work to make transparent archives possible.

          Show
          Doug Cutting added a comment - > Why not do an equivalent of NFS mount On unix, mounts are not managed by a filesystem implementation, but by the kernel. If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS. In fact, we already have a mount mechanism, but it only permits mounts by scheme, not at any point in the path. We could add a mechanism to mount a filesystem under an arbitrary path, or even a regex like "*.har". This could be confusing, however, since a path that looks like an hdfs: path would really be using some other protocol. And I don't yet see that we need to add a new mount feature, since I think the existing one is sufficient to implement this feature. Also, if we use "har:" or "hdfs-ar:" then it is clear that these are not normal HDFS files. A feature that might be good to add to the generic FileSystem is symbolic links. Then one could add a link in HDFS to an archive URI, thus grafting it into the namespace. If one linked hdfs://host:port/dir/foo/ to hdfs-ar://host:port/dir/foo.har then one could list the files in the former to get the uris in the latter. But that's beyond the scope of this issue. This would be good future work to make transparent archives possible.
          Hide
          Andrzej Bialecki added a comment -

          On unix, mounts are not managed by a filesystem implementation, but by the kernel.

          (Off-topic) Strictly speaking, yes - but in reality many filesystems are implemented as loadable modules (handlers) so it's not the monolithic kernel that handles all file ops. And many implementations exist only in user space (FUSE). Kernel then only makes sure that file ops under this mount point are delegated to the appropriate handler. Which is the model that I suggested here - in our case the client FileSystem abstraction would play this role.

          If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS.

          Correct, that's what I was suggesting.

          Show
          Andrzej Bialecki added a comment - On unix, mounts are not managed by a filesystem implementation, but by the kernel. (Off-topic) Strictly speaking, yes - but in reality many filesystems are implemented as loadable modules (handlers) so it's not the monolithic kernel that handles all file ops. And many implementations exist only in user space (FUSE). Kernel then only makes sure that file ops under this mount point are delegated to the appropriate handler. Which is the model that I suggested here - in our case the client FileSystem abstraction would play this role. If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS. Correct, that's what I was suggesting.
          Hide
          Mahadev konar added a comment -

          ok coming back to the topic of discussion ...

          I like Doug's idea of
          har://hdfs-host:port/dir/my.har/file/in/har

          and we assume any directory ending with .har is a an archive and the path following it is the path in archive.

          Show
          Mahadev konar added a comment - ok coming back to the topic of discussion ... I like Doug's idea of har://hdfs-host:port/dir/my.har/file/in/har and we assume any directory ending with .har is a an archive and the path following it is the path in archive.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > we assume any directory ending with .har is a an archive and the path following it is the path in archive.

          Consider har://hdfs-host:port/dir/my.har/file/in/har. I think the assumption should be

          • if my.har is a file, then follow the path in archive.
          • if my.har is a directory, treat it as a normal directory.

          Questions: Do we support nested archive? What will you do for something like har://hdfs-host:port/dir/foo.har/bar.har/file?

          Show
          Tsz Wo Nicholas Sze added a comment - > we assume any directory ending with .har is a an archive and the path following it is the path in archive. Consider har://hdfs-host:port/dir/my.har/file/in/har. I think the assumption should be if my.har is a file, then follow the path in archive. if my.har is a directory, treat it as a normal directory. Questions : Do we support nested archive? What will you do for something like har://hdfs-host:port/dir/foo.har/bar.har/file?
          Hide
          Mahadev konar added a comment -

          i dont see why the nested archives cannot be supported. For now (since I have just started coding) I think we should be able to support it but Ill keep that in mind!!

          Show
          Mahadev konar added a comment - i dont see why the nested archives cannot be supported. For now (since I have just started coding) I think we should be able to support it but Ill keep that in mind!!
          Hide
          Doug Cutting added a comment -

          Nicholas, archives are directories, not files, right? The har filesystem implementation should take the path up to the first ".har" element and assume that names an archive directory. Attempting to open har://hdfs-host:port/dir/my.har/foo.bar should throw an exception if my.har is not an archive-formatted directory.. This should naturally permit nested har files, if that's desired.

          Show
          Doug Cutting added a comment - Nicholas, archives are directories, not files, right? The har filesystem implementation should take the path up to the first ".har" element and assume that names an archive directory. Attempting to open har://hdfs-host:port/dir/my.har/foo.bar should throw an exception if my.har is not an archive-formatted directory.. This should naturally permit nested har files, if that's desired.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Oops, I thought that the archives are files like tar. Then I have another question:
          In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive?

          Show
          Tsz Wo Nicholas Sze added a comment - Oops, I thought that the archives are files like tar. Then I have another question: In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive?
          Hide
          Lohit Vijayarenu added a comment -

          Can we distinguish a directory ending with .har to be an archive only if it has index file in it.

          Show
          Lohit Vijayarenu added a comment - Can we distinguish a directory ending with .har to be an archive only if it has index file in it.
          Hide
          Doug Cutting added a comment -

          > In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive?

          As I said before, I think it would be nice and not too difficult to make nested archives work. Not essential, but convenient if its not too difficult. So if you have hdfs://h/bar/* and you pack it into hdfs://foo/bar.har, then you pack that into hdfs://h/foo/* into hdfs://h/dir/foo.har, then har://hdfs-h/dir/foo.har/bar.har/file should either (a) contain the content of the original file if we implement nested archives, or (b) throw FileNotFoundException if we don't implement nested archives. Is that what you were asking?

          > Can we distinguish a directory ending with .har to be an archive only if it has index file in it.

          If a path component of a har: uri ends with ".har" then I think it should be an error if it is not a ".har" format directory. It's fine to have files named .har in HDFS that are not har-format, but if one tries to access them using the archive mechanism, we shouldn't silently ignore them, but rather throw a MalformedArchive exception, no?

          Show
          Doug Cutting added a comment - > In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive? As I said before, I think it would be nice and not too difficult to make nested archives work. Not essential, but convenient if its not too difficult. So if you have hdfs://h /bar/* and you pack it into hdfs://foo/bar.har, then you pack that into hdfs://h /foo/* into hdfs://h /dir/foo.har, then har://hdfs-h /dir/foo.har/bar.har/file should either (a) contain the content of the original file if we implement nested archives, or (b) throw FileNotFoundException if we don't implement nested archives. Is that what you were asking? > Can we distinguish a directory ending with .har to be an archive only if it has index file in it. If a path component of a har: uri ends with ".har" then I think it should be an error if it is not a ".har" format directory. It's fine to have files named .har in HDFS that are not har-format, but if one tries to access them using the archive mechanism, we shouldn't silently ignore them, but rather throw a MalformedArchive exception, no?
          Hide
          Joydeep Sen Sarma added a comment -

          if 'har' is truly a client side abstraction - then the assumption that the protocol is hdfs - breaks this abstraction - no? one could imagine har archives on top of local file system - or for that matter KFS or any other future file system (say Lustre?).

          also - the 'har' protocol is redundantly indicated in the uri scheme as well as the file extension. conceivably - one could drop it from the uri scheme (and thereby retain the ability to work with different file systems) and use the presence of the .har extension in the file path to automatically layer on a archive file system.

          if done right - one should be able to support any archive format no? essentially - we are just associating the .har extension as a trigger to switch over to some nested file system (in this case, the har file system). one would think that in future a .zip extension could be associated with a ZIP file system provider which would allow nested view of the files/directories underneath .. (this would be, quite nice, since many data sets float around as zip files. one could just copy them into hdfs - and pronto - we are all set).

          am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new archive format). how do we populate a single hdfs file (backing the archive) in parallel?

          Show
          Joydeep Sen Sarma added a comment - if 'har' is truly a client side abstraction - then the assumption that the protocol is hdfs - breaks this abstraction - no? one could imagine har archives on top of local file system - or for that matter KFS or any other future file system (say Lustre?). also - the 'har' protocol is redundantly indicated in the uri scheme as well as the file extension. conceivably - one could drop it from the uri scheme (and thereby retain the ability to work with different file systems) and use the presence of the .har extension in the file path to automatically layer on a archive file system. if done right - one should be able to support any archive format no? essentially - we are just associating the .har extension as a trigger to switch over to some nested file system (in this case, the har file system). one would think that in future a .zip extension could be associated with a ZIP file system provider which would allow nested view of the files/directories underneath .. (this would be, quite nice, since many data sets float around as zip files. one could just copy them into hdfs - and pronto - we are all set). am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new archive format). how do we populate a single hdfs file (backing the archive) in parallel?
          Hide
          Doug Cutting added a comment -

          > the assumption that the protocol is hdfs

          No, the nested URI scheme is appended to the front of the host. A KFS archive would be:

          har://kfs-host:port/dir/foo.har/a/b

          which would name the file /a/b within the archive kfs://host:port/dir/foo.har.

          > if done right - one should be able to support any archive format no?

          That's a different goal. Most archive formats are not Hadoop friendly. The goal here is to develop a Hadoop-friendly archive format.

          Show
          Doug Cutting added a comment - > the assumption that the protocol is hdfs No, the nested URI scheme is appended to the front of the host. A KFS archive would be: har://kfs-host:port/dir/foo.har/a/b which would name the file /a/b within the archive kfs://host:port/dir/foo.har. > if done right - one should be able to support any archive format no? That's a different goal. Most archive formats are not Hadoop friendly. The goal here is to develop a Hadoop-friendly archive format.
          Hide
          Mahadev konar added a comment -

          >am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new > achive format). how do we populate a single hdfs file (backing the archive) in parallel?

          The archive isnt a single file backed by an index but multiple files

          quoting from the design posted earlier in the comments:

          The format of an archive as a filesystem path is:

          /user/mahadev/foo.har/_index*
          /user/mahadev/foo.har/part-*

          The indexes store the filenames and the offset with the part files.

          Each map would create part-$i files and a single reduce or multiple reduces could create the index files in the archive directory. Does that help in understanding the design?

          Show
          Mahadev konar added a comment - >am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new > achive format). how do we populate a single hdfs file (backing the archive) in parallel? The archive isnt a single file backed by an index but multiple files quoting from the design posted earlier in the comments: The format of an archive as a filesystem path is: /user/mahadev/foo.har/_index* /user/mahadev/foo.har/part-* The indexes store the filenames and the offset with the part files. Each map would create part-$i files and a single reduce or multiple reduces could create the index files in the archive directory. Does that help in understanding the design?
          Hide
          Mahadev konar added a comment -

          what about using this URI for har filesystem –

          har://hdfs-host:port/dir/foo.har?pathinsideharfilesystem

          SO the query of the uri is actually the path inside the har filesystem.

          This might require some changes to path but looks like a cleaner way rather than assuming .har extension as the har archive.

          example would be

          har://hdfs-port:port/dir/foo.har?dir1/file1

          Show
          Mahadev konar added a comment - what about using this URI for har filesystem – har://hdfs-host:port/dir/foo.har?pathinsideharfilesystem SO the query of the uri is actually the path inside the har filesystem. This might require some changes to path but looks like a cleaner way rather than assuming .har extension as the har archive. example would be har://hdfs-port:port/dir/foo.har?dir1/file1
          Hide
          Doug Cutting added a comment -

          > har://hdfs-port:port/dir/foo.har?dir1/file1

          The problem with this is that path operations like getParent() wouldn't work.

          Show
          Doug Cutting added a comment - > har://hdfs-port:port/dir/foo.har?dir1/file1 The problem with this is that path operations like getParent() wouldn't work.
          Hide
          Mahadev konar added a comment -

          the intent is to change path to make it work....

          Would it not be possible to make these changes without breaking things that used to work?

          Show
          Mahadev konar added a comment - the intent is to change path to make it work.... Would it not be possible to make these changes without breaking things that used to work?
          Hide
          Doug Cutting added a comment -

          > the intent is to change path to make it work....

          Would you special case the handling of "har:" uri's in Path? Or would you always parse queries as part of the hierarchical path? Both of these sound like bad ideas to me.

          We should not add special functionality to FileSystem or Path for "har:" uris. We have a proposal that layers cleanly on top of the existing FileSystem and Path implementations. Alternately, we might consider generic extensions to FileSystem and/or Path, like symbolic links or mount points, to see whether these might facilitate a more transparent archive implementation. But we should not add special-purpose hacks for a particular archive format to these generic classes.

          Mounts of various sorts would be fairly easy to add, but perhaps not that easy to use. I proposed a simple version above that requires no changes to existing code. A mount capability that permitted one to attach a FileSystem implementation at an arbitrary point in the URI space would not be overly hard to add.

          The primary downside of mount-based approaches is that they require state. One would have to add something to the configuration or job for each mount point, or require all FileSystem implementations to know how to store a mount, or add a mount file type, or somesuch. Note that this is not a problem with Unix mount, since there's only one system involved, but in a distributed system like Hadoop we need to either transmit the mount points with code (e.g., in the job) or somehow store them in the filesystem.

          The current proposal, embedding the URI of the archive within a "har:" uri, will both solve the problems at hand and require no architectural changes to the filesystem. The only downside is that archive file naming is a little obtuse. Long-term, the addition of symbolic links to FileSystem might address that, no?

          Show
          Doug Cutting added a comment - > the intent is to change path to make it work.... Would you special case the handling of "har:" uri's in Path? Or would you always parse queries as part of the hierarchical path? Both of these sound like bad ideas to me. We should not add special functionality to FileSystem or Path for "har:" uris. We have a proposal that layers cleanly on top of the existing FileSystem and Path implementations. Alternately, we might consider generic extensions to FileSystem and/or Path, like symbolic links or mount points, to see whether these might facilitate a more transparent archive implementation. But we should not add special-purpose hacks for a particular archive format to these generic classes. Mounts of various sorts would be fairly easy to add, but perhaps not that easy to use. I proposed a simple version above that requires no changes to existing code. A mount capability that permitted one to attach a FileSystem implementation at an arbitrary point in the URI space would not be overly hard to add. The primary downside of mount-based approaches is that they require state. One would have to add something to the configuration or job for each mount point, or require all FileSystem implementations to know how to store a mount, or add a mount file type, or somesuch. Note that this is not a problem with Unix mount, since there's only one system involved, but in a distributed system like Hadoop we need to either transmit the mount points with code (e.g., in the job) or somehow store them in the filesystem. The current proposal, embedding the URI of the archive within a "har:" uri, will both solve the problems at hand and require no architectural changes to the filesystem. The only downside is that archive file naming is a little obtuse. Long-term, the addition of symbolic links to FileSystem might address that, no?
          Hide
          Mahadev konar added a comment -

          I dont htink just symbolik links would solve the problem. We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts.
          I was going to go with the second option – parsing for query on every path which does seem like a bad idea. The syntax right now is a little ambiguous and has implicit asusmptiosn about the har filesystem. I was trying to get rid of the implicit assumption.

          Show
          Mahadev konar added a comment - I dont htink just symbolik links would solve the problem. We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts. I was going to go with the second option – parsing for query on every path which does seem like a bad idea. The syntax right now is a little ambiguous and has implicit asusmptiosn about the har filesystem. I was trying to get rid of the implicit assumption.
          Hide
          Doug Cutting added a comment -

          > We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts.

          Perhaps. But we don't want to go there in this issue, right?

          > I was trying to get rid of the implicit assumption. [ ... ]

          The assumption that an archive directory name end with ".har"? Is that what troubles you?

          Here's another option: use percent-encoding to name the archive dir in the authority, e.g., har://hdfs:%2F%2Fhost:port%2Fdir%2Farchive/a/b. This is harder to read, but otherwise elegant and general.

          Or yet another option: use the name of a file within the archive directory, like the index or parts which will also presumably be fixed, not user-variable. Then a path might look something like har://hdfs-host:port/dir/archive/har.index/a/b, where "har.index" is hardwired.

          Show
          Doug Cutting added a comment - > We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts. Perhaps. But we don't want to go there in this issue, right? > I was trying to get rid of the implicit assumption. [ ... ] The assumption that an archive directory name end with ".har"? Is that what troubles you? Here's another option: use percent-encoding to name the archive dir in the authority, e.g., har://hdfs:%2F%2Fhost:port%2Fdir%2Farchive/a/b. This is harder to read, but otherwise elegant and general. Or yet another option: use the name of a file within the archive directory, like the index or parts which will also presumably be fixed, not user-variable. Then a path might look something like har://hdfs-host:port/dir/archive/har.index/a/b, where "har.index" is hardwired.
          Hide
          Mahadev konar added a comment -

          > Perhaps. But we don't want to go there in this issue, right?
          True

          I think ill go with the original proposal of implicit hars. The name of a file within the archive directory is more confusing .

          Show
          Mahadev konar added a comment - > Perhaps. But we don't want to go there in this issue, right? True I think ill go with the original proposal of implicit hars. The name of a file within the archive directory is more confusing .
          Hide
          Mahadev konar added a comment -

          this patch addresses the archives isssue.

          This patch includes the following –

          • har:///user/mahadev/foo.har

          denotes a Hadoop archive. This is default uri which will use the default underlying filesystem specififed in your conf.

          In case you want to be explicit or some other hdfs (not the defautlt one )

          then the uri is –

          har://hdfs-host:port/user/mahadev/foo.har

          The uri's have an implicit assumption on which part of the uri denotes the directory for hadoop archives. The code looks the path from the end and assumes the part matching *.har to be the directory that is the archive.

          • it has a filesystem layer so all the commands like

          hadoop fs -ls har:///user/mahadev/foo.har

          work. Most of the mutating commands are not implemented in the archives. -cat -copytolocal work as expected.

          • works with map reduce.

          so the input to a map reduce job could be har:///user/mahadev/foo.har and this would work fine.

          Code Design and explanation -

          • There are two index files _index file contains files of the form
            filename <dir>/<file> partfile startindex size childpathnames_if_directory.
            The _index file is sorted by hashcode of filenames.
            The second index file _masterindex contains pointers into the index file to speed up the lookuptime of files inside the _index file.
          • To create an archive user need to run
            bin/hadoop archives -archiveName foo.har inputpaths outputdir

          This is a map reduce job wherein all the files are distributed amongst the maps which create part files of around 2GB or so. The reduce then get the startindex and size ffrom the maps for all the files and creates the _index and _masterindex.

          • Permissions are not persisted. So the permissions returned by the Har filesystem are the same as those of index files.
          Show
          Mahadev konar added a comment - this patch addresses the archives isssue. This patch includes the following – har:///user/mahadev/foo.har denotes a Hadoop archive. This is default uri which will use the default underlying filesystem specififed in your conf. In case you want to be explicit or some other hdfs (not the defautlt one ) then the uri is – har://hdfs-host:port/user/mahadev/foo.har The uri's have an implicit assumption on which part of the uri denotes the directory for hadoop archives. The code looks the path from the end and assumes the part matching *.har to be the directory that is the archive. it has a filesystem layer so all the commands like hadoop fs -ls har:///user/mahadev/foo.har work. Most of the mutating commands are not implemented in the archives. -cat -copytolocal work as expected. works with map reduce. so the input to a map reduce job could be har:///user/mahadev/foo.har and this would work fine. Code Design and explanation - There are two index files _index file contains files of the form filename <dir>/<file> partfile startindex size childpathnames_if_directory. The _index file is sorted by hashcode of filenames. The second index file _masterindex contains pointers into the index file to speed up the lookuptime of files inside the _index file. To create an archive user need to run bin/hadoop archives -archiveName foo.har inputpaths outputdir This is a map reduce job wherein all the files are distributed amongst the maps which create part files of around 2GB or so. The reduce then get the startindex and size ffrom the maps for all the files and creates the _index and _masterindex. Permissions are not persisted. So the permissions returned by the Har filesystem are the same as those of index files.
          Hide
          Doug Cutting added a comment -

          This sounds great! Using the default filesystem makes the URIs much more readable!

          > bin/hadoop archives -archiveName foo.har inputpaths outputdir

          • can we name the command 'archive' instead of 'archives'?
          • can the output name and directory be combined?

          If so, the command might look like:

          bin/hadoop archive dir/foo.har dir1 dir2 [ ... ]

          Show
          Doug Cutting added a comment - This sounds great! Using the default filesystem makes the URIs much more readable! > bin/hadoop archives -archiveName foo.har inputpaths outputdir can we name the command 'archive' instead of 'archives'? can the output name and directory be combined? If so, the command might look like: bin/hadoop archive dir/foo.har dir1 dir2 [ ... ]
          Hide
          Mahadev konar added a comment -

          this is an updated patch with better comments and a few bug fixes which i found while testing corner cases.

          Show
          Mahadev konar added a comment - this is an updated patch with better comments and a few bug fixes which i found while testing corner cases.
          Hide
          Mahadev konar added a comment -

          in reply to doug's comments:

          > can we name the command 'archive' instead of 'archives'?

          the name was archive always . I just mistyped it.

          > can the output name and directory be combined?

          I am not strongly against it - i just feel the current command line is more user friendly making it clear that there is an archivename which should end with .har. I might be wrong. I can go either way. I am not strongly against or for any of those.

          Show
          Mahadev konar added a comment - in reply to doug's comments: > can we name the command 'archive' instead of 'archives'? the name was archive always . I just mistyped it. > can the output name and directory be combined? I am not strongly against it - i just feel the current command line is more user friendly making it clear that there is an archivename which should end with .har. I might be wrong. I can go either way. I am not strongly against or for any of those.
          Hide
          Mahadev konar added a comment -

          trying hudson again...

          Show
          Mahadev konar added a comment - trying hudson again...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12382982/hadoop-3307_2.patch
          against trunk revision 661918.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 9 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382982/hadoop-3307_2.patch against trunk revision 661918. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path)
          2) decodeURI should be done first and then the har archive path can be extracted
          3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable.
          4) open() & other filesystem calls should support taking just the fragment path to a file within the archive
          5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive)
          6) The index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them.
          7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus.
          8) In listStatus, the start and end braces are missing for the if/else block
          9) In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false
          10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls
          11) Do you enforce the .har in the archive name when it is created?

          I am not done reviewing the entire patch yet ..

          Show
          Devaraj Das added a comment - 1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path) 2) decodeURI should be done first and then the har archive path can be extracted 3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable. 4) open() & other filesystem calls should support taking just the fragment path to a file within the archive 5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive) 6) The index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them. 7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus. 8) In listStatus, the start and end braces are missing for the if/else block 9) In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false 10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls 11) Do you enforce the .har in the archive name when it is created? I am not done reviewing the entire patch yet ..
          Hide
          Devaraj Das added a comment -

          Some more comments:
          1) In writeTopLevelDirs, remove the comment "invert the paths"
          2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something)
          3) Use NullOutputFormat instead of the HarOutputFormat
          4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that.

          Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible.

          Show
          Devaraj Das added a comment - Some more comments: 1) In writeTopLevelDirs, remove the comment "invert the paths" 2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something) 3) Use NullOutputFormat instead of the HarOutputFormat 4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that. Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible.
          Hide
          Devaraj Das added a comment -

          In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right?

          Show
          Devaraj Das added a comment - In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right?
          Hide
          Mahadev konar added a comment -

          Repsonse to devaraj's comments -

          > 1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path)
          agreed

          > 2) decodeURI should be done first and then the har archive path can be extracted
          agreed

          > 3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable.
          will do

          > 4) open() & other filesystem calls should support taking just the fragment path to a file within the archive
          makes sense

          > 5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive)
          no the list can be more than one since hashcode will have collisions and might end up between different buckets

          > 6) the index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them.
          for now I will just be closing and opening the files. Ill leave this optimization for later.

          > 7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus.

          this was stated in the design that we will be ignoring permissions in this version. In later versions we can persist the permissions as well.

          > 8) In listStatus, the start and end braces are missing for the if/else block
          will fix

          > In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false
          will fix

          > 10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls
          Its just to join two paths that are assolute.
          SO a path in archive is persisted as /user/mahadev so to have this as the later componet of the archive its just made relative to create a new Path

          > Do you enforce the .har in the archive name when it is created?
          yes.

          > 1) In writeTopLevelDirs, remove the comment "invert the paths"
          will do

          > 2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something)
          makes sense

          > Use NullOutputFormat instead of the HarOutputFormat
          Did not know we had a nulloutputformat!!

          > 4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that.
          will do

          > Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible.
          will try to clean up the code.

          > In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right?
          makes sense.

          Show
          Mahadev konar added a comment - Repsonse to devaraj's comments - > 1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path) agreed > 2) decodeURI should be done first and then the har archive path can be extracted agreed > 3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable. will do > 4) open() & other filesystem calls should support taking just the fragment path to a file within the archive makes sense > 5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive) no the list can be more than one since hashcode will have collisions and might end up between different buckets > 6) the index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them. for now I will just be closing and opening the files. Ill leave this optimization for later. > 7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus. this was stated in the design that we will be ignoring permissions in this version. In later versions we can persist the permissions as well. > 8) In listStatus, the start and end braces are missing for the if/else block will fix > In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false will fix > 10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls Its just to join two paths that are assolute. SO a path in archive is persisted as /user/mahadev so to have this as the later componet of the archive its just made relative to create a new Path > Do you enforce the .har in the archive name when it is created? yes. > 1) In writeTopLevelDirs, remove the comment "invert the paths" will do > 2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something) makes sense > Use NullOutputFormat instead of the HarOutputFormat Did not know we had a nulloutputformat!! > 4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that. will do > Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible. will try to clean up the code. > In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right? makes sense.
          Hide
          Mahadev konar added a comment -

          this patch fixes all of devaraj's comments.

          Show
          Mahadev konar added a comment - this patch fixes all of devaraj's comments.
          Hide
          Mahadev konar added a comment -

          and fixes findbugs warnings

          Show
          Mahadev konar added a comment - and fixes findbugs warnings
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383352/hadoop-3307_3.patch
          against trunk revision 663079.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383352/hadoop-3307_3.patch against trunk revision 663079. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          deleted the patch.

          Show
          Mahadev konar added a comment - deleted the patch.
          Hide
          Mahadev konar added a comment -

          attaching a new patch that gets rid of the find bugs warnings.

          Show
          Mahadev konar added a comment - attaching a new patch that gets rid of the find bugs warnings.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383399/hadoop-3307_4.patch
          against trunk revision 663337.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383399/hadoop-3307_4.patch against trunk revision 663337. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Mahadev!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Mahadev!

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development