Hadoop Common
  1. Hadoop Common
  2. HADOOP-619

Unify Map-Reduce and Streaming to take the same globbed input specification

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      Right now streaming input is specified very differently from other map-reduce input. It would be good if these two apps could take much more similar input specs.

      In particular -input in streaming expects a file or glob pattern while MR takes a directory. It would be cool if both could take a glob patern of files and if both took a directory by default (with some patern excluded to allow logs, metadata and other framework output to be safely stored).

      We want to be sure that MR input is backward compatible over this change. I propose that a single file should be accepted as an input or a single directory. Globs should only match directories if the paterns is '/' terminated, to avoid massive inputs specified by mistake.

      Thoughts?

      1. Hadoop-619_1.patch
        8 kB
        Sanjay Dahiya
      2. Hadoop-619_1.patch
        9 kB
        Sanjay Dahiya
      3. Hadoop-619_2.patch
        9 kB
        Sanjay Dahiya
      4. Hadoop-619_2.patch
        8 kB
        Sanjay Dahiya
      5. Hadoop-619_3.patch
        10 kB
        Sanjay Dahiya
      6. Hadoop-619_4.patch
        10 kB
        Doug Cutting
      7. Hadoop-619.patch
        12 kB
        Sanjay Dahiya
      8. Hadoop-619.patch
        12 kB
        Sanjay Dahiya
      9. Hadoop-619.patch
        11 kB
        Sanjay Dahiya

        Issue Links

          Activity

          eric baldeschwieler created issue -
          Sameer Paranjpye made changes -
          Field Original Value New Value
          Component/s mapred [ 12310690 ]
          Hide
          Sanjay Dahiya added a comment -

          If we have patterns in input path to MR, do we still check for existence of files matching that pattern in dfs while validating the command?
          Currently its a single check which validates if directory exists, with patterns we will need to fetch all the files in the directory and match the pattern and if no file matches the pattern then show error and exit. This will be done in job client.

          comments?

          Show
          Sanjay Dahiya added a comment - If we have patterns in input path to MR, do we still check for existence of files matching that pattern in dfs while validating the command? Currently its a single check which validates if directory exists, with patterns we will need to fetch all the files in the directory and match the pattern and if no file matches the pattern then show error and exit. This will be done in job client. comments?
          Sanjay Dahiya made changes -
          Assignee Sanjay Dahiya [ sanjay.dahiya ]
          Hide
          Runping Qi added a comment -

          We should check the existence of the path(s), but leave the pattern matching to InputFormatBase class.

          We have three cases to consider: all paths exist, non exists, and some do, some do not.
          Cases 1 and 2 are easy. Case 3, I hink it is reasonable that the job proceeds, but records that in the job
          status.

          When a job starts, we may should record all the input directories and number of files under it matching the pattern. If no file matches the pattern, the job should stop. Otherwise, the job generates splits from the matched files and proceeds.

          Show
          Runping Qi added a comment - We should check the existence of the path(s), but leave the pattern matching to InputFormatBase class. We have three cases to consider: all paths exist, non exists, and some do, some do not. Cases 1 and 2 are easy. Case 3, I hink it is reasonable that the job proceeds, but records that in the job status. When a job starts, we may should record all the input directories and number of files under it matching the pattern. If no file matches the pattern, the job should stop. Otherwise, the job generates splits from the matched files and proceeds.
          Hide
          Owen O'Malley added a comment -

          I think that streaming should stop using the filename globbing for --input and instead list a set of input directories.

          I think that the InputFormatBase should support a regex filter on filenames that should default to something like "[^_].*" so that any file that starts with an "_" is not treated as input. This will allow us to put things like _LOGS as a filename to store the log files for a job.

          Streaming should then support the regex filename filters using "--input-filter" with a regex that filters the filenames.

          Show
          Owen O'Malley added a comment - I think that streaming should stop using the filename globbing for --input and instead list a set of input directories. I think that the InputFormatBase should support a regex filter on filenames that should default to something like " [^_] .*" so that any file that starts with an "_" is not treated as input. This will allow us to put things like _LOGS as a filename to store the log files for a job. Streaming should then support the regex filename filters using "--input-filter" with a regex that filters the filenames.
          Hide
          Owen O'Malley added a comment -

          After a bit more consideration, I think it would be fine for JobConf.getInputPaths to contain both files and directories. InputFormatBase should use the HDFS globbing to expand wildcards. For files, it would include just that file and for directories, it would treat it like the current semantics.

          Show
          Owen O'Malley added a comment - After a bit more consideration, I think it would be fine for JobConf.getInputPaths to contain both files and directories. InputFormatBase should use the HDFS globbing to expand wildcards. For files, it would include just that file and for directories, it would treat it like the current semantics.
          Hide
          Owen O'Malley added a comment -

          > Globs should only match directories if the paterns is '/' terminated, to avoid massive inputs
          > specified by mistake.

          I don't like this. As I commented on HADOOP-778, I think that having different semantics for trailing slash is confusing. I think we'd be better off warning the user if they have a trailing "/" on an input path. (Since "foo/" is likely a mistake and they probably meant "foo" or "foo/log*.gz" or something.)

          Show
          Owen O'Malley added a comment - > Globs should only match directories if the paterns is '/' terminated, to avoid massive inputs > specified by mistake. I don't like this. As I commented on HADOOP-778 , I think that having different semantics for trailing slash is confusing. I think we'd be better off warning the user if they have a trailing "/ " on an input path. (Since "foo/ " is likely a mistake and they probably meant "foo" or "foo/log*.gz" or something.)
          Hide
          Runping Qi added a comment -

          A couple of points:

          1. specifying input:
          I think the user should specify the input files in the following way:

          spec_1,spec_2, ...

          Each spec above is a "root" directory, with an optional a pttern specification
          (regex, or simple wildcad based patterns).
          The semantics for a spec is that all the files directly or indirectly under the root are input files,
          as long as the paths match the pattern.
          If the pattern is ommitted in the spec, that means all the files under the root dir are the input files.

          Here are a couple of examples:

          "foo/" all the files under foo tree

          "foo/;*.gz": all the files under foo tree, with extension .gz

          "foo/; /bar/.gz: all the files under foo tree, with extension .gz and with bar as a intermediate directory.

          2. Checking and matching:
          The jobclient should check the existence of the root dirs.
          If none of the root dirs exists, then the job should fail immediately.
          If some root dirs do not exist, the job client should generate warning.

          The InputFormatbase should perform the file list generation and matching. The list of matched files
          should be part of the job's status so that the user can examine them through web UI.

          Show
          Runping Qi added a comment - A couple of points: 1. specifying input: I think the user should specify the input files in the following way: spec_1,spec_2, ... Each spec above is a "root" directory, with an optional a pttern specification (regex, or simple wildcad based patterns). The semantics for a spec is that all the files directly or indirectly under the root are input files, as long as the paths match the pattern. If the pattern is ommitted in the spec, that means all the files under the root dir are the input files. Here are a couple of examples: "foo/" all the files under foo tree "foo/;*.gz": all the files under foo tree, with extension .gz "foo/; /bar/ .gz: all the files under foo tree, with extension .gz and with bar as a intermediate directory. 2. Checking and matching: The jobclient should check the existence of the root dirs. If none of the root dirs exists, then the job should fail immediately. If some root dirs do not exist, the job client should generate warning. The InputFormatbase should perform the file list generation and matching. The list of matched files should be part of the job's status so that the user can examine them through web UI.
          Hide
          Owen O'Malley added a comment -

          I really don't like this proposal. It adds custom syntax namly the ';' operator that binds the wrong way in that it is tighter than ','. I'm ok with either:

          1. The input paths are all directories and there is a single glob or regex filter on filenames.
          2. The input paths are files or directories.

          In either case, I think that the input paths should be globbed by the file system.

          Show
          Owen O'Malley added a comment - I really don't like this proposal. It adds custom syntax namly the ';' operator that binds the wrong way in that it is tighter than ','. I'm ok with either: 1. The input paths are all directories and there is a single glob or regex filter on filenames. 2. The input paths are files or directories. In either case, I think that the input paths should be globbed by the file system.
          Hide
          eric baldeschwieler added a comment -

          Perhaps we should just limit to either globing or a single directory per argument and simply drop directories from globbing? This seems fairly simple and not too restrictive in practice.

          I agree that if a directory is used we should exclude files starting with "_". This will allow us to put metadata in output directories. I think we should also simply exclude subdirectories in input directories. Again, I doubt this will prove restrictive in practice.

          It seems to me we should error out if any glob matches no files or a listed input directory is not present. Perhaps we could provide another switch for an optional input in case users actual want a job to run if an input spec doesn't match any input.

          Show
          eric baldeschwieler added a comment - Perhaps we should just limit to either globing or a single directory per argument and simply drop directories from globbing? This seems fairly simple and not too restrictive in practice. I agree that if a directory is used we should exclude files starting with "_". This will allow us to put metadata in output directories. I think we should also simply exclude subdirectories in input directories. Again, I doubt this will prove restrictive in practice. It seems to me we should error out if any glob matches no files or a listed input directory is not present. Perhaps we could provide another switch for an optional input in case users actual want a job to run if an input spec doesn't match any input.
          Hide
          Owen O'Malley added a comment -

          just to clarify my proposal, it would look like:

          for p in job.getInputPaths():
          for g in p.globPaths():
          if g.isFIle():
          fileList.add(g)
          else:
          for f in g.listFiles():
          fileList.add(f)

          I don't understand what you are proposing, Eric. Is it:

          for p in job.getInputPaths:
          if p.isDirectory():
          for f in p.listFiles():
          fileList.add(f)
          else:
          for g in p.globPaths():
          if g.isFile():
          fileList.add(g)

          Show
          Owen O'Malley added a comment - just to clarify my proposal, it would look like: for p in job.getInputPaths(): for g in p.globPaths(): if g.isFIle(): fileList.add(g) else: for f in g.listFiles(): fileList.add(f) I don't understand what you are proposing, Eric. Is it: for p in job.getInputPaths: if p.isDirectory(): for f in p.listFiles(): fileList.add(f) else: for g in p.globPaths(): if g.isFile(): fileList.add(g)
          Hide
          eric baldeschwieler added a comment -

          owen: yes

          arkady: -input ...somepath//.gz or somesuch should work in this proposal, so that should solve your problem, right?

          Maybe I should relax on allowing directories in glob patterns and directories. It is just that this sort of matching can be really messy on typos. Standard unix tools don't allow it. But I guess as long as it is not recursive, the limits of input formats and such should keep this from going too nuts.

          So I'm ok with owen's proposal. (No recursion, right?)

          Show
          eric baldeschwieler added a comment - owen: yes arkady: -input ...somepath/ / .gz or somesuch should work in this proposal, so that should solve your problem, right? Maybe I should relax on allowing directories in glob patterns and directories. It is just that this sort of matching can be really messy on typos. Standard unix tools don't allow it. But I guess as long as it is not recursive, the limits of input formats and such should keep this from going too nuts. So I'm ok with owen's proposal. (No recursion, right?)
          Hide
          Runping Qi added a comment -

          How flexible is globPaths()?

          Is p.glovPaths() legal for the following p:
          p = "root1//foo//*.gz"
          p = "root2//200611??//*.gz"

          In other words, can I specify the following inut paths:
          "root1//foo//.gz,root2//200611??//.gz" in jobconf?
          I am OK with Owen's proposal.

          Another point related to specifying different mapper classes for different input files: we should
          use some similar syntax:
          "mapperClass1@root1//foo//.gz,mapperClass2@root2//200611??//.gz"

          thoughts?

          Show
          Runping Qi added a comment - How flexible is globPaths()? Is p.glovPaths() legal for the following p: p = "root1/ /foo/ /*.gz" p = "root2/ /200611??/ /*.gz" In other words, can I specify the following inut paths: "root1/ /foo/ / .gz,root2/ /200611??/ / .gz" in jobconf? I am OK with Owen's proposal. Another point related to specifying different mapper classes for different input files: we should use some similar syntax: "mapperClass1@root1/ /foo/ / .gz,mapperClass2@root2/ /200611??/ / .gz" thoughts?
          Hide
          Sanjay Dahiya added a comment -

          Runping:
          The patterns you mentioned work, basically it breaks the input path in parts and matches files/directories at each step going forward(without recursing) so it should be ok with these types of patterns.

          >The list of matched files
          > should be part of the job's status so that the user can examine them through web UI.
          The list of input paths will be a part of jobclient so it will be available through the JobTracker UI anyways. Does this sound ok ?

          Show
          Sanjay Dahiya added a comment - Runping: The patterns you mentioned work, basically it breaks the input path in parts and matches files/directories at each step going forward(without recursing) so it should be ok with these types of patterns. >The list of matched files > should be part of the job's status so that the user can examine them through web UI. The list of input paths will be a part of jobclient so it will be available through the JobTracker UI anyways. Does this sound ok ?
          Hide
          Sanjay Dahiya added a comment -

          These changes cause TestEmptyJobWithDFS test to fail as in that no of input files is 0. with new changes Job fails if there are no input files in the inputs paths after glob expansion.

          Do we need the ability to run empty jobs or do we want to specify some extra flag stating input not needed for the job? The test case will need to be changed then as well.

          Comments?

          Show
          Sanjay Dahiya added a comment - These changes cause TestEmptyJobWithDFS test to fail as in that no of input files is 0. with new changes Job fails if there are no input files in the inputs paths after glob expansion. Do we need the ability to run empty jobs or do we want to specify some extra flag stating input not needed for the job? The test case will need to be changed then as well. Comments?
          Hide
          Sanjay Dahiya added a comment -

          Here is a patch which adds gobbing support to JobClient. its not commit ready yet as it breaks unit tests as I mentioned in prev comment. Unit tests need to be changed if we agree on that.

          Show
          Sanjay Dahiya added a comment - Here is a patch which adds gobbing support to JobClient. its not commit ready yet as it breaks unit tests as I mentioned in prev comment. Unit tests need to be changed if we agree on that.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346474 ]
          Hide
          Owen O'Malley added a comment -

          This patch doesn't change anything anything in streaming yet.

          A lot of the code in StreamInputFormat should be removable after this patch.

          Show
          Owen O'Malley added a comment - This patch doesn't change anything anything in streaming yet. A lot of the code in StreamInputFormat should be removable after this patch.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346474 ]
          Hide
          Sanjay Dahiya added a comment -

          This patch makes following changes :
          1. Accept empty input directories only if not specified as glob.
          2. Remove validation code from StreamngInputFormat, there are a few other related changes to streaming as part of HADOOP-476 ( not included in this patch)

          Show
          Sanjay Dahiya added a comment - This patch makes following changes : 1. Accept empty input directories only if not specified as glob. 2. Remove validation code from StreamngInputFormat, there are a few other related changes to streaming as part of HADOOP-476 ( not included in this patch)
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346542 ]
          Sanjay Dahiya made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346542 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346543 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346548 ]
          Hide
          Hadoop QA added a comment -

          +1, http://issues.apache.org/jira/secure/attachment/12346548/Hadoop-619.patch applied and successfully tested against trunk revision r482999

          Show
          Hadoop QA added a comment - +1, http://issues.apache.org/jira/secure/attachment/12346548/Hadoop-619.patch applied and successfully tested against trunk revision r482999
          Hide
          Sanjay Dahiya added a comment -

          Updating patch with javadoc for JobConf setInputPath and addInputPath, which now include glob patterns.

          Show
          Sanjay Dahiya added a comment - Updating patch with javadoc for JobConf setInputPath and addInputPath, which now include glob patterns.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346617 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346548 ]
          Hide
          Doug Cutting added a comment -

          I can't get this patch to apply to trunk anymore. It seems to conflict with other patches that were recently comitted. Sorry! Can you please re-generate it against the current trunk? Thanks!

          Show
          Doug Cutting added a comment - I can't get this patch to apply to trunk anymore. It seems to conflict with other patches that were recently comitted. Sorry! Can you please re-generate it against the current trunk? Thanks!
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Sanjay Dahiya added a comment -

          Updated patch with current trunk.

          Show
          Sanjay Dahiya added a comment - Updated patch with current trunk.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619.patch [ 12346747 ]
          Hide
          Doug Cutting added a comment -

          I should have followed this patch more closely earlier. Sorry!

          This logic should not be in JobClient, but rather in InputFormatBase. The kernel should know very little about how inputs are specified, it should delegate all of that to the InputFormat. Inputs will usually be files in a Hadoop FileSystem, but they could be something else entirely, that is, e.g., not easily modelled by Hadoop's FileSystem API. So in mapred kernel code, like JobClient, we shouldn't assume that inputs are from a FileSystem.

          I have not been as good at enforcing this distinction in the past. For example, JobConf.addInputPath(Path) should be a static InputFormatBase.addInputPath(JobConf,Path). The method is provided on JobConf as a convenience, for jobs that specify their inputs with FileSystem paths, so this isn't a gross violation: applications are not forced to use this method.

          More seriously, the InputFormat method areValidInputDirectories(Path[]) should be more abstract: we must not assume that inputs are always named with FileSystem Paths. Rather, this method should probably be something like hasValidInput(JobConf). Then InputFormatBase can implement it to glob, etc. To make this even more clear, we should rename InputFormatBase to be FileSystemInputFormat.

          But those changes are beyond the scope of this issue and should be made under separate issues. For this issue we need to move as much of the logic as possible out of JobClient and into InputFormatBase.

          Show
          Doug Cutting added a comment - I should have followed this patch more closely earlier. Sorry! This logic should not be in JobClient, but rather in InputFormatBase. The kernel should know very little about how inputs are specified, it should delegate all of that to the InputFormat. Inputs will usually be files in a Hadoop FileSystem, but they could be something else entirely, that is, e.g., not easily modelled by Hadoop's FileSystem API. So in mapred kernel code, like JobClient, we shouldn't assume that inputs are from a FileSystem. I have not been as good at enforcing this distinction in the past. For example, JobConf.addInputPath(Path) should be a static InputFormatBase.addInputPath(JobConf,Path). The method is provided on JobConf as a convenience, for jobs that specify their inputs with FileSystem paths, so this isn't a gross violation: applications are not forced to use this method. More seriously, the InputFormat method areValidInputDirectories(Path[]) should be more abstract: we must not assume that inputs are always named with FileSystem Paths. Rather, this method should probably be something like hasValidInput(JobConf). Then InputFormatBase can implement it to glob, etc. To make this even more clear, we should rename InputFormatBase to be FileSystemInputFormat. But those changes are beyond the scope of this issue and should be made under separate issues. For this issue we need to move as much of the logic as possible out of JobClient and into InputFormatBase.
          Hide
          Sanjay Dahiya added a comment -

          thanks Doug for the review.

          For using InputFormatBase for validating and expanding globs, i think it would be a good idea to change the signature of InputFormatBase.areValidInputDirectories() to return a list of valid Paths as part of this patch itself. If some Path is found to be invalid, that will cause the job to fail it should throw InvalidArgumentException.
          This will prevent the glob expansion to happen twice, once on job client end ( InputFormatBase.areValidInputDirectories() ) and then on the jobtracker end while assigning splits ( InputFormatBase.listPaths() )

          areValidInputDirectories a public method in InputFormat and in hadoop it's used only in JobClient. Otherwise we will need to expand the globs twice/

          Does this sound reasonable?

          Show
          Sanjay Dahiya added a comment - thanks Doug for the review. For using InputFormatBase for validating and expanding globs, i think it would be a good idea to change the signature of InputFormatBase.areValidInputDirectories() to return a list of valid Paths as part of this patch itself. If some Path is found to be invalid, that will cause the job to fail it should throw InvalidArgumentException. This will prevent the glob expansion to happen twice, once on job client end ( InputFormatBase.areValidInputDirectories() ) and then on the jobtracker end while assigning splits ( InputFormatBase.listPaths() ) areValidInputDirectories a public method in InputFormat and in hadoop it's used only in JobClient. Otherwise we will need to expand the globs twice/ Does this sound reasonable?
          Hide
          Doug Cutting added a comment -

          > change the signature of InputFormatBase.areValidInputDirectories() to return a list of valid Paths

          I don't think we should do this, since it assumes that inputs are always representable by Paths. The long-term goal is that inputs and outputs are only assumed to be (1) specifiable in a Configuration and (2) representable as a serializeable Split implementation. In particular, we should not assume that they are FileSystem files, and hence should not pass Path in a primitive MapReduce API. Thus areValidInputDirectories(Path[]) should evolve into hasValidInput(JobConf), but not as a part of this patch.

          > Otherwise we will need to expand the globs twice

          For now, I think the glob can happen twice: once when checking the input in the JobClient, and once on the JobTracker when constructing splits. Longer term we may wish to try to optmize this, but I'm not convinced that's required. If it is required, then perhaps, instead of hasValidInput(JobConf) we could have a prepareInput(JobConf) method that validates inputs and is permitted to alter the job. But we shouldn't modify the InputFormat API for this issue. Owen's already working on that separately.

          Show
          Doug Cutting added a comment - > change the signature of InputFormatBase.areValidInputDirectories() to return a list of valid Paths I don't think we should do this, since it assumes that inputs are always representable by Paths. The long-term goal is that inputs and outputs are only assumed to be (1) specifiable in a Configuration and (2) representable as a serializeable Split implementation. In particular, we should not assume that they are FileSystem files, and hence should not pass Path in a primitive MapReduce API. Thus areValidInputDirectories(Path[]) should evolve into hasValidInput(JobConf), but not as a part of this patch. > Otherwise we will need to expand the globs twice For now, I think the glob can happen twice: once when checking the input in the JobClient, and once on the JobTracker when constructing splits. Longer term we may wish to try to optmize this, but I'm not convinced that's required. If it is required, then perhaps, instead of hasValidInput(JobConf) we could have a prepareInput(JobConf) method that validates inputs and is permitted to alter the job. But we shouldn't modify the InputFormat API for this issue. Owen's already working on that separately.
          Hide
          Owen O'Malley added a comment -

          Actually, I'd prefer to have the globbing happen and stay internal to the InputFormat, since it may not make sense for all InputFormat's. How about this interface for validating input directories:

          InputFormat:
          void hasValidInput(JobConf) throws IOException, InvalidInputException;

          public class InvalidInputException extends IOException

          { List<IOException> getProblems(); }

          Although this will cause the system to glob twice (once on job client and once on job tracker), I think in the medium term that the job client should generate the splits and write them to the system directory parallel to the job.xml. After that, the glob will only happen once.

          Show
          Owen O'Malley added a comment - Actually, I'd prefer to have the globbing happen and stay internal to the InputFormat, since it may not make sense for all InputFormat's. How about this interface for validating input directories: InputFormat: void hasValidInput(JobConf) throws IOException, InvalidInputException; public class InvalidInputException extends IOException { List<IOException> getProblems(); } Although this will cause the system to glob twice (once on job client and once on job tracker), I think in the medium term that the job client should generate the splits and write them to the system directory parallel to the job.xml. After that, the glob will only happen once.
          Hide
          Doug Cutting added a comment -

          > I'd prefer to have the globbing happen and stay internal to the InputFormat

          +1

          > void hasValidInput(JobConf) throws IOException, InvalidInputException;

          If it returns void, then it ought to be named 'validateInput'. But otherwise, this looks fine.

          I still don't think we should try to address this change in this issue. Adding globbing does not need to change the InputFormat API. This discussion should be a part of HADOOP-372 and HADOOP-451, the grand job input interface restructuring.

          Show
          Doug Cutting added a comment - > I'd prefer to have the globbing happen and stay internal to the InputFormat +1 > void hasValidInput(JobConf) throws IOException, InvalidInputException; If it returns void, then it ought to be named 'validateInput'. But otherwise, this looks fine. I still don't think we should try to address this change in this issue. Adding globbing does not need to change the InputFormat API. This discussion should be a part of HADOOP-372 and HADOOP-451 , the grand job input interface restructuring.
          Hide
          Sanjay Dahiya added a comment -

          Updated patch, it moves all changes to InputFormatBase and doesnt change the API. Glob expansion is done twice as discussed before.

          thanks Doug, Owen for comments.

          Show
          Sanjay Dahiya added a comment - Updated patch, it moves all changes to InputFormatBase and doesnt change the API. Glob expansion is done twice as discussed before. thanks Doug, Owen for comments.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12346936 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12346936 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12346984 ]
          Hide
          Owen O'Malley added a comment -

          The globbing should be done via FileSystem.globPaths() instead of manipulating regular expressions. Otherwise, we'll have to continue to maintain two code bases. Other than that, it looks good.

          Show
          Owen O'Malley added a comment - The globbing should be done via FileSystem.globPaths() instead of manipulating regular expressions. Otherwise, we'll have to continue to maintain two code bases. Other than that, it looks good.
          Hide
          Sanjay Dahiya added a comment -

          Thanks Owen for review. I hadn't looked at glob impl in FileSystem.

          This patch uses Filesystem.glob. also fixes an issue which caused server:port pattern to give error in glob expansion.

          Show
          Sanjay Dahiya added a comment - Thanks Owen for review. I hadn't looked at glob impl in FileSystem. This patch uses Filesystem.glob. also fixes an issue which caused server:port pattern to give error in glob expansion.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12347163 ]
          Sanjay Dahiya made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Sanjay Dahiya made changes -
          Link This issue blocks HADOOP-788 [ HADOOP-788 ]
          Hide
          Hadoop QA added a comment -

          +1, since http://issues.apache.org/jira/secure/attachment/12347163/Hadoop-619_1.patch applied and successfully tested against trunk revision r486901.

          Show
          Hadoop QA added a comment - +1, since http://issues.apache.org/jira/secure/attachment/12347163/Hadoop-619_1.patch applied and successfully tested against trunk revision r486901.
          Hide
          Owen O'Malley added a comment -

          A couple of things:
          1. You shouldn't use the "ignored" FileSystem. You should use the file system for that particular path. (path.getFileSystem(conf)).
          2. I think the error conditions should be that either:
          a. The list of input directories is empty. (conf.getIputPaths().length == 0)
          b. Any glob pattern matches no files or directories.
          3. the hiddenFileFilter is cleaner as:
          public boolean accept(Path p)

          { return !name.startsWith("_") && !name.startsWith("."); }
          Show
          Owen O'Malley added a comment - A couple of things: 1. You shouldn't use the "ignored" FileSystem. You should use the file system for that particular path. (path.getFileSystem(conf)). 2. I think the error conditions should be that either: a. The list of input directories is empty. (conf.getIputPaths().length == 0) b. Any glob pattern matches no files or directories. 3. the hiddenFileFilter is cleaner as: public boolean accept(Path p) { return !name.startsWith("_") && !name.startsWith("."); }
          Owen O'Malley made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Fix Version/s 0.10.0 [ 12312207 ]
          Affects Version/s 0.9.1 [ 12312214 ]
          Hide
          Sanjay Dahiya added a comment -

          Thanks owen for review.
          This patch includes review comments and removes hidden files in directory listing on jobtracker end while generating splits.

          Show
          Sanjay Dahiya added a comment - Thanks owen for review. This patch includes review comments and removes hidden files in directory listing on jobtracker end while generating splits.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12347302 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_1.patch [ 12347302 ]
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_2.patch [ 12347303 ]
          Sanjay Dahiya made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1, because http://issues.apache.org/jira/secure/attachment/12347303/Hadoop-619_2.patch applied and successfully tested against trunk revision r486901.

          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12347303/Hadoop-619_2.patch applied and successfully tested against trunk revision r486901.
          Hide
          Doug Cutting added a comment -

          + String filename = filePattern.toString();
          + int schemeIdx = filename.lastIndexOf(':');
          + // remove scheme/server/port from path for glob match
          + if( -1 != schemeIdx )

          { + filename = filename.substring( + filename.indexOf(Path.SEPARATOR_CHAR, schemeIdx)); + }

          This should not be a string manipulation, but rather use the Path's URI, e.g. simply:

          String filename = filePattern.toUri().getPath();

          The spacing in this patch is also unusual.

          http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#682

          Show
          Doug Cutting added a comment - + String filename = filePattern.toString(); + int schemeIdx = filename.lastIndexOf(':'); + // remove scheme/server/port from path for glob match + if( -1 != schemeIdx ) { + filename = filename.substring( + filename.indexOf(Path.SEPARATOR_CHAR, schemeIdx)); + } This should not be a string manipulation, but rather use the Path's URI, e.g. simply: String filename = filePattern.toUri().getPath(); The spacing in this patch is also unusual. http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#682
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Sanjay Dahiya added a comment -

          Updated patch with Doug's comments
          thanks Doug for the review.

          Show
          Sanjay Dahiya added a comment - Updated patch with Doug's comments thanks Doug for the review.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_2.patch [ 12347570 ]
          Sanjay Dahiya made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Doug Cutting added a comment -

          The spacing in this patch is still non-standard.

          I note that the "mapred.input.subdir" feature is removed. I think this is okay, as no one uses it, but thought it should be noted.

          Why can't validateInput() simply call globPaths() and check that the results exist? The current implementation is not only much more complicated, but I'm not sure that it's correct, since it fails if any glob pattern fails to have matches. Is that what we want? I would think that non-matching glob expressions, like empty directories, should be ignored so long as some of the inputs exist.

          Finally, it looks like globPaths() still calls Path#toString() instead of Path#toUri().getPath().

          Show
          Doug Cutting added a comment - The spacing in this patch is still non-standard. I note that the "mapred.input.subdir" feature is removed. I think this is okay, as no one uses it, but thought it should be noted. Why can't validateInput() simply call globPaths() and check that the results exist? The current implementation is not only much more complicated, but I'm not sure that it's correct, since it fails if any glob pattern fails to have matches. Is that what we want? I would think that non-matching glob expressions, like empty directories, should be ignored so long as some of the inputs exist. Finally, it looks like globPaths() still calls Path#toString() instead of Path#toUri().getPath().
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Sanjay Dahiya added a comment -

          Why can't validateInput() simply call globPaths() and check that the results exist? The current implementation is not only much more complicated, but I'm not sure that it's correct, since it fails if any glob pattern fails to have matches. Is that what we want? I would think that non-matching glob expressions, like empty directories, should be ignored so long as some of the inputs exist.

          Doug:

          I'll see whats happening to indentation.

          Show
          Sanjay Dahiya added a comment - Why can't validateInput() simply call globPaths() and check that the results exist? The current implementation is not only much more complicated, but I'm not sure that it's correct, since it fails if any glob pattern fails to have matches. Is that what we want? I would think that non-matching glob expressions, like empty directories, should be ignored so long as some of the inputs exist. Doug: From an earlier discussion with Owen http://issues.apache.org/jira/browse/HADOOP-619#action_12458633 we decided that non matching patterns should throw errors where as empty directories are acceptable. I'll see whats happening to indentation.
          Hide
          Sanjay Dahiya added a comment -

          Updated patch, with spacing changes.

          >Finally, it looks like globPaths() still calls Path#toString() instead of Path#toUri().getPath().

          Doug: The earlier patch didn't use Path.toString(), i replaced it with Path.toURI().
          Also I have not changed the implementation to accept the empty patterns yet, if you feel thats the right semantics then i will change it and resubmit the patch.

          Show
          Sanjay Dahiya added a comment - Updated patch, with spacing changes. >Finally, it looks like globPaths() still calls Path#toString() instead of Path#toUri().getPath(). Doug: The earlier patch didn't use Path.toString(), i replaced it with Path.toURI(). Also I have not changed the implementation to accept the empty patterns yet, if you feel thats the right semantics then i will change it and resubmit the patch.
          Sanjay Dahiya made changes -
          Attachment Hadoop-619_3.patch [ 12347738 ]
          Hide
          Doug Cutting added a comment -

          > we decided that non matching patterns should throw errors

          Sorry, I missed that. Yes, that's probably reasonable.

          There're still a few unspaced '}else{' clauses, but I can fix those easily enough.

          One other thing I just noticed is that it looks like globbed paths are not fully qualified with scheme and authority. This will cause problems if patterns from multiple filesystems are included in a job's input. We don't test that yet, so this won't break anything today, but we'd like that to work.

          Show
          Doug Cutting added a comment - > we decided that non matching patterns should throw errors Sorry, I missed that. Yes, that's probably reasonable. There're still a few unspaced '}else{' clauses, but I can fix those easily enough. One other thing I just noticed is that it looks like globbed paths are not fully qualified with scheme and authority. This will cause problems if patterns from multiple filesystems are included in a job's input. We don't test that yet, so this won't break anything today, but we'd like that to work.
          Hide
          Sanjay Dahiya added a comment -

          >One other thing I just noticed is that it looks like globbed paths are not fully qualified with scheme and authority. This will cause problems if patterns from multiple filesystems are included in a job's input. We don't test that yet, so this won't break anything today, but we'd like that to work.

          Doug:
          Actually fully qualified paths should work with globbing, for example following will work with this patch -
          bin/hadoop jar build/hadoop-0.9.3-dev-examples.jar wordcount hdfs://<host>:9015/<dir>/*.txt <out>
          Or were referring to something else?

          Show
          Sanjay Dahiya added a comment - >One other thing I just noticed is that it looks like globbed paths are not fully qualified with scheme and authority. This will cause problems if patterns from multiple filesystems are included in a job's input. We don't test that yet, so this won't break anything today, but we'd like that to work. Doug: Actually fully qualified paths should work with globbing, for example following will work with this patch - bin/hadoop jar build/hadoop-0.9.3-dev-examples.jar wordcount hdfs://<host>:9015/<dir>/*.txt <out> Or were referring to something else?
          Hide
          Doug Cutting added a comment -

          InputFormatBase#listPaths() no longer calls fs.makeQualified() on each path, so, if a non-default FileSystem implementation does not return fully-qualified paths (without scheme & authority) then things would be broken. Now, perhaps it is a bug for a FileSystem#listPaths implementation ever to return a non-fully-qualified path, but they don't all currently.

          So, to be clear, the case I'm concerned about is if hdfs://xxx:1234/ is your default fs, but you specify s3://yyy/zzz/*.txt as an input, and that glob pattern expands to simply /yyy/foo.txt, then this file would not be found. Does that make sense?

          Show
          Doug Cutting added a comment - InputFormatBase#listPaths() no longer calls fs.makeQualified() on each path, so, if a non-default FileSystem implementation does not return fully-qualified paths (without scheme & authority) then things would be broken. Now, perhaps it is a bug for a FileSystem#listPaths implementation ever to return a non-fully-qualified path, but they don't all currently. So, to be clear, the case I'm concerned about is if hdfs://xxx:1234/ is your default fs, but you specify s3://yyy/zzz/*.txt as an input, and that glob pattern expands to simply /yyy/foo.txt, then this file would not be found. Does that make sense?
          Doug Cutting made changes -
          Attachment Hadoop-619_4.patch [ 12348405 ]
          Hide
          Doug Cutting added a comment -

          I've attached a version of this that fully qualifies paths. I'll commit this soon unless there are objections.

          Show
          Doug Cutting added a comment - I've attached a version of this that fully qualifies paths. I'll commit this soon unless there are objections.
          Doug Cutting made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Sanjay!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Sanjay!
          Doug Cutting made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]
          Gavin made changes -
          Link This issue blocks HADOOP-788 [ HADOOP-788 ]
          Gavin made changes -
          Link This issue is depended upon by HADOOP-788 [ HADOOP-788 ]

            People

            • Assignee:
              Sanjay Dahiya
              Reporter:
              eric baldeschwieler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development