Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Target Version/s:

      Description

      After the discussion on HADOOP-8709 it was decided that we need better APIs for globbing to remove some of the inconsistencies with other APIs. Inorder to maintain backwards compatibility we should deprecate the existing APIs and add in new ones.

      See HADOOP-8709 for more information about exactly how those APIs should look and behave.

      1. HADOOP-8724.patch
        37 kB
        Allen Wittenauer
      2. HADOOP-8724.txt
        37 kB
        Robert Joseph Evans

        Activity

        Hide
        Daryn Sharp added a comment -

        It would be helpful to describe the intended new api here. If I understand the proposal in the other jira, there would be a new Glob object. Something to consider is that callers of globStatus(Path) don't know if the path is a glob or not prior to calling. Simply instantiating a Glob object when the path might not be a glob may be problematic. Perhaps a better way to handle might be to have a static Path.create(String) which returns either a Path or GlobPath.

        Show
        Daryn Sharp added a comment - It would be helpful to describe the intended new api here. If I understand the proposal in the other jira, there would be a new Glob object. Something to consider is that callers of globStatus(Path) don't know if the path is a glob or not prior to calling. Simply instantiating a Glob object when the path might not be a glob may be problematic. Perhaps a better way to handle might be to have a static Path.create(String) which returns either a Path or GlobPath .
        Hide
        Robert Joseph Evans added a comment -

        There were two proposals. The first was to simply have a base path and a string pattern as the parameters to globStatus. I thought it would be better to encapsulate the two into a single Glob object so it is obvious when an API takes a glob and when it does not. The point of that is simply to indicate to the user of the API if expansion will happen or not. I personally think it is difficult right now to know which APIs will expand a pattern and which will not.

        I would find it much more confusing to have Glob be a subclass of Path because now how do I indicate that this API will not do expansion, or I have to update all APIs to optionally do expansion when a GlobPath is passed in which seems like a much bigger change.

        Show
        Robert Joseph Evans added a comment - There were two proposals. The first was to simply have a base path and a string pattern as the parameters to globStatus. I thought it would be better to encapsulate the two into a single Glob object so it is obvious when an API takes a glob and when it does not. The point of that is simply to indicate to the user of the API if expansion will happen or not. I personally think it is difficult right now to know which APIs will expand a pattern and which will not. I would find it much more confusing to have Glob be a subclass of Path because now how do I indicate that this API will not do expansion, or I have to update all APIs to optionally do expansion when a GlobPath is passed in which seems like a much bigger change.
        Hide
        Chris Douglas added a comment - - edited

        Simply instantiating a Glob object when the path might not be a glob may be problematic. Perhaps a better way to handle might be to have a static Path.create(String) which returns either a Path or GlobPath.

        Distinguishing intent where the Path is created (as above) could solve part of the problem with HADOOP-8709 (caller can resolve which API to use?), but I don't think a subtype of Path will solve the other issues. Dispatch is still on the static type, so nothing is solved for the callee.

        The first was to simply have a base path and a string pattern as the parameters to globStatus. I thought it would be better to encapsulate the two into a single Glob object so it is obvious when an API takes a glob and when it does not.

        Having an API advertise its "globiness" is useful, I like it. Though users specifying a single resource will need to create Glob objects on top of Path objects which are really URIs... which seems unnecessarily confusing. Still, methods for Configuration are also straightforward; setGlob() would need to escape everything in the URI side first (so users could continue to specify globs on the commandline as Paths with special characters), but aside from that it seems straightforward. "Correcting" it everywhere in the code may be prohibitive, though...

        Since many of these are user-facing, do you think we need a more specific type than String for the glob part? ls /users/hadoop/*.foo translated into:

        fs.globStatus(new Glob(new Path("hdfs://nn:8020/users/hadoop"), Pattern.compile("*.foo")))

        seems like it's strayed from sanity...

        Show
        Chris Douglas added a comment - - edited Simply instantiating a Glob object when the path might not be a glob may be problematic. Perhaps a better way to handle might be to have a static Path.create(String) which returns either a Path or GlobPath. Distinguishing intent where the Path is created (as above) could solve part of the problem with HADOOP-8709 (caller can resolve which API to use?), but I don't think a subtype of Path will solve the other issues. Dispatch is still on the static type, so nothing is solved for the callee. The first was to simply have a base path and a string pattern as the parameters to globStatus. I thought it would be better to encapsulate the two into a single Glob object so it is obvious when an API takes a glob and when it does not. Having an API advertise its "globiness" is useful, I like it. Though users specifying a single resource will need to create Glob objects on top of Path objects which are really URIs... which seems unnecessarily confusing. Still, methods for Configuration are also straightforward; setGlob() would need to escape everything in the URI side first (so users could continue to specify globs on the commandline as Paths with special characters), but aside from that it seems straightforward. "Correcting" it everywhere in the code may be prohibitive, though... Since many of these are user-facing, do you think we need a more specific type than String for the glob part? ls /users/hadoop/*.foo translated into: fs.globStatus( new Glob( new Path( "hdfs: //nn:8020/users/hadoop" ), Pattern.compile( "*.foo" ))) seems like it's strayed from sanity...
        Hide
        Daryn Sharp added a comment -

        The issue is that user code doesn't know if the path is a glob, so listStatus is used if the path is considered a literal, whereas globStatus handles both literals and globs. So code just calls globStatus if a glob is allowed.

        A new Glob with a ctor that takes a regexp might be useful, but it has a few problems. How would it handle paths with multiple globs in it? Variadic? I might not understand your full intent, but how would a user input provided path be decomposed?

        Show
        Daryn Sharp added a comment - The issue is that user code doesn't know if the path is a glob, so listStatus is used if the path is considered a literal, whereas globStatus handles both literals and globs. So code just calls globStatus if a glob is allowed. A new Glob with a ctor that takes a regexp might be useful, but it has a few problems. How would it handle paths with multiple globs in it? Variadic? I might not understand your full intent, but how would a user input provided path be decomposed?
        Hide
        Robert Joseph Evans added a comment -

        I like the points that have been made so far, but I would prefer to discuss a real API over a theoretical one. The implementation still needs a lot of work but the existing tests so far pass.

        I am mostly interested in feedback about the APIs themselves.

        In general there is a Glob class. It is made up of a path and a pattern. The Path is optional. If the Path was supplied and does not exist an exception will be thrown. The pattern is required. A Glob can also be created from just a Path, but that is mostly for backwards compatibility. In most cases I would expect a Glob to be created with just a pattern.

        The result of running a glob is a GlobResult which is a RemoteIterator<FileStatus>, but also provides some extra APIs so that someone can easily tell if nothing matched the glob, or (and this this impl is a real hack right now) the glob looked more like a path then a glob. This is only really needed by PathData so the shell can adjust the error message appropriately.

        Show
        Robert Joseph Evans added a comment - I like the points that have been made so far, but I would prefer to discuss a real API over a theoretical one. The implementation still needs a lot of work but the existing tests so far pass. I am mostly interested in feedback about the APIs themselves. In general there is a Glob class. It is made up of a path and a pattern. The Path is optional. If the Path was supplied and does not exist an exception will be thrown. The pattern is required. A Glob can also be created from just a Path, but that is mostly for backwards compatibility. In most cases I would expect a Glob to be created with just a pattern. The result of running a glob is a GlobResult which is a RemoteIterator<FileStatus>, but also provides some extra APIs so that someone can easily tell if nothing matched the glob, or (and this this impl is a real hack right now) the glob looked more like a path then a glob. This is only really needed by PathData so the shell can adjust the error message appropriately.
        Hide
        Robert Joseph Evans added a comment -

        Submitting patch so that others are more likely to look at the patch and provide feedback.

        Show
        Robert Joseph Evans added a comment - Submitting patch so that others are more likely to look at the patch and provide feedback.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12542617/HADOOP-8724.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The applied patch generated 2091 javac compiler warnings (more than the trunk's current 2059 warnings).

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.TestHftpDelegationToken

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
        Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//artifact/trunk/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542617/HADOOP-8724.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. -1 javac. The applied patch generated 2091 javac compiler warnings (more than the trunk's current 2059 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1364//console This message is automatically generated.
        Hide
        Allen Wittenauer added a comment -

        Renaming current patch so test-patch will process for bug bash.

        Show
        Allen Wittenauer added a comment - Renaming current patch so test-patch will process for bug bash.
        Hide
        Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 patch 0m 0s The patch command could not apply the patch during dryrun.



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12729951/HADOOP-8724.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 6ae2a0d
        Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6418/console

        This message was automatically generated.

        Show
        Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12729951/HADOOP-8724.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 6ae2a0d Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6418/console This message was automatically generated.

          People

          • Assignee:
            Robert Joseph Evans
            Reporter:
            Robert Joseph Evans
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development