Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2461

Support HDFS file name globbing in libhdfs

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: libhdfs
    • Labels:
      None

      Description

      This is to enhance the C API in libhdfs to support HDFS file name globbing. The proposal is to keep the new API simple and return a list of matched HDFS path names. Callers can use existing hdfsGetPathInfo() to get additional information on each of the matched path. Following code snippet shows the proposed API enhancements:

      hdfs.h
      /**
       * hdfsGlob - Get all the HDFS file names that match a glob pattern.  The
       * returned result will be sorted by the file names.  The last element in the
       * array is NULL.  The function hdfsFreeGlob() should be called to free this
       * array and its contents.
       * @param fs The configured filesystem handle.
       * @param globPattern The glob pattern to match file names against.  Note that
       * this is not a POSIX regular expression but rather a POSIX glob pattern.
       * @return Returns a dynamically-allocated array of strings; if there is no
       * match, an array with one entry that has a NULL value will be returned.  If
       * there is an error, NULL will be returned.
       */
      char ** hdfsGlob(hdfsFS fs, const char *globPattern);
      
      /**
       * hdfsFreeGlob - Free up the array returned by hdfsGlob().
       * @param globResult The array of dynamically-allocated strings returned by
       * hdfsGlob().
       */
      void hdfsFreeGlob(char **globResult);
      

      Please comment on the above proposed API. I will start the implementation and testing. However, I need a committer to work with.

      Thanks.

      1. HDFS-2461.0.patch
        8 kB
        Mariappan Asokan

        Activity

        Hide
        M. C. Srivas added a comment -

        o.a.h.fs.FileSystem defines

        globStatus( String pattern)
        globStatus( String pattern, PathFilter filter)

        Can we call the libhdfs functions with identical names, and returning identical values?

        Show
        M. C. Srivas added a comment - o.a.h.fs.FileSystem defines globStatus( String pattern) globStatus( String pattern, PathFilter filter) Can we call the libhdfs functions with identical names, and returning identical values?
        Hide
        Mariappan Asokan added a comment -

        Srivas,
        Thanks for your comments. I am aware of the methods in FileSystem class. However, I wanted the C API to be simpler. Callers can iterate through the array and call hdfsGetPathInfo() to get the equivalent of FileStatus object if they wish. Also, the caller can pass each file name to a filter function.

        Having said that, the use cases of the API will dictate the function signature(simplicity versus convenience.) My current requirement is just to get file names matching wildcard patterns. I would like to hear the opinions from other developers before finalizing the API.

        Show
        Mariappan Asokan added a comment - Srivas, Thanks for your comments. I am aware of the methods in FileSystem class. However, I wanted the C API to be simpler. Callers can iterate through the array and call hdfsGetPathInfo() to get the equivalent of FileStatus object if they wish. Also, the caller can pass each file name to a filter function. Having said that, the use cases of the API will dictate the function signature(simplicity versus convenience.) My current requirement is just to get file names matching wildcard patterns. I would like to hear the opinions from other developers before finalizing the API.
        Hide
        Mariappan Asokan added a comment -

        I thought more on this. Since the Java globStatus() method already queried the name node to retrieve the status information, for the sake of efficiency I think we can change the function signature. Also, conforming to already existing hdfsListDirectory(), I decided to return an array of structures rather than array of pointers. This will enable reusing the existing C function hdfsFreeFileInfo(). I also added the path filter function in the interface. Filtering will be done in the C implementation. Following is the description of the prototype of the single function:

        hdfs.h
        /**
         * Path filter function prototype.
         * @param pathName path name passed to this function.
         * @return 0 if the path name has to be excluded; a non-zero otherwise.
         */
        typedef int (*PathFilter)(const char * pathName);
        
        /**
         * hdfsGlobStatus - Get status for all HDFS file names that match a glob
         * pattern.  The returned result will be an array of hdfsFileInfo structures.
         * The array is sorted by file names.
         * The function hdfsFreeFileInfo() should be called to free this array and its
         * contents.
         * @param fs The configured filesystem handle.
         * @param globPattern The glob pattern(as supported by Hadoop implementation) to
         * match file names against.
         * @param filter A path filter function.  If this is NULL, no filtering will be
         * done after glob expansion.
         * @param numEntries pointer to an integer in which the number of entries in the
         * returned array will be returned.  This will be set to -1 in case of error.
         * @return Returns a dynamically-allocated array of hdfsFileInfo structures; if
         * there is no match or an error, a NULL value will be returned.  An error
         * condition can be identified by testing numEntries.
         */
        hdfsFileInfo * hdfsGlobStatus(hdfsFS fs, const char *globPattern,
                                      PathFilter filter, int *numEntries);
        

        If anyone has any comments, please let me know.
        Thanks.

        Show
        Mariappan Asokan added a comment - I thought more on this. Since the Java globStatus() method already queried the name node to retrieve the status information, for the sake of efficiency I think we can change the function signature. Also, conforming to already existing hdfsListDirectory(), I decided to return an array of structures rather than array of pointers. This will enable reusing the existing C function hdfsFreeFileInfo(). I also added the path filter function in the interface. Filtering will be done in the C implementation. Following is the description of the prototype of the single function: hdfs.h /** * Path filter function prototype. * @param pathName path name passed to this function. * @ return 0 if the path name has to be excluded; a non-zero otherwise. */ typedef int (*PathFilter)( const char * pathName); /** * hdfsGlobStatus - Get status for all HDFS file names that match a glob * pattern. The returned result will be an array of hdfsFileInfo structures. * The array is sorted by file names. * The function hdfsFreeFileInfo() should be called to free this array and its * contents. * @param fs The configured filesystem handle. * @param globPattern The glob pattern(as supported by Hadoop implementation) to * match file names against. * @param filter A path filter function. If this is NULL, no filtering will be * done after glob expansion. * @param numEntries pointer to an integer in which the number of entries in the * returned array will be returned. This will be set to -1 in case of error. * @ return Returns a dynamically-allocated array of hdfsFileInfo structures; if * there is no match or an error, a NULL value will be returned. An error * condition can be identified by testing numEntries. */ hdfsFileInfo * hdfsGlobStatus(hdfsFS fs, const char *globPattern, PathFilter filter, int *numEntries); If anyone has any comments, please let me know. Thanks.
        Hide
        Mariappan Asokan added a comment -

        I am attaching the patch file for this Jira. The patch was applied on the trunk version though it can be applied on 0.23 since there have not been any changes in the files involved. The patch was tested and the tests(for libhdfs) ran successfully. I would appreciate if developers can provide feedback to have this committed.

        Show
        Mariappan Asokan added a comment - I am attaching the patch file for this Jira. The patch was applied on the trunk version though it can be applied on 0.23 since there have not been any changes in the files involved. The patch was tested and the tests(for libhdfs) ran successfully. I would appreciate if developers can provide feedback to have this committed.
        Hide
        Colin Patrick McCabe added a comment -

        there is already a libc method called fnmatch which does what you want.

        NAME
               fnmatch - match filename or pathname
        
        SYNOPSIS
               #include <fnmatch.h>
        
               int fnmatch(const char *pattern, const char *string, int flags);
        
        DESCRIPTION
               The  fnmatch()  function checks whether the string argument matches the
               pattern argument, which is a shell wildcard pattern.
        

        As far as I can tell, the org.apache.hadoop.fs.FileSystem#globStatus does everything client-side (there is no support for server-side globs-- correct me if I'm wrong) so there is no point in duplicating libc functionality in libhdfs.

        Show
        Colin Patrick McCabe added a comment - there is already a libc method called fnmatch which does what you want. NAME fnmatch - match filename or pathname SYNOPSIS #include <fnmatch.h> int fnmatch( const char *pattern, const char *string, int flags); DESCRIPTION The fnmatch() function checks whether the string argument matches the pattern argument, which is a shell wildcard pattern. As far as I can tell, the org.apache.hadoop.fs.FileSystem#globStatus does everything client-side (there is no support for server-side globs-- correct me if I'm wrong) so there is no point in duplicating libc functionality in libhdfs.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mariappan Asokan
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development