Hadoop Common
  1. Hadoop Common
  2. HADOOP-7601

Move common fs implementations to a hadoop-fs module

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fs
    • Labels:
      None
    • Tags:
      fs, module, modularization

      Description

      Much of the hadoop-common dependencies is from the fs implementations. We have more fs implementations on the way (ceph, lafs etc). I propose that we move all the fs implementations to a hadoop-fs module under hadoop-common-project.

        Issue Links

          Activity

          Hide
          jay vyas added a comment -

          ive looked some into this, shall we proceed ? Is there still interest?

          Show
          jay vyas added a comment - ive looked some into this, shall we proceed ? Is there still interest?
          Hide
          jay vyas added a comment -

          What is the status on this? I think it relates to our JIRA to move test implementations into fs/*/ packages (HADOOP-10461), so that test implementations are separated from HCFS Test utilities?

          Show
          jay vyas added a comment - What is the status on this? I think it relates to our JIRA to move test implementations into fs/*/ packages ( HADOOP-10461 ), so that test implementations are separated from HCFS Test utilities?
          jay vyas made changes -
          Link This issue is related to HADOOP-10461 [ HADOOP-10461 ]
          Luke Lu made changes -
          Link This issue is superceded by HADOOP-9385 [ HADOOP-9385 ]
          Harsh J made changes -
          Fix Version/s 0.24.0 [ 12317652 ]
          Hide
          Harsh J added a comment -

          Hey Luke,

          Looks like you may want to file those comments up as a separate JIRA to attract more traction, than under this topic of moving the fs-package stuff to its own module. Mind filing a new JIRA Luke? Else, holler and I'll get it done.

          Show
          Harsh J added a comment - Hey Luke, Looks like you may want to file those comments up as a separate JIRA to attract more traction, than under this topic of moving the fs-package stuff to its own module. Mind filing a new JIRA Luke? Else, holler and I'll get it done.
          Arun C Murthy made changes -
          Fix Version/s 0.24.0 [ 12317652 ]
          Fix Version/s 0.23.0 [ 12315569 ]
          Hide
          Luke Lu added a comment -

          Asking for a default constructor for plugins is reasonable (and a common practice) request.

          For plugin classes, yes. But not for things like FileSystem that was not built for plugin. A provider class is nice plugin class.

          Show
          Luke Lu added a comment - Asking for a default constructor for plugins is reasonable (and a common practice) request. For plugin classes, yes. But not for things like FileSystem that was not built for plugin. A provider class is nice plugin class.
          Luke Lu made changes -
          Link This issue is related to HADOOP-7549 [ HADOOP-7549 ]
          Hide
          Luke Lu added a comment -

          All default constructors are cheap in the case of FS impls.

          It's not guaranteed for future fs impls but the main issue is not the object creation but the class loading (which includes static initialization) along with all its dependencies. I don't want to load any classes that I don't use. Mandating that any fs impl and its dependencies not to do anything expensive in static initialization is unreasonable. It also limits packaging options. e.g., I want to be able to distribute a small hadoop-fs-client jars and selected underlying fs driver jars for a mobile device.

          Show
          Luke Lu added a comment - All default constructors are cheap in the case of FS impls. It's not guaranteed for future fs impls but the main issue is not the object creation but the class loading (which includes static initialization) along with all its dependencies. I don't want to load any classes that I don't use. Mandating that any fs impl and its dependencies not to do anything expensive in static initialization is unreasonable. It also limits packaging options. e.g., I want to be able to distribute a small hadoop-fs-client jars and selected underlying fs driver jars for a mobile device.
          Hide
          Alejandro Abdelnur added a comment -

          @Luke,

          Asking for a default constructor for plugins is reasonable (and a common practice) request. Note that in the case of the FS impls they all have default constructors already.

          Also, initialization is already decoupled from the constructor via the initialize() method. All default constructors are cheap in the case of FS impls.

          Show
          Alejandro Abdelnur added a comment - @Luke, Asking for a default constructor for plugins is reasonable (and a common practice) request. Note that in the case of the FS impls they all have default constructors already. Also, initialization is already decoupled from the constructor via the initialize() method. All default constructors are cheap in the case of FS impls.
          Hide
          Luke Lu added a comment -

          This is not true, it loads as many file systems classes as defined in the service file.

          That's good to know, thanks But the real problem is that in order to query the schemes you have to instantiate every file system object which can be very expensive (esp. all the resulting loading of dependent classes) and unnecessary (people typically only use one dfs implementation) and mandate a side effect free default constructor for every FileSystem implementation. Loading a single provider (per jar) then querying for support and using a factory is much more efficient and flexible. ServiceLoader is a primitive load time only DI solution with limited functionality. That's probably the reason that I never use non-provider bindings in the service files

          Show
          Luke Lu added a comment - This is not true, it loads as many file systems classes as defined in the service file. That's good to know, thanks But the real problem is that in order to query the schemes you have to instantiate every file system object which can be very expensive (esp. all the resulting loading of dependent classes) and unnecessary (people typically only use one dfs implementation) and mandate a side effect free default constructor for every FileSystem implementation. Loading a single provider (per jar) then querying for support and using a factory is much more efficient and flexible. ServiceLoader is a primitive load time only DI solution with limited functionality. That's probably the reason that I never use non-provider bindings in the service files
          Hide
          Alejandro Abdelnur added a comment -

          Regarding HADOOP-7549 being too limited, where you can only load one filesystem per jar. This is not true, it loads as many file systems classes as defined in the service file.

          Show
          Alejandro Abdelnur added a comment - Regarding HADOOP-7549 being too limited, where you can only load one filesystem per jar. This is not true, it loads as many file systems classes as defined in the service file.
          Hide
          Luke Lu added a comment -

          Still, how do we enable their testing? Regarding the service loading: HADOOP-7549

          Ah, didn't see that one. The service loading mechanism in HADOOP-7549 is too limited, where you can only load one filesystem per jar. The FileSystemsProvider interface mentioned here is a lot more versatile.

          Show
          Luke Lu added a comment - Still, how do we enable their testing? Regarding the service loading: HADOOP-7549 Ah, didn't see that one. The service loading mechanism in HADOOP-7549 is too limited, where you can only load one filesystem per jar. The FileSystemsProvider interface mentioned here is a lot more versatile.
          Hide
          Alejandro Abdelnur added a comment -

          I'm OK with a single module with all FS client implementations. Still, how do we enable their testing?

          Regarding the service loading: HADOOP-7549

          Show
          Alejandro Abdelnur added a comment - I'm OK with a single module with all FS client implementations. Still, how do we enable their testing? Regarding the service loading: HADOOP-7549
          Hide
          Luke Lu added a comment -

          we'll be bundling FS clients, but we have to test them.

          Yes. I need to clarify that the fs implementations I mentioned is hadoop FileSystem implementations of these fs clients. Typically there are only couple files per client implementation, which is too small a granularity for separate modules. Maybe we should call the module hadoop-fs-clients for bundled fs clients.

          In addition, I propose that we adopt a ServiceLoader interface say FileSystemsProvider:

          interface FileSystemsProvider {
            /** @return a list of supported file system schemes */
            List<String> getSupportedFileSystems();
          
            /** @return a FileSystemFactory instance */
            FileSystemFactory getFileSystemFactory(); 
          }
          
          interface FileSystemFactory {
            /** @return a FileSystem for a given scheme */
            FileSystem getFileSystem(String scheme, Configuration conf);
          }
          

          The advantage of this approach is that people experimenting with new filesystems can just drop their new fs jars in the classpath without having to modify any hadoop code.

          Show
          Luke Lu added a comment - we'll be bundling FS clients, but we have to test them. Yes. I need to clarify that the fs implementations I mentioned is hadoop FileSystem implementations of these fs clients. Typically there are only couple files per client implementation, which is too small a granularity for separate modules. Maybe we should call the module hadoop-fs-clients for bundled fs clients. In addition, I propose that we adopt a ServiceLoader interface say FileSystemsProvider: interface FileSystemsProvider { /** @ return a list of supported file system schemes */ List< String > getSupportedFileSystems(); /** @ return a FileSystemFactory instance */ FileSystemFactory getFileSystemFactory(); } interface FileSystemFactory { /** @ return a FileSystem for a given scheme */ FileSystem getFileSystem( String scheme, Configuration conf); } The advantage of this approach is that people experimenting with new filesystems can just drop their new fs jars in the classpath without having to modify any hadoop code.
          Hide
          Alejandro Abdelnur added a comment -

          Moving FS impls out of common is a good idea.

          But moving all of them to a single module is not a good idea. It is just moving the hadoop-common problem somewhere else.

          A FS impl normally comes with 2 parts, the FS impl and the FS client. we'll be bundling FS clients, but we have to test them.

          So we have to see how to modularize that.

          Show
          Alejandro Abdelnur added a comment - Moving FS impls out of common is a good idea. But moving all of them to a single module is not a good idea. It is just moving the hadoop-common problem somewhere else. A FS impl normally comes with 2 parts, the FS impl and the FS client. we'll be bundling FS clients, but we have to test them. So we have to see how to modularize that.
          Hide
          Milind Bhandarkar added a comment -

          I like this idea. +1

          Show
          Milind Bhandarkar added a comment - I like this idea. +1
          Luke Lu made changes -
          Field Original Value New Value
          Description Much of the hadoop-common dependencies is from the fs implementations. We more fs implementations on the way (ceph, lafs etc). I propose that we move all the fs implementations to a hadoop-fs module under hadoop-common-project. Much of the hadoop-common dependencies is from the fs implementations. We have more fs implementations on the way (ceph, lafs etc). I propose that we move all the fs implementations to a hadoop-fs module under hadoop-common-project.
          Luke Lu created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Luke Lu
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development