Uploaded image for project: 'IMPALA'
  2. IMPALA-10266

Replace instanceof *FileSystem with FS scheme checks



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Impala 4.0.0
    • Component/s: Frontend
    • Labels:
    • Epic Color:


      In the Impala code we have checks in various places about which filesystem implementation we are using. E.g in the frontend, many of these checks are here - https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java.

      In the frontend, some of these checks are done using the instanceof operator with subclasses of org.apache.hadoop.fs.FileSystem. E.g.

        public static boolean supportsStorageIds(FileSystem fs) {
          // Common case.
          if (isDistributedFileSystem(fs)) return true;
          // Blacklist FileSystems that are known to not to include storage UUIDs.
          return !(fs instanceof S3AFileSystem || fs instanceof LocalFileSystem ||
              fs instanceof AzureBlobFileSystem || fs instanceof SecureAzureBlobFileSystem ||
              fs instanceof AdlFileSystem);

      We also identify filesystem based on the scheme, e.g. s3a in a URL like s3a://path/

          private static final Map<String, FsType> SCHEME_TO_FS_MAPPING =
              ImmutableMap.<String, FsType>builder()
                  .put("abfs", ADLS)
                  .put("abfss", ADLS)
                  .put("adl", ADLS)
                  .put("file", LOCAL)
                  .put("hdfs", HDFS)
                  .put("s3a", S3)
                  .put("o3fs", OZONE)
                  .put("alluxio", ALLUXIO)

      The proposal is to replace all instanceof use with checks based on the scheme, which we can get from the FileSystem - https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#getScheme--

      Checking the java class and the scheme are not exactly equivalent because there are some cases where a new scheme is handled by a known class (or subclass of that class) - that's what happened with Alluxio with IMPALA-10087 where we accidentally supported it for a bit until we broke it. But since IMPALA-6050 we need to check both the scheme and the class, so it would be better at this point to just standardise on the scheme AFAICT.

      In future we could conceivably then remove some of this hardcoded logic and consolidate the information about filesystem capabilities into one place.




            • Assignee:
              rizaon Riza Suminto
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              3 Start watching this issue


              • Created: