Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7265

Cache remote file handles

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 3.1.0
    • Fix Version/s: Impala 3.2.0
    • Component/s: Backend
    • Labels:
      None
    • Docs Text:
      Hide
      This introduced a new parameter cache_remote_file_handles, which modifies the behavior of the file handle cache. I think some pieces of documentation will need updates:
      http://impala.apache.org/docs/build/html/topics/impala_scalability.html (section: "Scalability Considerations for NameNode Traffic with File Handle Caching")
      Show
      This introduced a new parameter cache_remote_file_handles, which modifies the behavior of the file handle cache. I think some pieces of documentation will need updates: http://impala.apache.org/docs/build/html/topics/impala_scalability.html (section: "Scalability Considerations for NameNode Traffic with File Handle Caching")
    • Target Version:
    • Epic Color:
      ghx-label-4

      Description

      The file handle cache currently does not allow caching remote file handles. This means that clusters that have a lot of remote reads can suffer from overloading the NameNode. Impala should be able to cache remote file handles.

      There are some open questions about remote file handles and whether they behave differently from local file handles. In particular:

      1. Is there any resource constraint on the number of remote file handles open? (e.g. do they maintain a network connection?)
      2. Are there any semantic differences in how remote file handles behave when files are deleted, overwritten, or appended?
      3. Are there any extra failure cases for remote file handles? (i.e. if a machine goes down or a remote file handle is left open for an extended period of time)

      The form of caching will depend on the answers, but at the very least, it should be possible to cache a remote file handle at the level of a query so that a Parquet file with multiple columns can share file handles.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joemcdonnell Joe McDonnell
                Reporter:
                joemcdonnell Joe McDonnell
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: