Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.2.4
    • Fix Version/s: Impala 2.10.0
    • Component/s: Backend
    • Labels:

      Description

      The current Parquet scanner implementation treats a column in a row group as a "scan range". When reading a "scan range", Impala will issue a fopen RPC to the name node. Therefore, Impala will issue one RPC per column per row group. NN has a limited processing rate of fopen RPC and this can be a limiting factor on the query performance.

      Fundamentally, there is no need to issue a fopen for each column. Impala should issue at most one fopen for each row group.

      The current workaround of using file handle cache is not practical due to the large (1k byte) memory footprint per file handle cache. File handle cannot be shared by concurrent readers. So, if we have 10 queries reading the same file at the same time, we need 10 file handles cached.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joemcdonnell Joe McDonnell
                Reporter:
                alan@cloudera.com Alan Choi
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: