Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5383

Fix PARQUET_FILE_SIZE option for ADLS

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:
      None

      Description

      PARQUET_FILE_SIZE query option doesn't work with ADLS because the AdlFileSystem doesn't have a notion of block sizes. And impala depends on the filesystem remembering the block size which is then used as the target parquet file size (this is done for Hdfs so that the parquet file size and block size match even if the parquet_file_size isn't a valid blocksize).

      We should special case Adls just like we do for S3 to bypass the FileSystem block size, and instead just use the requested PARQUET_FILE_SIZE as the output partitions block_size (and consequently the parquet file target size) here:

      HdfsTableSink::CreateNewTmpFile()
        if (IsS3APath(output_partition->current_file_name.c_str())) {
          // On S3A, the file cannot be stat'ed until after it's closed, and even so, the block
          // size reported will be just the filesystem default. So, remember the requested
          // block size.
          output_partition->block_size = block_size;
        } else {
          // HDFS may choose to override the block size that we've recommended, so for non-S3
          // files, we get the block size by stat-ing the file.
          hdfsFileInfo* info = hdfsGetPathInfo(output_partition->hdfs_connection,
              output_partition->current_file_name.c_str());
          if (info == nullptr) {
            return Status(GetHdfsErrorMsg("Failed to get info on temporary HDFS file: ",
                output_partition->current_file_name));
          }
          output_partition->block_size = info->mBlockSize;
          hdfsFreeFileInfo(info, 1);
        }
      

      After this is fixed we can re-enable test_insert_parquet_verify_size()

        Activity

        Show
        sailesh Sailesh Mukil added a comment - Yes, closing it. Commit in: https://github.com/apache/incubator-impala/commit/1f34a9e7034cb1b068dbcaba94d3f01295995fee
        Hide
        srus@cloudera.com Silvius Rus added a comment -

        Sailesh Mukil, is this fixed?

        Show
        srus@cloudera.com Silvius Rus added a comment - Sailesh Mukil , is this fixed?

          People

          • Assignee:
            sailesh Sailesh Mukil
            Reporter:
            dhecht Dan Hecht
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development