Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2003

Auto Compute Compression ratio for input data to output parquet/orc file size

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Writer Core

      Description

      Context : 

      Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. Creating the following table with all the runs that I had carried out based on different options

       

      CONFIG  Number of Files Created Size of each file
      PARQUET_FILE_MAX_BYTES=DEFAULT 30K 21MB
      PARQUET_FILE_MAX_BYTES=1GB 3700 178MB
      PARQUET_FILE_MAX_BYTES=1GB
      COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=1100000
      Same as before Same as before
      PARQUET_FILE_MAX_BYTES=1GB
      BULKINSERT_PARALLELISM=100
      Same as before Same as before
      PARQUET_FILE_MAX_BYTES=4GB 1600 675MB
      PARQUET_FILE_MAX_BYTES=6GB 669 1012MB

      Based on this runs, it feels that the compression ratio is off. 

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              vinaypatil18 Vinay
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: