Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2003

Auto Compute Compression ratio for input data to output parquet/orc file size

    XMLWordPrintableJSON

Details

    Description

      Context : 

      Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. Creating the following table with all the runs that I had carried out based on different options

       

      CONFIG  Number of Files Created Size of each file
      PARQUET_FILE_MAX_BYTES=DEFAULT 30K 21MB
      PARQUET_FILE_MAX_BYTES=1GB 3700 178MB
      PARQUET_FILE_MAX_BYTES=1GB
      COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=1100000
      Same as before Same as before
      PARQUET_FILE_MAX_BYTES=1GB
      BULKINSERT_PARALLELISM=100
      Same as before Same as before
      PARQUET_FILE_MAX_BYTES=4GB 1600 675MB
      PARQUET_FILE_MAX_BYTES=6GB 669 1012MB

      Based on this runs, it feels that the compression ratio is off. 

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vinaypatil18 Vinay
              Forward Xu, Shiyan Xu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: