Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6950

query should process listed partitions avoid driver oom due to large number files in table

    XMLWordPrintableJSON

Details

    Description

      currently if multiple partition table,would cause oom easy
      eg:
      CREATE TABLE hudi_test.tmp_hudi_test_1 (
      id string,
      name string,
      dt bigint,
      day STRING COMMENT '日期分区',
      hour INT COMMENT '小时分区'
      )using hudi
      OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 'hoodie.datasource.meta.sync.enable' 'false', 'hoodie.datasource.hive_sync.enable' 'false')
      tblproperties (
      'primaryKey' = 'id',
      'type' = 'mor',
      'preCombineField'='dt',
      'hoodie.index.type' = 'BUCKET',
      'hoodie.bucket.index.hash.field' = 'id',
      'hoodie.bucket.index.num.buckets'=512
      )
      PARTITIONED BY (day,hour);

      select count(1) from hudi_test.tmp_hudi_test_1 where day='2023-10-17' would list much filestatus to driver,and driver would oom(such as table with hundreds billion records in a partition(day='2023-10-17'))

       

      commit is 7c79ebee1ff1c9a0f5117252cb12fa2f03ac4b24 from master

      Attachments

        1. oom_stages.jpg
          178 kB
          xy
        2. oom_before_sparkui.jpg
          145 kB
          xy
        3. hang_then_oom.jpg
          165 kB
          xy
        4. fix_stages.jpg
          181 kB
          xy
        5. dump_files.jpg
          122 kB
          xy
        6. before_fix_dump_filestatus.jpg
          137 kB
          xy

        Issue Links

          Activity

            People

              xuzifu xy
              xuzifu xy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: