Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-1172

add row count limit config for one stripe

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.0
    • 1.8.0
    • Java
    • None

    Description

      for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
      In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
      But for different kind of table, the row count is difficult to use.

      • for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
      • for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.

      for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

      So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
      The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261

      Attachments

        Activity

          People

            wesleydeng_tencent wesleydeng
            wesleydeng_tencent wesleydeng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: