Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15583

[C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors

Details

    Description

      In Substrait a function is referred to by a "fully qualified name" which consists of a URI and a function name. For example, the "add" function is something like https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml. To avoid serializing these long names multiple times in the plan the producer should pick an anchor value (an int32 in protobuf) and use that everywhere (with a single lookup table at the top level of the plan).

      To avoid map lookups the Arrow C++ consumer currently assumes that this lookup table will be small enough it can be stored in a vector...

      {
        "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add",
        "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract"
      }
      

      However, this sort of assumes that a plan is going to use numbers like 0, 1, 2, ... N to create N anchors. There is nothing that prevents a consumer from using whatever numbers it wants (e.g. a pointer value). If the producer uses a really large anchor value then the C++ Substrait consumer will create a lookup table with a lot of blank values. This could lead to a lot of wasted memory.

      We could try and request the Substrait spec enfoce small anchors or we could change the extension set handling in the C++ consumer to use an unordered_map.

      Attachments

        Issue Links

          Activity

            People

              sanjibansg Sanjiban Sengupta
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 50m
                  5h 50m

                  Slack

                    Issue deployment