Uploaded image for project: 'IMPALA'
  2. IMPALA-12871 Implement front end using Calcite
  3. IMPALA-12961

Use a Map instead of an ArrayList for Expr in HDFS RelNode



    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • ghx-label-12


      This came up in code review in ImpalaHdfsScanRel:

      "For wide tables where we are only needing a few columns projected, we will end up with a long list with mostly Nulls. A LinkedHashMap (preserves Insertion order) where the key is position and value is the SlotRef would be better suited despite the cpu cost of hashing. In general, in a query planner, memory is the most precious commodity since the plan search space can be large, so anything we can do to reduce memory footprint would be preferred."

      One counter argument:  The list is used in other Rel Nodes, and it seems more natural.  For instance, the Project RelNode will have a RexInputRef RexNode which is "$2".  It seems more natural to have an array in this case.  Every other RelNode works this way except for the ScanNode.

      To add to the counter argument: Let's take a worst case scenario of a query that has 10 tables with 500 columns apiece.    If we are allocating 8 byte pointers, we would need 10*500*8 to hold this information, which is 40,000 bytes.  While reducing the memory footprint is more important, reducing it by 40,000 bytes really isn't going to make an impact.  Even if we take into account that multiple queries would be running simultaneously, this is a very shortlived code path.  So should we go with the more natural approach versus the less memory intensive approach?




            Unassigned Unassigned
            scarlin Steve Carlin
            0 Vote for this issue
            2 Start watching this issue

