Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5525

Inconsistent, unhelpful semantics for batch, field schema comparison

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.10.0
    • 1.12.0
    • None
    • None

    Description

      Drill provides two classes to define schema at execution time:

      • MaterializedField - describes one column (or a collection of columns for a map)
      • BatchSchema - describes the set of columns that make up a row

      Each class provides an isEqual() method. However it seems these methods may not be used much because the contain a number of flaws and contradictions.

      • MaterializedField has a comment that says it compares only names; but then it turns around and compares the object field-by-field using Protobuf:
          // DRILL-1872: Compute equals only on key. See also the comment
          // in MapVector$MapTransferPair
      
          return this.name.equalsIgnoreCase(other.name) &&
                  Objects.equals(this.type, other.type);
      
      • MaterializedField ends up doing a physical comparison of fields rather than a logical comparison. Varchar columns are defined, at the logical level as Varchar. But, at the physical level, a Varchar is a two-part column: it has a child that represents the $offsets$ vector. As a result, a comparison between logical and physical schema returns false, even if both is just a Varchar.
      • BatchSchema ends up being inconsistent because of the above confusion. It first (seemingly) compares names, then tries to compare types. But, because the MaterializedField method actually compares all fields, the type comparison does nothing.
          if (!fields.equals(other.fields)) { // Compare all fields
            return false;
          }
          for (int i = 0; i < fields.size(); i++) { // So this loop is a no-op
            MajorType t1 = fields.get(i).getType();
            MajorType t2 = other.fields.get(i).getType();
            if (t1 == null) {
              if (t2 != null) {
                return false;
              }
            } else {
              if (!majorTypeEqual(t1, t2)) {
                return false;
              }
            }
          }
      

      The result is that one is not quite sure what the two methods are supposed to compare. Is MaterializedField supposed to:

      • Do a physical compare (existing behavior)?
      • Do a name-only compare (as the comment suggests)?
      • Do a logical comparison (as unit tests would want)?

      And for BatchSchema, should it:

      • Do a physical compare (existing behavior)?
      • Do a type-aware comparison (as in the loop that does nothing)?

      Further note that neither of the methods do anything special for map fields: they end up being compared as part of the protobuf comparison. But, we shold probably apply the same rules to map fields as we apply to top-level fields (as expressed in the second loop above.)

      This ticket requests:

      • Document current uses.
      • Determine the semantics of the isEqual() method.
      • Create additional methods as needed: isPhysicallyEqual(), isLogicallyEqual(), hasSameNames(), or whatever.

      Not that this issue is classic: it seems that "equal" is well defined concept, but as the Greek philosopher Heraclitus suggested with his famous quote that "you can never step into the same river twice", the concept of "sameness" is fluid and ad-hoc. It is up to us to define the semantics for equality, and sometimes we need more than one concept.

      Attachments

        Issue Links

          Activity

            People

              paul-rogers Paul Rogers
              paul-rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: