Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3409

org.apache.pig.data.DefaultTuple hashcode perfomance issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.11
    • None
    • impl
    • None

    Description

      I've met serious perfomance issue.
      please see visualvm screenshot.

      Here is hashCode implementation from the class:

       @Override
          public int hashCode() {
              int hash = 17;
              for (Iterator<Object> it = mFields.iterator(); it.hasNext();) {
                  Object o = it.next();
                  if (o != null) {
                      hash = 31 * hash + o.hashCode();
                  }
              }
              return hash;
          }
      

      I don't see any reason here to iterate over the whole tuple, aggregate hash value and then return it.

      I can fix it, if it's possible to take part in dev process. I'm new to it

      The idea for any join:
      If we have a plan we know for sure which relations would be joined.
      It means that we can precalculate hashcode values.
      The difference is: m+n hashcode calculations or m*n (current implementation).
      It think it should bring significant perfomance boost.

      Attachments

        Activity

          People

            Unassigned Unassigned
            serega_sheypak Sergey
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 3h
                3h
                Remaining:
                Remaining Estimate - 3h
                3h
                Logged:
                Time Spent - Not Specified
                Not Specified