Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1808

SimpleGroup.toString() uses String += and so has poor performance

    XMLWordPrintableJSON

Details

    Description

      This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.

      https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50

      We ran into a performance problem whereby a single column in a Parquet file was defined as a group:

          optional group customer_ids (LIST) {
              repeated group list { 
              optional binary element (STRING); 
            }
          }

       

      and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method.  Using a different implementation that uses `StringBuffer` like this:

       StringBuffer result = new StringBuffer();
       int i = 0;
       for (Type field : schema.getFields()) {
         String name = field.getName();
         List<Object> values = data[i];
         ++i;
         if (values != null) {
           if (values.size() > 0) {
             for (Object value : values) {
               result.append(indent);
               result.append(name);
               if (value == null) { 
                 result.append(": NULL\n");
               } else if (value instanceof Group){ 
                 result.append("\n"); 
                 result.append(betterToString((SimpleGroup)value, indent+" "));
               } else { 
                 result.append(": "); 
                 result.append(value.toString()); 
                 result.append("\n"); 
               }
             }
           }
         }
       }
       return result.toString();

      reduced that time to less than 500 milliseconds. 

      The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.

      This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".

      Attachments

        Issue Links

          Activity

            People

              kornsanz Shankar Koirala
              tiddman Randy Tidd
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: