Uploaded image for project: 'Commons Lang'
  1. Commons Lang
  2. LANG-1606

StringUtils.countMatches returns incorrect value while handling intersecting substrings

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.11
    • Fix Version/s: 3.12.0
    • Component/s: lang.*
    • Labels:
      None

      Description

      Steps to reproduce:

      1. Call the method like that:

      int count = StringUtils.countMatches("abaabaababaab", "aba");
      

      Actual result: the value of count variable equals 3
      Expected result: the value of count variable equals 4

      The substrings are highlighted in red:
      abaabaababaab
      abaabaababaab
      abaabaababaab
      abaabaababaab

      Method returns incorrect value because of this code:

      while ((idx = CharSequenceUtils.indexOf(str, sub, idx)) != INDEX_NOT_FOUND) {
          count++;
          idx += sub.length();
      }
      

      This looks like a greedy algorithm - but increasing the idx variable by the length of substring could lead to the problems like in example:

      Let's say that idx = 6, so we try to find a substring in the highlighted suffix:
      abaabaababaab

      We found the substring, so idx now becomes idx + 3 = 9. So now this suffix will be used for searching substring in it:
      abaabaababaab
      But because of increasing the value of idx by 3 we won't find the substring (abaabaababaab) which intersects with the already found substring on the last step.

      Basically, this method will work incorrectly with any substrings that intersect with each other.

      There is also a unit test with incorrect expected value:

      assertEquals(4,
           StringUtils.countMatches("oooooooooooo", "ooo"));
      

      If this behavior (counting substrings that do not intersect) is intended, please update the JavaDoc to reflect it. Right now it looks like that:

      Counts how many times the substring appears in the larger string.
      

      Link for the PR: https://github.com/apache/commons-lang/pull/615

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              GaliFFun Rustem Galiev
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m