Uploaded image for project: 'Commons Text'
  1. Commons Text
  2. TEXT-228

StringTokenizer performance degradation when parsing large lines

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.9, 1.10.0
    • 1.11.0
    • None
    • Linux

    Description

      After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed our system "hangs" (or likely will take an excessively long time to process) large lines (100MB+ in size) when splitting strings with StringTokenizer.

       

      Mitigation: Revert to Apache Commons Text 1.9

       

      Scala version:

       

      > scala -version
      Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.
      

       

      Java version:

       

      > java -version 
      openjdk version "1.8.0_382"
      OpenJDK Runtime Environment (build 1.8.0_382-b05)
      OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
      

       

       

      Reproduction Steps:

      1. Generate a sample large file
      echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
      dd if=/dev/zero bs=100MB count=1 >> largefile 
      sed -ie "s/\x0/0/g" largefile
      echo -n "\0" >> largefile
      
      1. Setup reproduce.scala
      import org.apache.commons.text.StringTokenizer
      val lines = scala.io.Source.fromFile("./largefile").getLines.toList
      val st: StringTokenizer = new StringTokenizer(lines(0))
      val res = st.getTokenArray()
      
      1. Download Apache Commons Jars
      wget https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
      
      wget https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
      
      1. Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
      > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
      timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s user 0.83s system 121% cpu 2.818 total
       
      > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
      timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s user 0.00s system 0% cpu 10.002 total
      

      As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 seconds.  I haven't come across a definite amount of time 1.10 takes, but it seems to run for >1 minute

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hablz Zack Hable
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: