Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.9, 1.10.0
-
None
-
Linux
Description
After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed our system "hangs" (or likely will take an excessively long time to process) large lines (100MB+ in size) when splitting strings with StringTokenizer.
Mitigation: Revert to Apache Commons Text 1.9
Scala version:
> scala -version Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.
Java version:
> java -version openjdk version "1.8.0_382" OpenJDK Runtime Environment (build 1.8.0_382-b05) OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
Reproduction Steps:
- Generate a sample large file
echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile dd if=/dev/zero bs=100MB count=1 >> largefile sed -ie "s/\x0/0/g" largefile echo -n "\0" >> largefile
- Setup reproduce.scala
import org.apache.commons.text.StringTokenizer val lines = scala.io.Source.fromFile("./largefile").getLines.toList val st: StringTokenizer = new StringTokenizer(lines(0)) val res = st.getTokenArray()
- Download Apache Commons Jars
wget https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar wget https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
- Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s user 0.83s system 121% cpu 2.818 total > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s user 0.00s system 0% cpu 10.002 total
As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 seconds. I haven't come across a definite amount of time 1.10 takes, but it seems to run for >1 minute
Attachments
Issue Links
- links to