Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
New
Description
Customers complained about high CPU for Elasticsearch cluster in production. We noticed that few search requests were stuck for long time
% curl -s localhost:9200/_cat/tasks?v indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 AmMLzDQ4RrOJievRDeGFZw:569204 direct 1645195007282 14:36:47 6.2h indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 emjWc5bUTG6lgnCGLulq-Q:502074 direct 1645195037259 14:37:17 6.2h indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 emjWc5bUTG6lgnCGLulq-Q:583269 direct 1645201316981 16:21:56 4.5h
Flame graphs indicated that CPU time is mostly going into getMinCompetitiveScore method in MaxScoreSumPropagator. After doing some live JVM debugging found that org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had around 4 million invocations every second
Figured out the values of some parameters from live debugging:
minScoreSum = 3.5541441
minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 3.554144322872162
returnObj scoreSumUpperBound = 3.5541444
Math.ulp(minScoreSum) = 2.3841858E-7
Example code snippet:
double sumOfOtherMaxScores = 3.554144322872162; double minScoreSum = 3.5541441; float minScore = (float) (minScoreSum - sumOfOtherMaxScores); while (scoreSumUpperBound(minScore + sumOfOtherMaxScores) > minScoreSum) { minScore -= Math.ulp(minScoreSum); System.out.printf("%.20f, %.100f\n", minScore, Math.ulp(minScoreSum)); }
Attachments
Attachments
Issue Links
- links to