Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3
-
None
-
None
Description
Jsoup is giving us some issues in our encoding detection module, namely:
https://github.com/jhy/jsoup/issues/1251 (which caused ANY23-441)
and
https://github.com/jhy/jsoup/issues/1250 (which is going to make our encoding detector blow up anytime we're detecting, e.g., UTF-16.)
The latter issue is more serious than the former due to the potential frequency of the errors.
There is one pull request open in jsoup for the first issue which fixes it, but unfortunately Jonathan Hedley (creator of jsoup) has not been active over the past few months and I doubt it'll get merged anytime soon.
I propose that we temporarily repackage a couple jsoup classes in our encoding detection module and add some quick fixes. When the jsoup library gets updated, we can potentially remove the repackaged classes again.
One bonus advantage: this will allow us to implement a streaming approach to encoding detection instead of our current strategy of building the entire DOM to extract the plaintext (which is really overkill on memory usage).
Attachments
Issue Links
- links to