XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.4
    • None
    • None

    Description

      Jsoup is giving us some issues in our encoding detection module, namely:

      https://github.com/jhy/jsoup/issues/1251 (which caused ANY23-441)

      and

      https://github.com/jhy/jsoup/issues/1250 (which is going to make our encoding detector blow up anytime we're detecting, e.g., UTF-16.)

      The latter issue is more serious than the former due to the potential frequency of the errors.

      There is one pull request open in jsoup for the first issue which fixes it, but unfortunately Jonathan Hedley (creator of jsoup) has not been active over the past few months and I doubt it'll get merged anytime soon.

      I propose that we temporarily repackage a couple jsoup classes in our encoding detection module and add some quick fixes. When the jsoup library gets updated, we can potentially remove the repackaged classes again.

      One bonus advantage: this will allow us to implement a streaming approach to encoding detection instead of our current strategy of building the entire DOM to extract the plaintext (which is really overkill on memory usage).

      Attachments

        Issue Links

          Activity

            People

              hansbrende Hans Brende
              hansbrende Hans Brende
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m