Flume
  1. Flume
  2. FLUME-2215

ResettableFileInputStream can't support ucs-4 character

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: v1.5.0
    • Fix Version/s: v1.7.0
    • Component/s: None
    • Labels:

      Description

      ResettableFileInputStream.java:readChar() not handle ucs-4 character. it need 2 charBuf. it cause an unexpected termination。
      a temporary solution:
      if (res.isOverflow() && !charBuf.hasRemaining()){
      logger.warn("decoder ucs-4 at postion: {}" , buf.position());
      tmpBuf.clear();
      res = decoder.decode(buf, tmpBuf, isEndOfInput);
      incrPosition( buf.position() - start, false);
      return '?';
      }

      1. 0001-FLUME-2215-Fixes-reading-surrogate-based-chars.patch
        7 kB
        Patrick Sodre
      2. FLUME-2215-0.patch
        26 kB
        Alexandre Dutra
      3. FLUME-2215-0-README.txt
        4 kB
        Alexandre Dutra
      4. FLUME-2215-1.patch
        205 kB
        Alexandre Dutra
      5. FLUME-2215-1-README.txt
        5 kB
        Alexandre Dutra

        Issue Links

          Activity

          Hide
          Alexandre Dutra added a comment - - edited

          I brought this bug to the dev list a few days ago, see here.

          I'm attaching a patch that solves the problem by decoding the surrogate pair in a single pass.

          The only drawback I could think of would be if the agent crashes while the reader is in the middle of a surrogate pair and an attempt is made to resume reading from last recorded position. Should this case happen, the low surrogate would be lost and the consumer would end up with a corrupted char stream (high surrogate alone).

          Show
          Alexandre Dutra added a comment - - edited I brought this bug to the dev list a few days ago, see here . I'm attaching a patch that solves the problem by decoding the surrogate pair in a single pass. The only drawback I could think of would be if the agent crashes while the reader is in the middle of a surrogate pair and an attempt is made to resume reading from last recorded position. Should this case happen, the low surrogate would be lost and the consumer would end up with a corrupted char stream (high surrogate alone).
          Hide
          Alexandre Dutra added a comment -

          I am attaching two patches against the current trunk that fix the bug.

          1. FLUME-2215-0.patch solves the issue with minimal changes;
          2. FLUME-2215-1.patch solves the issue in a more ambitious way.

          Note to reviewers:

          1. Please DO NOT apply both patches simultaneously, they are intended to be applied alternatively;
          2. Please read files FLUME-2215-0-README.txt and FLUME-2215-1-README.txt for guidelines and review instructions for both patches.
          Show
          Alexandre Dutra added a comment - I am attaching two patches against the current trunk that fix the bug. FLUME-2215 -0.patch solves the issue with minimal changes; FLUME-2215 -1.patch solves the issue in a more ambitious way. Note to reviewers: Please DO NOT apply both patches simultaneously, they are intended to be applied alternatively; Please read files FLUME-2215 -0-README.txt and FLUME-2215 -1-README.txt for guidelines and review instructions for both patches.
          Hide
          Santiago M. Mola added a comment -

          I'm also interested in getting this done. I would propose to change the API to allow a more consistent approach: readChar() should return char[] with either one char or a surrogate pair. This would ensure that full characters are always read and would prevent clunky behaviour with mark/reset. A complementary method readCodePoint() might return an int.

          Show
          Santiago M. Mola added a comment - I'm also interested in getting this done. I would propose to change the API to allow a more consistent approach: readChar() should return char[] with either one char or a surrogate pair. This would ensure that full characters are always read and would prevent clunky behaviour with mark/reset. A complementary method readCodePoint() might return an int.
          Hide
          Santiago M. Mola added a comment -

          I would like to see something like Alexandre Dutra's second approach implemented. The character decoding logic should be separated from the byte reading, just as it is separated in the Java standard library. While the first approach works, it forces us to solve the problem at every level over and over. For example, I have a (highly hacky) ResettableDecompressInputStream implementation that takes a ResettableFileInputStream and decompresses it. After applying the first patch, I still have to manage character decoding on ResettableDecompressInputStream.

          Show
          Santiago M. Mola added a comment - I would like to see something like Alexandre Dutra 's second approach implemented. The character decoding logic should be separated from the byte reading, just as it is separated in the Java standard library. While the first approach works, it forces us to solve the problem at every level over and over. For example, I have a (highly hacky) ResettableDecompressInputStream implementation that takes a ResettableFileInputStream and decompresses it. After applying the first patch, I still have to manage character decoding on ResettableDecompressInputStream.
          Hide
          Santiago M. Mola added a comment -

          I'm using the second patch without any problem so far. In combination with decodeErrorPolicy=replace (or ignore) it seems to solve all Unicode parsing problems.

          Show
          Santiago M. Mola added a comment - I'm using the second patch without any problem so far. In combination with decodeErrorPolicy=replace (or ignore) it seems to solve all Unicode parsing problems.
          Hide
          Santiago M. Mola added a comment -

          I have updated the patch by Alexandre Dutra and opened a Review Board request for others to review it. It has been working for me for a couple of months without any problem so far.

          Show
          Santiago M. Mola added a comment - I have updated the patch by Alexandre Dutra and opened a Review Board request for others to review it. It has been working for me for a couple of months without any problem so far.
          Hide
          Hari Shreedharan added a comment -

          Santiago M. Mola - I don't see the patch here or the link to the RB. Can you post them here?

          Show
          Hari Shreedharan added a comment - Santiago M. Mola - I don't see the patch here or the link to the RB. Can you post them here?
          Show
          Santiago M. Mola added a comment - Hari Shreedharan https://reviews.apache.org/r/27378/diff/
          Hide
          Hari Shreedharan added a comment -

          This seems to do a lot of renames etc. Are those required, the patch is too huge to review.

          Show
          Hari Shreedharan added a comment - This seems to do a lot of renames etc. Are those required, the patch is too huge to review.
          Hide
          Santiago M. Mola added a comment -

          Hari Shreedharan Yes. The patch rewrites a substantial amount of code. Alternatively, we can go with the first patch proposed by Alexandre Dutra, which is much more limited in scope.

          Show
          Santiago M. Mola added a comment - Hari Shreedharan Yes. The patch rewrites a substantial amount of code. Alternatively, we can go with the first patch proposed by Alexandre Dutra , which is much more limited in scope.
          Hide
          Santiago M. Mola added a comment -

          Can someone review the smaller patch? It's here: https://reviews.apache.org/r/16516/

          Hari Shreedharan?

          Show
          Santiago M. Mola added a comment - Can someone review the smaller patch? It's here: https://reviews.apache.org/r/16516/ Hari Shreedharan ?
          Hide
          Hari Shreedharan added a comment -

          I will take a look at it sometime this week

          Show
          Hari Shreedharan added a comment - I will take a look at it sometime this week
          Hide
          Patrick Sodre added a comment -

          I had also started working on this bug independently and have a slightly smaller fix. Would you mind waiting a little longer before committing Santiago's patch?

          Thanks!

          Show
          Patrick Sodre added a comment - I had also started working on this bug independently and have a slightly smaller fix. Would you mind waiting a little longer before committing Santiago's patch? Thanks!
          Hide
          Hari Shreedharan added a comment -

          Sure.

          Show
          Hari Shreedharan added a comment - Sure.
          Hide
          Jameel Al-Aziz added a comment -

          Just wanted to mention that I've tried the most current patch (backporting it to 1.5.2), and while it solves the UCS-4 problem, it also introduces a BufferUnderflowException occasionally. I'll work on trying to find a reproducible test case and fix the patch.

          java.nio.BufferUnderflowException
          at java.nio.Buffer.nextGetIndex(Buffer.java:500)
          at java.nio.HeapCharBuffer.get(HeapCharBuffer.java:135)
          at org.apache.flume.serialization.ResettableFileInputStream.readChar(ResettableFileInputStream.java:225)

          Show
          Jameel Al-Aziz added a comment - Just wanted to mention that I've tried the most current patch (backporting it to 1.5.2), and while it solves the UCS-4 problem, it also introduces a BufferUnderflowException occasionally. I'll work on trying to find a reproducible test case and fix the patch. java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:500) at java.nio.HeapCharBuffer.get(HeapCharBuffer.java:135) at org.apache.flume.serialization.ResettableFileInputStream.readChar(ResettableFileInputStream.java:225)
          Hide
          Patrick Sodre added a comment -

          Jameel,

          Since there were two different solutions submitted to the same problem, can you tell me which one failed the test? Mola's or mine ?

          Thank you,
          Patrick

          Show
          Patrick Sodre added a comment - Jameel, Since there were two different solutions submitted to the same problem, can you tell me which one failed the test? Mola's or mine ? Thank you, Patrick
          Hide
          Jameel Al-Aziz added a comment -

          Patrick Sodre, your patch seems to be failing with the BufferUnderflowException. The branch I'm building is available here: https://github.com/6si/flume/tree/flume-2215.

          I will try to get you a reproducible test case today or tomorrow!

          Show
          Jameel Al-Aziz added a comment - Patrick Sodre , your patch seems to be failing with the BufferUnderflowException. The branch I'm building is available here: https://github.com/6si/flume/tree/flume-2215 . I will try to get you a reproducible test case today or tomorrow!
          Hide
          Jameel Al-Aziz added a comment -

          Patrick Sodre, I haven't managed to get an isolated test case where the BufferUnderflow gets triggered. I can trigger it reliably given a full file, but when isolating the line where the exception occurs, it seems to work.

          That being said, I believe I found an issue with the patches submitted so far. If you have malformed input where one character is a high surrogate, but the next character is not the complementing lower surrogate, you get a case where we have two characters in the buffer that are not part of a surrogate pair. This seems to work, except that the patches "past" the the second character. Therefore, an error occurring between the first and second getChar() won't recover from the correct position.

          It also appears that this problem is further complicated if you choose to ignore malformed input (although, I'm still testing the edge cases here).

          Show
          Jameel Al-Aziz added a comment - Patrick Sodre , I haven't managed to get an isolated test case where the BufferUnderflow gets triggered. I can trigger it reliably given a full file, but when isolating the line where the exception occurs, it seems to work. That being said, I believe I found an issue with the patches submitted so far. If you have malformed input where one character is a high surrogate, but the next character is not the complementing lower surrogate, you get a case where we have two characters in the buffer that are not part of a surrogate pair. This seems to work, except that the patches "past" the the second character. Therefore, an error occurring between the first and second getChar() won't recover from the correct position. It also appears that this problem is further complicated if you choose to ignore malformed input (although, I'm still testing the edge cases here).
          Hide
          Patrick Sodre added a comment -

          @jalaziz, okay. I'll keep a look for anything that fails on my environment as well.

          Show
          Patrick Sodre added a comment - @jalaziz, okay. I'll keep a look for anything that fails on my environment as well.
          Hide
          Jeffrey Theobald added a comment - - edited

          Hi there,

          This bug doesn't just affect ucs-4. It also affects utf-8 characters that are four bytes long, (and only four bytes long, two and three bytes don't seem to be a problem). This might become a more significant issue since there are a bunch of emoji that are four bytes long in UTF-8 that are probably going to be increasingly common in text data.

          Specifically ingesting a line like:

          "two bytes: § three bytes: ⚂ four bytes: 😏. This text will never be read"
          

          from a spooling directory will cause a premature EOF at the emoticon, so the rest of the line and the rest of the file will be lost.

          Applying patch FLUME-2215-3.patch to commit 619e78fe68658db242808a18f41ee5137b127748 created a build that fixed this issue in my case. But this patch seems to be a long way out of date from the current trunk.

          EDIT: Patch downloaded from here with the 'download diff' link in the top right: https://reviews.apache.org/r/27378/diff/

          Show
          Jeffrey Theobald added a comment - - edited Hi there, This bug doesn't just affect ucs-4. It also affects utf-8 characters that are four bytes long, (and only four bytes long, two and three bytes don't seem to be a problem). This might become a more significant issue since there are a bunch of emoji that are four bytes long in UTF-8 that are probably going to be increasingly common in text data. Specifically ingesting a line like: "two bytes: § three bytes: ⚂ four bytes: 😏. This text will never be read" from a spooling directory will cause a premature EOF at the emoticon, so the rest of the line and the rest of the file will be lost. Applying patch FLUME-2215 -3.patch to commit 619e78fe68658db242808a18f41ee5137b127748 created a build that fixed this issue in my case. But this patch seems to be a long way out of date from the current trunk. EDIT: Patch downloaded from here with the 'download diff' link in the top right: https://reviews.apache.org/r/27378/diff/
          Hide
          Johny Rufus added a comment -

          Jeffrey Theobald, can you try and confirm with the smaller patch @ https://reviews.apache.org/r/16516/
          I reviewed the patch, and tried it, it takes care of the utf-8 4 bytes decoding issue.

          Show
          Johny Rufus added a comment - Jeffrey Theobald , can you try and confirm with the smaller patch @ https://reviews.apache.org/r/16516/ I reviewed the patch, and tried it, it takes care of the utf-8 4 bytes decoding issue.
          Hide
          Hari Shreedharan added a comment -

          +1. I am committing the first patch. For a substantial rewrite, lets file a new jira - add a design doc with an extensible interface and use that, rather than doing it in patches directly.

          Show
          Hari Shreedharan added a comment - +1. I am committing the first patch. For a substantial rewrite, lets file a new jira - add a design doc with an extensible interface and use that, rather than doing it in patches directly.
          Hide
          ASF subversion and git services added a comment -

          Commit 344e0accae5675fd3d14b8414531528607865aae in flume's branch refs/heads/trunk from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=344e0ac ]

          FLUME-2215. ResettableFileInputStream can't support ucs-4 character

          (Alexandre Dutra via Hari)

          Show
          ASF subversion and git services added a comment - Commit 344e0accae5675fd3d14b8414531528607865aae in flume's branch refs/heads/trunk from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=344e0ac ] FLUME-2215 . ResettableFileInputStream can't support ucs-4 character (Alexandre Dutra via Hari)
          Hide
          ASF subversion and git services added a comment -

          Commit 4b8b9d631dde73620ac03f03eb48bcee55378a62 in flume's branch refs/heads/flume-1.7 from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=4b8b9d6 ]

          FLUME-2215. ResettableFileInputStream can't support ucs-4 character

          (Alexandre Dutra via Hari)

          Show
          ASF subversion and git services added a comment - Commit 4b8b9d631dde73620ac03f03eb48bcee55378a62 in flume's branch refs/heads/flume-1.7 from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=4b8b9d6 ] FLUME-2215 . ResettableFileInputStream can't support ucs-4 character (Alexandre Dutra via Hari)
          Hide
          Hari Shreedharan added a comment -

          Committed! Thanks Alexandre Dutra!

          Show
          Hari Shreedharan added a comment - Committed! Thanks Alexandre Dutra !
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Flume-trunk-hbase-1 #105 (See https://builds.apache.org/job/Flume-trunk-hbase-1/105/)
          FLUME-2215. ResettableFileInputStream can't support ucs-4 character (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=344e0accae5675fd3d14b8414531528607865aae)

          • flume-ng-core/src/test/java/org/apache/flume/serialization/TestResettableFileInputStream.java
          • flume-ng-core/src/main/java/org/apache/flume/serialization/ResettableFileInputStream.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Flume-trunk-hbase-1 #105 (See https://builds.apache.org/job/Flume-trunk-hbase-1/105/ ) FLUME-2215 . ResettableFileInputStream can't support ucs-4 character (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=344e0accae5675fd3d14b8414531528607865aae ) flume-ng-core/src/test/java/org/apache/flume/serialization/TestResettableFileInputStream.java flume-ng-core/src/main/java/org/apache/flume/serialization/ResettableFileInputStream.java

            People

            • Assignee:
              Alexandre Dutra
              Reporter:
              syntony liu
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development