Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3373

XMLLoader returns non-matching nodes when a tag name spans through the block boundary

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • site
    • 0.13.0
    • piggybank
    • Patch Available
    • Hide
      I added a new patch that fixes this bug. It turned out that this bug happens only when the input file is .bz2 compressed and the non-matching tag spans two file splits in the compressed file. Since it's almost impossible to tailor an example that has this bug since the compression is virtually non-deterministic, I included a random generator that generates this test case.
      I don't like the idea of discovering a bug using this randomly generated file since, by definition, it's non-deterministic, I attached the test file for reference.
      The fix is still the same as the previous patch, but this time, the test fails without this fix.
      Show
      I added a new patch that fixes this bug. It turned out that this bug happens only when the input file is .bz2 compressed and the non-matching tag spans two file splits in the compressed file. Since it's almost impossible to tailor an example that has this bug since the compression is virtually non-deterministic, I included a random generator that generates this test case. I don't like the idea of discovering a bug using this randomly generated file since, by definition, it's non-deterministic, I attached the test file for reference. The fix is still the same as the previous patch, but this time, the test fails without this fix.

    Description

      When node start tag spans two blocks this tag is returned even if it is not of the type.
      Example: For the following input file

      <event id="3423">
      <ev
      -------- BLOCK BOUNDARY
      entually id="dfasd">

      XMLoader with tag type 'event' should return only the first one but it actually returns both of them

      Attachments

        1. bad-file.xml.bz2
          209 kB
          Ahmed Eldawy
        2. PIG3373_1.patch
          4 kB
          Ahmed Eldawy
        3. PIG3373_2.patch
          3 kB
          Ahmed Eldawy
        4. PIG3373_3.patch
          7 kB
          Ahmed Eldawy
        5. PIG3373.patch
          4 kB
          Ahmed Eldawy
        6. test-file-2.xml.bz2
          119 kB
          Ahmed Eldawy

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            aseldawy Ahmed Eldawy
            aseldawy Ahmed Eldawy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment