Pig
  1. Pig
  2. PIG-1842

Improve Scalability of the XMLLoader for large datasets such as wikipedia

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0, 0.8.0, 0.9.0
    • Fix Version/s: 0.8.1
    • Component/s: impl
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.

      Viraj

      1. TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
        40 kB
        Alan Gates
      2. PIG-1842_2.patch
        16 kB
        Vivek Padmanabhan
      3. PIG-1842_1.patch
        19 kB
        Vivek Padmanabhan

        Activity

        Hide
        Vivek Padmanabhan added a comment -

        Attaching an initial patch.

        Show
        Vivek Padmanabhan added a comment - Attaching an initial patch.
        Hide
        Vivek Padmanabhan added a comment -

        The below are some of the issues addressed in the patch :
        a) Marking splittable of the loader as true except for gz formats
        a) Changing XMLLoader to read for splits rather than entire file.
        b) Handling scenarios regarding split/record boundaries
        c) Using CBZip2InputStream to handle bzip2 files
        d) An improvement on logic of collectTag (ie, skip unnecessary reads to find end tag if no start tags are found)

        Manual tests for scalability and functional verification were done for the patch.
        Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) the new loader completed within 3 minutes,while the older version took more than 35minutes for a simple load-filter null-store script.

        Show
        Vivek Padmanabhan added a comment - The below are some of the issues addressed in the patch : a) Marking splittable of the loader as true except for gz formats a) Changing XMLLoader to read for splits rather than entire file. b) Handling scenarios regarding split/record boundaries c) Using CBZip2InputStream to handle bzip2 files d) An improvement on logic of collectTag (ie, skip unnecessary reads to find end tag if no start tags are found) Manual tests for scalability and functional verification were done for the patch. Using latest wikipedia dump in bz2 format (contains 10861606 pages; 6.5gb bz2) the new loader completed within 3 minutes,while the older version took more than 35minutes for a simple load-filter null-store script.
        Hide
        Alan Gates added a comment -

        The patch does not apply cleanly against the trunk. Can you regenerate the patch against the latest trunk?

        Show
        Alan Gates added a comment - The patch does not apply cleanly against the trunk. Can you regenerate the patch against the latest trunk?
        Hide
        Vivek Padmanabhan added a comment -

        Attaching the patch again

        Show
        Vivek Padmanabhan added a comment - Attaching the patch again
        Hide
        Alan Gates added a comment -

        From reviewing the code it is not clear to me how this splits the XML file. Let's say we have an XML file that looks like:

        <a>
            <b>
                <c>
                </c>
                <c1>
                </c1>
            </b>
        </a>
        <a1>
            <b1>
            </b1>
            <b2>
            </b2>
        </a2>
        

        and the split falls on line "</c1>".

        How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document. Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point.

        How does split 2 know where to start? I don't see any code that is telling split 2 to fast forward to the point where split 1 ends.

        All the tests pass just fine.

        Show
        Alan Gates added a comment - From reviewing the code it is not clear to me how this splits the XML file. Let's say we have an XML file that looks like: <a> <b> <c> </c> <c1> </c1> </b> </a> <a1> <b1> </b1> <b2> </b2> </a2> and the split falls on line "</c1>". How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document. Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point. How does split 2 know where to start? I don't see any code that is telling split 2 to fast forward to the point where split 1 ends. All the tests pass just fine.
        Hide
        Vivek Padmanabhan added a comment -

        Hi Alan ,
        The below is how I have handled these cases :

        Note :-
        The XMLLoader will consider one record from begining tag to end tag just like a line record reader searching for new line char .
        Split start and end locations are provided by the default FileInputFormat.

        Describing the entire steps in a simple way ;

        *The loader will collect the start and end tags and create a record out of it. (XMLLoaderBufferedPositionedInputStream.collectTag)
        *For begin tag
        *Read till the tag is found in this block
        *If tag not found and split end has reached then no rec found in this split (return empty array)
        *If partial tag is found in the current split then even though split end has reached
        continue reading rest of the file , beyond the split end location (handled by cond in while loop)
        *For end tag
        *Read till the end tag is found even if the split end location is reached.

        >>How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document.
        >>Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point.

        The other condition will keep the reading going on. (matchBuf.size() > 0 )

        Here in this case lets say my tag identifier is <a> . Then the loader will read till the split end to search for begining tag.
        Now for the end tag, it reads the rest of file starting from the last read position.Lets say split end has reached in between,
        it will check whether it has found a match/or partial match. If not proceed with the reading till it finds a end tag.

        Show
        Vivek Padmanabhan added a comment - Hi Alan , The below is how I have handled these cases : Note :- The XMLLoader will consider one record from begining tag to end tag just like a line record reader searching for new line char . Split start and end locations are provided by the default FileInputFormat. Describing the entire steps in a simple way ; *The loader will collect the start and end tags and create a record out of it. (XMLLoaderBufferedPositionedInputStream.collectTag) *For begin tag *Read till the tag is found in this block *If tag not found and split end has reached then no rec found in this split (return empty array) *If partial tag is found in the current split then even though split end has reached continue reading rest of the file , beyond the split end location (handled by cond in while loop) *For end tag *Read till the end tag is found even if the split end location is reached. >>How far will split 1 read? It seems like it has to read to "</a>" or else the map processing split one will not be able to process this as a coherent document. >>Yet from the setting of maxBytesReadable on line 132 it looks to me like it won't read past the end point. The other condition will keep the reading going on. (matchBuf.size() > 0 ) Here in this case lets say my tag identifier is <a> . Then the loader will read till the split end to search for begining tag. Now for the end tag, it reads the rest of file starting from the last read position.Lets say split end has reached in between, it will check whether it has found a match/or partial match. If not proceed with the reading till it finds a end tag.
        Hide
        Vivek Padmanabhan added a comment -

        I have done manual test for split boundary conditions. Please suggest whether/how I can do the same with unit tests.

        Show
        Vivek Padmanabhan added a comment - I have done manual test for split boundary conditions. Please suggest whether/how I can do the same with unit tests.
        Hide
        Alan Gates added a comment -

        I have checked the patch into trunk. I applied it to the 0.8 branch, but got errors in the unit tests. I will attach the results of the 0.8 test run.

        Show
        Alan Gates added a comment - I have checked the patch into trunk. I applied it to the 0.8 branch, but got errors in the unit tests. I will attach the results of the 0.8 test run.
        Hide
        Vivek Padmanabhan added a comment -

        The errors are because PIG-1839(XMLLoader will always add an extra empty tuple even if no tags are matched) was not applied to 0.8 branch which corrects these test cases.

        Show
        Vivek Padmanabhan added a comment - The errors are because PIG-1839 (XMLLoader will always add an extra empty tuple even if no tags are matched) was not applied to 0.8 branch which corrects these test cases.
        Hide
        Alan Gates added a comment -

        Patch 2 checked into 0.8 branch.

        Show
        Alan Gates added a comment - Patch 2 checked into 0.8 branch.

          People

          • Assignee:
            Vivek Padmanabhan
            Reporter:
            Viraj Bhat
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development