Pig
  1. Pig
  2. PIG-1561

XMLLoader in Piggybank does not support bz2 or gzip compressed XML files

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0, 0.8.0
    • Fix Version/s: 0.8.1
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I have a simple Pig script which uses the XMLLoader after the Piggybank is built.

      register piggybank.jar;
      A = load '/user/viraj/capacity-scheduler.xml.gz' using org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray);
      B = limit A 1;
      dump B;
      --store B into '/user/viraj/handlegz' using PigStorage();
      

      returns empty tuple

      ()
      

      If you supply the uncompressed XML file, you get

      (<property>
          <name>mapred.capacity-scheduler.queue.my.capacity</name>
          <value>10</value>
          <description>Percentage of the number of slots in the cluster that are
            guaranteed to be available for jobs in this queue.
          </description>    
        </property>)
      
      1. PIG-1561-1.patch
        6 kB
        Vivek Padmanabhan

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        143d 4h 55m 1 Vivek Padmanabhan 14/Jan/11 05:46
        In Progress In Progress Resolved Resolved
        5d 19h 31m 1 Daniel Dai 20/Jan/11 01:17
        Resolved Resolved Closed Closed
        94d 22h 44m 1 Daniel Dai 25/Apr/11 01:02
        Daniel Dai made changes -
        Fix Version/s 0.8.1 [ 12316393 ]
        Fix Version/s 0.8.0 [ 12314562 ]
        Daniel Dai made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Daniel Dai added a comment -

        Also commit to 0.8 branch.

        Show
        Daniel Dai added a comment - Also commit to 0.8 branch.
        Daniel Dai made changes -
        Fix Version/s 0.8.0 [ 12314562 ]
        Fix Version/s 0.9.0 [ 12315191 ]
        Daniel Dai made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Daniel Dai added a comment -

        All tests pass. Patch committed to trunk. Thanks Vivek!

        Show
        Daniel Dai added a comment - All tests pass. Patch committed to trunk. Thanks Vivek!
        Daniel Dai made changes -
        Fix Version/s 0.9.0 [ 12315191 ]
        Hide
        Vivek Padmanabhan added a comment -

        In the current XML loader, the behavior is that, the XMLLoaderBufferedPositionedInputStream reads the entire XML file without considering the split start and end locations.
        Hence if there is an XML > block size, the MR will execute multiple mappers but in all the mappers the loaders will load the entire XML file.
        ie If i have an XML of size 256mb and the block size is 128mb there will be two mappers , but because of the loader, both the mappers will read the entire file regardless of the split boundaries . This is functionally wrong. This is the reason why I marked it as unsplitable.

        Show
        Vivek Padmanabhan added a comment - In the current XML loader, the behavior is that, the XMLLoaderBufferedPositionedInputStream reads the entire XML file without considering the split start and end locations. Hence if there is an XML > block size, the MR will execute multiple mappers but in all the mappers the loaders will load the entire XML file. ie If i have an XML of size 256mb and the block size is 128mb there will be two mappers , but because of the loader, both the mappers will read the entire file regardless of the split boundaries . This is functionally wrong. This is the reason why I marked it as unsplitable.
        Hide
        Daniel Dai added a comment -

        Patch looks good. The only concern is we mark it unsplitable. Did you find out why we cannot split? Neither bz2 and gz is splittable?

        Show
        Daniel Dai added a comment - Patch looks good. The only concern is we mark it unsplitable. Did you find out why we cannot split? Neither bz2 and gz is splittable?
        Vivek Padmanabhan made changes -
        Attachment PIG-1561-1.patch [ 12468366 ]
        Hide
        Vivek Padmanabhan added a comment -

        Attaching an initial patch for the issue. Please review.

        Show
        Vivek Padmanabhan added a comment - Attaching an initial patch for the issue. Please review.
        Vivek Padmanabhan made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Viraj Bhat made changes -
        Field Original Value New Value
        Assignee Vivek Padmanabhan [ vivekp ]
        Affects Version/s 0.8.0 [ 12314562 ]
        Viraj Bhat created issue -

          People

          • Assignee:
            Vivek Padmanabhan
            Reporter:
            Viraj Bhat
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development