Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-930

merge join should handle compressed bz2 sorted files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      There are two issues - POLoad which is used to read the right side input does not handle bz2 files right now. This needs to be fixed.
      Further inn the index map job we bindTo(startOfBlockOffSet) (this will internally discard first tuple if offset > 0). Then we do the following:

      While(tuple survives pipeline) {
      
        Pos =  getPosition()
      
        getNext() 
      
        run the tuple  through pipeline in the right side which could have filter
      
      }
      
      Emit(key, pos, filename).
      

      Then in the map job which does the join, we bindTo(pos > 0 ? pos 1 : pos) (we do pos -1 because bindTo will discard first tuple for pos> 0). Then we do getNext()
      Now in bz2 compressed files, getPosition() returns a position which is not really accurate. The problem is it could be a position in the middle of a compressed bz2 block. Then when we use that position to bindTo() in the final map job, the code would first hunt for a bz2 block header thus skipping the whole current bz2 block.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pkamath Pradeep Kamath
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: