Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-129

New macro - dedup

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0

    Description

      Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically a date updated field).

      One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test dependencies in order for the test to run. While I feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true - do we have an alternative? (maybe adding it to the jarjar?)

      The macro's definition looks as follows:

      DEFINE dedup(relation, row_key, order_field) returns out {

      relation - relation to dedup
      row_key - field(s) for group by
      order_field - the field for ordering (to find the most recent record)

      Attachments

        1. DATAFU-129.patch
          5 kB
          Eyal Allweil
        2. DATAFU-129-2.patch
          13 kB
          Eyal Allweil
        3. DATAFU-129-bad.patch
          8 kB
          Eyal Allweil

        Activity

          People

            eyal Eyal Allweil
            eyal Eyal Allweil
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: