Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-3178

Builtin for tuples deduplication

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Identify and remove the duplicate tuples in the data. Finding the duplicates requires comparing the pairwise similarity, which is an expensive process. Apply the clustering or blocking techniques to divide the data into partitions and then only compare the pairwise similarity inside partitions.
      the builtin could be named as dedup() with parameters like matrix dataset, string  similarityMeasure (euclidean, manhattan, cosine e.t.c.),  and boolean returnDuplicates (if TRUE, return the duplicate rows only, if FALSE return the original dataset without duplicate rows)

      Attachments

        Activity

          People

            Unassigned Unassigned
            ssiddiqi Shafaq Siddiqi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: