Uploaded image for project: 'Apache Apex Malhar'
  1. Apache Apex Malhar
  2. APEXMALHAR-2309

TimeBasedDedupOperator marks new tuples as duplicates if expired tuples exist

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 3.6.0
    • None
    • None

    Description

      The deduper marks valid tuples outside the expiry window as duplicates.

      Consider the following configuration (number of buckets = 1 )

        <property>
          <name>dt.application.DedupTestApp.operator.Deduper.prop.expireBefore</name>
          <value>10</value>
        </property>
        <property>
          <name>dt.application.DedupTestApp.operator.Deduper.prop.bucketSpan</name>
          <value>10</value>
        </property>
      

      The data piped in is :

      "10",1474614305000,"Test"
      "11",1474614315000,"Test"
      "10",1474614325000,"Test"
      

      The 3rd tuple is valid since it is outside of the expiry window. But it is marked as duplicate because although the first tuple although expired is still present in the Bucket.flash.

      The issue happens when the expiry duration lesser than the checkpointing duration.

      Attachments

        Issue Links

          Activity

            People

              francisf Francis Fernandes
              francisf Francis Fernandes
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: