Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6061

File source dstream can not include the old file which timestamp is before the system time

    Details

      Description

      The file source dstream (StreamContext.fileStream) has a properties named "newFilesOnly" to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set.

      Here is the simple reproduce code:
      https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb

      The reason is that: the "modTimeIgnoreThreshold" in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jhu Jack Hu
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 1m
                  1m
                  Remaining:
                  Remaining Estimate - 1m
                  1m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified