Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6061

File source dstream can not include the old file which timestamp is before the system time

    XMLWordPrintableJSON

Details

    Description

      The file source dstream (StreamContext.fileStream) has a properties named "newFilesOnly" to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set.

      Here is the simple reproduce code:
      https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb

      The reason is that: the "modTimeIgnoreThreshold" in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jhu Jack Hu
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 1m
                  1m
                  Remaining:
                  Remaining Estimate - 1m
                  1m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified