Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-3655

Allow comma-separated or multiple directories to be specified for FileInputFormat

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.0.0
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      Allow comma-separated or multiple directories to be specified for FileInputFormat so that a DataSource will process the directories sequentially.

      env.readFile("/data/2016/01/01//,/data/2016/01/02//,/data/2016/01/03//")

      in Scala

      env.readFile(paths: Seq[String])
      or
      env.readFile(path: String, otherPaths: String*)

      Wildcard support would be a bonus.

        Issue Links

          Activity

          Hide
          fhueske Fabian Hueske added a comment -

          I don't think there's a particular reason why the PR hasn't been merged.
          Nobody picked it up and it disappeared from the radar in the list of stale PRs :-/

          I'll try to have a look in the next days.

          Show
          fhueske Fabian Hueske added a comment - I don't think there's a particular reason why the PR hasn't been merged. Nobody picked it up and it disappeared from the radar in the list of stale PRs :-/ I'll try to have a look in the next days.
          Hide
          soniclavier Vishnu Viswanath added a comment -

          was looking for this feature. why wasn't this ever merged?

          Show
          soniclavier Vishnu Viswanath added a comment - was looking for this feature. why wasn't this ever merged?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user gna-phetsarath commented on the pull request:

          https://github.com/apache/flink/pull/1990#issuecomment-221890183

          You are correct, the majority of the changes were in the "generate splits" method and "statistics" methods which included changes to subclasses that used the file path directly. Not as extensive as it appears.

          Also, additional tests were added.

          Show
          githubbot ASF GitHub Bot added a comment - Github user gna-phetsarath commented on the pull request: https://github.com/apache/flink/pull/1990#issuecomment-221890183 You are correct, the majority of the changes were in the "generate splits" method and "statistics" methods which included changes to subclasses that used the file path directly. Not as extensive as it appears. Also, additional tests were added.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the pull request:

          https://github.com/apache/flink/pull/1990#issuecomment-221866397

          Thanks for opening that contribution.

          Can you sum up the changes you made? That would make the review easier.
          The changes look quite extensive. My gut feeling would be that it should not require so many changes, ideally only an additional loop in the "generate splits" method, and possibly in the "statistics" method.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1990#issuecomment-221866397 Thanks for opening that contribution. Can you sum up the changes you made? That would make the review easier. The changes look quite extensive. My gut feeling would be that it should not require so many changes, ideally only an additional loop in the "generate splits" method, and possibly in the "statistics" method.
          Hide
          Gna Phetsarath Gna Phetsarath added a comment -

          There's a pull request for this: https://github.com/apache/flink/pull/1990

          Show
          Gna Phetsarath Gna Phetsarath added a comment - There's a pull request for this: https://github.com/apache/flink/pull/1990
          Hide
          Gna Phetsarath Gna Phetsarath added a comment -

          What's the progress on this ticket, Tian, Li?

          Show
          Gna Phetsarath Gna Phetsarath added a comment - What's the progress on this ticket, Tian, Li ?
          Hide
          tianli Tian, Li added a comment -

          Thanks, I will do the path list first and use "readFile(FileInputFormat<X> inputFormat, String.. filePaths)".

          Show
          tianli Tian, Li added a comment - Thanks, I will do the path list first and use "readFile(FileInputFormat<X> inputFormat, String.. filePaths)".
          Hide
          tianli Tian, Li added a comment -

          Will support wildcards

          Show
          tianli Tian, Li added a comment - Will support wildcards
          Hide
          Gna Phetsarath Gna Phetsarath added a comment -

          Will do be doing wildcards as well, or should be put that as another ticket?

          Show
          Gna Phetsarath Gna Phetsarath added a comment - Will do be doing wildcards as well, or should be put that as another ticket?
          Hide
          mxm Maximilian Michels added a comment - - edited

          Sounds good. It is important to maintain backwards compatibility.

          I'm not sure about the "comma-separated Path string". File names may contain commas. So we might skip that for now and do the path list first.

          I think we could also use readFile(FileInputFormat<X> inputFormat, String.. filePaths) which will return the filePath as a String[] filepaths array.

          Show
          mxm Maximilian Michels added a comment - - edited Sounds good. It is important to maintain backwards compatibility. I'm not sure about the "comma-separated Path string". File names may contain commas. So we might skip that for now and do the path list first. I think we could also use readFile(FileInputFormat<X> inputFormat, String.. filePaths) which will return the filePath as a String[] filepaths array.
          Hide
          tianli Tian, Li added a comment - - edited

          I think we may need to use "List<Path> filePaths" instead of "Path filePath" in FileInputFormat.
          In this way, we should also
          1. modify current implementations to support multiple input paths
          2. add functions like setFilePaths, getFilePaths to FileInputFormat, and support comma-seperated Path string in ExecutionEnvironment
          3. for backward compatibility, let FileInputFormat.setFilePath set the inputPaths to a one-element list

          Show
          tianli Tian, Li added a comment - - edited I think we may need to use "List<Path> filePaths" instead of "Path filePath" in FileInputFormat. In this way, we should also 1. modify current implementations to support multiple input paths 2. add functions like setFilePaths, getFilePaths to FileInputFormat, and support comma-seperated Path string in ExecutionEnvironment 3. for backward compatibility, let FileInputFormat.setFilePath set the inputPaths to a one-element list
          Hide
          mxm Maximilian Michels added a comment -

          Hi! Great. Feel free to open a PR. The PR should include tests. Also, could you briefly describe how you want to integrate the feature into the existing code?

          Show
          mxm Maximilian Michels added a comment - Hi! Great. Feel free to open a PR. The PR should include tests. Also, could you briefly describe how you want to integrate the feature into the existing code?
          Hide
          tianli Tian, Li added a comment -

          Hi. I would like to contribute for this issue. Thanks.

          Show
          tianli Tian, Li added a comment - Hi. I would like to contribute for this issue. Thanks.
          Hide
          rmetzger Robert Metzger added a comment -

          Thank you for opening a JIRA for this feature request.
          I think its a good idea and it shouldn't be too difficult to implement.

          Show
          rmetzger Robert Metzger added a comment - Thank you for opening a JIRA for this feature request. I think its a good idea and it shouldn't be too difficult to implement.

            People

            • Assignee:
              Unassigned
              Reporter:
              Gna Phetsarath Gna Phetsarath
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development