Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29617

Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster

    XMLWordPrintableJSON

Details

    Description

      Scenario:

      Our user use flink batch to compact small files in one day. Flink version : 1.15
      He split pipeline into 24 for each hour. So there are 24 source
       
      I find it  costs too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
       
       as follow:
       

       

      Root Cause:

      I got the root cause after check: 

      1. AbstractFileSource will enumerateSplits when createEnumerator
      2. NotSplittingRecursiveEnumerator need to get fileblockLocation of every fileblock which is a heavy IO operation

       


       

      Suggestion

      1. FileSource add option to disable location fetcher
      2. Move location fetcher into IOExecutor

      Attachments

        1. image-2022-10-13-19-02-18-555.png
          1.15 MB
          LI Mingkun
        2. image-2022-10-13-19-02-29-620.png
          193 kB
          LI Mingkun
        3. image-2022-10-13-19-02-35-422.png
          1.00 MB
          LI Mingkun

        Activity

          People

            luoyuxia luoyuxia
            dangshazi LI Mingkun
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: