Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2517

Performance issue: SimpleDateFormat constructor takes 30% of HDFSEventSink.process()

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0.1
    • Fix Version/s: 1.6.0
    • Component/s: Sinks+Sources
    • Labels:
    • Environment:

      linux i686
      java version "1.7.0_45"

      Description

      I started investigating why HDFS sink has so bad throughput in v 1.5.0.0. It seems to be better in 1.6.0.0 (current trunk).

      PseudoTx channel was filling up, because HDFS Sink could not write as fast as data coming from source.

      Profiling from jconsole revealed that 30% of the time spent in HDFSEventSink.process() method is taken by constructing SimpleDateFormat objects. SimpleDateFormat object is notoriously a heavy and time consuming object to create. It is also not thread-safe.

      It is used in HDFS Sink to calculate the path that contains date-time wildcards. I will provide a patch to cache SimpleDateFormat objects for thread. With this patch, the PseudoTx channel I used for testing was not constantly filling up, and throughput was much better.

      1. flume_2517.patch
        2 kB
        Pal Konyves
      2. flume_2517.png
        660 kB
        Pal Konyves

        Activity

        Hide
        pkonyves Pal Konyves added a comment -

        Patch attached with cached SimpleDateFormat instances.
        Cleanup of HashMap is intentionally left out, since there isn't a scenario of an always changing format string. format strings are static and permanent per configuration.

        Show
        pkonyves Pal Konyves added a comment - Patch attached with cached SimpleDateFormat instances. Cleanup of HashMap is intentionally left out, since there isn't a scenario of an always changing format string. format strings are static and permanent per configuration.
        Hide
        pkonyves Pal Konyves added a comment -

        jconsole and profiling output added before and after patch

        Show
        pkonyves Pal Konyves added a comment - jconsole and profiling output added before and after patch
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Why does the map need to be thread-local?

        Show
        hshreedharan Hari Shreedharan added a comment - Why does the map need to be thread-local?
        Hide
        pkonyves Pal Konyves added a comment - - edited

        because as far as i know, SimpleDateFormat is not thread-safe. it stores intermediate values during formatting. Storing them in a ThreadLocal, fixes the concurrent access issue, and it is faster than using synchronized during using the SimpleDateFormat object

        Show
        pkonyves Pal Konyves added a comment - - edited because as far as i know, SimpleDateFormat is not thread-safe. it stores intermediate values during formatting. Storing them in a ThreadLocal, fixes the concurrent access issue, and it is faster than using synchronized during using the SimpleDateFormat object
        Hide
        hshreedharan Hari Shreedharan added a comment -

        So you are caching data formats per thread? I don't think there is a solution - seems like the one you wrote this might be the only solution. Let me think about it

        Show
        hshreedharan Hari Shreedharan added a comment - So you are caching data formats per thread? I don't think there is a solution - seems like the one you wrote this might be the only solution. Let me think about it
        Hide
        pkonyves Pal Konyves added a comment - - edited

        I think the other option is below, but then we also need to synchronize access to the Map for thread-safety and visibility, maybe use a ConcurrentMap. The problem with this is, that if many threads access the SimpleDateFormat object, the mutex can be slow, although, I didn't make measurements, how heavy the concurrent access is.

        Map<String, SimpleDateFormat> cache;
        SimpleDateFormat sdf = cache.get(formatString);
        
        synchronized(sdf) {
          sdf.format(formatString);
        }
        

        ... I don't want to stick to the thread-local solution, I just would like to avoid creating SimpleDateFormat object for every event.

        Show
        pkonyves Pal Konyves added a comment - - edited I think the other option is below, but then we also need to synchronize access to the Map for thread-safety and visibility, maybe use a ConcurrentMap. The problem with this is, that if many threads access the SimpleDateFormat object, the mutex can be slow, although, I didn't make measurements, how heavy the concurrent access is. Map< String , SimpleDateFormat> cache; SimpleDateFormat sdf = cache.get(formatString); synchronized (sdf) { sdf.format(formatString); } ... I don't want to stick to the thread-local solution, I just would like to avoid creating SimpleDateFormat object for every event.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        If we keep a map per sink, we don't need the synchronization, correct? Each sink only processes one event at a time, so that should be fine. Having one thread per map should also work around the synchronization issues.

        Show
        hshreedharan Hari Shreedharan added a comment - If we keep a map per sink, we don't need the synchronization, correct? Each sink only processes one event at a time, so that should be fine. Having one thread per map should also work around the synchronization issues.
        Hide
        pkonyves Pal Konyves added a comment -

        I didn't look into the code in that much detail that uses the BucketPath class. Because one HDFSSink can have multiple writer threads, I assumed, the problematic method can be called from different threads. So it might not make sense to make it thread-safe in this situation.

        On the other hand, if someone in the future wants to call BucketPath#replaceShorthand from multiple threads, it might malfunction. So I think we better make it thread-safe already.

        Here they suggest the same solutions for using SimpleDateFormat: http://stackoverflow.com/questions/10411944/java-text-simpledateformat-not-thread-safe

        Show
        pkonyves Pal Konyves added a comment - I didn't look into the code in that much detail that uses the BucketPath class. Because one HDFSSink can have multiple writer threads, I assumed, the problematic method can be called from different threads. So it might not make sense to make it thread-safe in this situation. On the other hand, if someone in the future wants to call BucketPath#replaceShorthand from multiple threads, it might malfunction. So I think we better make it thread-safe already. Here they suggest the same solutions for using SimpleDateFormat: http://stackoverflow.com/questions/10411944/java-text-simpledateformat-not-thread-safe
        Hide
        hshreedharan Hari Shreedharan added a comment -

        +1. This looks good to me. Let me run the tests

        Show
        hshreedharan Hari Shreedharan added a comment - +1. This looks good to me. Let me run the tests
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 77d56e95ead7a04499aa83d1a78fcfbd957b20c7 in flume's branch refs/heads/trunk from Hari Shreedharan
        [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=77d56e9 ]

        FLUME-2517. Cache SimpleDataFormat objects in bucketwriter for better performance.

        (Pal Konyves via Hari)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 77d56e95ead7a04499aa83d1a78fcfbd957b20c7 in flume's branch refs/heads/trunk from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=77d56e9 ] FLUME-2517 . Cache SimpleDataFormat objects in bucketwriter for better performance. (Pal Konyves via Hari)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0c1ff2fec6c3b815d54f056c629c8744736cccbb in flume's branch refs/heads/flume-1.6 from Hari Shreedharan
        [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=0c1ff2f ]

        FLUME-2517. Cache SimpleDataFormat objects in bucketwriter for better performance.

        (Pal Konyves via Hari)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0c1ff2fec6c3b815d54f056c629c8744736cccbb in flume's branch refs/heads/flume-1.6 from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=0c1ff2f ] FLUME-2517 . Cache SimpleDataFormat objects in bucketwriter for better performance. (Pal Konyves via Hari)
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Committed! Thanks Pal!

        Show
        hshreedharan Hari Shreedharan added a comment - Committed! Thanks Pal!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in flume-trunk #681 (See https://builds.apache.org/job/flume-trunk/681/)
        FLUME-2517. Cache SimpleDataFormat objects in bucketwriter for better performance. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=77d56e95ead7a04499aa83d1a78fcfbd957b20c7)

        • flume-ng-core/src/main/java/org/apache/flume/formatter/output/BucketPath.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in flume-trunk #681 (See https://builds.apache.org/job/flume-trunk/681/ ) FLUME-2517 . Cache SimpleDataFormat objects in bucketwriter for better performance. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=77d56e95ead7a04499aa83d1a78fcfbd957b20c7 ) flume-ng-core/src/main/java/org/apache/flume/formatter/output/BucketPath.java
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Flume-trunk-hbase-98 #40 (See https://builds.apache.org/job/Flume-trunk-hbase-98/40/)
        FLUME-2517. Cache SimpleDataFormat objects in bucketwriter for better performance. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=77d56e95ead7a04499aa83d1a78fcfbd957b20c7)

        • flume-ng-core/src/main/java/org/apache/flume/formatter/output/BucketPath.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Flume-trunk-hbase-98 #40 (See https://builds.apache.org/job/Flume-trunk-hbase-98/40/ ) FLUME-2517 . Cache SimpleDataFormat objects in bucketwriter for better performance. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=77d56e95ead7a04499aa83d1a78fcfbd957b20c7 ) flume-ng-core/src/main/java/org/apache/flume/formatter/output/BucketPath.java
        Hide
        pkonyves Pal Konyves added a comment -

        thanks

        Show
        pkonyves Pal Konyves added a comment - thanks
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -
        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - Wouldn't it be simpler to use Apache Commons Lang's FastDateFormat? http://commons.apache.org/proper/commons-lang/javadocs/api-release/org/apache/commons/lang3/time/FastDateFormat.html
        Hide
        hshreedharan Hari Shreedharan added a comment -

        I don;t have much experience with that? Is it a drop in replacement, and is the performance much better.

        Show
        hshreedharan Hari Shreedharan added a comment - I don;t have much experience with that? Is it a drop in replacement, and is the performance much better.
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -

        It is mostly compatible and it is thread-safe. See the Javadoc for details. The problem this issue is addressing is the contention on locking the SimpleDateFormat, which using FastDateFormat would solve.

        Note that it also has a DateParser for parsing dates.

        http://daniel.mitterdorfer.name/articles/2014/benchmarking-digging-deeper/ has a performance comparison of SimpleDateFormat, FastDateFormat and a Thread-scoped SimpleDateFormat.

        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - It is mostly compatible and it is thread-safe. See the Javadoc for details. The problem this issue is addressing is the contention on locking the SimpleDateFormat, which using FastDateFormat would solve. Note that it also has a DateParser for parsing dates. http://daniel.mitterdorfer.name/articles/2014/benchmarking-digging-deeper/ has a performance comparison of SimpleDateFormat, FastDateFormat and a Thread-scoped SimpleDateFormat.
        Hide
        pkonyves Pal Konyves added a comment -

        I was not aware of this. I think FastDateFormat is also a good solution, if apache commons-lang is already on the dependency list.

        Show
        pkonyves Pal Konyves added a comment - I was not aware of this. I think FastDateFormat is also a good solution, if apache commons-lang is already on the dependency list.

          People

          • Assignee:
            pkonyves Pal Konyves
            Reporter:
            pkonyves Pal Konyves
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development