Tika
  1. Tika
  2. TIKA-780

Optimize loading of the media type registry

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: mime
    • Labels:
      None

      Description

      Parsing of our pretty large media type registry takes quite a while (hundreds of milliseconds), which can be a problem for some applications. There's a lot of ways in which we could optimize the loading of the type registry.

        Activity

        Jukka Zitting created issue -
        Hide
        Jukka Zitting added a comment -

        With various refactorings I was able to significantly speed up the following benchmark:

        long a = System.nanoTime();
        new Tika();
        long b = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            new Tika();
        }
        long c = System.nanoTime();
        

        The average time between a and b (i.e. initial loading of the default configuration) is down from 655ms to 377ms on my computer. It looks like any further improvements would probably require precompiling the tika-mimetypes.xml file to another format to avoid the XML parsing overhead. That's a topic for another issue.

        And thanks to the fact that the default media type registry is now memorized at first load, the average time for creating a hundred more default Tika instances went down from 4277ms to just 43ms!

        Show
        Jukka Zitting added a comment - With various refactorings I was able to significantly speed up the following benchmark: long a = System .nanoTime(); new Tika(); long b = System .nanoTime(); for ( int i = 0; i < 100; i++) { new Tika(); } long c = System .nanoTime(); The average time between a and b (i.e. initial loading of the default configuration) is down from 655ms to 377ms on my computer. It looks like any further improvements would probably require precompiling the tika-mimetypes.xml file to another format to avoid the XML parsing overhead. That's a topic for another issue. And thanks to the fact that the default media type registry is now memorized at first load, the average time for creating a hundred more default Tika instances went down from 4277ms to just 43ms!
        Jukka Zitting made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jukka Zitting [ jukkaz ]
        Fix Version/s 1.1 [ 12318849 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development