Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:

      40+% time is spent in getting the offset from the index file
      30+% time is spent in reading the count (e.g. number of fields to load)

      Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)

      A related note, are there plans to have custom segments as part of flexible indexing feature?

        Activity

        Hide
        John Wang added a comment -

        Hi Mike:

        Sorry for the late reply. We have written something for this purpose:

        http://snaprojects.jira.com/wiki/display/KRTI/Krati+Performance+Evaluation

        Thanks

        -John

        Show
        John Wang added a comment - Hi Mike: Sorry for the late reply. We have written something for this purpose: http://snaprojects.jira.com/wiki/display/KRTI/Krati+Performance+Evaluation Thanks -John
        Hide
        Michael McCandless added a comment -

        The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields.

        John can you describe the approach here?

        Show
        Michael McCandless added a comment - The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields. John can you describe the approach here?
        Hide
        Robert Muir added a comment -

        In modern machines (at least with the machines we are using, e.g. a macbook pro)

        its not really subjective, or based on modern machines. you are talking about 5M documents, some indexes have a lot more documents and 4bytes/doc in ram adds up to a lot!
        for the case of using lucene as a search engine library, this memory could be better spent on other things.
        I dont think this is subjective, because its a search engine library, not a document store.

        Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal.

        but it is easy, and takes away your disk seek. the "in-memory seek, read/parse" is as you say, peanuts in comparison.

        Show
        Robert Muir added a comment - In modern machines (at least with the machines we are using, e.g. a macbook pro) its not really subjective, or based on modern machines. you are talking about 5M documents, some indexes have a lot more documents and 4bytes/doc in ram adds up to a lot! for the case of using lucene as a search engine library, this memory could be better spent on other things. I dont think this is subjective, because its a search engine library, not a document store. Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal. but it is easy, and takes away your disk seek. the "in-memory seek, read/parse" is as you say, peanuts in comparison.
        Hide
        John Wang added a comment -

        I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain)

        That depends on the application. In modern machines (at least with the machines we are using, e.g. a macbook pro) we can afford it I am not sure I agree with "virtually no gain" if you look at the numbers I posted. IMHO, the gain is significant.

        I hate to get into a subjective argument on this though.

        I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case?

        Having a custom segment allows me to not having to get into this subjective argument in what is too much memory or what is the gain, since it just depends on my application, right?

        Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal. For my use case, the cost of the seek/read for the count on the data file is very wasteful. Also even for getting position, I can just a random access into an array compare to a in-memory seek,read/parse.

        The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields. We would like to however, take advantage of the some of the good stuff already in lucene, e.g. merge mechanism (which is very nicely done), delete handling etc.

        Show
        John Wang added a comment - I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain) That depends on the application. In modern machines (at least with the machines we are using, e.g. a macbook pro) we can afford it I am not sure I agree with "virtually no gain" if you look at the numbers I posted. IMHO, the gain is significant. I hate to get into a subjective argument on this though. I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case? Having a custom segment allows me to not having to get into this subjective argument in what is too much memory or what is the gain, since it just depends on my application, right? Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal. For my use case, the cost of the seek/read for the count on the data file is very wasteful. Also even for getting position, I can just a random access into an array compare to a in-memory seek,read/parse. The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields. We would like to however, take advantage of the some of the good stuff already in lucene, e.g. merge mechanism (which is very nicely done), delete handling etc.
        Hide
        Robert Muir added a comment -

        as stated earlier, assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc.

        right, you stated this, but even if your 'store long into an int' works, I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain)

        I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case?

        Show
        Robert Muir added a comment - as stated earlier, assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc. right, you stated this, but even if your 'store long into an int' works, I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain) I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case?
        Hide
        John Wang added a comment -

        Sorry, I meant LUCENE-1914

        Show
        John Wang added a comment - Sorry, I meant LUCENE-1914
        Hide
        John Wang added a comment -

        I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs.

        as stated earlier, assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc. There are many ways of representing this without paying much performance penalty. Off the top of my head, this would work:

        since positions are always positive, you can indicate using the first bit to see if MAX_INT is reached, if so, add MAX_INT to the masked bits. You get away with int per doc.

        I am sure with there are other tons of neat stuff for this the Mikes or Yonik can come up with

        John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general?

        Hi Yonik, I understand this may not be a common use case. I am trying to use Lucene as a store solution. e.g. supporting just get()/put() operations as a content store. We wrote something simple in house and I compared it against lucene, and the difference was dramatic. So after profiling, just seems this is an area with lotsa room for improvement. (posted earlier)

        Reasons:
        1) Our current setup is that the content is stored outside of the search cluster. It just seems being able to fetch the data for rendering/highlighting within our search cluster would be good.
        2) If the index contains the original data, changing indexing schema, e.g. reindexing can be done within each partition/node. Getting data from our authoratative datastore is expensive.

        Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. If you also agree, I can just dup it over.

        Thanks

        -John

        Show
        John Wang added a comment - I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs. as stated earlier, assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc. There are many ways of representing this without paying much performance penalty. Off the top of my head, this would work: since positions are always positive, you can indicate using the first bit to see if MAX_INT is reached, if so, add MAX_INT to the masked bits. You get away with int per doc. I am sure with there are other tons of neat stuff for this the Mikes or Yonik can come up with John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general? Hi Yonik, I understand this may not be a common use case. I am trying to use Lucene as a store solution. e.g. supporting just get()/put() operations as a content store. We wrote something simple in house and I compared it against lucene, and the difference was dramatic. So after profiling, just seems this is an area with lotsa room for improvement. (posted earlier) Reasons: 1) Our current setup is that the content is stored outside of the search cluster. It just seems being able to fetch the data for rendering/highlighting within our search cluster would be good. 2) If the index contains the original data, changing indexing schema, e.g. reindexing can be done within each partition/node. Getting data from our authoratative datastore is expensive. Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. If you also agree, I can just dup it over. Thanks -John
        Hide
        Robert Muir added a comment -

        Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc

        I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs.

        Show
        Robert Muir added a comment - Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs.
        Hide
        Yonik Seeley added a comment -

        The thing about stored fields is that it's normally not inner-loop stuff. The index may be 100M documents, but the average application pages through hits a handful at a time. And when loading stored fields gets really slow, it tends to be the OS cache misses due to the index being large. We should still optimize it if we can of course (some apps do access many fields at once), but I agree with Robert that a direct in-memory stored field index probably wouldn't be a good default.

        John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general?

        Show
        Yonik Seeley added a comment - The thing about stored fields is that it's normally not inner-loop stuff. The index may be 100M documents, but the average application pages through hits a handful at a time. And when loading stored fields gets really slow, it tends to be the OS cache misses due to the index being large. We should still optimize it if we can of course (some apps do access many fields at once), but I agree with Robert that a direct in-memory stored field index probably wouldn't be a good default. John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general?
        Hide
        John Wang added a comment -

        Thanks Uwe for the pointer. Will check that out!

        Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc. This memory would be less than the data structure needed to be held in memory for only 1 field cache entry for sort. I understand it is always better to use less memory, but sometimes we do have to make trade-off decisions.
        But you are right, different applications have different needs/requirements, so having support for custom segments would be a good thing. e.g. LUCENE-1914

        Show
        John Wang added a comment - Thanks Uwe for the pointer. Will check that out! Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc. This memory would be less than the data structure needed to be held in memory for only 1 field cache entry for sort. I understand it is always better to use less memory, but sometimes we do have to make trade-off decisions. But you are right, different applications have different needs/requirements, so having support for custom segments would be a good thing. e.g. LUCENE-1914
        Hide
        Uwe Schindler added a comment -

        FileSwitchDirectory comes into my mind. Just delegate the *.fdx extension into RAMDirectory. On instantiation of the dir, create the copy during wrapping with FileSwitchDir.

        Show
        Uwe Schindler added a comment - FileSwitchDirectory comes into my mind. Just delegate the *.fdx extension into RAMDirectory. On instantiation of the dir, create the copy during wrapping with FileSwitchDir.
        Hide
        Robert Muir added a comment -

        John, couldnt you simply write your own Directory if you want to put the fdx in RAM? I am not sure about 'peanuts', some people may not to pay 8 bytes/doc or whatever it is for this stored field offset, when the memory could be used better for other purposes.

        Show
        Robert Muir added a comment - John, couldnt you simply write your own Directory if you want to put the fdx in RAM? I am not sure about 'peanuts', some people may not to pay 8 bytes/doc or whatever it is for this stored field offset, when the memory could be used better for other purposes.

          People

          • Assignee:
            Unassigned
            Reporter:
            John Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development