Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Lucene.Net Core
    • Labels:
      None

      Description

      Adding binary values to a field is an expensive operation, as the whole binary data must be loaded into memory and then written to the index. Adding the ability to use a stream instead of a byte array could not only speed up the indexing process, but reducing the memory footprint as well.

      Java lucene has the ability to use a TextReader the both analyze and store text in the index. Lucene.NET lacks the ability to store string data in the index via streams. This should be a feature added into Lucene .NET as well. My thoughts are to add another Field constructor, that is Field(string name, System.IO.Stream stream, System.Text.Encoding encoding), that will allow the text to be analyzed and stored into the index.

      Comments about this approach are greatly appreciated.

      1. StreamValues.patch
        31 kB
        Christopher Currens

        Activity

        Hide
        Digy added a comment -

        Maybe something like this

        doc.Add(new Field("name",-----
        doc.Add(new Field("metadata",-----
        doc.Add(new Field("content",part1-----
        doc.Add(new Field("content",part2-----
        ....
        doc.Add(new Field("content",partN-----

        DIGY

        Show
        Digy added a comment - Maybe something like this doc.Add(new Field("name",----- doc.Add(new Field("metadata",----- doc.Add(new Field("content",part1----- doc.Add(new Field("content",part2----- .... doc.Add(new Field("content",partN----- DIGY
        Hide
        Digy added a comment -

        OK, File System Indexer is a good example.
        Have you tested your code, since I see some codes like
        AbstractField.cs

                    if (fieldsData is System.IO.Stream)
                    {
                        var streamReader = new System.IO.BinaryReader((System.IO.Stream)fieldsData);
                        byte[] data = new byte[streamReader.BaseStream.Length]; // <----
                        streamReader.Read(data, 0, data.Length);
        
                        return data;
                    }
        

        or
        Field.cs

                    if (fieldsData is Lucene.Net.Util.InternalStreamReader)
                    {
                        using (var reader = ((Lucene.Net.Util.InternalStreamReader)fieldsData))
                        {
                            reader.DiscardBufferedData();
                            reader.BaseStream.Seek(0, SeekOrigin.Begin);
                            fieldsData = reader.ReadToEnd();  // <----
                        }
                    }
        

        Wouldn't they be costly in terms of mem usage?

        DIGY

        PS: I still think there should be other solutions(outside of Lucene.Net) to handle these type of problems.

        Show
        Digy added a comment - OK, File System Indexer is a good example. Have you tested your code, since I see some codes like AbstractField.cs if (fieldsData is System .IO.Stream) { var streamReader = new System .IO.BinaryReader(( System .IO.Stream)fieldsData); byte [] data = new byte [streamReader.BaseStream.Length]; // <---- streamReader.Read(data, 0, data.Length); return data; } or Field.cs if (fieldsData is Lucene.Net.Util.InternalStreamReader) { using ( var reader = ((Lucene.Net.Util.InternalStreamReader)fieldsData)) { reader.DiscardBufferedData(); reader.BaseStream.Seek(0, SeekOrigin.Begin); fieldsData = reader.ReadToEnd(); // <---- } } Wouldn't they be costly in terms of mem usage? DIGY PS: I still think there should be other solutions(outside of Lucene.Net) to handle these type of problems.
        Hide
        Christopher Currens added a comment -

        That's a valid question. I think it's mostly common (but not limited to) when Lucene is used to index file systems. As an example, extracted text out of some xls files can be shudder in the hundreds of mb. When accuracy is needed in a search, the MaxFieldLength.Unlimited becomes important, as we don't want silent truncation of search terms. The idea of streaming it, as I said before, was more for handling program memory, especially when multiple indexes are read/written at the same time, rather than the ability to index a large file. Granted, there are other ways to solve the problem, like what you sort of suggested, breaking up a larger file into smaller chunks. However, not all data is divisible like a book would be, so it's not an ideal solution, especially if you're storing file metadata along with full text.

        Show
        Christopher Currens added a comment - That's a valid question. I think it's mostly common (but not limited to) when Lucene is used to index file systems. As an example, extracted text out of some xls files can be shudder in the hundreds of mb. When accuracy is needed in a search, the MaxFieldLength.Unlimited becomes important, as we don't want silent truncation of search terms. The idea of streaming it, as I said before, was more for handling program memory , especially when multiple indexes are read/written at the same time, rather than the ability to index a large file. Granted, there are other ways to solve the problem, like what you sort of suggested, breaking up a larger file into smaller chunks. However, not all data is divisible like a book would be, so it's not an ideal solution, especially if you're storing file metadata along with full text.
        Hide
        Digy added a comment -

        Maybe, this is a stupid question but, what is the reason to index a very large doc?
        If I indexed a whole book as single document, It would appear in almost every kind of search's result sets.
        search "computer" --> this book.
        search "sport" --> this book.
        search "politics" --> this book.

        DIGY

        Show
        Digy added a comment - Maybe, this is a stupid question but, what is the reason to index a very large doc? If I indexed a whole book as single document, It would appear in almost every kind of search's result sets. search "computer" --> this book. search "sport" --> this book. search "politics" --> this book. DIGY
        Hide
        Christopher Currens added a comment -

        Also, SimpleFSDirectory doesn't really support stream indexing as much as I would hope. The issue lies in that SimpleFSDirectory creates a RAMOutputStream that it uses before its flushed to disk. The PerDoc class keeps the entire thing in memory before flushing to disk. I'm assuming it does this so indexes aren't corrupted.

        It seems a good idea may be to create a new Directory implementation that has a special IndexOutput that will buffer to disk when a certain limit is hit, to prevent OOM exceptions indexing huge amounts of data. However, I'm not sure that falls within scope of Lucene.Net...maybe contrib? I have some ideas on how to do this without leaving behind any artifacts, like temp files. It seems the easiest way would be using MemoryMappedFile, as GC frees the file, even under early termination of a program. Unfortunately, that's a .Net 4 only class.

        Show
        Christopher Currens added a comment - Also, SimpleFSDirectory doesn't really support stream indexing as much as I would hope. The issue lies in that SimpleFSDirectory creates a RAMOutputStream that it uses before its flushed to disk. The PerDoc class keeps the entire thing in memory before flushing to disk. I'm assuming it does this so indexes aren't corrupted. It seems a good idea may be to create a new Directory implementation that has a special IndexOutput that will buffer to disk when a certain limit is hit, to prevent OOM exceptions indexing huge amounts of data. However, I'm not sure that falls within scope of Lucene.Net...maybe contrib? I have some ideas on how to do this without leaving behind any artifacts, like temp files. It seems the easiest way would be using MemoryMappedFile, as GC frees the file, even under early termination of a program. Unfortunately, that's a .Net 4 only class.
        Hide
        Christopher Currens added a comment -

        Adding another patch that adds Streaming capabilities for both Binary and String values.

        String values are still input as streams, however, must include the encoding to be used as string, otherwise it will be treated as binary.

        Show
        Christopher Currens added a comment - Adding another patch that adds Streaming capabilities for both Binary and String values. String values are still input as streams, however, must include the encoding to be used as string, otherwise it will be treated as binary.
        Hide
        Troy Howard added a comment -

        Stream.CanSeek should be enough. We can throw a runtime exception if someone is trying to store an indexed field with a Stream value that has a false CanSeek. This will basically be the same experience it is now (runtime exception if you pass a TextReader value type for an indexed field).

        Show
        Troy Howard added a comment - Stream.CanSeek should be enough. We can throw a runtime exception if someone is trying to store an indexed field with a Stream value that has a false CanSeek. This will basically be the same experience it is now (runtime exception if you pass a TextReader value type for an indexed field).
        Hide
        Troy Howard added a comment -

        Chris's goal here is to prevent large blobs from being placed in memory either as binary data or as string data. This is to prevent OOM exceptions on very large documents. Using Stream semantics, you can avoid this.

        The limitation of TextReader value types not being stored is due to the TextReader type being forward-only, which is based around how Encodings work, not due to some kind of fundamental mismatch with Lucene's business rules. There is no reason you should not be provide a resettable Stream, and an Encoding and perform the same operations, but reset the stream between tokenization and value storage stages.

        The only issue would be multi-threading, if tokenization and value storage were happening at the same time, they could not operate against the same stream.

        Show
        Troy Howard added a comment - Chris's goal here is to prevent large blobs from being placed in memory either as binary data or as string data. This is to prevent OOM exceptions on very large documents. Using Stream semantics, you can avoid this. The limitation of TextReader value types not being stored is due to the TextReader type being forward-only, which is based around how Encodings work, not due to some kind of fundamental mismatch with Lucene's business rules. There is no reason you should not be provide a resettable Stream, and an Encoding and perform the same operations, but reset the stream between tokenization and value storage stages. The only issue would be multi-threading, if tokenization and value storage were happening at the same time, they could not operate against the same stream.
        Hide
        Christopher Currens added a comment -

        Good call. I think I was confusing storing the whole field with storing the term vectors, which lucene.net can do.

        I still think at the very least being able to store binary values via a stream is a necessary addition to Lucene.Net. Strings are less of an issue, to me at least, of making streamable. However, I can see the benefit when indexing large items, which is really all this is attempting to solve. There are speed/memory issues created by being forced to load large quantities of data into memory to perform any sort of indexing operation on them. This may not be a terribly large use case for some people, but anyone trying to write a multi-threaded indexing system would certainly enjoy the benefits of a low memory footprint/speed increase.

        Show
        Christopher Currens added a comment - Good call. I think I was confusing storing the whole field with storing the term vectors, which lucene.net can do. I still think at the very least being able to store binary values via a stream is a necessary addition to Lucene.Net. Strings are less of an issue, to me at least, of making streamable. However, I can see the benefit when indexing large items, which is really all this is attempting to solve. There are speed/memory issues created by being forced to load large quantities of data into memory to perform any sort of indexing operation on them. This may not be a terribly large use case for some people, but anyone trying to write a multi-threaded indexing system would certainly enjoy the benefits of a low memory footprint/speed increase.
        Hide
        Robert Jordan added a comment -

        BTW, Java Lucene does not have the ability to tokenize AND store from a reader:

        http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Field.html

        IMO, this doesn't belong in Lucene.Net. The Lucene core simply cannot tokenize and store in one pass, so it's up to the application to deal with this issue.

        Show
        Robert Jordan added a comment - BTW, Java Lucene does not have the ability to tokenize AND store from a reader: http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/document/Field.html IMO, this doesn't belong in Lucene.Net. The Lucene core simply cannot tokenize and store in one pass, so it's up to the application to deal with this issue.
        Hide
        Christopher Currens added a comment -

        This patch allows StreamValue to be used with binary data.

        Show
        Christopher Currens added a comment - This patch allows StreamValue to be used with binary data.

          People

          • Assignee:
            Unassigned
            Reporter:
            Christopher Currens
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development