Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    1. InStream.bp
      6 kB
      Marvin Humphrey
    2. InStream.c
      12 kB
      Marvin Humphrey
    3. InStream.pm
      2 kB
      Marvin Humphrey
    4. OutStream.pm
      2 kB
      Marvin Humphrey
    5. OutStream.bp
      4 kB
      Marvin Humphrey
    6. OutStream.c
      8 kB
      Marvin Humphrey
    7. MockFileHandle.bp
      1 kB
      Marvin Humphrey
    8. MockFileHandle.c
      2 kB
      Marvin Humphrey
    9. TestUtils.bp
      2 kB
      Marvin Humphrey
    10. TestUtils.c
      2 kB
      Marvin Humphrey
    11. TestInStream.bp
      0.8 kB
      Marvin Humphrey
    12. TestInStream.c
      8 kB
      Marvin Humphrey
    13. 052-instream.t
      0.1 kB
      Marvin Humphrey
    14. 054-io_primitives.t
      0.1 kB
      Marvin Humphrey
    15. TestIOPrimitives.bp
      0.8 kB
      Marvin Humphrey
    16. TestIOPrimitives.c
      11 kB
      Marvin Humphrey
    17. TestIOChunks.bp
      0.8 kB
      Marvin Humphrey
    18. TestIOChunks.c
      3 kB
      Marvin Humphrey
    19. 055-io_chunks.t
      0.1 kB
      Marvin Humphrey
    20. 101-simple_io.t
      4 kB
      Marvin Humphrey

      Activity

      Hide
      Marvin Humphrey added a comment -

      InStream and OutStream are roughly analogous to Lucene's IndexInput and
      IndexOutput classes, but there are some differences.

      Under Lucy, FileHandle is where alternate "file" treatments are implemented:
      RAMFileHandle, FSFileHandle. InStream and OutStream are not final, but that's
      so that it's possible to extend them with new methods. In contrast, alternate
      file treatments are achieved under Lucene by subclassing IndexInput and
      IndexOutput directly.

      Additionally, InStream and OutStream are always buffered. This allows us to
      inline some functionality that would otherwise have to be implemented in terms
      of abstract methods like IndexInput.readByte() and IndexOutput.WriteByte().

      From Lucene's IndexInput.java (note readByte() in loop):

      public int readVInt() throws IOException {
        byte b = readByte();
        int i = b & 0x7F;
        for (int shift = 7; (b & 0x80) != 0; shift += 7) {
          b = readByte();
          i |= (b & 0x7F) << shift;
        }
        return i;
      }
      

      From Lucy's InStream.c (note static inline function SI_read_u8() in loop):

      u32_t 
      InStream_read_c32 (InStream *self) 
      {
          u32_t retval = 0;
          while (1) {
              const u8_t ubyte = SI_read_u8(self);
              retval = (retval << 7) | (ubyte & 0x7f);
              if ((ubyte & 0x80) == 0) { break; }
          }
          return retval;
      }
      
      static INLINE u8_t
      SI_read_u8(InStream *self)
      {
          if (self->buf >= self->limit) { S_refill(self); }
          return (u8_t)*self->buf++;
      }
      

      The fact that OutStream is buffered means an extra memory copy (Lucene has
      this too). Theoretically, it would be nice if we could write to the system
      buffer directly, but that requires extending the file first – see
      http://www.linuxquestions.org/questions/programming-9/mmap-tutorial-cc-511265/#post2549203.

      The fact that InStream is buffered introduces no extra cost, because there is
      no copy: for InStreams which wrap FSFileHandles, the buffer is sourced from a
      memory-mapping operation (mmap for Unixen, MapViewOfFile under Windows).
      Multiple InStream objects may share the same underlying FileHandle, since they
      do not rely on or update the FileHandle's file position or other state
      (excluding refcount).

      At present, no support is provided for systems which do not support memory
      mapping. Previous experiments included a fallback which read data into a
      malloc'd buffer, and it would be possible to reintroduce that functionality if
      we have to. For now, though, it's simpler to leave it out.

      Show
      Marvin Humphrey added a comment - InStream and OutStream are roughly analogous to Lucene's IndexInput and IndexOutput classes, but there are some differences. Under Lucy, FileHandle is where alternate "file" treatments are implemented: RAMFileHandle, FSFileHandle. InStream and OutStream are not final, but that's so that it's possible to extend them with new methods. In contrast, alternate file treatments are achieved under Lucene by subclassing IndexInput and IndexOutput directly. Additionally, InStream and OutStream are always buffered. This allows us to inline some functionality that would otherwise have to be implemented in terms of abstract methods like IndexInput.readByte() and IndexOutput.WriteByte(). From Lucene's IndexInput.java (note readByte() in loop): public int readVInt() throws IOException { byte b = readByte(); int i = b & 0x7F; for ( int shift = 7; (b & 0x80) != 0; shift += 7) { b = readByte(); i |= (b & 0x7F) << shift; } return i; } From Lucy's InStream.c (note static inline function SI_read_u8() in loop): u32_t InStream_read_c32 (InStream *self) { u32_t retval = 0; while (1) { const u8_t ubyte = SI_read_u8(self); retval = (retval << 7) | (ubyte & 0x7f); if ((ubyte & 0x80) == 0) { break; } } return retval; } static INLINE u8_t SI_read_u8(InStream *self) { if (self->buf >= self->limit) { S_refill(self); } return (u8_t)*self->buf++; } The fact that OutStream is buffered means an extra memory copy (Lucene has this too). Theoretically, it would be nice if we could write to the system buffer directly, but that requires extending the file first – see http://www.linuxquestions.org/questions/programming-9/mmap-tutorial-cc-511265/#post2549203 . The fact that InStream is buffered introduces no extra cost, because there is no copy: for InStreams which wrap FSFileHandles, the buffer is sourced from a memory-mapping operation (mmap for Unixen, MapViewOfFile under Windows). Multiple InStream objects may share the same underlying FileHandle, since they do not rely on or update the FileHandle's file position or other state (excluding refcount). At present, no support is provided for systems which do not support memory mapping. Previous experiments included a fallback which read data into a malloc'd buffer, and it would be possible to reintroduce that functionality if we have to. For now, though, it's simpler to leave it out.
      Hide
      Marvin Humphrey added a comment -

      Committed as r830909.

      Show
      Marvin Humphrey added a comment - Committed as r830909.

        People

        • Assignee:
          Marvin Humphrey
          Reporter:
          Marvin Humphrey
        • Votes:
          0 Vote for this issue
          Watchers:
          0 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development