Index: src/site/src/documentation/content/xdocs/fileformats.xml =================================================================== --- src/site/src/documentation/content/xdocs/fileformats.xml (revision 483366) +++ src/site/src/documentation/content/xdocs/fileformats.xml (working copy) @@ -926,7 +926,8 @@
Compound Files

Starting with Lucene 1.4 the compound file format became default. This - is simply a container for all files described in the next section.

+ is simply a container for all files described in the next section + (except for the .del file).

Compound (.cfs) --> FileCount, <DataOffset, FileName> FileCount @@ -1511,14 +1512,25 @@

Deleted Documents

The .del file is - optional, and only exists when a segment contains deletions: + optional, and only exists when a segment contains deletions.

-

Deletions +

Although per-segment, this file is maintained exterior to compound segment files. +

+ +

+ Pre-2.1: + Deletions (.del) --> ByteCount,BitCount,Bits

-

ByteSize,BitCount --> +

+ 2.1 and above: + Deletions + (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format) +

+ +

Format,ByteSize,BitCount --> Uint32

@@ -1527,6 +1539,23 @@ ByteCount

+

DGaps --> + <DGap,NonzeroByte> + NonzeroBytesCount +

+ +

DGap --> + VInt +

+ +

NonzeroByte --> + Byte +

+ +

Format + is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded. +

+

ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1. @@ -1544,6 +1573,20 @@ Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as deleted.

+ +

DGaps + represents sparse bit-vectors more efficiently than Bits. + It is made of DGaps on indexes of nonzero bytes in Bits, + and the nonzero bytes themselves. The number of nonzero bytes + in Bits (NonzeroBytesCount) is not stored. +

+

For example, + if there are 8000 bits and only bits 10,12,32 are set, + DGaps would be used: +

+

+ (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1 +