|
Currently SequenceFile-s may contain Metadata. In my applications I never found use for this, because this Metadata needs to be written out right after the file is opened, and cannot be updated later. A much better model for my needs would be to write the Metadata record right before closing the file (in the "tail" section above), so that it can be updated until the end, with e.g. record count.
I would propose that we move to:
block 1 block 2 ... index tail where block is compressed, and contain: key/value1: key len (vint), key, value len (vint), value key/value 2 ... The index would be compressed and contain: block 1: offset, first record idx block 2: offset, first record idx block 3: offset, first record idx: ... and the tail would look like: key class name value class name index kind (none, keys, keys+bloom filter) format version offset of tail offset of index Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block: first key of block 1: key len (vint), key first key of block 2 ... offset of start key index Another reasonable extension of the key index would be a bloom filter of the keys: bloom filter serialization offset of bloom filter index start
The issue is most critical for compressed sequence files, but it would make sense to make the compression optional. I would not support value compression.
yes
that probably makes sense, although my desire is to make applications not read the header and only read the tail, which has the meta data including the index.
It is the row id of the first row of that block. So that you can support seek to a given row number, which is useful if you have a bunch of files that correspond to different columns in a big table. You would make splits that look like rows 1000-2000 and you can map that across multiple files. magic (4 bytes) block 0 .. n index tail and the tail looks like: key class name
value class name
index kind (none, keys, keys+bloom filter)
compression kind (none, default, lzo)
format version
offset of tail
offset of index
One other possibility would be to represent key/values as:
record length (vint) key value That would work for all writables, because they all record their sizes internally. In theory you could drop the record length, but that would mean that you would have to deserialize all of the keys and values as you skip over records. Thoughts? > Owen O'Malley - 26/Apr/08 09:46 AM
> One other possibility would be to represent key/values as: record length (vint) key value > That would work for all writables, because they all record their sizes internally. In theory you could drop Dropping the record length would seriously slow down random reads unless the index is 'complete', i.e., every key/offset is represented. If the index is sparse like MapFile's, you would only get an approximate location of the desired record and then have to do a lot of work to seek forward to the desired one. I agree that we shouldn't drop the record length. It would be unacceptable when seeking having to deserialize the values, which can be quite slow. I was asking about dropping the key length. I can't see any cases where we need the key length and I was asking if anyone else could see one...
I guess since I don't expect to ever read the front of these, we should put the magic value at the back too.
magic (4 bytes) block 0 .. n index tail magic Will the new code be able to read the old format? I would like to see something like Andrzej is mentioning, the metadata section at the end.
Our use case is specifically the example hi mentions, the record count. Currently keep the count in a shadow file, ie _NAME.counter, using a custom SequenceOutputFormat. The problem with this approach is that we are doubling the number of files in HDFS for SequenceFiles. > Owen: I can't see any cases where we need the key length
I think we use it when sorting. We read raw keys into memory and call raw comparators on them, passing the key length to the raw comparator. The sorting code pre-dates MapReduce, though, from back when we manually partitioned, shuffled, sorted, merged, etc. things in Nutch. Perhaps we no longer need to sort and merge files directly, since folks can use MapReduce for that. Does any application code still use SequenceFile#Sorter? > Owen: my desire is to make applications not read the header and only read the tail So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too. Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter. > Alejandro: Our use case is specifically the example he mentions, the record count. I think we can include record-count as a base feature of this file format. But permitting a Map<String,String> of other metadata might also be good. Writers know exactly how many records have been written, so they can write the record count easily. I suggest using MapWritable for the metadata section, so that we can store other data, not just Strings. Record counts are just an example of informaiton that becomes available only when closing a file - other info may include timestamps, or min / max values, etc, etc.
> Doug Cutting - 28/Apr/08 09:27 AM
> Do you expect to make this a drop-in replacement for SequenceFile and MapFile, or rather something that we expect code to migrate to? I'm guessing the latter. I think it will be the latter as well. > So we should have a magic number there too, but, if it doesn't harm things, I'd prefer leaving an (frequently unread) magic number in the header too. > Jim Kellerman - 26/Apr/08 10:19 AM > Each block in this file would be memory loadable, it doesnt really matter (much) if we store key length or not as the total operation is bounded by seek and read. Even going through the index with variable sized keys is linear. Maybe bounding the index to one read block makes sense aswell. The only case this can change is if we have some metadata about the key being fixed size: in which case all the seek-to-keys are O(1) One more thought is an index purely based on record ids (fixed size encoded) that may keep the index skippable/seekable. I like the record count being in the tail. If we are saving the first key of each block, we should probably also save the last key of the file too.
I would think this is just a new class that people migrate too. For most applications, I would expect it to be an easy transition. Srikanth, it is useful to have the magic bytes at the front so that commands like "file" on unix can work. It is just a constant 4 bytes, it doesn't really complicate the format at all. You absolutely do want the record length in the data, because deserialization can be slow. I absolutely don't think there should be any special code for fixed width types. The cost of the variable width types is really small with the vint encoding. Specification of a TFile. Binary block compressed file.
Owen suggested the following:
magic: 4 bytes Adding doug's request as well, I am updating the document and will post it soon, will have a document by the eod. Owen, one concern separating the keys from offsets:
In the rare case that the keys do not fit in memory or reach a limit: We might have to write out both the keys and offsets at that point and start another index "block". Hmm, I think in general that would strongly suggest that you have your block size too small. smile
However, you could write the key index to a side file while you were building it and then copy it into place. The offset, row id table should never be too big for ram. I think for at least the first version we should assume that the index fits in memory.
As a subsequent enhancement, we might permit one to limit the size of the index. Then, when writing, if the index gets too big, we can downsample at that point, discarding every-other entry or somesuch, to keep the index within a certain bound. Similarly, when the file is opened, if the index is larger than available memory, we can downsample further then. We should not depend on keys/values being Writables in TFile, the API should be :
public class TFile<K extends Comparable<K>, V> { public static class Writer<K extends Comparable<K>, V> { void append(K key, V value); Class<K> getKeyClass(); ... } public static class Reader<K extends Comparable<K>, V> { V get(K key, V val) ; boolean next(K key, V val); ... } } Latest draft looks much better!
> We should not depend on keys/values being Writables in TFile. Good point. So the writer's constructor should have Serlializer<K> and Serializer <V> parameters, and the reader Deserializer<K> and Deserializer<V> parameters. This will permit us to, e.g., store Thrift or other objects in a TFile. It would be good to have a way to find the offset of the start of the RO entries. I'd propose that we have:
kindexoffset - the start of the the key index rindexoffset - the start of the row index moffset - the start of the meta data
This proposal looks great. Here's a couple of comments:
In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row? In the description of the blockidx, says 'Spare index of keys into the datablocks'. Whats this mean? The key that is at the start of each block will be in the block index? And only this? Or will index have entries keys from the middle of blocks in it? Does the metadata value have to be a String? It looks like it doesn't have to be – that I can specify my own keyClass and valClass. For example, I would like to be able to write a bloom filter into the metadata. Its not plain that user can add their own metadata to imeta. You might explicitly state this. Section 3.2 where you describe two different kinds of index is a little confusing (I'm not clear on RO vs. Key as per above). In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply? Is the Writer API missing metadata writing? Same for reading. Reading talks about rowids but writer does not. Is this intentional? For the reader API, expose methods getting key only without reading value? > So the writer's constructor should have Serlializer<K> and Serializer <V> parameters [ ... ]
On second thought, we should just use a SerializationFactory to construct serializers and deserializers for the key and value classes. > It is useful to be able to get the key/value class names without the class. We should have constants defined metadata keys. So getting the key class should look something like: Class.forName(tfile.getMeta(TFile.KEY_CLASS)); With such constants, we may not need methods like getKeyClass() at all... > Does the metadata value have to be a String? I'd vote for supporting both String and byte[] as metadata values, perhaps with methods like: Writer#setMeta(String key, String value); String Reader#getMeta(String key); > Does RO stand for something, or is it short for "row"?
RO = Rowid Offset >The RO entry values can be more compactly represented as differences from the prior entry. Is this intended? If so, we should state this. > In data blocks, we might use something like <entryLength><keyLength><key><value>. This would permit one to skip entire entries more quickly. The valueLength can be computed as entryLength-keyLength. Do folks think this is worthwhile? I think it should not matter much, but since I heard this request third time, it seems like a good thing to do. > Owen O'Malley - 12/May/08 11:44 AM > getClosest should specify that it means closest after the given key > Doug Cutting - 12/May/08 10:53 AM > does the compression of the key index follow tfile.compressionCodec? > should the ioffset and moffset (or my key, row, and meta offsets) be vints? > I think the append method that takes an input stream should be: void appendRaw(int keyLength, InputStream key, int valueLength, InputStream value) throws IOException; Keys are supposed to be memory loadable, maximum length of 64K. I am not sure if this interface will be / should be used. We may want to keep that for later. > Most of the methods should have "throws IOException" > It is useful to be able to get the key/value class names without the class. I'd replace the getKeyClass and getValueClass with string equivalents: > stack - 12/May/08 12:15 PM > In 'Goals', says 'support all kinds of columns'? Do you mean all column data types? Also says 'support seek to key and seek to row'. What is the difference between a key and a row? > Its not plain that user can add their own metadata to imeta. You might explicitly state this. > In the Writer API, you state that a null key class is for a keyless column. Whats a null value class imply? > Doug Cutting - 12/May/08 12:37 PM Writer#setMeta(String key, String value); Writer#setMetaBytes(String key, byte[] value); String Reader#getMeta(String key); byte[] Reader#getMetaBytes(String key); I thinks it makes sense to add these. (As opposed to having them in the constructor). The getMeta functions are a nobrainer. > The only problem I see is when do we allow the setMeta functions [ ...]
Perhaps you could instead make it so that a value can only be set once? Then, if the ctor sets things like the key and value classes, application code cannot. appendRaw should take:
void appendRaw(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength); so that you can pass in subsets of byte arrays. The InputStream appender is a raw appender. So it breaks the abstraction to pass as non-raw key with a raw value. If we support user-specified meta-data, it should be settable until the close method is called. Since you might want to include summary information about the contents of the file. It seems that the append API and the next API for the sequence file are not complementary.
append ( Object key, Object value) next (WritableComparable key, Writable Value) are we planning to change it? Or should I have the same API as the Sequence File for read. Map file uses: WritableComparator.get(keyClass) to get the comparator. I will go with that for getting the comparator if the key is not byte comparable.
> Map file uses: WritableComparator.get(keyClass) to get the comparator.
TFile should be independent of Writable, so that it supports other serialization frameworks, like Thrift. Our generic Serialization framework does not yet include comparators, but it should. We should add a method: RawComparator Serialization#getComparator(); TFile should use 'new SerializationFactory(conf).getSerialization(keyClass)' to get the serialization, and then get the serializer, deserializer and comparator from that. We should add comparators to the Serialization framework (HADOOP-3380) before we implement TFile.
Changed the API
Restructured the blockidx split it into keyidx, roidx. Added offsets to each. Fixed typos restructured the datablocks: entry, keylength, key, value Hopefully this is good. We are starting to work on TFile implementation getting most of the stuff in will use byte compare for now.
Pardon me if I'm being thick but it is still not clear to me now rows and keys relate and why we need two indices especially if "Number of key entries equals number of RO entries, and they have 1-1 correspondence". Is a row made of keys? You might extend the introduction to include explication of how they relate.
Doc. still says meta data keys and values are String. I thought above you allowed that values could be byte arrays? You have setMeta and then setMetaBytes? Why not just name the second method same as first? In your implemenation, please keep in mind that others will most likely want to extend. You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation? Good stuff That is my fault. Let's say that you have:
block size = 1 mb key size = 100 bytes file size = 5 gb that means that you have: key length vint = 1 byte
row idx vint = 4 bytes
block offset vint = 4 bytes
blocks = 5000
key index = 500kb
row index = 40kb
which makes it very cheap to read the row index if you don't need the key index (ie. a single 64k read of the tail of the file will likely get the whole thing). The whole key index is not huge, but it is much bigger than the row index and may be compressed and therefore more expensive to read. Note that for map/reduce you have two options, either the getInputSplits could read all of the row indexes and plan the splits exactly. Or you could take the current approach and cut blindly and have the maps adjust the boundaries. In either case, you want it to be cheap to read the row indexes, which is much harder if the keys are intermixed.
> You might consider implementing TFile as an Interface with an easily subclassable or reuseable implementation?
TFile should not be an interface, but perhaps we should have an an abstract class for such files. Then we might have a SequenceFile-based implementation, so that it would be easy to have code that can, e.g., both process SequenceFiles and TFiles with a simple way to select one or the other. Sorry I am late to the party. (I just joined the Grid group on 05/12).
A few questions/suggestions:
It would be really nice if TFiles (and SequenceFiles) were accessible by non-Java programs - e.g., pipes or Python over fuse. If the header/metadata were versioned Jute or Thrift records, it would make it much easier to implement bindings for other languages - esp Thrift since it's in like 50 langauges. pete In the imeta section, we need to be careful about using class names to represent the formatting of data in a TFile. In particular, we should support standardized data and compression formats whose existence is independent of any implementing to Java classes. (To Pete's point, this would support cross-language accessibility of the data.)
One could use class names for this purpose. We could claim that the name of some Java codec class is the name of the data format. The problem, is that class file names tend to change (and multiple implementations of the same codec tend to get written), so this is not a good way to go. So perhaps the strings in the imeta section should be names of formats, and the codec classes are found through some registry. On the other hand, for "local" Hadoop installations that wants to make up its own data formats, being able to wire the class name directly into the file and not having to maintain a registry might be convenient. I'm not convinced: I suspect most people will use standard formats we can bake into a standard registry. Still, if we think we need to support both requirements ("universal" names for standardized formats, and class-names for local ones), then perhaps a convention should be defined: a string that starts with the character "#" is the name of a standard format, the codec for which will be found in some registry. Other strings are treated like Java class names and are used to find the codec classes directly. I would suggest splitting the API into two parts a cooked part and raw part.
public class TFile { static public class Writer { public void append(byte[] keyData, int keyOffset, int keyLength, byte[] valueData, int valueOffset, int valueLength) throws IOException; /** * Get an output stream to write the key and value to. Only valid until the next call to append. * The number of bytes written to the stream must equal keyLength + valueLength. */ public OutputStream append(long keyLength, long valueLength) throws IOException; } public static Writer create(Path p, RawComparator comparator) throws IOException; static public class Reader { /** * Get an input stream to read the key from. available() will return the number of bytes left. * Only valid until the next next call on this stream. */ public InputStream nextKeyInputStream() throws IOException; /** * Get and input stream to read the value from. available will return the number of bytes left. * Only valid until the next next call. */ public InputStream nextValueInputStream() throws IOException; } public static Reader open(Path p, RawComparator comparator) throws IOException; } and then we add a level of abstraction on top that handles objects:
public class ObjectFile<KEY_CLASS, VALUE_CLASS> { Owen,
There would be one complication in exposing the append(long keyLength, long valueLength) that we did not discuss earlier. Although it can be handled. If it the key,value is at the beginning of a block we need to copy to a byte array in the key.serialize(outputstream). We can do this by having a keyValueOutputStream(keybytes,valuebytes, outputstream), that captures the first keybytes of data written into a buffer. This needs to be done to generate an index. But it starts getting ugly. I would also suggest ObjectFile should be extending the TFile and it can do all this in a neater fashion without exposing the append(keyLength, valueLength). Additionally to make any of this feasible (You mentioned this earlier, I just want to record it), serializers should also have getSerializedLength(). Srikanth, I don't understand your concern. When the user calls append(long, long), the writer can decide whether to start a new block or not based on the lengths. So as the client calls write(byte[], int, int) on the output stream, it can be written directly to the file stream or codec's ByteBuffer. For codecs like lzo, the write may be broken into multiple calls to handle the required chunking.
And yes, to make this efficient, you need to be able to get the serialized length of the objects. At the beginning of each block. We need to record the first key that goes into the block for making an index entry. A raw write may preclude us from doing that, unless we wrap it around a "BoundedOutputStream" that captures the first n bytes written out.
Attaching first preview of TFile implementation to get feedback. In a few days we will have a Beta release .
Dev Team (alphabetically): Mahadev Konar, Hong Tang, Nathan Wang and Amir Youssefi Initially we used some parts contributed by Venkata Srikanth Kakani. Hong will attach updated spec shortly. Tested with revision: 693321 On the read API the finalRowId() and getRowId(), I assume this is the row count and it is 0..N, with N being the number or rows (records) in the file right?
If that is the case, are you planing to provide an API to get the total number of rows in a directory when all files in the directory are TFiles ? Alejandro,
Thanks for reading above spec. It is changed and evolved dramatically. I will e-mail you a copy of new one and remove old ones. Hong will attach an updated version shortly. Hi Amir,
could you also update the description of this JIRA which currently says only "SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs"; with the design goals of the new format? thanks, pete Util.java serves no purpose. It mostly contains a number of package-private classes that should be independent files. Also much of its code seems redundant:
If these differ in subtle ways from the versions elsewhere in Hadoop then the justification for near-duplication of implementation should be made in a comment. I got tired of reading the patch after that. For a simpler implementation, it seems very complicated. The implementation of TFile is around 3.9k lines of Java. SequenceFile, which is thought to be too complicated, is only 3.1k lines. The TFile specification.
TFile is an immutable <Key, Value> storage format. TFile is meant to replace Sequence File. Comparing with Sequence File, TFile provides the features of block compression, sorted and unsorted keys, block indexing (for sorted TFile only), scan (bulk read), and application-level meta data. The data abstraction and storage representation is language-neutral, which supports different implementation in various languages (although this specification only describes the API in Java).
Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no?
> Why did you write your own variable length integer encoding? is this superior to protocolbuffers encoding?it seems like writing your own encoding/decoding format makes it harder for people to implement this in other languages, no?
No, I have not looked at protocol buffer. Does it have a VInt/VLong implementation that can be used independently? As I commented previously, the only variable length integer format I found is the one from WriteableUtils, which has the odd property of only doubling the range of representation when we increase from 1B to 2B. I am open for adopting VInt/VLong encoding schemes that are relatively standardized and provides a good range of coverage for small to medium integers (which is the primary purpose of having variable length integers). They do have varints and are under the apache license. I don't know how they compare compression wise:
http://code.google.com/apis/protocolbuffers/docs/encoding.html#varints I would of course prefer to recommend thrift, but they don't yet have variable sized ints. Just checked the code (CodedInputStream.java). The protocol buffer VInt (or VLong) is pretty interesting. They first transform the integer through ZigZag encoding, which essentially transform the long into leading 000+[n]+[sign]. They then encode the n+1 bits using ceiling((n+1)/7) bytes (in little-endian style). So effectively, 1B can represent -64 to 63, 2B: -8K to 8K-1, 3B: -1M to 1M-1, 4B: -128M to 128M. Comparing to my encoding scheme, I basically traded off some 2B encoding space for expanded 1B coverage. Additionally, protocol buffer's decoding requires you to read byte after byte, while both WritableUtils and my VLong can detect the length of the whole encoding after the first byte.
More of a coding style question, why Utils and TFile are full of static inner classes? Shouldn't be better to have them a separate java files in a subpackage?
The static inner classes for TFile are very tightly related to TFile.
But various static classes in Utils are indeed for quite disparate purposes. They are currently lumped together as some sort of implementation details. We can certainly regroup them into independent modules. Are there any opposing concerns for having too many small modules? > There is little to be shared between the two except for the buffer and count definition.
If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods. > The new VLong format enlarges the range of integers that can be encoded with 2-4 bytes [ ... ] Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise. > we did not directly use CompressionCodecFactory is because CompressionCodecFactory.getCodec() expects a path Then CompressionCodecFactory should be extended, rather than duplicated. > If you instead extended ByteArrayOutputStream, BoundedByteArrayOutputStream would only need to override the two write() methods, rather than implement all of the methods.
[Hong] This class is optimized for TFile key appending, The whole implementation only consists of the two writes() and needs nothing else from BufferedInputStream. > Then it should be named something different. Why support negative values at all if you don't use them? Lucene also defines a VInt format that might be considered. Personally, I'd prefer Hadoop used a single VInt format and I don't think it is worth defining yet another VInt and String format for TFile is wise. > Then CompressionCodecFactory should be extended, rather than duplicated. Looking at the specification document, I see the major stated goals are (1) language neutrality; (2) extensibility, and (3) compatibility. I assume these are relative to SequenceFile.
Langauge neutrality without an implementation in another language seems a risky claim. SequenceFile's only language dependence is in the naming of key and value classes, but implementations of these classes are not required to process a SequenceFile. SequenceFile, like TFile, lacks implementations in other languages, so I don't yet see a clear advantage there. (2) and (3) are very related. SequenceFile has proven extensible and back-compatible. Many features have been added without breaking back-compatibility. I don't see a qualitative advantage here to the TFile format. Perhaps you should include a section specifically addressing the advantages of TFile over SequenceFile, how they are achieved and how they can be measured. I suspect there may be other unstated goals in TFile. The case for TFile should be clearly made, as it adds a lot of code to Hadoop that must now be supported. If it has demonstrable advantages to SequenceFile and the case can be made that we will be able to retire SequenceFile after it is added, then TFile should go forward. Or if it is significantly simpler than SequenceFile while providing the same features, that might make the case that it will be easier to reimplement in other languages. But if it is equivalently complex and supports more-or-less the same features then it only adds baggage to the project. > The whole implementation only consists of the two writes()
reset(), size(), close() and much of the ctor could be shared with ByteArrayOutputStream. > We will consider refactoring based on your suggestions later. We generally don't want to commit duplicated logic with a promise to clean it up later. If hadoop will now support at least TFiles and SequenceFiles, is there any thought around an Interface that TFile and SequenceFile and other Files or other self describing files or files with user defined metadata would implement?
InputStreams are buffered and also isn't the case you are optimizing for mainly the 2 byte case in which case this doesn' t help? Even if you don't want to pull in serialization from thrift or protocol buffers, wouldn't using RecordIO's libraries make more sense to doug's point? I believe the primary advantages over SequenceFile are:
I agree with Doug's comments on the Util class, which should be broken up into separate classes. I would either stick with Hadoop vints or use protocol buffer vints. If we use protocol buffer vints, they should be named differently. My preference would be to stick with the current Hadoop vints. Since a lot of the focus on TFile has been on making it performant, it would be nice to see a benchmark that uses a 100 byte Text as key and a 5k byte Text as value and benchmark SequenceFile relative to TFile. I'd suggest both with compression none and possibly lzo. Did initial benchmarks with 100B keys and 5KB values. We found some performance improvement opportunities and an LZO issue (
+1 on Util.java break-up.
Initially, we had parts scattered in separate files, then refactored those and put relevant parts into TFile, BCFile and ones which didn't fit well went to Util.java file so we can deal with that mix later. It's time to break up Util.java into a few files and hopefully get rid of some baggage and find a sweet spot. I suspect it may never be as sweet as we want sorry # of columns should have been # of rows, so it is already included. Since the types are binary, there are clearly no columns.
Adding to Owen's comments.
With regard to the questions why we need so many lines for the implementation, I did a quick run-down of the code:
So that brings the actual TFile-specific code to 1200 lines. Among them,
If you really want this as a top-level goal, I really think you could define the TFileHeaders as a struct in something like protocol buffers or Thrift or even RecordIO (which would at least give you C) and then no C++, Python, Perl, ... implementor ever has to worry about data format. This is probably a day or two of work and in the long run should really pay off. If one didn't need to worry about format and ordering of headers, how much easier is it to implement the read side of TFiles which is often the first thing you want.
this is also aided greatly by using thrift/protobuffers/record io since they can parse newer versions of the same struct and just ignore unknown fields Stupid question, but if I am not really using the key field (ie just storing an opaque row), does the TFile incur anymore overhead for all the empty keys then does a SequenceFile?
Yes, I agree that using Protocol Buffer or Thrift for headers (rather, tails) is probably a good way to help achieve language neutrality and compatibility, and even shrink the # of lines. I will take a closer look at them.
Each empty key would take one byte (length=0) to represet. So it does have some overhead. Update with refactoring and performance improvements.
Utils.java is very light now.
memcmp is updated to use Hadoop's WritableComparator.compareBytes(..). We plan to replace VInt, VLong with alternative options soon. A preliminary study on performance comparison between TFile and SequenceFile.
Settings:
private interface KVAppendable { public void append(BytesWritable key, BytesWritable value) throws IOException; public void close() throws IOException; } private interface KVReadable { public byte[] getKey(); public byte[] getValue(); public int getKeyLength(); public int getValueLength(); public boolean next() throws IOException; public void close() throws IOException; } Both interfaces allow for efficient implementations by either TFile or SequenceFile to avoid any object creations or buffer copying to conform to the interface. Some finer details:
Finally, the results:
Things to notice:
The above results are still preliminary. No YourKit profiling is done on TFile side. And results could vary for different settings - different key value lengths, compression ratios, underlying I/O speed, etc. > Used SequenceFile and TFile to implement two common interfaces [ ... ]
This calls out the need, as we add TFile, to have a common API for container files. This would, e.g., reduce code duplication in input and output formats for mapreduce. This is related to HADOOP-4065. I took a look at the new PDF doc (Haven't looked at the patch yet – excuse me if my questions are answered therein)
We would like to leverage TFile in hbase. memcmp-sort of keys won't work for us. Will it be hard to add support for another comparator? On page 8., the awkward-looking for-loop at the head of the page with its isCursorAtEnd and advanceCursor has some justification in practicality, I presume. otherwise why not use the hasNext/next Iterator common (java) idiom? On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles? (If you need a tool to to test random accessing, the MapFilePerformanceEvaluation tool in hbase-test.jar might help: http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/test/org/apache/hadoop/hbase/MapFilePerformanceEvaluation.java?view=markup hbase depends on the MapFile#getClosest (and MapFile#getClosestBefore): i.e. "Finds the record that is the closest match to the specified key." returning either an exact match as TFile#locate does or the next key (before or after dependent on provided parameters). TFile does not expose its index so this facility would be hard to build on TFile even in a subclass.
In Data Model/Operations (page 2) it says: "A TFile is not accessible for read while it is being created." While I can understand why that is reasonable for a sorted TFile since the index is written last, for a TFile that is unsorted (so it behaves as a SequenceFile) it should be possible to open a Reader on the TFile to read up to the last sync, provided you do a getFileStatus to get the "current length". Is this not the case?
If TFile does not support sync, then deprecating SequenceFile (as has been discussed above) would be a problem for HBase. I'd also like to see a comparison between TFile's random access performance vs MapFile's random access performance. Other comments on TFile.java:
+ How about defines for at least the common compression types and comparator name(s) at least for the common case where TFile is used by java? To add support for alternate comparators and for exposing the index at least to subclasses, should we add a patch atop your patch or just wait till whats here gets committed? It looks like I could do an in-memory TFile if I wanted since I provide the stream? Is that so? If so, thats sweet!
Yes, we plan to add this feature later - at least for Java. But we may restrict TFiles with such kind of comparators being accessed by Java only.
Yes, the original consideration behind it is because the Java Iterator interface always park the cursor on the entry that is already read, and you use next() to move the cursor to the next and fetch the result atomically. On the other hand, TFile scanner separates the cursor movement and data access (because we have two ways of moving cursor: advanceCusrosr() and. seek()), so next() does not make sense here. [ Note that the idiom is close to the iterator concept in C++ STL design, you first get a begin and end iterator from a container, then you can do for_each(begin, end, Op). And advanceCursor() corresponds to ++iter, and isCursorAtEnd corresponds to (iter == end). ] In terms why we do not get key and value in one call, this is because we want to allow people to first get the key, then decide whether to read the value or not (considering the application of doing an inner join). But conceivably, we can provide various convenience utility methods to get both key and value in one shot (just as we did for append). bq On the performance numbers above, how about adding in test of random accesses into TFiles/SequenceFiles?
The reason we did not support this is because TFile.Reader.Scanner is intrinsically a forward iterator (borrowing STL C++ concept, that means, it only supports ++, not --). So to support the operation of looking for the key that is the largest key smaller or equal to the search key, you may need to decomress a block twice. We can discuss in detail of your usage case and see if there may be some work-around of it. Note that this is also an issue with the currently implementation of scanner, which requires you to seek twice in the block (first in TFile.Reader.locate(), and then in the constructor of Scanner). This is not an issue in using TFile with MapReduce, because locate() and the construction of Scanner are in different processes (one in JobClient, one in Mapper). But for direct random access to TFile, we need to avoid such a problem (by providing a separate lookup call that returns the key and value stream instead of Location object).
TFile is not meant to be used in such situation. Attempting to read a TFile that is being written out will result in failure because it misses the various signatures and data structures stored at the end of the file. Not getting key+value in the one go is a nice improvement.
-1 on decompressing a block twice if it can be avoided. Regards discussing our use case, our need is plain. Our catalog table keeps a list of table regions (Catalog table is basically made from MapFiles). Regions are denoted by their start and end keys. To update a particular row, client asks the catalog table for the region that contains the key to update. Currently we depend on MapFile#getClosest (with the before flag set) to do this.
What I mean is, currently TFile does not directly support random lookup. The only way for you to perform this is by doing the following:
You may just index the region by the endKey (and keep startKey as an attribute if you want to). Then ceiling() will give you the right region.
Agree.
Multiple blocks, raw size == compressed size.
That is a usage case I have not thought about. A (slightly) less performant way I can recommend is to write your own key appender and value appender classes as filter stream on top of the key/value appending streams returned by TFile, and add customized actions in close() (before/after calling close() on the down stream).
Hmm, not clear how intercepting advanceCursorInBlock may help. Would it be suffice for you to know a seek() call moves forward or backward by how many <key, value> pairs?
I'd say let's wait it gets committed. Also, to keep in line with the original design objective, language-specific customized comparators should have string names like "jclass:path/to/java/package/ClassName", or
Guess so. (We haven't thought about this usage though.) Oh, I see what you are saying about "not directly supporting random lookup". I'd say thats a bit of a hole in TFile, especially if you want to replace MapFile (MapFile.Reader.get(key)).
On "indexing the region by the endKey", pardon me, I'm not sure I follow. Currently index is block-based, not key-based IIUC so can I even make an index that has all keys? Or, can you make an index that is key-based? (Even if I could index all keys, if key/values are small, might make for a big index so might need something like the MapFile interval). When you say attribute, do you mean attribute of the key? Or something else. Thanks for entertaining my questions Hong. Any remarks on my questions in 'stack - 23/Sep/08 02:26 PM'? Thanks. > Hong Tang - 23/Sep/08 04:34 PM
> What I mean is, currently TFile does not directly support random lookup. The only way for you to perform > this is by doing the following: -1 If Tfile does not directly support random lookup then it is hardly a replacement for MapFile. jim,
Tfile as per this jira is not supposed to be a replacement for MapFile. We can create another jira wherein we can implement an interface that allows random fast lookups which would be easy to do. For now our intention is to get Tfile in as per the specification posted
Sorry for the confusion. My comment is in response to your question wrt whether TFile can support something similar to MapFile's getClosest() call. The answer to that question is that we cannot implement such semantics efficiently because the API would require a bidirectional iterator and the underlying decompression stream is not so. My understanding of your usage case is that you currently have a MapFile with key being <region startKey> value may contain <region endKey, ...>. Given a client key, you perform getCloest(before==true) to get to the right region entry in the MapFile. To support the usage case in TFile, you may use <region endKey> as TFile key, and <region startKey, ...> as the value of TFile. Then TFile.Reader.ceiling(clientKey) will get you to the right entry. @Hong. Yeah. You got all my questions. Here's some follow-on comments.
Would be great if you could figure some way of avoiding double-decompress doing a single random access. Yeah on name of alternate comparator, etc. On "indexing the region by the endKey", pardon me. I thought ceiling a Scanner-only method. I see now what you are suggesting and that could almost work (leaving aside need to rewrite a bunch of ornery hbase code to make the change). But what if the following in TFile.Reader: private final MetaBlockTFileDataIndex metaTFileDataIndex; private final MetaBlockTFileMeta metaTFileMeta; ...had protected accessors so were accessible by a subclass? Then a subclass could do its own implementation of ceiling to return the key closest before rather than closest after as currently implemented? To do this, I'd replicate the body of ceiling – would need the accessors so I could do the following call in my subclass: int blkIndex = metaTFileDataIndex.lowerBound(inKey);
and if searchInBlock was also protected instead of private, a subclass could ask it to find the equal to or before rather than equal to or after? > this jira is not supposed to be a replacement for MapFile
That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file. I wonder whether tfile should start out in contrib until it is more full-featured? Without support for java comparators or random access it is not yet a replacement for SequenceFile. It also doesn't yet have any inputformats, so it cannot be used from mapreduce. Nor does it yet have bindings for other programming languages. So my preference is that, until tfile is proven to be of general utility to Hadoop applications, it should live in contrib. We don't want code in core that's not both widely usable and actually used. Doug Cutting - 24/Sep/08 10:04 AM
>> this jira is not supposed to be a replacement for MapFile > > That's unfortunate. I thought that was the point of including indexes in the file, rather than in a side file. >Owen O'Malley - 26/Apr/08 09:31 AM Ok, I guess my misunderstanding was due to Owen's comment. Doug Cutting - 24/Sep/08 10:04 AM +1 Since TFile is not a replacement for MapFile and does not support reading up to the latest sync while writing, we probably cannot use it for HBase. Thinking a bit more about this, I'm not sure contrib is right either. Since portability and support for mapreduce are stated goals, we should perhaps not commit this at all until it implements those goals. I'd like to see an inputformat, an outputformat and c++ APIs before we commit this. I'd also very much like to see support for comparators and random access. We do not want code committed that's not yet useful--that leads to bloat and sets a bad precedent for the project. If language-independence is a primary goal then it should be addressed up-front. The implementation might also be much simpler if two implementations must be developed and maintained.
I think that a reasonable path to get the committed are to make it useful to map/reduce. That would require, precisely:
1. Support for user comparators. 2. An ObjectFile class that extends (or wraps) TFile and uses the serialization libraries to provide objects. 3. {In,Out}putFormats that use ObjectFile. 4. A testcase that has a map/reduce job that reads and writes ObjectFiles. The other option is holding off until it includes a table framework.
I don't think it needs to include a table framework, but it does need to support random I/O and comparators at a minimum.
To Owen's list I'd suggest adding amenable random-access and access to the block index by subclasses.
@Jim. MapFile is ill-suited to hbase use. It needs to be replaced. Discussions above are about looking at TFile as possible foundation for replacement. For hbase log files, where we need append, we can just keep on with the SequenceFile we use now. @Hong .bq ...is to write your own key appender and value appender classes... This sounds fine for creating auxillary index information. I then suggested advanceCursorInBlock needed to be made accessible so I could then exploit the auxillary index at read time but on study, I see this is the wrong place. Where would you suggest I plugin to exploit my extra index information at Read time? Sorry if this need seems exotic but I think we can get away with casting this need under the 'Extensibility' TFile Design Principal. In our application, keys are row/column/timestamp. If millions of columns in a row and we want to skip to the next row, we can't next-next-next through the keys. It'll be too slow. We need to skip ahead to the new row. Block index won't help in this regard.
Yes, it sounds reasonable to change the various indices in TFile as protected instead of private. Just curiously, would your auxiilary index remember how many records start with the same row-key? So that you may want to take advantage of this to quickly advance? If true, a better way than opening on advanceCursoInBlock() is to provide an advanceCursor I'm all for putting this in contrib. Let's hold the bar relatively low and wait until it proves itself to graduate it elsewhere.
We could record the start of the first duplicate of a block's leading key earlier in the file. This might allow a relatively efficient skip to the start of a block of duplicate keys that cross a block boundary. If you just store the distance in uncompressed bytes, you can probably sort the rest out.
Since this includes a file format, won't that mean we won't then be able to make any changes to the format or if we have to make changes, we will have to version it/deal with "legacy" data right away? yup. There is that danger. Of course if it is in contrib, not making
strong guarantees seems fair. We plan to use this on lots of data, so if we make format changes, we'll need to provide a solution to upgrades. The format should support version checking on day one. 1) Currently there are two versions one for TFile API and other for BCFile API. Version consists of a Major and a Minor.
TFile/BCFile API versions are stored when a new file is written. Later when we read a file, API versions used for reading are compared to versions from file. Currently we only check Major and throw an exception if there is a mismatch in Major. We can add more compatibility logic to this code when necessary. 2) There are some compatibility ideas incorporates in the design (see Page 11 of attached spec for more details). @Stack
We come up the following proposal to address the random access API issue: we would move all seek() related functionalities into Scanner, and let Reader simply become the starting point to create Scanners plus some misc functionalities. To allow applications to create scanners from reader, we also let reader to implement the Scanner interface (internally, it would lazily instantiate a Scanner whenever those Scanner functions are called, and delegate the call to the internal scanner). This has the advantage of simplifying the life of applications that want to access a TFile in single-thread fashion. The following is the detailed API: Scanner.java /** * The TFile Scanner. The Scanner has an implicit cursor, which either points to * a valid row in the scan range, or to a special "end" location. There are * three kinds of operations supported by a Scanner: * <ol> * <li>Query about locations. Locations are represented by row IDs. * <ul> * <li><code>long begin()</code> returns the row ID for the first row. * <li><code>long end()</code> returns the row ID of the last row plus one. * <li><code>long current()</code> returns the current Row ID. * <li><code>boolean atEnd()</code> test whether the cursor is at the end. * </ul> * <li>Move the cursor. * <ul> * <li><code>void seek(long rowID)</code>. Go to the specific row as indicated * by the input row ID. * <li><code>void lowerBound(byte[] key)</code> (and its variant). Go to the * first row whose key is greater than or equal to the input key. * <li><code>void upperBound(byte[] key)</code> (and its variant). Go to the * first row whose key is greater than the input key. * <li><code>boolean seek(byte[] key)</code> (and its variant). Similar to * <code>lowerBound</code>, except that it returns a boolean to indicate whether * we find an exact match of the key. * <li><code>boolean advance()</code> advance to next row. An efficient * implementation of <code>seek(current()+1)</code> that returns a boolean to * indicate whether the cursor actually moves. * <li><code>void rewind()</code>. Equivalent to <code>seek(begin())</code> * </ul> * <li>Retrieve data. * <ul> * <li><code>DataInputStream getKeyStream()</code>. Return the key stream. * <li><code>int getKeylength()</code>. Return the key stream. * <li><code>DataInputStream getValueStream()</code>. Return the value stream. * <li><code>int getValueLength()</code>. Return the key stream. * <li><code>int getKey(byte[] buffer, int offset)</code>. Convenience function, * get key. * <li><code>int getValue(byte[] buffer, int offset)</code>. Convenience * function, get value. * <li><code>void get(BytesWritable key, BytesWritable value)</code>. * Convenience function, get key and value in one shot. * <li><code>DataInputStream find(byte[] key)</code> (and its variants). * Convenience function, same as <code> * if (seek(key)) { * return getValueStream(); * } * return null; * </code> * <li><code>boolean find(byte[] key, BytesWritable value)</code> (and its * variants). Convenience function, same as <code> * if (seek(key)) { * getValue(value); * return true; * } * return false; * </code> * <li><code>boolean isValueLengthKnown()</code>. Test whether we know the value * length. * </ul> * </ol> */ public interface Scanner extends Closeable { /** * Get the beginning row's RowID. * * @return RowID for the first row. */ long begin(); /** * Get the last row's RowID plus 1. * * @return RowID for the last row + 1. */ long end(); /** * Get the current row's RowID. * * @return RowID for the current row. */ long current(); /** * Test whether we reach the end of scan range. * * @return (current()==end()); */ boolean atEnd(); /** * Move the cursor to the first row whose key is greater than or equal to the * input key. Equivalent to <code>lowerBound(key, 0, key.length)</code>. * * @param key * input key. * @throws IOException */ void lowerBound(byte[] key) throws IOException; /** * Move the cursor to the first row whose key is greater than or equal to the * input key. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @throws IOException */ void lowerBound(byte[] key, int offset, int len) throws IOException; /** * Move the cursor to the first row whose key is greater than the input key. * Equivalent to <code>upperBound(key, 0, key.length)</code>. * * @param key * input key. * @throws IOException */ void upperBound(byte[] key) throws IOException; /** * Move the cursor to the first row whose key is greater than the input key. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @throws IOException */ void upperBound(byte[] key, int offset, int len) throws IOException; /** * Seek to a particular key. Equivalent to <code>lowerBound(key)</code>, * except that it returns true if we find an exact match. * * @param key * input key. * @return whether we find an exact match. * @throws IOException */ boolean seek(byte[] key) throws IOException; /** * Seek to a particular key. Equivalent to * <code>lowerBound(key, offset, len)</code>, except that it returns true if * we find an exact match. * * @param key * input key. * @param offset * offset into the key buffer. * @param len * length of the key. * @return whether we find an exact match. * @throws IOException */ boolean seek(byte[] key, int offset, int len) throws IOException; /** * Seek to a particular RowID. * * @param rowID * @throws IOException */ void seek(long rowID) throws IOException; /** * Advance the cursor to the next row. An efficient implementation of * <code>seek(current()+1)</code>. And it returns a boolean indicating whether * the cursor actually moves. * * @return whether the cursor actually advances. * @throws IOException */ boolean advance() throws IOException; /** * Move the cursor to the first row. Equivalent to <code>seek(begin())</code>. */ void rewind() throws IOException; /** * Streaming access to the key. <code>atEnd()</code> must be tested false. * * @return The input stream. * @throws IOException */ DataInputStream getKeyStream() throws IOException; /** * Get the length of the key. <code>atEnd()</code> must be tested false. * * @return the length of the key. */ int getKeyLength(); /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false. Synonymous to <code>getKey(key, 0)</code>. * * @param key * The buffer supplied by user. The length of the buffer must not be * shorter than the key length. * @return The length of the key. * * @throws IOException */ int getKey(byte[] key) throws IOException; /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false. * * @param key * The buffer supplied by user. * @param offset * The starting offset of the user buffer where we should copy the * key into. Requiring the following condition to hold: * <code>getKeyLength() + offset <= key.length</code>. * @return The length of the key. * * @throws IOException */ int getKey(byte[] key, int offset) throws IOException; /** * Copy the key into user supplied buffer. <code>atEnd()</code> must be tested * false. * * @param key * The BytesWritable buffer supplied by user. * @return The length of the key. * * @throws IOException */ int getKey(BytesWritable key) throws IOException; /** * Streaming access to the value. <code>atEnd()</code> must be tested false. * * @return The input stream. * @throws IOException */ DataInputStream getValueStream() throws IOException; /** * Test whether we know the value length. <code>atEnd()</code> must be tested * false. * * @return Boolean indicate whether we know the length of the value. */ boolean isValueLengthKnown(); /** * Get the length of the value. <code>atEnd()</code> must be tested false. * * @return the length of the value. */ int getValueLength(); /** * Copy the value into user supplied buffer. <code>atEnd()</code> must be * tested false. Synonymous to <code>getValue(value, 0)</code>. * * @param value * The buffer supplied by user. The length of the buffer must not be * shorter than the value length. * @return The length of the value. * * @throws IOException */ int getValue(byte[] value) throws IOException; /** * Copy the value into user supplied buffer. <code>atEnd()</code> must be * tested false. * * @param value * The buffer supplied by user. * @param offset * The starting offset of the user buffer where we should copy the * value into. Requiring the following condition to hold: * <code>valueLength + offset <= value.length</code>. * @return The length of the value. * * @throws IOException */ int getValue(byte[] value, int offset) throws IOException; /** * Copy the value into user supplied BytesWritable buffer. * <code>atEnd()</code> must be tested false. * * @param value * The BytesWritable buffer supplied by user. * @return The length of the value. * * @throws IOException */ int getValue(BytesWritable value) throws IOException; /** * Get both key and value in one shot. Synonymous to * <code>getKey(key); getValue(value)</code> * * @param key * The BytesWritable buffer to hold the key. * @param value * The BytesWritable buffer to hold the value. * @throws IOException */ void get(BytesWritable key, BytesWritable value) throws IOException; /** * Search based on a particular key. Equivalent to * <code>find(key, 0, key.length, value)</code> * * @param key * The input key buffer * @param value * The BytesWritable buffer to hold the value. * @return true we find value for the specific key. * @throws IOException */ boolean find(byte[] key, BytesWritable value) throws IOException; /** * Search based on a particular key. * * @param key * The input key * @param offset * The offset into the key buffer. * @param len * The length of the key. * @param value * The BytesWritable buffer to hold the value. * @return true we find value for the specific key. * @throws IOException */ boolean find(byte[] key, int offset, int len, BytesWritable value) throws IOException; } Reader.java /** * TFile Reader. Typical usage patterns: * <ul> * <li>Single-threaded application: Users may directly use the Reader object. * <li>Multi-threaded application: Create one scanner for each thread through * <code>Reader.createScanner()</code>. To restrict a thread to access only a * region of the rows, use the reader to find the begin and end row RowID and * call <code>createScanner(begin, end)</code>. Note the end RowID is exclusive. * <li>MapReduce: The InputFormat would use Reader to find out the RowIDs * corresponding to split boundaries. Each mapper then uses reader to create a * scanner with the begin and end RowID. * </ul> */ public class Reader implements Scanner { /** * Constructor. * * @param path * Path to the TFile. * @param conf * Configuration object. * @throws IOException */ public Reader(Path path, Configuration conf) throws IOException { // TODO } /** * Constructor * * @param fsin * The seekable input stream containing data of a TFile. * @param length * The input stream length. * @param conf * Configuration object. * @throws IOException */ public Reader(FSDataInputStream fsin, long length, Configuration conf) throws IOException { // TODO: } /** * Is TFile sorted. * * @return whether the TFile is sorted. Key-based search in Scanner requires * the underlying TFile to be sorted. */ public boolean isSorted() { // TODO: return false; } /** * Get an input stream to read the Meta block. * * @param name * Name of the Meta Block. * @return The input stream for the Meta block. */ public DataInputStream getMetaBlock(String name) throws IOException { // TODO: return null; } /** * Get the name of the comparator. * * @return If TFile is not sorted, empty string will be returned. Otherwise, * the name of the comparator is returned. */ public String getComparatorName() { // TODO: return null; } /** * Get the RowID of the beginning row in the first compressed block whose byte * offset in the TFile is greater than or equal to the specified offset. * * @param offset * the user supplied offset. * @return the RowID of the row that fits the specified criteria; or * <code>end()</code> if no such entry exists. */ public long getRowIDNear(long offset) { // TODO: return 0; } /** * Get the total # of rows in TFile. * * @return total # of rows. */ public long getRowCount() { // TODO: return 0; } /** * Create a scanner that scans the whole TFile. Equivalent to * <code>createScanner(begin(), end())</code>. * * @return The scanner object. */ public Scanner createScanner() { // TODO: return null; } /** * Create a scanner that scans a specified portion of TFile. * * @param begin * The beginning RowID (inclusive). * @param end * The end RowID (exclusive). * @return The scanner object. */ public Scanner createScanner(long begin, long end) { // TODO: return null; } } This new Scanner API is still Comparator-free. It's easy to layer lexicographic-by-byte on top of a Comparator-based API, but not vice versa.
@Hong
So, to read a random key, I'd now do: TFile.Reader r = new TFile.Reader.... TFile.Reader.Scanner s = r.createScanner(); ... // Below would body of some 'get' method that took a byte array 'key' for an argument Writable w = new SomeWritable(); return s.find(key, w)? w: null; That'll work. This avoids the double-decompress possibility mentioned above because all seek is done in the Scanner now? Are rowids a new notion? Am I supposed to be able to go between rowid and a Location? Yes, Scanner is stateful: lowerBound() or upperBound() will get to the right point of the compression block, and keep the block steram open.
Sorry forgot to mention about the change from Location to RowID. One school of thoughts is that one cannot really do much with the Location object. So at some time of the point, we may need to expose the fact that you can locate by RowID anyways. At that point, the use of Location would be obsolete. So let's just expose it now to keep the API set tight. With RowID, you can easily implement some auxiliary index to remember the # of rows starting with the same "row key" in HBase, and you can do seek(current()+n) to efficiently skip those rows.
We are working on that as a separate task. The complete task list is as follows:
Expect more comments addressing these issues soon. @Stack,
Some work has been done based on comments we receive after our initial patch release, major feature additions include:
I think it may be a good idea for us to upload another patch and collect feedback. However, I have very limited time working on TFile now so I may not be able to answer questions and address issues promptly. Thanks for update Hong. If you get a minute, yeah, for sure put up new patch. (BTW, I notice this is most watched issue in JIRA database according to this: http://community.cloudera.com/reports/2/issues/
I will attach an updated TFile patch. This will remove hold-up for ObjectFile and it's I/O Format patches which needed new TFile code.
OK, Hung needs to make some changes and will attach a new patch. I will wait for it to attach ObjectFile and I/O Format patches.
Some preliminary random seek performance on TFile.
Settings:
Results:
2. A single-node DFS with 4 disks (JBO), >70% full.
3. A single-node DFS with 4 disks (RAID0), empty
Some observations:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Is this a format just for compressed sequence files, or for all sequence files?
Is this intended as a replacement for MapFile too?
I think some kind of a magic number header at the start files is good to have. That would also permit back-compatibility with SequenceFile in this case.
In the index, what is "first record idx" – is that the key or the ordinal position of the first entry?