HBase
  1. HBase
  2. HBASE-1249

Rearchitecting of server, client, API, key format, etc for 0.20

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None

      Description

      To discuss all the new and potential issues coming out of the change in key format (HBASE-1234): zero-copy reads, client binary protocol, update of API (HBASE-880), server optimizations, etc...

      1. HBASE-1249-GetQuery-v4.pdf
        49 kB
        Jonathan Gray
      2. HBASE-1249-StoreFile-v4.pdf
        11 kB
        Jonathan Gray
      3. HBASE-1249-GetQuery-v3.pdf
        50 kB
        Jonathan Gray
      4. HBASE-1249-Example-v2.pdf
        47 kB
        Jonathan Gray
      5. HBASE-1249-Example-v1.pdf
        47 kB
        Jonathan Gray
      6. HBASE-1249-GetQuery-v2.pdf
        54 kB
        Jonathan Gray
      7. HBASE-1249-GetQuery-v1.pdf
        54 kB
        Jonathan Gray
      8. HBASE-1249-StoreFile-v1.pdf
        34 kB
        Jonathan Gray

        Issue Links

          Activity

          Hide
          Jonathan Gray added a comment -

          This issue was used for discussion. A new issue will be opened to turn these design documents into user documentation.

          Closing this issue as "no longer valid". This issue is being resolved as part of HBASE-1234 and HBASE-1304.

          Show
          Jonathan Gray added a comment - This issue was used for discussion. A new issue will be opened to turn these design documents into user documentation. Closing this issue as "no longer valid". This issue is being resolved as part of HBASE-1234 and HBASE-1304 .
          Hide
          stack added a comment -

          Erik. You need to update to TRUNK. Filters were changed so they took byte array, offset and length.

          Show
          stack added a comment - Erik. You need to update to TRUNK. Filters were changed so they took byte array, offset and length.
          Hide
          Erik Holstad added a comment -

          @jgray
          Yeah, it's is a little weird, cause when I pasted it, it looked really good in the window, but now it looks really bad.

          @Stack
          Not really sure what is considered to be very different, I do think that the difference is quite big
          The only methods that stays pretty much the same is reset() and filterColumn which changes name to filterQualifier since we are trying to get rid of the column notion.
          The filterRowKey could be used as filterRow, but since we only take a byte[] as input we can't use the KeyValue from the scanners but have to do a copy for every check, which seems like
          an unnecessary thing to do, specially since the whole plan on the server side is to avoid copying.
          I'm not dying to rewrite the filters, but have started, cause I don't really seem how we could use the old ones in a good way, but if you see a way for us to do that, I'm happy to not do the work.
          .

          Show
          Erik Holstad added a comment - @jgray Yeah, it's is a little weird, cause when I pasted it, it looked really good in the window, but now it looks really bad. @Stack Not really sure what is considered to be very different, I do think that the difference is quite big The only methods that stays pretty much the same is reset() and filterColumn which changes name to filterQualifier since we are trying to get rid of the column notion. The filterRowKey could be used as filterRow, but since we only take a byte[] as input we can't use the KeyValue from the scanners but have to do a copy for every check, which seems like an unnecessary thing to do, specially since the whole plan on the server side is to avoid copying. I'm not dying to rewrite the filters, but have started, cause I don't really seem how we could use the old ones in a good way, but if you see a way for us to do that, I'm happy to not do the work. .
          Hide
          Jonathan Gray added a comment -

          I don't believe there's anything antithetical to KV with the current implementation. I'm not sure if we need to redo it or not, but at the least, the current interface needs to be extended to add a column qualifier check.

          Show
          Jonathan Gray added a comment - I don't believe there's anything antithetical to KV with the current implementation. I'm not sure if we need to redo it or not, but at the least, the current interface needs to be extended to add a column qualifier check.
          Hide
          stack added a comment -

          Regards Filter interface, it doesn't look that different from what we had before, not different enough to warrant redoing. Is current filter interface antithetical to KV?

          Show
          stack added a comment - Regards Filter interface, it doesn't look that different from what we had before, not different enough to warrant redoing. Is current filter interface antithetical to KV?
          Hide
          Jonathan Gray added a comment -

          Code looked okay in the e-mail but is unreadable on the issue. Attach as a patch or java file next time? Should probably file another issue for this.

          Can you use an enum instead of ints for return codes?

          Show
          Jonathan Gray added a comment - Code looked okay in the e-mail but is unreadable on the issue. Attach as a patch or java file next time? Should probably file another issue for this. Can you use an enum instead of ints for return codes?
          Hide
          Erik Holstad added a comment -

          Been working on a new FilterInterface that will work better with the new server side implementation, been thinking something like:

          /**

          • Interface to be used for server side filtering for gets and scans.
          • The cases of wanting to not look at any more entries in the current store
          • and the case of always wanting to look at the filter should be handled with
          • the return codes from the individual filters.
            *
            */
            public interface FilterInterface extends Writable { final static int DONE = -1; final static int SKIP = 0; final static int INCLUDE = 1; /** * Resets the state of this filter. */ public void reset(); /** * Filters the current row to see if it should be included in the result or * not. * @param bytes * @param rowOffset * @param rowLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included. */ public int filterRow(byte [] bytes, int rowOffset, int rowLength); /** * Filters the current qualifier to see if it should be included in the result * or not. * @param bytes * @param qualifierOffset * @param qualifierLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included */ public int filterQualifier(byte [] bytes, int qfOffset, int qfLength); /** * Filters the current value to see if it should be included in the result or * not. * @param bytes * @param valueOffset * @param valueLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included */ public int filterValue(byte [] bytes, int valueOffset, int valueLength); }

          They way I see it work is while comparing the different parts of the KeyValue you after every check do the matching filter check, to be able to early our as soon as possible. An extra method for filterFamily can easily be added in the future if we choose to support multiple families in the same store.

          Show
          Erik Holstad added a comment - Been working on a new FilterInterface that will work better with the new server side implementation, been thinking something like: /** Interface to be used for server side filtering for gets and scans. The cases of wanting to not look at any more entries in the current store and the case of always wanting to look at the filter should be handled with the return codes from the individual filters. * */ public interface FilterInterface extends Writable { final static int DONE = -1; final static int SKIP = 0; final static int INCLUDE = 1; /** * Resets the state of this filter. */ public void reset(); /** * Filters the current row to see if it should be included in the result or * not. * @param bytes * @param rowOffset * @param rowLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included. */ public int filterRow(byte [] bytes, int rowOffset, int rowLength); /** * Filters the current qualifier to see if it should be included in the result * or not. * @param bytes * @param qualifierOffset * @param qualifierLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included */ public int filterQualifier(byte [] bytes, int qfOffset, int qfLength); /** * Filters the current value to see if it should be included in the result or * not. * @param bytes * @param valueOffset * @param valueLength * @return -1 if you are done and can return the result, 0 when not to include * in result and 1 if this row should be included */ public int filterValue(byte [] bytes, int valueOffset, int valueLength); } They way I see it work is while comparing the different parts of the KeyValue you after every check do the matching filter check, to be able to early our as soon as possible. An extra method for filterFamily can easily be added in the future if we choose to support multiple families in the same store.
          Hide
          Erik Holstad added a comment -

          Just for the people that are still a little bit sceptic that an algorithmic change can make a great improvement in speed, you can just test this simple timing test for different m and n and have a look at what the result is.

          public void timingTester(){
          long start = 0L;
          long stop = 0L;
          int m = 100;
          int n = 100;
          int size = m*n;
          int ret = 0;

          //old
          start = System.nanoTime();
          for(int i=0; i<size; i++)

          { ret = (2*i) -i; //or do something else }

          stop = System.nanoTime();
          System.out.println("old timer " +(stop-start));

          //new
          size = m+n;
          start = System.nanoTime();
          for(int i=0; i<size; i++)

          { ret = (2*i) -i; }

          stop = System.nanoTime();
          System.out.println("new timer " +(stop-start));
          }

          Show
          Erik Holstad added a comment - Just for the people that are still a little bit sceptic that an algorithmic change can make a great improvement in speed, you can just test this simple timing test for different m and n and have a look at what the result is. public void timingTester(){ long start = 0L; long stop = 0L; int m = 100; int n = 100; int size = m*n; int ret = 0; //old start = System.nanoTime(); for(int i=0; i<size; i++) { ret = (2*i) -i; //or do something else } stop = System.nanoTime(); System.out.println("old timer " +(stop-start)); //new size = m+n; start = System.nanoTime(); for(int i=0; i<size; i++) { ret = (2*i) -i; } stop = System.nanoTime(); System.out.println("new timer " +(stop-start)); }
          Hide
          Erik Holstad added a comment -

          @Stack
          I don't really know exactly where the big improvement comes from compared to the old code, but I don't think that it is that weird that we see a big improvement even though the x factor might not be totally true. And I'm not going to say that I haven't missed anything in the new code, cause that would just be really silly, but I think atleast we have some good numbers to work with. But since I don't have a working patch just yet, these number should be more as guidelines than actual numbers to compare.

          Show
          Erik Holstad added a comment - @Stack I don't really know exactly where the big improvement comes from compared to the old code, but I don't think that it is that weird that we see a big improvement even though the x factor might not be totally true. And I'm not going to say that I haven't missed anything in the new code, cause that would just be really silly, but I think atleast we have some good numbers to work with. But since I don't have a working patch just yet, these number should be more as guidelines than actual numbers to compare.
          Hide
          Erik Holstad added a comment -

          With the old code we used to put all the calls for for multiple columns or all
          columns for a family into the same get call, getFull, this was done even though
          they are doing very different things. So what I would like to see on the server
          side are 2 different pieces of code, that handles the 2 queries differently.

          Let's call these calls getFamilies and getColumns for simplicity, where
          getFamilies is the call you make when you want to get all the columns for a
          family and getColumns when you specify some columns that you want to get.

          In the new code, for the getColumns call I remove the elements from the list
          that already have the number of versions asked for, so the getList becomes
          smaller and smaller.

          For getFamilies the situation is different since you start with an empty getList
          and build it as you go.

          The problem with the getFamilies query is that you have to keep the numbers of
          versions around even though you have gotten them all.

          The old way of doing things was to build a map that had an entry for every
          KeyValue fetched that mapped to the number of versions fetched.

          The new approach is to keep a newGet list where you insert the new things from
          the current storefile and then before going into the next storefile merge these
          2 lists together, the same way as you do with deletes.
          This is just one small detail in the new implementation, but I thought it was
          important to bring it up since it brings some extra complexity from the
          otherwise pretty simple getColumns code, where most of the code from getFamilies
          is taken from.

          Some timing results:
          These test are only done on the server side and doesn't include any time for
          sending data between server and client. I basically created a HRegion in the
          test, inserted data into it and queried it with 2 different calls the old
          getFull() and the semi old, getFamilies(). The test were done by inserting
          data at 3 different times with a flush in between so you end up with data in
          memcache and data in 2 storefiles. The region only have one family and the
          qualifiers are numbers ranging from 1-n, so you have 3 versions of every insert.
          The get query was to get 3 versions of all the data, so you need to go through
          all the storefiles and no way of early out, which is always the case when it
          comes down to a query like this.

          Results(ns):
          n = 10
          new timer 483525
          old timer 2619048

          n = 100
          new timer 3237894
          old timer 15307717

          n = 1000
          new timer 22449940
          old timer 435134261

          These test doesn't include any deletes. In the case of deletes in the system I
          think that the difference will be even greater.

          To be clear about the parts where extra testing and time is need to be spent are
          the following areas in the new code:
          Merging of deletes, merging of Gets and handling of versions in the case of a
          "GetFamilies" call.

          I did find some bugs while running this timing test, so will keep working
          towards a good patch where most of that stuff is taken care of.

          Show
          Erik Holstad added a comment - With the old code we used to put all the calls for for multiple columns or all columns for a family into the same get call, getFull, this was done even though they are doing very different things. So what I would like to see on the server side are 2 different pieces of code, that handles the 2 queries differently. Let's call these calls getFamilies and getColumns for simplicity, where getFamilies is the call you make when you want to get all the columns for a family and getColumns when you specify some columns that you want to get. In the new code, for the getColumns call I remove the elements from the list that already have the number of versions asked for, so the getList becomes smaller and smaller. For getFamilies the situation is different since you start with an empty getList and build it as you go. The problem with the getFamilies query is that you have to keep the numbers of versions around even though you have gotten them all. The old way of doing things was to build a map that had an entry for every KeyValue fetched that mapped to the number of versions fetched. The new approach is to keep a newGet list where you insert the new things from the current storefile and then before going into the next storefile merge these 2 lists together, the same way as you do with deletes. This is just one small detail in the new implementation, but I thought it was important to bring it up since it brings some extra complexity from the otherwise pretty simple getColumns code, where most of the code from getFamilies is taken from. Some timing results: These test are only done on the server side and doesn't include any time for sending data between server and client. I basically created a HRegion in the test, inserted data into it and queried it with 2 different calls the old getFull() and the semi old, getFamilies(). The test were done by inserting data at 3 different times with a flush in between so you end up with data in memcache and data in 2 storefiles. The region only have one family and the qualifiers are numbers ranging from 1-n, so you have 3 versions of every insert. The get query was to get 3 versions of all the data, so you need to go through all the storefiles and no way of early out, which is always the case when it comes down to a query like this. Results(ns): n = 10 new timer 483525 old timer 2619048 n = 100 new timer 3237894 old timer 15307717 n = 1000 new timer 22449940 old timer 435134261 These test doesn't include any deletes. In the case of deletes in the system I think that the difference will be even greater. To be clear about the parts where extra testing and time is need to be spent are the following areas in the new code: Merging of deletes, merging of Gets and handling of versions in the case of a "GetFamilies" call. I did find some bugs while running this timing test, so will keep working towards a good patch where most of that stuff is taken care of.
          Hide
          stack added a comment -

          Thanks for the exposition. I like it for case of gets of many columns into table with many deletes.

          I must say though that I'm surprised for the case described above – few entries with two families and no deletes – that the difference is even discernible, never mind that its 2x. Are you sure its because of the above described changes and not something else about the new code path? Perhaps something is being skipped or there is dumb bug in old code?

          Show
          stack added a comment - Thanks for the exposition. I like it for case of gets of many columns into table with many deletes. I must say though that I'm surprised for the case described above – few entries with two families and no deletes – that the difference is even discernible, never mind that its 2x. Are you sure its because of the above described changes and not something else about the new code path? Perhaps something is being skipped or there is dumb bug in old code?
          Hide
          Erik Holstad added a comment -

          @Stack

          1. My sense is that the delete check is not costly, not in the
          scheme of things.

          I would say that it is costly in cases where you have many deletes. I
          also see no downside or big areas where the new design would perform any
          worse. We also need to think about memory allocation.

          2. There is caching of lengths being done in many of those cases,
          right?

          I'm not sure exactly what you mean, but yes there may be. But without it
          being clear, it seems we would want to ensure in this very important (for
          performance) code path is optimally reusing them.

          One of the problems with java is that you can't return multiple things at
          the same time in an easy way. For me trying to make the server
          implementation faster I'd rather have duplication of code than make it
          slower.

          I totally agree that we should try to keep duplication to a minimum, but I
          rather focus on speed first and then work in minimizing the code base,
          might be the wrong approach. When it comes to the part of duplicating
          complex code I totally agree, this makes it more prone to errors and
          should definitely be avoided when possible.

          The code that I duplicate is mostly the compare code for 2 KeyValues, so
          it is pretty straightforward, not like the more complex merge codes which

          Show
          Erik Holstad added a comment - @Stack 1. My sense is that the delete check is not costly, not in the scheme of things. I would say that it is costly in cases where you have many deletes. I also see no downside or big areas where the new design would perform any worse. We also need to think about memory allocation. 2. There is caching of lengths being done in many of those cases, right? I'm not sure exactly what you mean, but yes there may be. But without it being clear, it seems we would want to ensure in this very important (for performance) code path is optimally reusing them. One of the problems with java is that you can't return multiple things at the same time in an easy way. For me trying to make the server implementation faster I'd rather have duplication of code than make it slower. I totally agree that we should try to keep duplication to a minimum, but I rather focus on speed first and then work in minimizing the code base, might be the wrong approach. When it comes to the part of duplicating complex code I totally agree, this makes it more prone to errors and should definitely be avoided when possible. The code that I duplicate is mostly the compare code for 2 KeyValues, so it is pretty straightforward, not like the more complex merge codes which
          Hide
          Jonathan Gray added a comment -

          (this was written together with erik holstad)

          @Stack

          We don't have any profiling stats or end-to-end numbers at the moment, but we do have are some early naive timing test that compare the 2 implementations and already for very few inserts like 10 to 15 entries with no deletes in 2 storefiles + memcache I saw around 2x improvment. These test were only done on the server and I have not tested for bigger tables and from the client, since I only got the rpc to work the other day and we've done some reworks since then, but it looks promising.

          When we looked at the get code and how to make it faster we started to look what structures were used and how many compares you have to do to get your result. Because to me the way to speed things up is to cut down the total number of compares that you need to do to get your result. Of course the total fetch time is made up by a number of causes, but on the server, outside of HFile/HDFS, where I think most time is spent is comparing entries to see if they should be included in the result or not (this decision made by looking at the query to see if it was asked for, comparing for deletes, checking for ttl, maxVersions requested, timeRange, etc).

          So what I saw when looking at the old code is that for every new entry from the storefile you compare it to all entries in the get set and then for every match between those you have to check to see it is has been deleted.

          Let k be the number KeyValues we will look at. Let l be the number of columns asked for by the client.

          Old Way: We compare every entry from the storefile (k) with every entry from the getset (l) which will give you k * l compares in total.

          If the getset is sorted (storefile/memcache always is), then you are just doing a sorted merge between two sorted lists.

          New Way: We do a sorted merge between the entries in the storefile (k) with the entries in the sorted getset (l) which will give you k + l compares in total.

          The difference between k * l and k + l can be significant if you are matching more than one column.

          Notes:

          • current plan is to build the sortedset behind the scenes on the client in the Get
          • this insertion sort will have a negligible impact on clients and no additional work will be required server-side
          • we will send the sortedset over the wire as a list. in that case, it will not have to be re-sorted.
            (this last optimization is important. right now we do a LOT of serializing a sorted tree, sending it over the wire, and then deserializing it. the deserializing of trees actually rebuilds the entire tree, thus reperforming the sort. using lists (which are sorted) will give us time and memory savings.)

          On the delete part, we have two things to deal with. Checking current KVs against the current DeleteSet, and adding newly seen deletes into the DeleteSet.

          Let k be the number of KeyValues we will look at. Let m be the number of deletes we have already seen. Let n be the number of new deletes we will see.

          Old way of checking current KVs agains the current DeleteSet:

          • DeleteSet is a sortedSet, so it has log(m) inserts/gets/deletes/contains operations.
          • For each KV seen (k), we have a log(m) operation to see if it is deleted.
          • This will give you k * log(m) total comparisons.

          Old way of inserting newly seen delete KVs into current DeleteSet:

          • When we see a delete, we insert it to the deleteset in log(m). This has downsides.
          • We're adding deletes which we know will not be relevant in this storefile (increasing m increases subsequent delete checks so we have log(m + n) eventually).
          • We're doing work which we may not potentially ever have to do (merging newly seen delete with previous set). If we are going to early-out of this storefile for some reason, there would never have been a reason to pay anything but constant-time cost for deletes related to older storefiles which will not end up being opened.
          • Trees are bad! Their allocation of lots of small nodes is a disaster for GC. We should use lists everywhere we can.
          • This gives you n * log(m) total comparisons.

          The new approach is similiar to how we now handle the get list. Deletes will be stored in two lists. A currentDeleteList and a newDeleteList.

          To start, the memcache has no deletes to look at. You will, however, add any deletes seen to the newDeleteList (in constant time, no significant extra work). Since memcache (and storefiles) are sorted, newDeleteList is sorted.

          When moving to the next storefile, you will merge newDeleteList with currentDeleteList.

          New way of handling deletes:

          • currentDeleteList is a sorted ArrayList of deletes. there is a DeleteFamily special-case that just stores a stamp (which implies it's existence).
          • We do a sorted merge of KVs seen (k) with the currentDeleteList (m), for k + m total comparisons.
          • Each seen new delete is a O(1) append to the newDeleteList.
          • At the end of the storefile, you will then need to merge the delete lists, for m + n total comparisons.
          • This will give you (k + m) + (m + n) total comparisons for each storefile.

          The difference in total is: Old = (k + n) * log(m) vs New = (k + n) + 2m

          The algorithmic improvements seem clear. The replacement of Trees/SortedSets with (Sorted)Lists/ArrayLists will also be beneficial for GC/heap usage.

          Additional complexity in things like merging delete lists are not a big deal if we make thorough unit tests (which we have already done in that case).

          Show
          Jonathan Gray added a comment - (this was written together with erik holstad) @Stack We don't have any profiling stats or end-to-end numbers at the moment, but we do have are some early naive timing test that compare the 2 implementations and already for very few inserts like 10 to 15 entries with no deletes in 2 storefiles + memcache I saw around 2x improvment. These test were only done on the server and I have not tested for bigger tables and from the client, since I only got the rpc to work the other day and we've done some reworks since then, but it looks promising. When we looked at the get code and how to make it faster we started to look what structures were used and how many compares you have to do to get your result. Because to me the way to speed things up is to cut down the total number of compares that you need to do to get your result. Of course the total fetch time is made up by a number of causes, but on the server, outside of HFile/HDFS, where I think most time is spent is comparing entries to see if they should be included in the result or not (this decision made by looking at the query to see if it was asked for, comparing for deletes, checking for ttl, maxVersions requested, timeRange, etc). So what I saw when looking at the old code is that for every new entry from the storefile you compare it to all entries in the get set and then for every match between those you have to check to see it is has been deleted. Let k be the number KeyValues we will look at. Let l be the number of columns asked for by the client. Old Way: We compare every entry from the storefile (k) with every entry from the getset (l) which will give you k * l compares in total. If the getset is sorted (storefile/memcache always is), then you are just doing a sorted merge between two sorted lists. New Way: We do a sorted merge between the entries in the storefile (k) with the entries in the sorted getset (l) which will give you k + l compares in total. The difference between k * l and k + l can be significant if you are matching more than one column. Notes: current plan is to build the sortedset behind the scenes on the client in the Get this insertion sort will have a negligible impact on clients and no additional work will be required server-side we will send the sortedset over the wire as a list. in that case, it will not have to be re-sorted. (this last optimization is important. right now we do a LOT of serializing a sorted tree, sending it over the wire, and then deserializing it. the deserializing of trees actually rebuilds the entire tree, thus reperforming the sort. using lists (which are sorted) will give us time and memory savings.) On the delete part, we have two things to deal with. Checking current KVs against the current DeleteSet, and adding newly seen deletes into the DeleteSet. Let k be the number of KeyValues we will look at. Let m be the number of deletes we have already seen. Let n be the number of new deletes we will see. Old way of checking current KVs agains the current DeleteSet: DeleteSet is a sortedSet, so it has log(m) inserts/gets/deletes/contains operations. For each KV seen (k), we have a log(m) operation to see if it is deleted. This will give you k * log(m) total comparisons. Old way of inserting newly seen delete KVs into current DeleteSet: When we see a delete, we insert it to the deleteset in log(m) . This has downsides. We're adding deletes which we know will not be relevant in this storefile (increasing m increases subsequent delete checks so we have log(m + n) eventually). We're doing work which we may not potentially ever have to do (merging newly seen delete with previous set). If we are going to early-out of this storefile for some reason, there would never have been a reason to pay anything but constant-time cost for deletes related to older storefiles which will not end up being opened. Trees are bad! Their allocation of lots of small nodes is a disaster for GC. We should use lists everywhere we can. This gives you n * log(m) total comparisons. The new approach is similiar to how we now handle the get list. Deletes will be stored in two lists. A currentDeleteList and a newDeleteList. To start, the memcache has no deletes to look at. You will, however, add any deletes seen to the newDeleteList (in constant time, no significant extra work). Since memcache (and storefiles) are sorted, newDeleteList is sorted. When moving to the next storefile, you will merge newDeleteList with currentDeleteList. New way of handling deletes: currentDeleteList is a sorted ArrayList of deletes. there is a DeleteFamily special-case that just stores a stamp (which implies it's existence). We do a sorted merge of KVs seen (k) with the currentDeleteList (m), for k + m total comparisons. Each seen new delete is a O(1) append to the newDeleteList. At the end of the storefile, you will then need to merge the delete lists, for m + n total comparisons. This will give you (k + m) + (m + n) total comparisons for each storefile. The difference in total is: Old = (k + n) * log(m) vs New = (k + n) + 2m The algorithmic improvements seem clear. The replacement of Trees/SortedSets with (Sorted)Lists/ArrayLists will also be beneficial for GC/heap usage. Additional complexity in things like merging delete lists are not a big deal if we make thorough unit tests (which we have already done in that case).
          Hide
          stack added a comment -

          @Holstad

          Do you have profiling stats that show us getting gains we'll even be able to notice?

          .bq You have your KeyValue iterator, data length k, your columns asked for, gets length l, and your deleteStructure, deletes length m.

          I don't follow the above and can you expand more on your exposition on why things will be faster? Looks like you've spent some time figuring it.

          1. My sense is that the delete check is not costly, not in the scheme of things.
          2. There is caching of lengths being done in many of those cases, right?
          3. I was going to say that I don't think this expensive in the scheme of things but just looked at PE profiling and see it taking 2% of CPU... this and getRegion... and this is with the simple PE schema (one family).

          -1 on duplicated code. Lets work on that. Especially when its complex code like this.

          You talked of 2x speedup. How'd you make out that?

          Show
          stack added a comment - @Holstad Do you have profiling stats that show us getting gains we'll even be able to notice? .bq You have your KeyValue iterator, data length k, your columns asked for, gets length l, and your deleteStructure, deletes length m. I don't follow the above and can you expand more on your exposition on why things will be faster? Looks like you've spent some time figuring it. 1. My sense is that the delete check is not costly, not in the scheme of things. 2. There is caching of lengths being done in many of those cases, right? 3. I was going to say that I don't think this expensive in the scheme of things but just looked at PE profiling and see it taking 2% of CPU... this and getRegion... and this is with the simple PE schema (one family). -1 on duplicated code. Lets work on that. Especially when its complex code like this. You talked of 2x speedup. How'd you make out that?
          Hide
          Erik Holstad added a comment -

          So was a little bit confused when it came to the number of compares for the old version yesterday, somehow I thought that the deletes structure was a hashset, but it is a sorted set.
          So the number of compares for getting becomes k*l+ l*log(m) and for the new k+l+m, which if you just wanna put some totally unreasonable test numbers in there just to get a feeling, k=l=m=8 turns out to be
          8*8 + 8*3 = 64+24 = 88 vs 8+8+8 = 24.

          For the new code, like said earlier the most complex part is the merging of the lists and specially the merge of the deletes. On the other hand it is very easy to write unit test for this to test it in isolation, which has also been done.

          Show
          Erik Holstad added a comment - So was a little bit confused when it came to the number of compares for the old version yesterday, somehow I thought that the deletes structure was a hashset, but it is a sorted set. So the number of compares for getting becomes k*l+ l*log(m) and for the new k+l+m, which if you just wanna put some totally unreasonable test numbers in there just to get a feeling, k=l=m=8 turns out to be 8*8 + 8*3 = 64+24 = 88 vs 8+8+8 = 24. For the new code, like said earlier the most complex part is the merging of the lists and specially the merge of the deletes. On the other hand it is very easy to write unit test for this to test it in isolation, which has also been done.
          Hide
          Erik Holstad added a comment -

          @Stack
          + I totally agree that the naming of the compareTo(KeyValue) method is not the best, should probably be something else like you suggested hasEnough or shouldAdd or something along those lines.

          + Why this new code is faster than the old one. There are a couple of things that will make this new code faster than the old one:
          You have your KeyValue iterator, data length k, your columns asked for, gets length l, and your deleteStructure, deletes length m.
          1. The way a get is done today is that for every data you compare it to all the columns asked for and for every match you do a contains on that data. This means that you will do k*l +l*something for the delete contains check. Since the deletes are stored in a sortedSet every insert into it takes log
          2. The way the KeyValue object is used in many places it to call getRowLength and then getColumnLength after each other, this means that you have to do all the calculations for the lengths and offsets multiple times.
          3. When not having a structure from the client that groups families together you will have to get the right store for every KeyValue.

          1. The new code don't have any complicated data structures, just ArrayList<KeyValue> and the only operation that is done to these are add(KeyValue), so the time complexity if compared to the old way is k+l+m which is much fewer compares.
          2. In the new code I have moved all the compare methods in to the actual code, this means that there will be more code that is duplicated, but I think that it is a better approach when we are going for speed. I don't recalculate and lengths or offset if it is not absolutely necessary .

          But the biggest gain in time comes from not having complicated data structures on the server but rather keeping it simple. Of course there are other things that becomes more complicated like merging the lists after every storefile, but there is no way around that as I see it. Don't think it gets much faster than doing a sorted merge.

          Show
          Erik Holstad added a comment - @Stack + I totally agree that the naming of the compareTo(KeyValue) method is not the best, should probably be something else like you suggested hasEnough or shouldAdd or something along those lines. + Why this new code is faster than the old one. There are a couple of things that will make this new code faster than the old one: You have your KeyValue iterator, data length k, your columns asked for, gets length l, and your deleteStructure, deletes length m. 1. The way a get is done today is that for every data you compare it to all the columns asked for and for every match you do a contains on that data. This means that you will do k*l +l*something for the delete contains check. Since the deletes are stored in a sortedSet every insert into it takes log 2. The way the KeyValue object is used in many places it to call getRowLength and then getColumnLength after each other, this means that you have to do all the calculations for the lengths and offsets multiple times. 3. When not having a structure from the client that groups families together you will have to get the right store for every KeyValue. 1. The new code don't have any complicated data structures, just ArrayList<KeyValue> and the only operation that is done to these are add(KeyValue), so the time complexity if compared to the old way is k+l+m which is much fewer compares. 2. In the new code I have moved all the compare methods in to the actual code, this means that there will be more code that is duplicated, but I think that it is a better approach when we are going for speed. I don't recalculate and lengths or offset if it is not absolutely necessary . But the biggest gain in time comes from not having complicated data structures on the server but rather keeping it simple. Of course there are other things that becomes more complicated like merging the lists after every storefile, but there is no way around that as I see it. Don't think it gets much faster than doing a sorted merge.
          Hide
          stack added a comment -

          @jon

          + If client-side needs to reflect server-implementation, then we're off going by current state of src (Server objective is all byte arrays, client is introducing classes that are carried over to the server and then made into byte arrays; e.g. Family).
          + I'm against two implementations, a 'slimmer' byte-based, and then a object-based one for more complex queries IF they do not resolve to same thing client side (otherwise, as I see it we are multiplying the number of client/server methods and reasoning in server)
          + Family:column has been reined in server-side with comparators that compare the family only portion of column – no new allocations.
          + I'm grand w/ lots of objects client-side but not wanton profligacy – should be elegant design
          + I wish I saw more KV in the current src; seems little going between client and server currently.
          + If we've gone past the current store file, and we have enough versions... thats how we figure when we can early-out, right? What other factor is there? Only one that might complicate is the GetTop. For that, you need to count versions differently – not per column-family but in total – IIUC. Has to be really good reason for doing a compare that is like but just a little different from what devs are used to.
          + I wonder why bother with NewDeletes? Just add them to the running deletes list? Its a lookup into a sorted set so extra numbers will make a very-slight difference, not enough of a difference that its worth instantiating 3x the lists and running combine as we move from one storefile to the next I'd guess (Looks like I'm not getting something here in the way the test if a cell is deleted is working – please help me understand if I'm missing something about how this system works.. the delete system).
          + OK w/ putting of filters to 0.21... using what we have not if we can.

          @Erik

          + Version handling inside the compareTo says to me that the method is not a compareTo at all... that its a kind of hasEnough test. Might be nice having the version code all up inside the hasEnough method.

          Erik, how do you figure this code makes things faster. Its not plain to me how it does that.

          Show
          stack added a comment - @jon + If client-side needs to reflect server-implementation, then we're off going by current state of src (Server objective is all byte arrays, client is introducing classes that are carried over to the server and then made into byte arrays; e.g. Family). + I'm against two implementations, a 'slimmer' byte-based, and then a object-based one for more complex queries IF they do not resolve to same thing client side (otherwise, as I see it we are multiplying the number of client/server methods and reasoning in server) + Family:column has been reined in server-side with comparators that compare the family only portion of column – no new allocations. + I'm grand w/ lots of objects client-side but not wanton profligacy – should be elegant design + I wish I saw more KV in the current src; seems little going between client and server currently. + If we've gone past the current store file, and we have enough versions... thats how we figure when we can early-out, right? What other factor is there? Only one that might complicate is the GetTop. For that, you need to count versions differently – not per column-family but in total – IIUC. Has to be really good reason for doing a compare that is like but just a little different from what devs are used to. + I wonder why bother with NewDeletes? Just add them to the running deletes list? Its a lookup into a sorted set so extra numbers will make a very-slight difference, not enough of a difference that its worth instantiating 3x the lists and running combine as we move from one storefile to the next I'd guess (Looks like I'm not getting something here in the way the test if a cell is deleted is working – please help me understand if I'm missing something about how this system works.. the delete system). + OK w/ putting of filters to 0.21... using what we have not if we can. @Erik + Version handling inside the compareTo says to me that the method is not a compareTo at all... that its a kind of hasEnough test. Might be nice having the version code all up inside the hasEnough method. Erik, how do you figure this code makes things faster. Its not plain to me how it does that.
          Hide
          Erik Holstad added a comment -

          Extra comment on Jonathan's comment:
          About deletes.
          So when you enter a new file and walking down the lists doing the sorted merge between the data in the store, the gets and the previous deletes. When finding a delete in the current file this is just added to the newDelete list. Since all data is sorted the inserts in the newDeletes list will be sorted too. When the current store is done and if you need to go to the next one you first merge the oldDeletes with the new and use these for checking in the next storefile.

          Same thing is done for the gets, in the case of a GetRow/GetFamilies call.

          Versions are just a different list that sits in the GetXServer. This list is handled differently depending on what kind of call it is. In case of a GetColumns call this list is getting smaller and smaller and for the other cases it gets bigger and bigger. This is because columns are removed from the column list in the case of a GetColumns call, so the versions list has to match.

          Show
          Erik Holstad added a comment - Extra comment on Jonathan's comment: About deletes. So when you enter a new file and walking down the lists doing the sorted merge between the data in the store, the gets and the previous deletes. When finding a delete in the current file this is just added to the newDelete list. Since all data is sorted the inserts in the newDeletes list will be sorted too. When the current store is done and if you need to go to the next one you first merge the oldDeletes with the new and use these for checking in the next storefile. Same thing is done for the gets, in the case of a GetRow/GetFamilies call. Versions are just a different list that sits in the GetXServer. This list is handled differently depending on what kind of call it is. In case of a GetColumns call this list is getting smaller and smaller and for the other cases it gets bigger and bigger. This is because columns are removed from the column list in the case of a GetColumns call, so the versions list has to match.
          Hide
          Erik Holstad added a comment -

          @Stack
          + TimeRange might be like the CellBounds, just a class with 2 timestamp, upper and lower bound for the getFetch.
          + About the Fmaily object, I think I need to do some testing, cause I didn't know that an object was that big while sending it over the wire, but we all want the same thing here, to make it as fast as possible, so it that turns out top be the case, we will absolutely change the format. and I guess if you wanted to send extra params along with the family and columns we can put them before. Just seems more complicated to have a non object approach, but will have a look at the sizes.
          + The version code is now handled inside of the compareTo(KeyValue) and if we would lift it out of there we would have to add some extra logic outside, nothing major. The biggest reason for not doing this is the fact that the GetRow/GetFamilies calls will have to do this compare too, even thought they can never early out, so those calls are going to have to carry some extra weight.
          + GetXServer is different from GetX by the compareTo(KeyValue) method and some other smaller helper methods that are related to that one, but that is the only thing.
          + For now I only have the RowFilterInterface in my code, so that is the same for now.

          Show
          Erik Holstad added a comment - @Stack + TimeRange might be like the CellBounds, just a class with 2 timestamp, upper and lower bound for the getFetch. + About the Fmaily object, I think I need to do some testing, cause I didn't know that an object was that big while sending it over the wire, but we all want the same thing here, to make it as fast as possible, so it that turns out top be the case, we will absolutely change the format. and I guess if you wanted to send extra params along with the family and columns we can put them before. Just seems more complicated to have a non object approach, but will have a look at the sizes. + The version code is now handled inside of the compareTo(KeyValue) and if we would lift it out of there we would have to add some extra logic outside, nothing major. The biggest reason for not doing this is the fact that the GetRow/GetFamilies calls will have to do this compare too, even thought they can never early out, so those calls are going to have to carry some extra weight. + GetXServer is different from GetX by the compareTo(KeyValue) method and some other smaller helper methods that are related to that one, but that is the only thing. + For now I only have the RowFilterInterface in my code, so that is the same for now.
          Hide
          Jonathan Gray added a comment -

          + Is the Family object needed? The old array of arrays where columns were compound of family and qualifier would seem to be more compact?

          Yes, it is more compact. And insanely difficult to actually use and reason about. We have had a good amount of back and forth about whether to use more classes or not. This applies to 880 especially. My general argument is (client-side):

          • The HTable API needs to be smaller, it is very wide and it makes things complicated.
          • The HTable API does not reflect server-side implementation.
          • Dealing with family:column and binary is a nightmare. Delimiters generally suck.
          • We can retain a slimmer, direct byte[] based HTable API for ease-of-use, and then use classes for more complex/custom queries.
          • A hierarchical client API that maps to a server-side set of classes mirrors queries to implementation and makes it more clear to the user which things have a cost (like each family you add as a parameter is much different than adding an additional column parameter to a family).
          • Instantiating objects on the client-side for Gets is basically to be considered "free". Compared to the amount of comparators, utility functions, and allocation we currently have to deal with family:column byte[]s, there's no difference in performance on the client-side using nested objects and lists. Performance does/will come from getting rid of server-side instantiation/allocation (zero-copy reads, optimized Gets, not using TreeMap after TreeMap).
          • Methods of all sorts basically being a single serialized object, many times just lists of KeyValues, for RPC in the interest of moving towards a language agnostic client protocol.

          Should bring this conversation to IRC if people want to chime in... we definitely want others input, this is a big decision, but the API is much more usable like this and server implementations much easier to reason about.

          + How do I read GetXServer? Is that the client-side GetX that has been deserialized Server-side? The GetXServer.compareTo will take into consideration the TimeRange? I think I like it. Why do I need result 2 and 3 out of the compareTo? Whats wrong with the compareTo working like any other comparator returning < 0, 0, or > 0? If 0, we add to the result. If > 0, we've gone past whatever our context, storefile or store, and then in the loop we just move on to the next storefile or store. Shouldn't be compareTo if returning different kind of results.

          No problem with renaming it. It does break the normal convention. It can't be -1, 0, 1 because it has four different things it needs to say. You say, if > 0, we've gone past storefile or store. That means in every query, we will always have to check every storefile. Many of our queries (GetColumns, GetTop at least) have early-outs where they do not need to look into any other storefiles. So this compare needs to be able to say, i'm done, return to client now.

          + What about deletes? They are orthogonal to this compareTo test? They are a running list that we bring along with our results as we do currently? Looks like you have this thing called NewDeletes that GetX knows about?
          + How does your DeleteSet work? How will it delete with different types (e.g. what do you add to this Set? Deletes? If so, how you going to have the Put something is supposed to Delete match in the comparator? Currently I have a special comparator that ignores types... that won't be good enough if need to consider family, column and plain deletes).

          Best answered by Erik... My understanding is basically we keep deletes to the side. DeleteFamily's are special cased. Since deletes only apply to older storefiles, when reading a storefile we insert deletes to a newDeletes list and actually use the oldDeletes list to check if things are deleted.

          But the way things are processed is neat. It's a sorted merge down the oldDeletes list. So you do not have to check against a bunch of things, or do a (log n) treemap operation.

          + We're changing how filters work?

          I would like to. None of the code is done. I have no issue with delaying that for 0.21, but since things are being ripped apart thought we might get it in. The biggest change outlined here is adding new language similar to the compareTo above... done, return now. Would allow for efficient limit, offset queries and other such things. As you say, in current code it is the same. The other thing with filters is it would be nice to be able to use dynamic classes. So we might just put off filter changes to 0.21 and give them more attention.

          Show
          Jonathan Gray added a comment - + Is the Family object needed? The old array of arrays where columns were compound of family and qualifier would seem to be more compact? Yes, it is more compact. And insanely difficult to actually use and reason about. We have had a good amount of back and forth about whether to use more classes or not. This applies to 880 especially. My general argument is (client-side): The HTable API needs to be smaller, it is very wide and it makes things complicated. The HTable API does not reflect server-side implementation. Dealing with family:column and binary is a nightmare. Delimiters generally suck. We can retain a slimmer, direct byte[] based HTable API for ease-of-use, and then use classes for more complex/custom queries. A hierarchical client API that maps to a server-side set of classes mirrors queries to implementation and makes it more clear to the user which things have a cost (like each family you add as a parameter is much different than adding an additional column parameter to a family). Instantiating objects on the client-side for Gets is basically to be considered "free". Compared to the amount of comparators, utility functions, and allocation we currently have to deal with family:column byte[]s, there's no difference in performance on the client-side using nested objects and lists. Performance does/will come from getting rid of server-side instantiation/allocation (zero-copy reads, optimized Gets, not using TreeMap after TreeMap). Methods of all sorts basically being a single serialized object, many times just lists of KeyValues, for RPC in the interest of moving towards a language agnostic client protocol. Should bring this conversation to IRC if people want to chime in... we definitely want others input, this is a big decision, but the API is much more usable like this and server implementations much easier to reason about. + How do I read GetXServer? Is that the client-side GetX that has been deserialized Server-side? The GetXServer.compareTo will take into consideration the TimeRange? I think I like it. Why do I need result 2 and 3 out of the compareTo? Whats wrong with the compareTo working like any other comparator returning < 0, 0, or > 0? If 0, we add to the result. If > 0, we've gone past whatever our context, storefile or store, and then in the loop we just move on to the next storefile or store. Shouldn't be compareTo if returning different kind of results. No problem with renaming it. It does break the normal convention. It can't be -1, 0, 1 because it has four different things it needs to say. You say, if > 0, we've gone past storefile or store. That means in every query, we will always have to check every storefile. Many of our queries (GetColumns, GetTop at least) have early-outs where they do not need to look into any other storefiles. So this compare needs to be able to say, i'm done, return to client now. + What about deletes? They are orthogonal to this compareTo test? They are a running list that we bring along with our results as we do currently? Looks like you have this thing called NewDeletes that GetX knows about? + How does your DeleteSet work? How will it delete with different types (e.g. what do you add to this Set? Deletes? If so, how you going to have the Put something is supposed to Delete match in the comparator? Currently I have a special comparator that ignores types... that won't be good enough if need to consider family, column and plain deletes). Best answered by Erik... My understanding is basically we keep deletes to the side. DeleteFamily's are special cased. Since deletes only apply to older storefiles, when reading a storefile we insert deletes to a newDeletes list and actually use the oldDeletes list to check if things are deleted. But the way things are processed is neat. It's a sorted merge down the oldDeletes list. So you do not have to check against a bunch of things, or do a (log n) treemap operation. + We're changing how filters work? I would like to. None of the code is done. I have no issue with delaying that for 0.21, but since things are being ripped apart thought we might get it in. The biggest change outlined here is adding new language similar to the compareTo above... done, return now. Would allow for efficient limit, offset queries and other such things. As you say, in current code it is the same. The other thing with filters is it would be nice to be able to use dynamic classes. So we might just put off filter changes to 0.21 and give them more attention.
          Hide
          stack added a comment -

          On filters, reading code, it looks like filters would be our old favorite interface, RowFilterInterface.... so I answered my own question.

          Show
          stack added a comment - On filters, reading code, it looks like filters would be our old favorite interface, RowFilterInterface.... so I answered my own question.
          Hide
          stack added a comment -

          Yes, I mean TimeRange (sorry).

          On the Family object, how many times would we have to repeat family in the old byte [][] of columns before it cost more than creation of Family object with a List? I'm guessing we'd have to have a lot of entries in a Get before it'd cost less using Family objects (In general, just trying to keep down number of items we deal with). Would the Family be passed over IRC?

          I can early out if compareTo returns > 0 – i.e. we've gone beyond – and we have enough versions? Do I need extra codes?

          Is GetXServer different from GetX on client?

          On filters, the docs suggest we're doing new type of filter. Was asking if that true.

          Show
          stack added a comment - Yes, I mean TimeRange (sorry). On the Family object, how many times would we have to repeat family in the old byte [][] of columns before it cost more than creation of Family object with a List? I'm guessing we'd have to have a lot of entries in a Get before it'd cost less using Family objects (In general, just trying to keep down number of items we deal with). Would the Family be passed over IRC? I can early out if compareTo returns > 0 – i.e. we've gone beyond – and we have enough versions? Do I need extra codes? Is GetXServer different from GetX on client? On filters, the docs suggest we're doing new type of filter. Was asking if that true.
          Hide
          Erik Holstad added a comment -

          + Do you mean TimeRange, or are you really talking about TimeStamp?
          + Versions are handled differently depending on what kind of query is is, but basically there is a separate list that keeps track of how many versions that have been fetched that lines up with the list of KeyValue fetched.
          + The Family object is not needed but it makes a nice abstraction from the array of arrays, and for the case of using the family for e get, you don't need to store the family name more than once, which is not the case with the old way. I nice thing with having a family object is also that you don't need to look at the objects themselves to figure out what family it is, since it is stored separately.
          + The GetXServer is a GetX that has be deserialized on the server, but also includes the compareTo(KeyValue) method and some other get specific methods that belong on the server side. Yes the compareTo(KeyValue) method in the GetXServer class checks everything from row/column to ttl, TimeRange, Filter and versions, everything is handled in this method. The reason that there are to different return code for next storefile and Done is to be able to early out, so that you don't have to go to the next storfile, but you are right, it should probably be called something else than compareTo since it doesn't have the expected return codes.
          +So in the GetXServer class you have like 6 different list, you have gets, newGets, deletes, newDeletes, versions and newVersions, all of these are not used for all types of gets but basically how it works it that you create the newList in the current storefile and after every store file you merge the 2 lists into one and they are but into the non new lists.
          +All the compare code that I'm using is currently in the GetXServer class, the only external compare that I'm currently using it Bytes.compare(). The reason that I've done it this way is because I don't want to recalculate the offsets and lengths when doing different compares. I don't have a deleteSet just a list of KeyValues that are deletes. The basic layout that we can't escape is that we have 3 lists, one with the KeyValues in the current memcache/storefile, data, one with things that we are looking for, gets and one with the deletes. So instead of having sets or sortedmaps where these are stores we just have 3 sorted lists. So for every data, we look in the get list to see if that entry can be found and iterate down the get list until we find that entry or something that is bigger. In the case of a hit in the get list we do the same thing in the delete list. This means that if the lists are k, l, m long worst case lookup are k+l+m.
          + I haven't done anything with filters yet, I just have a part in the compareTo(KeyValue) where the filter check should sit.

          Show
          Erik Holstad added a comment - + Do you mean TimeRange, or are you really talking about TimeStamp? + Versions are handled differently depending on what kind of query is is, but basically there is a separate list that keeps track of how many versions that have been fetched that lines up with the list of KeyValue fetched. + The Family object is not needed but it makes a nice abstraction from the array of arrays, and for the case of using the family for e get, you don't need to store the family name more than once, which is not the case with the old way. I nice thing with having a family object is also that you don't need to look at the objects themselves to figure out what family it is, since it is stored separately. + The GetXServer is a GetX that has be deserialized on the server, but also includes the compareTo(KeyValue) method and some other get specific methods that belong on the server side. Yes the compareTo(KeyValue) method in the GetXServer class checks everything from row/column to ttl, TimeRange, Filter and versions, everything is handled in this method. The reason that there are to different return code for next storefile and Done is to be able to early out, so that you don't have to go to the next storfile, but you are right, it should probably be called something else than compareTo since it doesn't have the expected return codes. +So in the GetXServer class you have like 6 different list, you have gets, newGets, deletes, newDeletes, versions and newVersions, all of these are not used for all types of gets but basically how it works it that you create the newList in the current storefile and after every store file you merge the 2 lists into one and they are but into the non new lists. +All the compare code that I'm using is currently in the GetXServer class, the only external compare that I'm currently using it Bytes.compare(). The reason that I've done it this way is because I don't want to recalculate the offsets and lengths when doing different compares. I don't have a deleteSet just a list of KeyValues that are deletes. The basic layout that we can't escape is that we have 3 lists, one with the KeyValues in the current memcache/storefile, data, one with things that we are looking for, gets and one with the deletes. So instead of having sets or sortedmaps where these are stores we just have 3 sorted lists. So for every data, we look in the get list to see if that entry can be found and iterate down the get list until we find that entry or something that is bigger. In the case of a hit in the get list we do the same thing in the delete list. This means that if the lists are k, l, m long worst case lookup are k+l+m. + I haven't done anything with filters yet, I just have a part in the compareTo(KeyValue) where the filter check should sit.
          Hide
          Erik Holstad added a comment -

          On the sorting of the types, that is good. So we have "bigger" deletes sorted before smaller, so that they we be encountered first.
          When I talked about deleteFamily being stored in a separate place that is when you are doing a get call, not how they are stored in
          the file. In the file they are stored first in the row, since they have an empty value in the column field.

          When it comes to deletes and timestamps the deleteFamily and deleteColumn deletes all data that comes after that ts but for
          delete it only deletes the given ts and does not affect other data.

          In memcache if putting first a delete(ts1) and then after a while putting a deleteColumn(ts3) the delete(ts1) is removed and deleteColumn(ts3) kept. But if they are put in, in opposite order deleteColumn(ts1) and then delete(ts3) both of them are kept.

          Another optimization that is made in memcache and also can be done when doing a minor compaction is to remove the delete itself in the case you find that exact timestamp, only for delete and not for the "bigger" deletes.

          Show
          Erik Holstad added a comment - On the sorting of the types, that is good. So we have "bigger" deletes sorted before smaller, so that they we be encountered first. When I talked about deleteFamily being stored in a separate place that is when you are doing a get call, not how they are stored in the file. In the file they are stored first in the row, since they have an empty value in the column field. When it comes to deletes and timestamps the deleteFamily and deleteColumn deletes all data that comes after that ts but for delete it only deletes the given ts and does not affect other data. In memcache if putting first a delete(ts1) and then after a while putting a deleteColumn(ts3) the delete(ts1) is removed and deleteColumn(ts3) kept. But if they are put in, in opposite order deleteColumn(ts1) and then delete(ts3) both of them are kept. Another optimization that is made in memcache and also can be done when doing a minor compaction is to remove the delete itself in the case you find that exact timestamp, only for delete and not for the "bigger" deletes.
          Hide
          stack added a comment -

          Looking at the GetQuery-v4:

          + TimeStamp looks like CellBounds in 880 proposal 5 v2 only its missing versions. How are versions done?
          + Is the Family object needed? The old array of arrays where columns were compound of family and qualifier would seem to be more compact?
          + How do I read GetXServer? Is that the client-side GetX that has been deserialized Server-side? The GetXServer.compareTo will take into consideration the TimeRange? I think I like it. Why do I need result 2 and 3 out of the compareTo? Whats wrong with the compareTo working like any other comparator returning < 0, 0, or > 0? If 0, we add to the result. If > 0, we've gone past whatever our context, storefile or store, and then in the loop we just move on to the next storefile or store. Shouldn't be compareTo if returning different kind of results.
          + What about deletes? They are orthogonal to this compareTo test? They are a running list that we bring along with our results as we do currently? Looks like you have this thing called NewDeletes that GetX knows about?
          + How does your DeleteSet work? How will it delete with different types (e.g. what do you add to this Set? Deletes? If so, how you going to have the Put something is supposed to Delete match in the comparator? Currently I have a special comparator that ignores types... that won't be good enough if need to consider family, column and plain deletes).
          + We're changing how filters work?

          Show
          stack added a comment - Looking at the GetQuery-v4: + TimeStamp looks like CellBounds in 880 proposal 5 v2 only its missing versions. How are versions done? + Is the Family object needed? The old array of arrays where columns were compound of family and qualifier would seem to be more compact? + How do I read GetXServer? Is that the client-side GetX that has been deserialized Server-side? The GetXServer.compareTo will take into consideration the TimeRange? I think I like it. Why do I need result 2 and 3 out of the compareTo? Whats wrong with the compareTo working like any other comparator returning < 0, 0, or > 0? If 0, we add to the result. If > 0, we've gone past whatever our context, storefile or store, and then in the loop we just move on to the next storefile or store. Shouldn't be compareTo if returning different kind of results. + What about deletes? They are orthogonal to this compareTo test? They are a running list that we bring along with our results as we do currently? Looks like you have this thing called NewDeletes that GetX knows about? + How does your DeleteSet work? How will it delete with different types (e.g. what do you add to this Set? Deletes? If so, how you going to have the Put something is supposed to Delete match in the comparator? Currently I have a special comparator that ignores types... that won't be good enough if need to consider family, column and plain deletes). + We're changing how filters work?
          Hide
          stack added a comment -

          So v4 of doc. is wrong where it says Puts come before Deletes (thats fine).

          Ok on the DeleteRow.

          Sort from TRUNK KV is:

            /**
             * Key type.
             * Has space for other key types to be added later.  Cannot rely on
             * enum ordinals . They change if item is removed or moved.  Do our own codes.
             */
            public static enum Type {
              Put((byte)4),
              Delete((byte)8),
              DeleteColumn((byte)16),
              DeleteFamily((byte)32),
              // Maximum is used when searching; you look from maximum on down.
              Maximum((byte)255);
          

          I don't follow what you mean by "separate place" for DeleteFamily. That sounds odd.

          What about how deletes and timestamps mix? Does a delete only delete cell of that ts? Or does it delete that ts and all behind?

          Show
          stack added a comment - So v4 of doc. is wrong where it says Puts come before Deletes (thats fine). Ok on the DeleteRow. Sort from TRUNK KV is: /** * Key type. * Has space for other key types to be added later. Cannot rely on * enum ordinals . They change if item is removed or moved. Do our own codes. */ public static enum Type { Put(( byte )4), Delete(( byte )8), DeleteColumn(( byte )16), DeleteFamily(( byte )32), // Maximum is used when searching; you look from maximum on down. Maximum(( byte )255); I don't follow what you mean by "separate place" for DeleteFamily. That sounds odd. What about how deletes and timestamps mix? Does a delete only delete cell of that ts? Or does it delete that ts and all behind?
          Hide
          Erik Holstad added a comment -

          Try to answer the questions:
          DeleteRow with square brackets only mean that it is a call from the client side but is broken up into multiple deleteFamilies on the server side, same thing as the type deleteRow.
          The sort order is the same as for the Kv.Comparator in KeyValue, if deletes are sorted above puts, have to check.
          Yes, that is exactly what that means. Deletes are taken care of directly in memcache by calling tailMap with the deleteKeyValue and remove all the affected puts and deletes, so "smaller" deletes are removed in case of a "bigger" one that comes in afterwards.
          In case of a merge deletes needs to be taken care of, so that we don't loose that rule that deletes only apply to the files after. So for a minor compaction we save all the deletes unless a "bigger" on comes in first and remove all the affected puts for those deletes.
          The DeleteFamily type is moved down through all the file just like the other deletes, but stored in a separate place and needs to be checked for every delete to see if it applies to the current KeyValue.

          Show
          Erik Holstad added a comment - Try to answer the questions: DeleteRow with square brackets only mean that it is a call from the client side but is broken up into multiple deleteFamilies on the server side, same thing as the type deleteRow. The sort order is the same as for the Kv.Comparator in KeyValue, if deletes are sorted above puts, have to check. Yes, that is exactly what that means. Deletes are taken care of directly in memcache by calling tailMap with the deleteKeyValue and remove all the affected puts and deletes, so "smaller" deletes are removed in case of a "bigger" one that comes in afterwards. In case of a merge deletes needs to be taken care of, so that we don't loose that rule that deletes only apply to the files after. So for a minor compaction we save all the deletes unless a "bigger" on comes in first and remove all the affected puts for those deletes. The DeleteFamily type is moved down through all the file just like the other deletes, but stored in a separate place and needs to be checked for every delete to see if it applies to the current KeyValue.
          Hide
          stack added a comment -

          Docs are hard to read because no context.

          Let me have a go at interpreting them and commenting on them:

          StoreFile-v4:

          + DeleteRow is in square brackets. Whats that mean?
          + Looks like KeyValue is same as we currently have. Is that right?
          + "Deletes apply only to older StoreFiles" means? Deletes in current file effect older files because deletes will have been directly applied in memcache?
          + Will sorting puts before deletes work? How do I delete an item in memcache? I split the memcache TreeMap at the put record though I've been passed a delete? And then iterate over its tail. If a put, remove it, then enter the delete?
          + Says "DeleteColumn/Delete" are sorted in ascending lexic order? Whats that mean? Ain't type a single byte? So ain't it just a case of saying what the code for deletecolumn, deletefamily is? Currently, we have a maximum type. We split the memcache on that and iterate over its tail. Deletes sort before so they override any Puts that are present. I like the idea that we execute on deletes immediately. Could make savings here. But need bit more info on how this'd all work.
          + In "StoreFile Notes", the idea that we deletes only apply to the next file sounds good but what about the ryan rawson point of what happens when a compaction? What do we do? In minor compaction, we are picking up two files from a possible N. We are putting together the edits. The dictum that deletes only apply to the next file is broke.
          + Don't say nothing on how deletes work regards timestamp. Does a deletefamily delete all in a family at the given timestamp? Or is it all in the family at the timestamp and older? What happens if I do a get behind the timestamp? I can find the values that were "deleted"

          More comments to follow.

          Show
          stack added a comment - Docs are hard to read because no context. Let me have a go at interpreting them and commenting on them: StoreFile-v4: + DeleteRow is in square brackets. Whats that mean? + Looks like KeyValue is same as we currently have. Is that right? + "Deletes apply only to older StoreFiles" means? Deletes in current file effect older files because deletes will have been directly applied in memcache? + Will sorting puts before deletes work? How do I delete an item in memcache? I split the memcache TreeMap at the put record though I've been passed a delete? And then iterate over its tail. If a put, remove it, then enter the delete? + Says "DeleteColumn/Delete" are sorted in ascending lexic order? Whats that mean? Ain't type a single byte? So ain't it just a case of saying what the code for deletecolumn, deletefamily is? Currently, we have a maximum type. We split the memcache on that and iterate over its tail. Deletes sort before so they override any Puts that are present. I like the idea that we execute on deletes immediately. Could make savings here. But need bit more info on how this'd all work. + In "StoreFile Notes", the idea that we deletes only apply to the next file sounds good but what about the ryan rawson point of what happens when a compaction? What do we do? In minor compaction, we are picking up two files from a possible N. We are putting together the edits. The dictum that deletes only apply to the next file is broke. + Don't say nothing on how deletes work regards timestamp. Does a deletefamily delete all in a family at the given timestamp? Or is it all in the family at the timestamp and older? What happens if I do a get behind the timestamp? I can find the values that were "deleted" More comments to follow.
          Hide
          Jonathan Gray added a comment -

          Updated version of GetQuery. The GetServer.compareTo implementation shown in the design document is a general design. There are three unique server-side Get implementation, each containing a slightly modified/specialized version of that method.

          Show
          Jonathan Gray added a comment - Updated version of GetQuery. The GetServer.compareTo implementation shown in the design document is a general design. There are three unique server-side Get implementation, each containing a slightly modified/specialized version of that method.
          Hide
          Jonathan Gray added a comment -

          Updated StoreFile design. New sort order of row, family, column, timestamp, type.

          Show
          Jonathan Gray added a comment - Updated StoreFile design. New sort order of row, family, column, timestamp, type.
          Hide
          Erik Holstad added a comment -

          To clear out some confusion that has come up since the pdfs that were put up has not been updated about the sort order that we are using for the new server. In the pdfs is says that we are sorting row/type/fam/col/ts or something like that, what we are actually doing is row/fam/col/ts,type, so the regular comparator from KeyValue is used.
          Will update the pdfs a soon a possible for it not to be so confusing.

          Show
          Erik Holstad added a comment - To clear out some confusion that has come up since the pdfs that were put up has not been updated about the sort order that we are using for the new server. In the pdfs is says that we are sorting row/type/fam/col/ts or something like that, what we are actually doing is row/fam/col/ts,type, so the regular comparator from KeyValue is used. Will update the pdfs a soon a possible for it not to be so confusing.
          Hide
          stack added a comment -

          I like the idea of NOW as seqid/flushid.

          Show
          stack added a comment - I like the idea of NOW as seqid/flushid.
          Hide
          Jonathan Gray added a comment -

          Right. My idea was just to use NOW as the storefile seqid/flushid (if future timestamps are not allowed, we maintain a property of no versions in a storefile being newer than the seqid/flush stamp). Nothing fancy like getting the newest version of the flush or anything, don't think that's necessary.

          Show
          Jonathan Gray added a comment - Right. My idea was just to use NOW as the storefile seqid/flushid (if future timestamps are not allowed, we maintain a property of no versions in a storefile being newer than the seqid/flush stamp). Nothing fancy like getting the newest version of the flush or anything, don't think that's necessary.
          Hide
          Erik Holstad added a comment -

          I thing that was shown in Jonathan's fine looking example.pdf is the notion of a timestamp for the storefile so that we could use that to find out if there is anything in there that might be of interest for the query. I looked at how storefiles are created today and it looks like they have a sequenceid/logCacheFlushId which to me just looks like in incremented value at the HR level, so all storefiles in a store have unique keys. So to me it looks like it would be possible to use a timestamp here instead and use that information to speed up the get queries. The new id would be set to the latest timestamp of the inserts in the flushed memcache or something like that.

          Show
          Erik Holstad added a comment - I thing that was shown in Jonathan's fine looking example.pdf is the notion of a timestamp for the storefile so that we could use that to find out if there is anything in there that might be of interest for the query. I looked at how storefiles are created today and it looks like they have a sequenceid/logCacheFlushId which to me just looks like in incremented value at the HR level, so all storefiles in a store have unique keys. So to me it looks like it would be possible to use a timestamp here instead and use that information to speed up the get queries. The new id would be set to the latest timestamp of the inserts in the flushed memcache or something like that.
          Hide
          Erik Holstad added a comment -

          @jgray
          I agree with most of the things that you are saying about letting the user setting the timestamp in some places but not in others and I think that it is a good point.
          But the reason that I brought this up is because I would like to keep the code as simple as possible and not have to deal with all these special cases, which DeleteFamily with the ability of setting the timestamp in the past brings, not to make it more complicated. But I guess we can do the same way as with the settable timestamps for the inserts, put a red warning label on there and try to explain in there what will actually happen when making a delete like that, because as a new user I think it is hard to fully understand all the details of your actions even though they make perfect sense.
          So I guess that leaves us with number 2 where we have 2 different cases to take care of, one with DeleteFamily and one without.

          Show
          Erik Holstad added a comment - @jgray I agree with most of the things that you are saying about letting the user setting the timestamp in some places but not in others and I think that it is a good point. But the reason that I brought this up is because I would like to keep the code as simple as possible and not have to deal with all these special cases, which DeleteFamily with the ability of setting the timestamp in the past brings, not to make it more complicated. But I guess we can do the same way as with the settable timestamps for the inserts, put a red warning label on there and try to explain in there what will actually happen when making a delete like that, because as a new user I think it is hard to fully understand all the details of your actions even though they make perfect sense. So I guess that leaves us with number 2 where we have 2 different cases to take care of, one with DeleteFamily and one without.
          Hide
          Jonathan Gray added a comment -

          Ah, so not setting it and performing deletes on the memcache means reading a deletefamily means everything prior storefiles is deleted for that row.

          I guess I just don't agree with that kind of selective restrictions for performance unless we're going to make a conscious and logical design decision. There's a very clear and logical argument for disallowing the manual setting of timestamps. However, this ability is part of the BigTable spec and there are numerous use cases for this (including pset). It closes the door for potential optimizations for those of us who have no need to manually set them, but it's not terrible to allow it as long as they're only in the past.

          The same argument can be applied to this and a bunch of other issues we've been tossing back and forth.

          Let's not make these kinds of decisions without deciding what our requirements are. Either timestamp is a user-settable attribute, or it isn't. I think it should be. Part of the issues with the current API is you can do certain things in one part of the API that aren't supported in the other type. Scanning and versions don't play nice even though we logically can support it. There shouldn't be caveats like, you can insert at any time in the past, but if you want to delete a row, you can only delete every version or particular versions of particular columns, not all versions older than a specified stamp.

          Erik's digging has shown numerous potential optimizations for the future, very good stuff. BUT Let's not alter our requirements or the properties of HBase in significant ways in the name of minor optimization of edge cases.

          If I understand correctly, even with #2 if you do a deleteFamily and specify NOW, it would have the same early-out possibility as with #1. I see a DeleteFamily with a stamp that is newer than the latest stamp in the next storefile. I know all columns are deleted so I do nothing. Enforcing the deletes in memcache means you tuck it away untli the next storefile anyways. So implementation is identical with #2 if used in the way #1 forces you to.

          But you remove the ability of the user to put a past stamp in. And this just adds additional caveats instead of keeping it simple. If a user does a deletefamily with a past stamp, then read queries would need to open additional stores. That's required for correctness of the query, this is not an inefficiency this is what the user wants to happen if he uses puts and deletes in this way.

          Show
          Jonathan Gray added a comment - Ah, so not setting it and performing deletes on the memcache means reading a deletefamily means everything prior storefiles is deleted for that row. I guess I just don't agree with that kind of selective restrictions for performance unless we're going to make a conscious and logical design decision. There's a very clear and logical argument for disallowing the manual setting of timestamps. However, this ability is part of the BigTable spec and there are numerous use cases for this (including pset). It closes the door for potential optimizations for those of us who have no need to manually set them, but it's not terrible to allow it as long as they're only in the past. The same argument can be applied to this and a bunch of other issues we've been tossing back and forth. Let's not make these kinds of decisions without deciding what our requirements are. Either timestamp is a user-settable attribute, or it isn't. I think it should be. Part of the issues with the current API is you can do certain things in one part of the API that aren't supported in the other type. Scanning and versions don't play nice even though we logically can support it. There shouldn't be caveats like, you can insert at any time in the past, but if you want to delete a row, you can only delete every version or particular versions of particular columns, not all versions older than a specified stamp. Erik's digging has shown numerous potential optimizations for the future, very good stuff. BUT Let's not alter our requirements or the properties of HBase in significant ways in the name of minor optimization of edge cases. If I understand correctly, even with #2 if you do a deleteFamily and specify NOW, it would have the same early-out possibility as with #1. I see a DeleteFamily with a stamp that is newer than the latest stamp in the next storefile. I know all columns are deleted so I do nothing. Enforcing the deletes in memcache means you tuck it away untli the next storefile anyways. So implementation is identical with #2 if used in the way #1 forces you to. But you remove the ability of the user to put a past stamp in. And this just adds additional caveats instead of keeping it simple. If a user does a deletefamily with a past stamp, then read queries would need to open additional stores. That's required for correctness of the query, this is not an inefficiency this is what the user wants to happen if he uses puts and deletes in this way.
          Hide
          Erik Holstad added a comment -

          So what you gain by going with number 1 instead of number 2 is that you don't need to check any of the storefiles for more than containment of a DeleteFamily insert, so no need to check the timestamp for every get and for merging of deletes, since yo know that all the deletes that are in that storefile are inserted after the DeleteFamily insert was made. The only time you actually take care of removing stuff in regards to the DeleteFamily call is in memCache. So basically you dont' have to do the checks for the DeleteFamily timestamp, because you know that none of the storefiles after the storefile where you encountered the DeleteFamily entry are "deleted".

          Regards Erik

          Show
          Erik Holstad added a comment - So what you gain by going with number 1 instead of number 2 is that you don't need to check any of the storefiles for more than containment of a DeleteFamily insert, so no need to check the timestamp for every get and for merging of deletes, since yo know that all the deletes that are in that storefile are inserted after the DeleteFamily insert was made. The only time you actually take care of removing stuff in regards to the DeleteFamily call is in memCache. So basically you dont' have to do the checks for the DeleteFamily timestamp, because you know that none of the storefiles after the storefile where you encountered the DeleteFamily entry are "deleted". Regards Erik
          Hide
          Jonathan Gray added a comment -

          I don't understand what we gain from #1 versus #2. What do we lose the ability to do by allowing past timestamp to be set for DeleteFamily?

          Also, Jim, you were the one who voiced opposition to disallowing manual setting of timestamps. I've come to agree with that, with the restriction only being on disallowing future stamps (which breaks nearly all optimizations). Is there a particular reason it makes sense to allow inserts to be put in with any timestamp but not a delete?

          Show
          Jonathan Gray added a comment - I don't understand what we gain from #1 versus #2. What do we lose the ability to do by allowing past timestamp to be set for DeleteFamily? Also, Jim, you were the one who voiced opposition to disallowing manual setting of timestamps. I've come to agree with that, with the restriction only being on disallowing future stamps (which breaks nearly all optimizations). Is there a particular reason it makes sense to allow inserts to be put in with any timestamp but not a delete?
          Hide
          Jim Kellerman added a comment -

          I would say, eliminate timestamp as a parameter to all the deleteFamily methods, and the one that finally actually makes the server call supplies HConstants.LATEST_TIMESTAMP. This is essentially option 1.

          Show
          Jim Kellerman added a comment - I would say, eliminate timestamp as a parameter to all the deleteFamily methods, and the one that finally actually makes the server call supplies HConstants.LATEST_TIMESTAMP. This is essentially option 1.
          Hide
          Jonathan Gray added a comment -

          I'm a little bit confused about what you're saying. First, I don't think you or I have described any recent changes we've made in the design from the current posted docs. Namely, separation of the deletes after all the puts is not actually a good idea.

          Rather than separating them out and having a KV sort like (row,type,column,ts) it would be (row,column,ts,type). You end up building your delete list as you go but now you can early out in more cases. I will update the pdfs later in the week.

          Erik has also simplified the return codes of Get.compareTo.

          Regarding the DeleteFamily issue... There's some new DeleteSet object now that handles the merging of deletes and containment checking? Should be very simple for it to keep around a single, optional DeleteFamily (really just a timestamp... defaults to 0L and is checked every time, or set to null and skip it if none found (should be able to get away with an overhead of a single if == null check when no deletefamily present)... it would get set when reading a DeleteFamily and then just a single long check for each timestamp.

          Regarding #1 above, i don't follow why they can only set to now? The rule can and should be, you can do anything for now or anything in the past. How would setting something in the past break anything here?

          2. This is what I'm proposing I guess? What's the downside? If no deletefamily, you have one line of code, a single instruction comparison. This is the least complex and seems efficient.

          3. I think you're saying the entire row should be timestamp ordered here? As you know, I'm against that.

          Show
          Jonathan Gray added a comment - I'm a little bit confused about what you're saying. First, I don't think you or I have described any recent changes we've made in the design from the current posted docs. Namely, separation of the deletes after all the puts is not actually a good idea. Rather than separating them out and having a KV sort like (row,type,column,ts) it would be (row,column,ts,type). You end up building your delete list as you go but now you can early out in more cases. I will update the pdfs later in the week. Erik has also simplified the return codes of Get.compareTo. Regarding the DeleteFamily issue... There's some new DeleteSet object now that handles the merging of deletes and containment checking? Should be very simple for it to keep around a single, optional DeleteFamily (really just a timestamp... defaults to 0L and is checked every time, or set to null and skip it if none found (should be able to get away with an overhead of a single if == null check when no deletefamily present)... it would get set when reading a DeleteFamily and then just a single long check for each timestamp. Regarding #1 above, i don't follow why they can only set to now? The rule can and should be, you can do anything for now or anything in the past. How would setting something in the past break anything here? 2. This is what I'm proposing I guess? What's the downside? If no deletefamily, you have one line of code, a single instruction comparison. This is the least complex and seems efficient. 3. I think you're saying the entire row should be timestamp ordered here? As you know, I'm against that.
          Hide
          Erik Holstad added a comment -

          When bringing DeleteFamily into the mix it creates problems with the current layout, cause every time you need to merge the deletes from the previous storefile with the current deletes and every time you compare those deletes with the current position that you are looking at in the current storefile you need to compare those timestamps. There are a couple of ways around this as I see it.

          1. Only letting the user set the timestamp for the DeleteFamily to now, where now can be System.currentTimeMillis() or some user generated now. This would mean the current memCache would be cleaned and all other storefiles considered to be deleted, for this row and family. This would mean that you only need to do one check for every storefile to see if there is a DeleteFamily entry in there and you will know that all data in that storefile is ok and that you don't need to look in any more storefiles.

          2. Do the check if there is a DeleteFamily in the deletes and have 2 different methods taking care of the cases where you have a DeleteFamily entry and when you don't. The downside of this is that you still have to pay the cost if you do have a DeleteFamily, worst case you have to do these 2 checks for every entry in every storefile.

          3. Keep the DeleteFamily sorted at the timestamp where it belongs, so that all deletes would be sorted in timestamp order before column. This is a rather big change, because it means that also the puts would have to be sorted this way for it to make any sense. Another advantage of this approach would be that earlying out from a query with a timerange "filter" would be more effective.

          I personally like the 3 option the most, but I can see people not liking it because it sort of redefines what HBase is, so I think that number 1 is the best option and after that number 2.

          Would love to get some input in this matter to see if there is anything that people might have against number 1. Otherwise I will move on with the process of implementing that option.

          Regards Erik

          Show
          Erik Holstad added a comment - When bringing DeleteFamily into the mix it creates problems with the current layout, cause every time you need to merge the deletes from the previous storefile with the current deletes and every time you compare those deletes with the current position that you are looking at in the current storefile you need to compare those timestamps. There are a couple of ways around this as I see it. 1. Only letting the user set the timestamp for the DeleteFamily to now, where now can be System.currentTimeMillis() or some user generated now. This would mean the current memCache would be cleaned and all other storefiles considered to be deleted, for this row and family. This would mean that you only need to do one check for every storefile to see if there is a DeleteFamily entry in there and you will know that all data in that storefile is ok and that you don't need to look in any more storefiles. 2. Do the check if there is a DeleteFamily in the deletes and have 2 different methods taking care of the cases where you have a DeleteFamily entry and when you don't. The downside of this is that you still have to pay the cost if you do have a DeleteFamily, worst case you have to do these 2 checks for every entry in every storefile. 3. Keep the DeleteFamily sorted at the timestamp where it belongs, so that all deletes would be sorted in timestamp order before column. This is a rather big change, because it means that also the puts would have to be sorted this way for it to make any sense. Another advantage of this approach would be that earlying out from a query with a timerange "filter" would be more effective. I personally like the 3 option the most, but I can see people not liking it because it sort of redefines what HBase is, so I think that number 1 is the best option and after that number 2. Would love to get some input in this matter to see if there is anything that people might have against number 1. Otherwise I will move on with the process of implementing that option. Regards Erik
          Hide
          Jonathan Gray added a comment -

          Changes Get.compareTo codes and HRegion sample logic.

          Show
          Jonathan Gray added a comment - Changes Get.compareTo codes and HRegion sample logic.
          Hide
          Jonathan Gray added a comment -

          Erik has been doing good work on implementing these ideas in code. One thing he realized today was that we decided partway through the design process to handle DeleteSet inside of the Get rather than in the HRegion logic. That further simplifies HRegion code and means the Get.compareTo function now returns one of 4 values rather than 6 total.

          Show
          Jonathan Gray added a comment - Erik has been doing good work on implementing these ideas in code. One thing he realized today was that we decided partway through the design process to handle DeleteSet inside of the Get rather than in the HRegion logic. That further simplifies HRegion code and means the Get.compareTo function now returns one of 4 values rather than 6 total.
          Hide
          Erik Holstad added a comment -

          I think that the idea is a the compareTo sits inside the Get interface and different Gets implement their own methods.
          But we didn't want to keep the compareto code on the client so that is why a serverside object is created.
          The idea of having a create stamp is just to be able to go to pass memCache and storefiles faster incase of a timeRange query,
          but that is an optimization.

          Show
          Erik Holstad added a comment - I think that the idea is a the compareTo sits inside the Get interface and different Gets implement their own methods. But we didn't want to keep the compareto code on the client so that is why a serverside object is created. The idea of having a create stamp is just to be able to go to pass memCache and storefiles faster incase of a timeRange query, but that is an optimization.
          Hide
          stack added a comment -

          I need some preamble. I can't figure out how to read these docs. Someone needs to hold my hand.

          For example, whats this compareTo method? Is it on an object GetServer (i.e. a Get on the server-side)?

          Whats this mean 'RegionUniqueID = Creation Stamp'?

          It looks like another rewrite of the Server?

          Show
          stack added a comment - I need some preamble. I can't figure out how to read these docs. Someone needs to hold my hand. For example, whats this compareTo method? Is it on an object GetServer (i.e. a Get on the server-side)? Whats this mean 'RegionUniqueID = Creation Stamp'? It looks like another rewrite of the Server?
          Hide
          Erik Holstad added a comment -

          Jonathan, I guess the confusion is about the way that DeleteFamily is used, as I wrote 12 posts above, or something, I thought that ts
          for DeleteRow and DeleteFamily should be set to now, that is why I thought it was in the wrong place or had the wrong ts, but I guess
          the reason for having it that way was for the sorting and since we sort by type before ts now, it is no longer needed and we can use that
          ts as in the case of DeleteColumn.

          Show
          Erik Holstad added a comment - Jonathan, I guess the confusion is about the way that DeleteFamily is used, as I wrote 12 posts above, or something, I thought that ts for DeleteRow and DeleteFamily should be set to now, that is why I thought it was in the wrong place or had the wrong ts, but I guess the reason for having it that way was for the sorting and since we sort by type before ts now, it is no longer needed and we can use that ts as in the case of DeleteColumn.
          Hide
          Jonathan Gray added a comment -

          Added numVersions=MAX to second GetColumns example.

          Show
          Jonathan Gray added a comment - Added numVersions=MAX to second GetColumns example.
          Hide
          Jonathan Gray added a comment -

          Erik:

          DeleteFamily in MemCache, should probably have a different ts, to belong there, same thing with stuff in the storefiles.

          Why should DeleteFamily have different timestamps? In the Memcache, it is a DeleteFamily for ts <= 9. This request can come in at any time so could be in the Memcache. Remember, deletes in the current memcache/storefile relate only to older ones (smaller timestamps).

          For the GetColumns(RowA, Fam, [ColA, ColB], ts(20, 0)), I would think that that query would mean give me 1 version of ColA and ColB from the timeperiod 20 - 0.

          I meant all versions. Will update and fix that. You're right, default should probably always be 1.

          And for GetFamily(RowA, Fam, num=1), Couldn't you early out as soon as you encounter the DeleteFamily in MemCache?

          I don't believe so. You encounter the DeleteFamily(ts <= 9), but the ID of the next StoreFile is 21, so you don't know anything about what's in there. You are actually done at this point, for this particular query, but there is no way to know that there are not versions of other columns besides colA and colB in StoreFile1, or even StoreFile2 (id=11, so could also contain undeleted columns for the resultset). For example, a (RowA, Put, Fam, ColC, 19, M) in StoreFile1 would be part of the resultset if it existed.

          Ryan:

          (RowA, Delete, Fam, ColA, 6, X)
          Is the 'X' a typo? This should be a */no value since deletes dont actually carry any value info - the info is in the key itself.

          Yes, that is a typo. If you look at the latest Example pdf, the StoreFiles should be accurate.

          It sounds like we lose the ability to version deletes if the value is in memcache

          I'm not sure I totally understand what you're saying about the removal from memcache affecting that. To expand on your example...

          We do a put of KeyValue: (rowA, fam, put, colA, ts=now=100, X)
          Then we do a delete of KeyValue, either Delete or DeleteColumn:
          DeleteColumn(rowA, fam, deletecolumn, colA, ts<=#)
          Delete(rowA, fam, delete, colA, ts=#)

          The timestamp that the delete goes in with is how you version the delete. Delete is for the lowest unit, single row, single family, single column, single version. DeleteColumn is for a range of versions of a column, my current thought was <= the specified stamp. If you don't specify one, then you will do now and that would delete all versions. Does that make more sense? I wasn't really clear about the 3 different delete types.

          Show
          Jonathan Gray added a comment - Erik: DeleteFamily in MemCache, should probably have a different ts, to belong there, same thing with stuff in the storefiles. Why should DeleteFamily have different timestamps? In the Memcache, it is a DeleteFamily for ts <= 9. This request can come in at any time so could be in the Memcache. Remember, deletes in the current memcache/storefile relate only to older ones (smaller timestamps). For the GetColumns(RowA, Fam, [ColA, ColB] , ts(20, 0)), I would think that that query would mean give me 1 version of ColA and ColB from the timeperiod 20 - 0. I meant all versions. Will update and fix that. You're right, default should probably always be 1. And for GetFamily(RowA, Fam, num=1), Couldn't you early out as soon as you encounter the DeleteFamily in MemCache? I don't believe so. You encounter the DeleteFamily(ts <= 9), but the ID of the next StoreFile is 21, so you don't know anything about what's in there. You are actually done at this point, for this particular query, but there is no way to know that there are not versions of other columns besides colA and colB in StoreFile1, or even StoreFile2 (id=11, so could also contain undeleted columns for the resultset). For example, a (RowA, Put, Fam, ColC, 19, M) in StoreFile1 would be part of the resultset if it existed. Ryan: (RowA, Delete, Fam, ColA, 6, X) Is the 'X' a typo? This should be a */no value since deletes dont actually carry any value info - the info is in the key itself. Yes, that is a typo. If you look at the latest Example pdf, the StoreFiles should be accurate. It sounds like we lose the ability to version deletes if the value is in memcache I'm not sure I totally understand what you're saying about the removal from memcache affecting that. To expand on your example... We do a put of KeyValue: (rowA, fam, put, colA, ts=now=100, X) Then we do a delete of KeyValue, either Delete or DeleteColumn: DeleteColumn(rowA, fam, deletecolumn, colA, ts<=#) Delete(rowA, fam, delete, colA, ts=#) The timestamp that the delete goes in with is how you version the delete. Delete is for the lowest unit, single row, single family, single column, single version. DeleteColumn is for a range of versions of a column, my current thought was <= the specified stamp. If you don't specify one, then you will do now and that would delete all versions. Does that make more sense? I wasn't really clear about the 3 different delete types.
          Hide
          ryan rawson added a comment -

          A few things:

          • in the storefile pdf, you have a row like so:
            RowA, Delete, Fam, ColA, 6, X

          Is the 'X' a typo? This should be a */no value since deletes dont actually carry any value info - the info is in the key itself.

          And as for the delete/memcache thing.... It sounds like we lose the ability to version deletes if the value is in memcache. Eg:

          • insert row A, ColA, value X
          • delete row A, Col A

          in a row, both will hit the memcache (probably). It seems like with your proposal we would process the delete by nuking the insert (aka put) right away, and storing the delete for posterity and to cover up older values in older hfiles.

          or am i missing something?

          Show
          ryan rawson added a comment - A few things: in the storefile pdf, you have a row like so: RowA, Delete, Fam, ColA, 6, X Is the 'X' a typo? This should be a */no value since deletes dont actually carry any value info - the info is in the key itself. And as for the delete/memcache thing.... It sounds like we lose the ability to version deletes if the value is in memcache. Eg: insert row A, ColA, value X delete row A, Col A in a row, both will hit the memcache (probably). It seems like with your proposal we would process the delete by nuking the insert (aka put) right away, and storing the delete for posterity and to cover up older values in older hfiles. or am i missing something?
          Hide
          Erik Holstad added a comment -

          The pdfs looks really good and they make sense to me, just hope that they can help others to understand what we are going for here.
          Just a couple of things that are a little bit unclear to me in the examples:
          DeleteFamily in MemCache, should probably have a different ts, to belong there, same thing with stuff in the storefiles.

          For the GetColumns(RowA, Fam, [ColA, ColB], ts(20, 0)), I would think that that query would mean give me 1 version of
          ColA and ColB from the timeperiod 20 - 0.

          And for GetFamily(RowA, Fam, num=1), Couldn't you early out as soon as you encounter the DeleteFamily in MemCache?

          Show
          Erik Holstad added a comment - The pdfs looks really good and they make sense to me, just hope that they can help others to understand what we are going for here. Just a couple of things that are a little bit unclear to me in the examples: DeleteFamily in MemCache, should probably have a different ts, to belong there, same thing with stuff in the storefiles. For the GetColumns(RowA, Fam, [ColA, ColB] , ts(20, 0)), I would think that that query would mean give me 1 version of ColA and ColB from the timeperiod 20 - 0. And for GetFamily(RowA, Fam, num=1), Couldn't you early out as soon as you encounter the DeleteFamily in MemCache?
          Hide
          Jonathan Gray added a comment -

          This contains an example Memcache and two example StoreFiles using previously described keys, comparators, and deletes.

          There are wordy psuedo-code implementations of 5 different queries and their server-side execution logic. It explains how everything in GetQueries would work in practice.

          It's difficult to break this work up, like one person working on deletes, because to do this stuff right it requires a much bigger overall rearchitecture of things. Erik already has a good bit of code written for all this, so he and I are willing to take this stuff on.

          That being said, these are major changes and we need feedback.

          Show
          Jonathan Gray added a comment - This contains an example Memcache and two example StoreFiles using previously described keys, comparators, and deletes. There are wordy psuedo-code implementations of 5 different queries and their server-side execution logic. It explains how everything in GetQueries would work in practice. It's difficult to break this work up, like one person working on deletes, because to do this stuff right it requires a much bigger overall rearchitecture of things. Erik already has a good bit of code written for all this, so he and I are willing to take this stuff on. That being said, these are major changes and we need feedback.
          Hide
          Jonathan Gray added a comment -

          Another thing I just realized. If we are going to push all the specialized server-side implementation of Gets into the GetServer implementing Classes, they must know the IDs (latest stamp) of all the StoreFiles in the current Store. This is known on the server so not a big deal, but I've not mentioned this before.

          Show
          Jonathan Gray added a comment - Another thing I just realized. If we are going to push all the specialized server-side implementation of Gets into the GetServer implementing Classes, they must know the IDs (latest stamp) of all the StoreFiles in the current Store. This is known on the server so not a big deal, but I've not mentioned this before.
          Hide
          Jonathan Gray added a comment -

          Also, that next comparison would not even require any comparison to the KeyValue. It would just be a simple check at the start of compareTo to check if we've overrun the boundaries as defined in return codes 3 and 4.

          Show
          Jonathan Gray added a comment - Also, that next comparison would not even require any comparison to the KeyValue. It would just be a simple check at the start of compareTo to check if we've overrun the boundaries as defined in return codes 3 and 4.
          Hide
          Jonathan Gray added a comment -

          Finishing my thought... This means that you may have to read an additional key that you otherwise would not have had to. Another option would be to add two additional return codes, but my thought is that 6 is more than enough as is and in general (when we'd return 3, not 4) we'll need to continue scanning anyways to read in all the deletes.

          Show
          Jonathan Gray added a comment - Finishing my thought... This means that you may have to read an additional key that you otherwise would not have had to. Another option would be to add two additional return codes, but my thought is that 6 is more than enough as is and in general (when we'd return 3, not 4) we'll need to continue scanning anyways to read in all the deletes.
          Hide
          Jonathan Gray added a comment -

          Erik pointed out that return codes 3 and 4 are not explicit about whether you should include the current value or not in the ResultSet.

          This updates the spec so that 3 and 4 mean DO NOT include the current key in the result. This means that you

          Also note, another optimization we will have is removing things in the Get that are to be matched once we have matched all that are required. For example, we specify we want the most recent versions of 5 explicit columns. Once we find one of them, they will be removed from the match list. Once all are found, and thus removed, the next compareTo call will see that there is nothing left to match on and return a 4. Return code of 3 would be used in many other cases, for example, if we're looking for all versions of a single column, once we scan past that column in the current storefile, we return 3 to jump to the next one.

          Show
          Jonathan Gray added a comment - Erik pointed out that return codes 3 and 4 are not explicit about whether you should include the current value or not in the ResultSet. This updates the spec so that 3 and 4 mean DO NOT include the current key in the result. This means that you Also note, another optimization we will have is removing things in the Get that are to be matched once we have matched all that are required. For example, we specify we want the most recent versions of 5 explicit columns. Once we find one of them, they will be removed from the match list. Once all are found, and thus removed, the next compareTo call will see that there is nothing left to match on and return a 4. Return code of 3 would be used in many other cases, for example, if we're looking for all versions of a single column, once we scan past that column in the current storefile, we return 3 to jump to the next one.
          Hide
          Jonathan Gray added a comment -

          This is somewhat related to HBASE-880, but specifically talks about how we will implement Gets on the server-side.

          For each GetX object we have on the client, there will be a corresponding GetXServer object that will contain any necessary state for processing the request. In addition, the GetServer interface will include a very unique and specialized compareTo(KeyValue) method with a range of return values [-1, 4] as described in the doc.

          These GetServer implementations will be where all the specialized server-side logic will live that allow the different styles of gets to be done in an optimized way. The bulk of that logic lives in the column/timestamp comparison in the compareTo.

          More information and examples to follow.

          Show
          Jonathan Gray added a comment - This is somewhat related to HBASE-880 , but specifically talks about how we will implement Gets on the server-side. For each GetX object we have on the client, there will be a corresponding GetXServer object that will contain any necessary state for processing the request. In addition, the GetServer interface will include a very unique and specialized compareTo(KeyValue) method with a range of return values [-1, 4] as described in the doc. These GetServer implementations will be where all the specialized server-side logic will live that allow the different styles of gets to be done in an optimized way. The bulk of that logic lives in the column/timestamp comparison in the compareTo. More information and examples to follow.
          Hide
          Jonathan Gray added a comment -

          Describes new KeyValue format, StoreFile format, and the default comparator/ordering of KeyValues within a StoreFile.

          Show
          Jonathan Gray added a comment - Describes new KeyValue format, StoreFile format, and the default comparator/ordering of KeyValues within a StoreFile.
          Hide
          Erik Holstad added a comment -

          Yeah Jonathan, that sounds like a really good approach, we get all the benefits from splitting deletes and puts and don't have to pay the cost of doing that, very nice.

          On the whole put/get/delete issue:

          When looking for a value in HBase we have 3 lists that need to be compared to
          each other, the list of data, da, in the storefile, the list of keys, k, to look
          for and the list of deletes, de.

          Today, we compare every da with every k and every match with every de, as far as
          I can tell. We get a complexity that looks something like da*k+k*de, which might
          be ok, when all those value are small, but if da = k = de = 10 you get 200
          comparisons that you have to do.

          I'm proposing more like a merge approach where you merge compare da and de and
          produce a survivor list, this list is then compared to k. This will result in
          de+da + da+k = 40 in worst case, which seems like a much better way to go.
          Can even be made more efficient.

          Think we should add get types into KeyValue, so we can tell the difference between getting
          a value for a specific ts and getting all values after a specific ts.

          Show
          Erik Holstad added a comment - Yeah Jonathan, that sounds like a really good approach, we get all the benefits from splitting deletes and puts and don't have to pay the cost of doing that, very nice. On the whole put/get/delete issue: When looking for a value in HBase we have 3 lists that need to be compared to each other, the list of data, da, in the storefile, the list of keys, k, to look for and the list of deletes, de. Today, we compare every da with every k and every match with every de, as far as I can tell. We get a complexity that looks something like da*k+k*de, which might be ok, when all those value are small, but if da = k = de = 10 you get 200 comparisons that you have to do. I'm proposing more like a merge approach where you merge compare da and de and produce a survivor list, this list is then compared to k. This will result in de+da + da+k = 40 in worst case, which seems like a much better way to go. Can even be made more efficient. Think we should add get types into KeyValue, so we can tell the difference between getting a value for a specific ts and getting all values after a specific ts.
          Hide
          Jonathan Gray added a comment -

          We need to do some testing on that. Scanning through the deletes in the memcache might be pretty fast, regardless. However I think it sounds like a good idea and the basis for some more thoughts.

          And yeah, there should probably be no such thing as a DeleteRow on the server. And this is especially the case with locality groups as you'd need to seek to the start of the row every time before seeking down to your family.

          But in thinking more about memcache deletes... when we flush the memcache, we can guarantee that none of the values being flushed have been deleted (if we do as above, applying deletes to the memcache). So we have a list of deletes that apply to older store files. Then we start a new memcache.

          When we read in the newest storefile, we actually know that we can process it without looking at any deletes except those that are in the new memcache. The deletes in this storefile aren't needed until the second newest is looked at. And at that point we can read them in in bulk from the previous storefile that's already been opened. Can even compare stamps from the deletes to the storefile stamps to possible query stamps to early out. This is a far cry from how things are now... deletes are interspersed and duplicated everywhere.

          It does seem to make sense to have the deletes order above where they apply, but then we have to check those sections first before reading? Well come to think of it, what could make sense is to order them below. The only time we actually have deletes in a storefile is when they need to be applied to the older storefiles. So, we can scan these deletes at the end, once we have reached past what we wanted (and still need to read additional storefiles) we can scan and seek for deletes pertaining to this row/family/column, if there are any.

          Those deletes are added to the in-memory deleteset for the remaining storefiles.

          Any rewriting of files must enforce deletions across them, and files must be sequential in age if not all are combined.

          So, DeleteRow and DeleteFamily would take no time parameters, and would be stored with the time of deletion. Their KeyValue will sort at the end of the row, meaning you need to scan to this spot any time you reach the end of what you're reading from that store's row and need to read the next.

          DeleteColumn would use now by default, or you could specify a stamp and it would delete everything <= that stamp. This could sort at the end of the column, but is there any point? It should probably be at the end of the row, this is where you have to seek to look for a DeleteFamily anyways.

          Delete would be the same thing. Sorted at the end of the row. Just need to get the deleteset and comparators right so they can do the matching well for these different delete types against different cell KeyValues.

          Might make sense to have a DeleteRow in this case, would be less work in the case of locality groups. But not a big deal either way really.

          Show
          Jonathan Gray added a comment - We need to do some testing on that. Scanning through the deletes in the memcache might be pretty fast, regardless. However I think it sounds like a good idea and the basis for some more thoughts. And yeah, there should probably be no such thing as a DeleteRow on the server. And this is especially the case with locality groups as you'd need to seek to the start of the row every time before seeking down to your family. But in thinking more about memcache deletes... when we flush the memcache, we can guarantee that none of the values being flushed have been deleted (if we do as above, applying deletes to the memcache). So we have a list of deletes that apply to older store files. Then we start a new memcache. When we read in the newest storefile, we actually know that we can process it without looking at any deletes except those that are in the new memcache. The deletes in this storefile aren't needed until the second newest is looked at. And at that point we can read them in in bulk from the previous storefile that's already been opened. Can even compare stamps from the deletes to the storefile stamps to possible query stamps to early out. This is a far cry from how things are now... deletes are interspersed and duplicated everywhere. It does seem to make sense to have the deletes order above where they apply, but then we have to check those sections first before reading? Well come to think of it, what could make sense is to order them below. The only time we actually have deletes in a storefile is when they need to be applied to the older storefiles. So, we can scan these deletes at the end, once we have reached past what we wanted (and still need to read additional storefiles) we can scan and seek for deletes pertaining to this row/family/column, if there are any. Those deletes are added to the in-memory deleteset for the remaining storefiles. Any rewriting of files must enforce deletions across them, and files must be sequential in age if not all are combined. So, DeleteRow and DeleteFamily would take no time parameters, and would be stored with the time of deletion. Their KeyValue will sort at the end of the row, meaning you need to scan to this spot any time you reach the end of what you're reading from that store's row and need to read the next. DeleteColumn would use now by default, or you could specify a stamp and it would delete everything <= that stamp. This could sort at the end of the column, but is there any point? It should probably be at the end of the row, this is where you have to seek to look for a DeleteFamily anyways. Delete would be the same thing. Sorted at the end of the row. Just need to get the deleteset and comparators right so they can do the matching well for these different delete types against different cell KeyValues. Might make sense to have a DeleteRow in this case, would be less work in the case of locality groups. But not a big deal either way really.
          Hide
          Erik Holstad added a comment -

          Some more thoughts:
          DeleteRow should not exist on the server side but broken up into
          DeleteFamily for all the families for that region.

          Client side
          DeleteRow
          DeleteFamily
          DeleteColumn
          Delete

          Ts for the deletes should be "now" for DeleteRow and DeleteFamily, an optional on for DeleteColumn that will delete everything after that ts,
          if not specified it will use "now" and Delete should always take a ts for the put in question and will only react to exact matches of that ts.

          and at the server we have
          DeleteFamily
          DeleteColumn
          Delete

          I like the idea of having a special ts for the bigger Deletes so you would know what to get from memCache and where to seekTo in the store,
          but maybe an even better option is to have a separate delete set that can be read in if needed to go to store file for more data.
          So for every delete that comes in, it first deletes everything in memCache that is related to it and then it is added to the deleteSet.
          this means that everything that is in memCache is valid data, and you only need to take deletes into consideration if you are going to the stores.

          Show
          Erik Holstad added a comment - Some more thoughts: DeleteRow should not exist on the server side but broken up into DeleteFamily for all the families for that region. Client side DeleteRow DeleteFamily DeleteColumn Delete Ts for the deletes should be "now" for DeleteRow and DeleteFamily, an optional on for DeleteColumn that will delete everything after that ts, if not specified it will use "now" and Delete should always take a ts for the put in question and will only react to exact matches of that ts. and at the server we have DeleteFamily DeleteColumn Delete I like the idea of having a special ts for the bigger Deletes so you would know what to get from memCache and where to seekTo in the store, but maybe an even better option is to have a separate delete set that can be read in if needed to go to store file for more data. So for every delete that comes in, it first deletes everything in memCache that is related to it and then it is added to the deleteSet. this means that everything that is in memCache is valid data, and you only need to take deletes into consideration if you are going to the stores.
          Hide
          Erik Holstad added a comment -

          With the new KeyValue format puts and deletes are put in the same format, KeyValue. We have for delete types:
          Delete
          DeleteColumn
          DeleteFamily
          DeleteRow
          The difference between Delete and DeleteColumn one would think is that DeleteColumn deletes all versions and for a delete
          you specify a version ts to delete.
          But KeyValue only holds one ts, so the way I see it we have 3 options:
          1. Not store the time of the delete and only use it in case of a Delete,

          2. Set ts to the time of the delete and not support delete explicit versions, but just delete all versions.

          3. Have an extra field to store the extra ts, so we have 2.

          I think that nr 1 is the best way to go and if we remove all puts in memcache underneath a delete we are not going to lose any functionality.
          I'm not totally against nr 2 either, but would have made more sense if we didn't allow users to set the timestamp.
          Don't like nr. 3 since it requires and extra layer or something like that to keep the extra ts.

          Show
          Erik Holstad added a comment - With the new KeyValue format puts and deletes are put in the same format, KeyValue. We have for delete types: Delete DeleteColumn DeleteFamily DeleteRow The difference between Delete and DeleteColumn one would think is that DeleteColumn deletes all versions and for a delete you specify a version ts to delete. But KeyValue only holds one ts, so the way I see it we have 3 options: 1. Not store the time of the delete and only use it in case of a Delete, 2. Set ts to the time of the delete and not support delete explicit versions, but just delete all versions. 3. Have an extra field to store the extra ts, so we have 2. I think that nr 1 is the best way to go and if we remove all puts in memcache underneath a delete we are not going to lose any functionality. I'm not totally against nr 2 either, but would have made more sense if we didn't allow users to set the timestamp. Don't like nr. 3 since it requires and extra layer or something like that to keep the extra ts.
          Hide
          Jim Kellerman added a comment -

          @Erik

          Sorry, just opposed to removing user settable timestamps.

          Show
          Jim Kellerman added a comment - @Erik Sorry, just opposed to removing user settable timestamps.
          Hide
          Erik Holstad added a comment -

          Jim, are you -1 my proposal of removing from memCache or just removing timestamps as settable for the user?

          Show
          Erik Holstad added a comment - Jim, are you -1 my proposal of removing from memCache or just removing timestamps as settable for the user?
          Hide
          stack added a comment -

          I should have said, unless we intend allowing fellas set their own timestamp though they are out of order, then I'm in favor of removing it from the API – tell it like it is rather than how we might like it to be (though as JK raises, need to figure what that means for pset application).

          Show
          stack added a comment - I should have said, unless we intend allowing fellas set their own timestamp though they are out of order, then I'm in favor of removing it from the API – tell it like it is rather than how we might like it to be (though as JK raises, need to figure what that means for pset application).
          Hide
          stack added a comment -

          .bq -1 Unless there is another "version" field. We have an application that makes extensive use of timestamp at Powerset.

          We should review our internal app. I don't believe our application guards against timestamps' being set non-chronologically.

          On the timestamping issue, we speak with forked tongue. We allow setting it all over our API but we only internally support chronologically-ordered stamps. If a user enters out-of-order timestamps, results are indeterminate

          Show
          stack added a comment - .bq -1 Unless there is another "version" field. We have an application that makes extensive use of timestamp at Powerset. We should review our internal app. I don't believe our application guards against timestamps' being set non-chronologically. On the timestamping issue, we speak with forked tongue. We allow setting it all over our API but we only internally support chronologically-ordered stamps. If a user enters out-of-order timestamps, results are indeterminate
          Hide
          Jim Kellerman added a comment -

          > Erik Holstad added a comment - 10/Mar/09 01:14 PM
          >
          > 2. I think that the setting of the timestamp should not be exposed to the user but, be set by HBase at all times. This limitation
          > simplifies a lot when reasoning about the order across storefiles and client formats, to make it as fast, early outs and simple,
          > list instead of map, as possible.

          -1 Unless there is another "version" field. We have an application that makes extensive use of timestamp at Powerset.

          Show
          Jim Kellerman added a comment - > Erik Holstad added a comment - 10/Mar/09 01:14 PM > > 2. I think that the setting of the timestamp should not be exposed to the user but, be set by HBase at all times. This limitation > simplifies a lot when reasoning about the order across storefiles and client formats, to make it as fast, early outs and simple, > list instead of map, as possible. -1 Unless there is another "version" field. We have an application that makes extensive use of timestamp at Powerset.
          Hide
          Erik Holstad added a comment -

          Some more thoughts for puts into memCache.
          I think that when writing a delete to memCache and timestamp is specified we should just overwrite the value in there and not save both.
          When we get a more complicated deletes such as row, family etc, I propose that we after that insert walk down the tailTree of that node and
          delete all it's children. This will result in less flushing of memCache and make things easier when reading.

          Show
          Erik Holstad added a comment - Some more thoughts for puts into memCache. I think that when writing a delete to memCache and timestamp is specified we should just overwrite the value in there and not save both. When we get a more complicated deletes such as row, family etc, I propose that we after that insert walk down the tailTree of that node and delete all it's children. This will result in less flushing of memCache and make things easier when reading.
          Hide
          stack added a comment -

          Now understand. That'd be sweet if we had that.

          Show
          stack added a comment - Now understand. That'd be sweet if we had that.
          Hide
          Jonathan Gray added a comment -

          This optimization only works if timestamps are set by the server AND points out why we should separate implementation of grabbing explicit columns vs all in a family.

          Show
          Jonathan Gray added a comment - This optimization only works if timestamps are set by the server AND points out why we should separate implementation of grabbing explicit columns vs all in a family.
          Hide
          Jonathan Gray added a comment -

          (assuming real timestamps)

          Get latest version of columns: [fam:col1, fam:col2, fam:col3, fam:col4]

          Begin reading through Memcache...

          We find col1: we add it to resultset and then remove from list of columns to match

          We find a delete for fam:col2: we remove it from the list of columns to match

          We find col2: doesn't matter, not in list of columns to match

          We find col1: old version, doesn't matter, not in list of cols

          Open up first store file...

          We find col2: old version, still doesn't matter, not in list

          We find col3: add to results, remove from list of cols to match

          etc...

          Show
          Jonathan Gray added a comment - (assuming real timestamps) Get latest version of columns: [fam:col1, fam:col2, fam:col3, fam:col4] Begin reading through Memcache... We find col1: we add it to resultset and then remove from list of columns to match We find a delete for fam:col2: we remove it from the list of columns to match We find col2: doesn't matter, not in list of columns to match We find col1: old version, doesn't matter, not in list of cols Open up first store file... We find col2: old version, still doesn't matter, not in list We find col3: add to results, remove from list of cols to match etc...
          Hide
          stack added a comment -

          My thought is that you need to keep running list of deletes so that the delete in memcache is available when you trip over the cell that is in store file N.

          Show
          stack added a comment - My thought is that you need to keep running list of deletes so that the delete in memcache is available when you trip over the cell that is in store file N.
          Hide
          Jonathan Gray added a comment -

          To be more clear about the API, above is a pseudo-api that doesn't make sense. In the first method with byte [][] columns, that would have to include families. There's no such thing as a column w/o a family.

          My thought for a new API is to try and mirror the new key format. Basically, for gets you'd have a row, and then a set of keys that represented what you wanted to filter by. If it's columns, then you'd have keys that had both the family and the column. If it's families, you'd have the family in the key but an empty column value. Do we support empty column values? if so, we might also make use of the extra type bits to add more info...

          Show
          Jonathan Gray added a comment - To be more clear about the API, above is a pseudo-api that doesn't make sense. In the first method with byte [][] columns, that would have to include families. There's no such thing as a column w/o a family. My thought for a new API is to try and mirror the new key format. Basically, for gets you'd have a row, and then a set of keys that represented what you wanted to filter by. If it's columns, then you'd have keys that had both the family and the column. If it's families, you'd have the family in the key but an empty column value. Do we support empty column values? if so, we might also make use of the extra type bits to add more info...
          Hide
          Jonathan Gray added a comment -

          In the reworking of basically everything, I'd like to propose we change server-side methods to allow optimizations wherever possible and client APIs to more closely reflect implementation.

          A very rough draft to show what i'm talking about:

          getColumnsLatest(byte [] row, byte [][] columns) - only takes columns, no families
          getFamiliesLatest(byte [] row, byte [][] families) - only takes families

          getColumnsVersions(byte [] row, byte [][] columns, int numVersions)

          getColumnsVersionsAfter(byte [] row, byte [][] columns, long afterStamp)
          getColumnsVersionsBefore(byte [] row, byte [][] columns, long beforeStamp)

          getLatest(byte [] row) implementation is the same as getFamiliesLatest() with all families specified.

          It's easy to see now how splitting families and columns into two fields will not at all work with the current API. Need a more hierarchical client api, client utilities, something more like BatchUpdate even for reads, ...

          Also, when dealing with versions (or latest), we will not be able to do most of the optimizations if the client can manually specify the timestamp as described above.

          A few reasons to do this. For one, it is more clear to users how things are being implemented. But more importantly, it makes sure we're writing a server-side method for all the different cases for which we can make optimizations. Right now getting explicitly listed columns shares code with getting all columns for explicitly listed families. These two things each contain their own unique possibilities for optimization. There are also different optimizations to be made for deletes and more well-defined read types will make the cell cache easier.

          Show
          Jonathan Gray added a comment - In the reworking of basically everything, I'd like to propose we change server-side methods to allow optimizations wherever possible and client APIs to more closely reflect implementation. A very rough draft to show what i'm talking about: getColumnsLatest(byte [] row, byte [][] columns) - only takes columns, no families getFamiliesLatest(byte [] row, byte [][] families) - only takes families getColumnsVersions(byte [] row, byte [][] columns, int numVersions) getColumnsVersionsAfter(byte [] row, byte [][] columns, long afterStamp) getColumnsVersionsBefore(byte [] row, byte [][] columns, long beforeStamp) getLatest(byte [] row) implementation is the same as getFamiliesLatest() with all families specified. It's easy to see now how splitting families and columns into two fields will not at all work with the current API. Need a more hierarchical client api, client utilities, something more like BatchUpdate even for reads, ... Also, when dealing with versions (or latest), we will not be able to do most of the optimizations if the client can manually specify the timestamp as described above. A few reasons to do this. For one, it is more clear to users how things are being implemented. But more importantly, it makes sure we're writing a server-side method for all the different cases for which we can make optimizations. Right now getting explicitly listed columns shares code with getting all columns for explicitly listed families. These two things each contain their own unique possibilities for optimization. There are also different optimizations to be made for deletes and more well-defined read types will make the cell cache easier.
          Hide
          Jonathan Gray added a comment -

          Comments on above:

          1. Not exactly following you, stack. I think Erik's idea is that when we are doing any kind of get that is explicit with the columns, we can match any deletes we find against the input list of columns. Just removing them from this list, rather than keeping a running list of deletes, should be more efficient. This leads into a new discussion I will post on next.

          2. I'm +1 on Erik's thought. Timestamps should not be set client side. If there is a need to know the timestamp, might be possible to have it returned. But the ability to set the timestamp means we can never optimize when dealing with versions because there could be a newer version of something in an older storefile, thus no ability to early-out ever. If you need more control over the timestamp, then you probably aren't really using versions for versioning and you should model your data differently.

          3. +1

          Not really sure what the opinion is for family + qualifier. I see no reason to store or use the : ever, anywhere. Client APIs should separate family and column as separate args. Is there a good argument for not doing it this way? Our comparators and new key format have no need for it.

          Show
          Jonathan Gray added a comment - Comments on above: 1. Not exactly following you, stack. I think Erik's idea is that when we are doing any kind of get that is explicit with the columns, we can match any deletes we find against the input list of columns. Just removing them from this list, rather than keeping a running list of deletes, should be more efficient. This leads into a new discussion I will post on next. 2. I'm +1 on Erik's thought. Timestamps should not be set client side. If there is a need to know the timestamp, might be possible to have it returned. But the ability to set the timestamp means we can never optimize when dealing with versions because there could be a newer version of something in an older storefile, thus no ability to early-out ever. If you need more control over the timestamp, then you probably aren't really using versions for versioning and you should model your data differently. 3. +1 Not really sure what the opinion is for family + qualifier. I see no reason to store or use the : ever, anywhere. Client APIs should separate family and column as separate args. Is there a good argument for not doing it this way? Our comparators and new key format have no need for it.
          Hide
          stack added a comment -

          Thanks for the review Erik.

          On 1., I've been thinking something similar – do it in actual return list. Need to keep the delete around though in case we come across a deleted value subsequently. Was thinking just before we let the thing out of the server, we'd remove deletes — or something like that.

          On 2., thats a big change. We should put it up on list.... something like remove timestamping till we do it right because meantime it only gives wrong impression.

          On 3., once we have server passing client Set of KVBBs, yes, we should do this as convenience for the perverse bit-shifters

          Yeah, the type has extra space. We should remember that as you point out.

          Ok on the family/qualifier stuff; lets get it into the patch we commit.

          Show
          stack added a comment - Thanks for the review Erik. On 1., I've been thinking something similar – do it in actual return list. Need to keep the delete around though in case we come across a deleted value subsequently. Was thinking just before we let the thing out of the server, we'd remove deletes — or something like that. On 2., thats a big change. We should put it up on list.... something like remove timestamping till we do it right because meantime it only gives wrong impression. On 3., once we have server passing client Set of KVBBs, yes, we should do this as convenience for the perverse bit-shifters Yeah, the type has extra space. We should remember that as you point out. Ok on the family/qualifier stuff; lets get it into the patch we commit.
          Hide
          Erik Holstad added a comment -

          Looks good!
          A couple of things that I've been thinking about are:
          1. For deletes I think it would make sense to not build the delete set alongside but instead delete the column in the asked for column list. So when going though memCache and the stores, when coming to a delete, just remove it from the get list.
          2. I think that the setting of the timestamp should not be exposed to the user but, be set by HBase at all times. This limitation simplifies a lot when reasoning about the order across storefiles and client formats, to make it as fast, early outs and simple, list instead of map, as possible.
          3. The client return type RowResult seems to be a pretty good general purpose return format, but think we should have an option to get the list<ByteBuffer> instead so that when you have knowledge about what kind of data that you have in your system you can make the format much easier, for example if you are asking for 1 row, 1 family, all columns and 1 version you only need map<qf, value> or a list of tuples.

          Looking at the implementation for the KeyValue there are a couple of things that I want to bring up.
          The Type has some unused bits that can be used in the future for different flags, just to keep in mind.
          I've added a new constructor family on my branch in KeyValue that deals with separating the column into family and qualifier so that we don't need to look for : and then add it back after we are done. If you still want the : for printing purposes or what not you can put it in the family.

          Show
          Erik Holstad added a comment - Looks good! A couple of things that I've been thinking about are: 1. For deletes I think it would make sense to not build the delete set alongside but instead delete the column in the asked for column list. So when going though memCache and the stores, when coming to a delete, just remove it from the get list. 2. I think that the setting of the timestamp should not be exposed to the user but, be set by HBase at all times. This limitation simplifies a lot when reasoning about the order across storefiles and client formats, to make it as fast, early outs and simple, list instead of map, as possible. 3. The client return type RowResult seems to be a pretty good general purpose return format, but think we should have an option to get the list<ByteBuffer> instead so that when you have knowledge about what kind of data that you have in your system you can make the format much easier, for example if you are asking for 1 row, 1 family, all columns and 1 version you only need map<qf, value> or a list of tuples. Looking at the implementation for the KeyValue there are a couple of things that I want to bring up. The Type has some unused bits that can be used in the future for different flags, just to keep in mind. I've added a new constructor family on my branch in KeyValue that deals with separating the column into family and qualifier so that we don't need to look for : and then add it back after we are done. If you still want the : for printing purposes or what not you can put it in the family.
          Hide
          stack added a comment -

          Thanks for opening this Jon.

          I'm currently working on changing the key format, HBASE-1234, as part of a regionserver rewrite that does away with HStoreKey replacing it with a new org.apache.hadoop.hbase.regionserver.KeyValue data structure that lives inside a ByteBuffer. The new key format is described in HBASE-1234 and its latest manifestation can be found in the github repositiory here: http://github.com/ryanobjc/hbase/blob/5ed35fb55bd4ba2404ecbc94c6c45d7c8a7162e4/src/java/org/apache/hadoop/hbase/regionserver/KeyValue.java

          Here is from the class comment:

          * Utility for making, comparing and fetching pieces of a hbase KeyValue blob.
          * Blob format is: <keylength> <valuelength> <key> <value>
          * Key is decomposed as: <rowlength> <row> <columnfamilylength> <columnfamily> <columnqualifier> <timestamp> <keytype>
          * Rowlength maximum is Short.MAX_SIZE, column family length maximum is
          * Byte.MAX_SIZE, and column qualifier + value length must be < Integer.MAX_SIZE.
          * The column does not contain the family/qualifier delimiter.
          

          Here are some notes on what I've learned as part of the rewrite:

          + Turns out we were doing a bunch of expensive column matching lookup operations – 10%+ of all CPU in recent seek+scan 1000 rows test – that were not necessary at all. The column match was being done in a store/family context so a bunch of the column family parse and fetching from maps of column matchers to find what to use in a particular column context were not needed.
          + How deletes work will have to be redone now we have a richer delete vocabulary. What was there previous was ugly anyways so no harm in a rewrite except for the work debugging new implementation.
          + We need to make the ByteBuffer that holds the KV that comes out hfile read-only
          + Will need to redo memcache size calculations (need Ryan and Erik help here).

          Show
          stack added a comment - Thanks for opening this Jon. I'm currently working on changing the key format, HBASE-1234 , as part of a regionserver rewrite that does away with HStoreKey replacing it with a new org.apache.hadoop.hbase.regionserver.KeyValue data structure that lives inside a ByteBuffer. The new key format is described in HBASE-1234 and its latest manifestation can be found in the github repositiory here: http://github.com/ryanobjc/hbase/blob/5ed35fb55bd4ba2404ecbc94c6c45d7c8a7162e4/src/java/org/apache/hadoop/hbase/regionserver/KeyValue.java Here is from the class comment: * Utility for making, comparing and fetching pieces of a hbase KeyValue blob. * Blob format is: <keylength> <valuelength> <key> <value> * Key is decomposed as: <rowlength> <row> <columnfamilylength> <columnfamily> <columnqualifier> <timestamp> <keytype> * Rowlength maximum is Short .MAX_SIZE, column family length maximum is * Byte .MAX_SIZE, and column qualifier + value length must be < Integer .MAX_SIZE. * The column does not contain the family/qualifier delimiter. Here are some notes on what I've learned as part of the rewrite: + Turns out we were doing a bunch of expensive column matching lookup operations – 10%+ of all CPU in recent seek+scan 1000 rows test – that were not necessary at all. The column match was being done in a store/family context so a bunch of the column family parse and fetching from maps of column matchers to find what to use in a particular column context were not needed. + How deletes work will have to be redone now we have a richer delete vocabulary. What was there previous was ugly anyways so no harm in a rewrite except for the work debugging new implementation. + We need to make the ByteBuffer that holds the KV that comes out hfile read-only + Will need to redo memcache size calculations (need Ryan and Erik help here).

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              Jonathan Gray
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development