HBase
  1. HBase
  2. HBASE-1923

Bulk incremental load into an existing table

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.0
    • Fix Version/s: 0.90.0
    • Component/s: Client, regionserver, scripts
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      hbase-48 is about bulk load of a new table,maybe it's more practicable to bulk load aganist a existing table.

      1. hbase-1923-prelim.txt
        79 kB
        Todd Lipcon
      2. hbase-1923.txt
        145 kB
        Todd Lipcon
      3. hbase-1923.txt
        161 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          anty.rao added a comment -

          Flush the table so there is nothing in memstore,then make it read-only so it didn't split and make new regions.
          Then we should parittion the dataset according to the regions of the table.But when generating HFiles,we should get the latest sequenceId from hbase.
          we need a new version of loadtable.rb that move the HFiles to the right region
          Thanks very much for the comment from stack.

          Show
          anty.rao added a comment - Flush the table so there is nothing in memstore,then make it read-only so it didn't split and make new regions. Then we should parittion the dataset according to the regions of the table.But when generating HFiles,we should get the latest sequenceId from hbase. we need a new version of loadtable.rb that move the HFiles to the right region Thanks very much for the comment from stack.
          Hide
          Todd Lipcon added a comment -

          This has become high priority for us, so I'll be looking into this in the short term.

          Show
          Todd Lipcon added a comment - This has become high priority for us, so I'll be looking into this in the short term.
          Hide
          Todd Lipcon added a comment -

          Basic Design:

          Changes to HFileOutputFormat:

          Should only need changes during job initialization:

          1. scan all regions of table from .META.
          2. configure TotalOrderPartitioner based on existing region key boundaries
          3. If the number of reducers > number of regions, we could
            (a) recursively split table until this is not true (degenerate case: incremental load into table with one row?)
            (b) simply split keyspace by taking the lexical "halfway" of the region (two HFiles go into one region in load stage)
            (c) add API to regionserver to get estimate of region midpoint (assuming that new data has similar distribution to old data)

          I plan to do either (a) or (b) initially.

          We should provide at least some sample code, if not good utility classes/methods to do this task.

          Job Running

          Should be unaffected

          Data Loader

          Note that the partitions output by the MR job no longer necessarily correspond to the region boundaries (regions could have split or merged). I think the algorithm looks like:

          for each reducer output:
            inspect hfile to find lowest key and highest key
            look up region name/startkey/endkey corresponding to first key in hfile
            if HFile's low<->high is entirely contained within regions low<->high:
              send RPC to RS: loadIncremental(region name, "/path/to/hfile")
            else:
              # this is the inefficient path, if the region split during the MR job
              On the loading side, manually split the HFile into two physical HFiles
                in a tmp directory
              recurse on the split files
          

          The "inefficient" path should occur in a minority of cases. In the future we can implement this path using reference files that would be cleaned at next compaction. I don't plan to do this in the first pass.

          The above functionality would be implemented in a client side script/program (currently a ruby script, though I will probably just write in Java)

          RegionServer Side

          Need to implement the "loadIncremental" RPC. This function needs to do the following reasonably simple steps:

          1. ensure that the region being accessed is the same one being hosted (including timestamp, etc)
          2. move the HFile into the correct store directory
          3. briefly lock the storefile list and add the HFile

          Probably need some other interaction with concurrent scanners, etc - will look at this carefully during implementation.

          Show
          Todd Lipcon added a comment - Basic Design: Changes to HFileOutputFormat: Should only need changes during job initialization: scan all regions of table from .META. configure TotalOrderPartitioner based on existing region key boundaries If the number of reducers > number of regions, we could (a) recursively split table until this is not true (degenerate case: incremental load into table with one row?) (b) simply split keyspace by taking the lexical "halfway" of the region (two HFiles go into one region in load stage) (c) add API to regionserver to get estimate of region midpoint (assuming that new data has similar distribution to old data) I plan to do either (a) or (b) initially. We should provide at least some sample code, if not good utility classes/methods to do this task. Job Running Should be unaffected Data Loader Note that the partitions output by the MR job no longer necessarily correspond to the region boundaries (regions could have split or merged). I think the algorithm looks like: for each reducer output: inspect hfile to find lowest key and highest key look up region name/startkey/endkey corresponding to first key in hfile if HFile's low<->high is entirely contained within regions low<->high: send RPC to RS: loadIncremental(region name, "/path/to/hfile" ) else : # this is the inefficient path, if the region split during the MR job On the loading side, manually split the HFile into two physical HFiles in a tmp directory recurse on the split files The "inefficient" path should occur in a minority of cases. In the future we can implement this path using reference files that would be cleaned at next compaction. I don't plan to do this in the first pass. The above functionality would be implemented in a client side script/program (currently a ruby script, though I will probably just write in Java) RegionServer Side Need to implement the "loadIncremental" RPC. This function needs to do the following reasonably simple steps: ensure that the region being accessed is the same one being hosted (including timestamp, etc) move the HFile into the correct store directory briefly lock the storefile list and add the HFile Probably need some other interaction with concurrent scanners, etc - will look at this carefully during implementation.
          Hide
          Todd Lipcon added a comment -

          oops, forgot to write about case where number of reducers is less than the number of regions. We could either (a) disallow it, or (b) configure the reducer to simply roll to a new HFile at each region boundary that it crosses. The load algorithm runs per-hfile, not per-reducer, so should work fine.

          I will probably choose option (a) initially

          Show
          Todd Lipcon added a comment - oops, forgot to write about case where number of reducers is less than the number of regions. We could either (a) disallow it, or (b) configure the reducer to simply roll to a new HFile at each region boundary that it crosses. The load algorithm runs per-hfile, not per-reducer, so should work fine. I will probably choose option (a) initially
          Hide
          Vidhyashankar Venkataraman added a comment -

          Can you also include support for delete operations along with this bulk-update feature?

          Thank you
          Vidhya

          Show
          Vidhyashankar Venkataraman added a comment - Can you also include support for delete operations along with this bulk-update feature? Thank you Vidhya
          Hide
          Todd Lipcon added a comment -

          Since the HFileOutputFormat just writes KeyValues, delete support should fall out "for free". However, the APIs to construct the delete tombstones aren't that obvious, so I think it will be an advanced feature. We should follow up this JIRA with (a) much better documentation on HFileOutputFormat (b) some better wrapper APIs for using it, as KeyValue is pretty low level.

          Show
          Todd Lipcon added a comment - Since the HFileOutputFormat just writes KeyValues, delete support should fall out "for free". However, the APIs to construct the delete tombstones aren't that obvious, so I think it will be an advanced feature. We should follow up this JIRA with (a) much better documentation on HFileOutputFormat (b) some better wrapper APIs for using it, as KeyValue is pretty low level.
          Hide
          Jonathan Gray added a comment -

          KeyValues are "low level" but there are some pretty user-accessible API calls in it that look just like Puts. Shouldn't be too hard to support Put and Delete though since their underlying structure is a bunch of KeyValues anyways.

          Could make HFOF the same API/similar API as TOF. Would make sense once u can load to an existing table.

          Show
          Jonathan Gray added a comment - KeyValues are "low level" but there are some pretty user-accessible API calls in it that look just like Puts. Shouldn't be too hard to support Put and Delete though since their underlying structure is a bunch of KeyValues anyways. Could make HFOF the same API/similar API as TOF. Would make sense once u can load to an existing table.
          Hide
          Jonathan Gray added a comment -

          Basic design looks good Todd.

          Show
          Jonathan Gray added a comment - Basic design looks good Todd.
          Hide
          stack added a comment -

          Looks good to me. Hows the sampling the TotalOrderPartitioner needs going to work going against hbase? This is one CF only still, right?

          Show
          stack added a comment - Looks good to me. Hows the sampling the TotalOrderPartitioner needs going to work going against hbase? This is one CF only still, right?
          Hide
          Todd Lipcon added a comment -

          Hows the sampling the TotalOrderPartitioner needs going to work going against hbase?

          TotalOrderPartitioner just needs split points, so I'm manually writing out the necessary SequenceFile, instead of using the Sampler interface.

          This is one CF only still, right?

          For now, yes. Unless along the way it becomes easy to fix that while I'm in the vicinity. But I think the two things are pretty much orthogonal.

          Show
          Todd Lipcon added a comment - Hows the sampling the TotalOrderPartitioner needs going to work going against hbase? TotalOrderPartitioner just needs split points, so I'm manually writing out the necessary SequenceFile, instead of using the Sampler interface. This is one CF only still, right? For now, yes. Unless along the way it becomes easy to fix that while I'm in the vicinity. But I think the two things are pretty much orthogonal.
          Hide
          Todd Lipcon added a comment -

          Attaching a work-in-progress patch in case anyone wants to start looking at this (a number of people sounded interested, figured early review would be best)

          Quick tutorial:

          # Generate some data
          perl -e 'for (1..10000) { print "$_\t$_\n"; }' | hadoop fs -put - mytsv.txt
          perl -e 'for (1..10000) { print "$_\t" . ($_*3) . "\n"; }' | hadoop fs -put - mytsv2.txt
          
          # Create table to hold it
          ./bin/hbase shell
          create 'myfile', 'f1'
          
          # Do a normal MR load
          HADOOP_CLASSPATH=$(cat /tmp/hbase-core-test-classpath.txt) hadoop jar target/hbase-0.21.0-SNAPSHOT.jar importtsv -Dcolumns=f1:blah  myfile mytsv.txt
          
          # scan in the shell if you like
          # Potentially split table if you like
          # Generate incremental from the other file
          HADOOP_CLASSPATH=$(cat /tmp/hbase-core-test-classpath.txt) hadoop jar target/hbase-0.21.0-SNAPSHOT.jar importtsv -Dcolumns=f1:blah  -Duse.hfile=true myfile mytsv2.txt
          
          # Load incremental
          HBASE_CLASSPATH=$HADOOP_CONF_DIR  ./bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hfof myfile
          
          # scan in the shell and see that the data has changed
          

          A fair amount of work remains - have to take care of regions that split, or the case when there are fewer reducers than regions.

          Show
          Todd Lipcon added a comment - Attaching a work-in-progress patch in case anyone wants to start looking at this (a number of people sounded interested, figured early review would be best) Quick tutorial: # Generate some data perl -e ' for (1..10000) { print "$_\t$_\n" ; }' | hadoop fs -put - mytsv.txt perl -e ' for (1..10000) { print "$_\t" . ($_*3) . "\n" ; }' | hadoop fs -put - mytsv2.txt # Create table to hold it ./bin/hbase shell create 'myfile', 'f1' # Do a normal MR load HADOOP_CLASSPATH=$(cat /tmp/hbase-core-test-classpath.txt) hadoop jar target/hbase-0.21.0-SNAPSHOT.jar importtsv -Dcolumns=f1:blah myfile mytsv.txt # scan in the shell if you like # Potentially split table if you like # Generate incremental from the other file HADOOP_CLASSPATH=$(cat /tmp/hbase-core-test-classpath.txt) hadoop jar target/hbase-0.21.0-SNAPSHOT.jar importtsv -Dcolumns=f1:blah -Duse.hfile= true myfile mytsv2.txt # Load incremental HBASE_CLASSPATH=$HADOOP_CONF_DIR ./bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hfof myfile # scan in the shell and see that the data has changed A fair amount of work remains - have to take care of regions that split, or the case when there are fewer reducers than regions.
          Hide
          stack added a comment -

          I was going to suggest you add this to the hbase mapreduce driver but you have done so already.

          Show
          stack added a comment - I was going to suggest you add this to the hbase mapreduce driver but you have done so already.
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/
          -----------------------------------------------------------

          Review request for hbase, stack and Jonathan Gray.

          Summary
          -------

          Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review

          I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP.

          The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review.

          This addresses bug HBASE-1923.
          http://issues.apache.org/jira/browse/HBASE-1923

          Diffs


          pom.xml 0a009cf
          src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6
          src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b
          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java cf4768f
          src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a
          src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695
          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e
          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java fb65ed1
          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d
          src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe
          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f
          src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb

          Diff: http://review.hbase.org/r/87/diff

          Testing
          -------

          Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well.

          Thanks,

          Todd

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/ ----------------------------------------------------------- Review request for hbase, stack and Jonathan Gray. Summary ------- Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP. The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review. This addresses bug HBASE-1923 . http://issues.apache.org/jira/browse/HBASE-1923 Diffs pom.xml 0a009cf src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6 src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java cf4768f src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695 src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588 src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java fb65ed1 src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2 src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb Diff: http://review.hbase.org/r/87/diff Testing ------- Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well. Thanks, Todd
          Hide
          HBase Review Board added a comment -

          Message from: "Ryan Rawson" <ryanobjc@gmail.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review76
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          <http://review.hbase.org/r/87/#comment396>

          this cant go here, HFile cannot know about KeyValues. It will go into StoreFile

          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          <http://review.hbase.org/r/87/#comment397>

          you might also say the last key in the HFile means the last "KeyValue"

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment398>

          this won't work for an existing table - if the current time millis > the current sequence id in the live server, we could end up losing edits during a log recovery

          • Ryan
          Show
          HBase Review Board added a comment - Message from: "Ryan Rawson" <ryanobjc@gmail.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review76 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java < http://review.hbase.org/r/87/#comment396 > this cant go here, HFile cannot know about KeyValues. It will go into StoreFile src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java < http://review.hbase.org/r/87/#comment397 > you might also say the last key in the HFile means the last "KeyValue" src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment398 > this won't work for an existing table - if the current time millis > the current sequence id in the live server, we could end up losing edits during a log recovery Ryan
          Hide
          ryan rawson added a comment -

          an extensive discussion outlined why the sequence id thing is critical and is a major blocker:

          23:43 < dj_ryan> before this can go in
          23:44 < dj_ryan> the sequence # stuff is... tricky
          23:44 < dj_ryan> so here's the thing
          23:44 < dj_ryan> if you use any static sequence id you will end up with problems
          23:44 < dj_ryan> if you use System.currentTimeMilli()
          23:44 < tlipcon> yea it needs to be unique
          23:44 < dj_ryan> it isnt that easy
          23:44 < dj_ryan> it has to be smaller than the current largest sequence ID
          23:44 < tlipcon> and less than the lowest unflushed
          23:44 < tlipcon> right
          23:44 < dj_ryan> because otherwise you might bump the sequence id up a bunch
          23:45 < dj_ryan> and if we ever hit a log recovery we'd skip a ton of edits
          23:45 < tlipcon> yea, perhaps unique negative numbers
          23:45 < dj_ryan> well it will go away after a compaction
          23:45 < dj_ryan> so if we use '0'
          23:45 < dj_ryan> that might workish
          23:45 < dj_ryan> you will have to compact between imports
          23:45 < tlipcon> or potentially change the storefile holder to be a
          List<Storefile> rather than a map
          23:45 < tlipcon> we're sort of abusing a map there
          23:45 < dj_ryan> there was a really good reason for doing what we did
          23:46 < tlipcon> yea i think it used to make more sense
          23:46 < tlipcon> when we "early out"ed gets
          23:46 < dj_ryan> so until we know exactly why that is so i'd be hesistant at
          removing it
          23:46 < dj_ryan> so we had that before that
          23:46 < tlipcon> now with gets as scans I don't htink it's so important
          23:46 < dj_ryan> didnt we stack?
          23:46 < dj_ryan> but we still need it
          23:46 < dj_ryan> because META has custom logic to do rowAtOrBefore()
          23:46 < tlipcon> we need to be able to find oldest/newest, but we're talking
          about a min/max over like 10s of elements, hardly slow
          23:46 < dj_ryan> which still does a similar behaviour like the old get, only
          with its own unique code path
          23:48 < dj_ryan> we dont want to go back to the ye olde days of 'you cant drop
          a table then recreate it without flushing and major compacting
          between the drop and the create'
          23:48 < dj_ryan> that was no good
          23:48 < dj_ryan> yeah we arent creating sequence ids for these HFiles are we
          23:48 < dj_ryan> they wont load
          23:49 < dj_ryan> oh isee
          23:49 < dj_ryan> it uses currentTimMillis
          23:49 < dj_ryan> you cant do that for this code
          23:50 < dj_ryan> because we'd end up potentially creating a HFile that has a
          seq id > the live seq ID
          23:50 < dj_ryan> which would be i trouble
          23:50 < tlipcon> yep
          23:51 < tlipcon> I think unique negative numbers are probably the best bet
          23:51 < tlipcon> though it's a hack
          23:51 < dj_ryan> well 0 is probably one of the best choices
          23:51 < tlipcon> or better, changing the container to not be a map
          23:51 < dj_ryan> actually
          23:51 < dj_ryan> we cant do that
          23:51 < tlipcon> 0 doesn't let you do multiple incrementals w/o major compact,
          like you said
          23:51 < dj_ryan> yeah
          23:51 < dj_ryan> but we cant get rid of the ordering
          23:51 < dj_ryan> i think the compaction code also depends on using that as well
          23:52 < St^Ack> we saying that there is no ordering any more?
          23:52 < St^Ack> we can't be saying that, right?
          23:52 < tlipcon> St^Ack: I dunno
          23:52 < St^Ack> no order to sequence files
          23:52 < tlipcon> it would be an enticing thing to say
          23:52 < tlipcon> but we never quite figured out if we can say it
          23:52 < tlipcon> i think the delete semantics question actually factors in here
          23:52 < St^Ack> you need sequenceid replaying logs
          23:53 < St^Ack> and for letting go of hlogs
          23:53 < St^Ack> each edit gets a seqid
          23:53 < St^Ack> they are ever increasing inside HRS
          23:54 < St^Ack> hfile gets written w/ last seqid
          23:54 < St^Ack> in old days for get
          23:54 < St^Ack> we'd order them by seqid
          23:54 < tlipcon> yea, I don't think we're debating whether we need seqid
          23:54 < tlipcon> just whether seqid has to be entirely unique
          23:54 < St^Ack> it does have to be unique
          23:55 < St^Ack> at least currenlty it does
          23:55 < St^Ack> when we open them, we key them by seqid
          23:55 < tlipcon> yea currently cuz of the Map<Long, StoreFile> or whatever we do
          23:55 < tlipcon> but we could change that to a List<Pair<Long, StoreFile>>
          23:56 < St^Ack> compacting code currently addresses them in order
          23:56 < St^Ack> which version to let go
          23:56 < St^Ack> we let go of the oldest
          23:56 < St^Ack> first
          23:57 < jgray> St^Ack: compacting code has notion of storefile ordering?
          23:57 < St^Ack> todd, you are inserting into quiescent table?
          23:57 < St^Ack> jgray: yes
          23:58 < tlipcon> St^Ack: nope, inserting a new storefile into a live table
          23:58 < St^Ack> tlipcon: if you asked RS for a seqid
          23:58 < tlipcon> that's the fun of this patch
          23:58 < St^Ack> that might help
          23:58 < tlipcon> it all happens inside the RS
          23:58 < St^Ack> hmmm....
          23:58 < St^Ack> that won't help
          23:58 < tlipcon> we don't want a new seqid though
          23:59 < tlipcon> we explicitly want an old seqid
          23:59 < St^Ack> because then you could have memsotre flush that had edits
          either side of your new file
          23:59 < jgray> St^Ack: where does it look at storefile order?
          23:59 < St^Ack> jgray: files are ordered in the Store
          23:59 < St^Ack> by seqid
          23:59 < St^Ack> newest to oldest
          00:00 < jgray> yeah, i thought u meant when it does the merge?
          00:00 < jgray> u mean in picking which files to compact?
          00:00 < St^Ack> yeah.... keesp order
          00:00 < jgray> yeah ok
          00:00 < jgray> sorry thought u meant something else
          00:00 < St^Ack> but you saying that because it uses mergesort it dont' matter
          00:00 < St^Ack> tlipcon: thats hard.
          00:00 < jgray> once it picks the files to merge, from then on it doesn't look
          at seqid
          00:01 < jgray> but we do look at them in seqid order when picking files to
          compact
          00:01 < dj_ryan> during the process of compaction yes we dont use the sequence
          ordering
          00:01 < St^Ack> so, if 10 versions and we're to keep 3, we could be keping any
          3?
          00:01 < tlipcon> yea, but I think we all agree current compaction heuristic is
          kind of silly
          00:01 < dj_ryan> well
          00:02 < dj_ryan> so the thing is picking which store files uses the sequence
          00:02 < dj_ryan> because when we do minor compaction its important that we
          compact adjacent files
          00:02 < dj_ryan> and it will be important for future patches
          00:02 < jgray> St^Ack: 10 columns w/ same timestamp?
          00:02 < jgray> the same 10 cols
          00:02 < St^Ack> thats diff topic (smile)
          00:02 < dj_ryan> heh
          00:03 < dj_ryan> i mean how often do we want to actually do do loadFile
          00:03 < tlipcon> dj_ryan: the particular customer that has requested this
          essentially wants to only do loadfile
          00:03 < dj_ryan> well
          00:03 < dj_ryan> you could pick the current sequence id then increment it
          00:03 < jgray> St^Ack: if the columns have unique versions/stamps, then we do
          it right... we merge them in KV order... the issue there is more
          about when they have the same stamp, we would want to know
          storefile stamps to use that to determine order
          00:03 < dj_ryan> thus creating a 1 edit hole in the hlog
          00:04 < dj_ryan> in fact you could even log it
          00:04 < tlipcon> dj_ryan: isn't it the opposite? we want to pick the latest
          flushed store file, and go one less?
          00:04 < dj_ryan> but then the metadata in the HFile wouldnt match
          00:04 < dj_ryan> unless we re-wrote the HFile during load (Doh)
          00:04 < St^Ack> jgray: makes sense
          00:04 < tlipcon> yea that's another pain in the ass
          00:04 < dj_ryan> hmm
          00:04 < dj_ryan> yeah i guess you're right
          00:05 < dj_ryan> again the problem would be the sequence id in the HFile
          00:05 < dj_ryan> right now its being baked in at the map reduce time
          00:05 < dj_ryan> no matter what we pick at regionserver time
          00:05 < tlipcon> right, I think our best bet is to either not put one in there,
          or put in a special BULKLOAD_HFILE signifier of some sort
          00:05 < tlipcon> either in place of or in addition to the seqid
          00:05 < dj_ryan> you'd potentially have an issue during the next region re-open
          00:09 < dj_ryan> 0 might be the best choice at MR time
          00:09 < tlipcon> I think adding a new metadata entry saying it was a bulk load
          is our best bet
          00:09 < tlipcon> then treating them specially where we need to
          00:09 < tlipcon> i hate special casing, but other things seem hackish
          00:09 < St^Ack> can't be 0 because next time he runs nd if its still around...
          clash
          00:09 < dj_ryan> St^Ack: right
          00:10 < dj_ryan> can we call seqid 0 special
          00:10 < tlipcon> I'd rather put no seqid at all
          00:10 < dj_ryan> we will never have a seqid 0 in the wild
          00:10 < tlipcon> calling seqid 0 special is easy to miss
          00:10 < dj_ryan> hmm
          00:10 < tlipcon> whereas if we make it null, we'll get NPEs if we forget to
          account for this case
          00:10 < tlipcon> which I think is preferable
          00:10 < tlipcon> otherwise people will forget about it and code stuff that
          breaks subtly instead of loudly
          00:10 < dj_ryan> hmm
          00:11 < dj_ryan> without a sequence id you cant compact those files to each
          other
          00:11 < dj_ryan> or maybe they can
          00:11 < dj_ryan> so if you grab all the hfiles w/sequence ids
          00:11 < St^Ack> w/o seqid can't load it
          00:11 < dj_ryan> then fit in the ones without it around it
          00:13 < dj_ryan> so when you compact N files together you pick the largest
          sequence id
          00:13 < dj_ryan> then use that as the new sequence id of the file
          00:13 < dj_ryan> (of the output file)
          00:14 < dj_ryan> and that comes from the file's metadata i think

          Show
          ryan rawson added a comment - an extensive discussion outlined why the sequence id thing is critical and is a major blocker: 23:43 < dj_ryan> before this can go in 23:44 < dj_ryan> the sequence # stuff is... tricky 23:44 < dj_ryan> so here's the thing 23:44 < dj_ryan> if you use any static sequence id you will end up with problems 23:44 < dj_ryan> if you use System.currentTimeMilli() 23:44 < tlipcon> yea it needs to be unique 23:44 < dj_ryan> it isnt that easy 23:44 < dj_ryan> it has to be smaller than the current largest sequence ID 23:44 < tlipcon> and less than the lowest unflushed 23:44 < tlipcon> right 23:44 < dj_ryan> because otherwise you might bump the sequence id up a bunch 23:45 < dj_ryan> and if we ever hit a log recovery we'd skip a ton of edits 23:45 < tlipcon> yea, perhaps unique negative numbers 23:45 < dj_ryan> well it will go away after a compaction 23:45 < dj_ryan> so if we use '0' 23:45 < dj_ryan> that might workish 23:45 < dj_ryan> you will have to compact between imports 23:45 < tlipcon> or potentially change the storefile holder to be a List<Storefile> rather than a map 23:45 < tlipcon> we're sort of abusing a map there 23:45 < dj_ryan> there was a really good reason for doing what we did 23:46 < tlipcon> yea i think it used to make more sense 23:46 < tlipcon> when we "early out"ed gets 23:46 < dj_ryan> so until we know exactly why that is so i'd be hesistant at removing it 23:46 < dj_ryan> so we had that before that 23:46 < tlipcon> now with gets as scans I don't htink it's so important 23:46 < dj_ryan> didnt we stack? 23:46 < dj_ryan> but we still need it 23:46 < dj_ryan> because META has custom logic to do rowAtOrBefore() 23:46 < tlipcon> we need to be able to find oldest/newest, but we're talking about a min/max over like 10s of elements, hardly slow 23:46 < dj_ryan> which still does a similar behaviour like the old get, only with its own unique code path 23:48 < dj_ryan> we dont want to go back to the ye olde days of 'you cant drop a table then recreate it without flushing and major compacting between the drop and the create' 23:48 < dj_ryan> that was no good 23:48 < dj_ryan> yeah we arent creating sequence ids for these HFiles are we 23:48 < dj_ryan> they wont load 23:49 < dj_ryan> oh isee 23:49 < dj_ryan> it uses currentTimMillis 23:49 < dj_ryan> you cant do that for this code 23:50 < dj_ryan> because we'd end up potentially creating a HFile that has a seq id > the live seq ID 23:50 < dj_ryan> which would be i trouble 23:50 < tlipcon> yep 23:51 < tlipcon> I think unique negative numbers are probably the best bet 23:51 < tlipcon> though it's a hack 23:51 < dj_ryan> well 0 is probably one of the best choices 23:51 < tlipcon> or better, changing the container to not be a map 23:51 < dj_ryan> actually 23:51 < dj_ryan> we cant do that 23:51 < tlipcon> 0 doesn't let you do multiple incrementals w/o major compact, like you said 23:51 < dj_ryan> yeah 23:51 < dj_ryan> but we cant get rid of the ordering 23:51 < dj_ryan> i think the compaction code also depends on using that as well 23:52 < St^Ack> we saying that there is no ordering any more? 23:52 < St^Ack> we can't be saying that, right? 23:52 < tlipcon> St^Ack: I dunno 23:52 < St^Ack> no order to sequence files 23:52 < tlipcon> it would be an enticing thing to say 23:52 < tlipcon> but we never quite figured out if we can say it 23:52 < tlipcon> i think the delete semantics question actually factors in here 23:52 < St^Ack> you need sequenceid replaying logs 23:53 < St^Ack> and for letting go of hlogs 23:53 < St^Ack> each edit gets a seqid 23:53 < St^Ack> they are ever increasing inside HRS 23:54 < St^Ack> hfile gets written w/ last seqid 23:54 < St^Ack> in old days for get 23:54 < St^Ack> we'd order them by seqid 23:54 < tlipcon> yea, I don't think we're debating whether we need seqid 23:54 < tlipcon> just whether seqid has to be entirely unique 23:54 < St^Ack> it does have to be unique 23:55 < St^Ack> at least currenlty it does 23:55 < St^Ack> when we open them, we key them by seqid 23:55 < tlipcon> yea currently cuz of the Map<Long, StoreFile> or whatever we do 23:55 < tlipcon> but we could change that to a List<Pair<Long, StoreFile>> 23:56 < St^Ack> compacting code currently addresses them in order 23:56 < St^Ack> which version to let go 23:56 < St^Ack> we let go of the oldest 23:56 < St^Ack> first 23:57 < jgray> St^Ack: compacting code has notion of storefile ordering? 23:57 < St^Ack> todd, you are inserting into quiescent table? 23:57 < St^Ack> jgray: yes 23:58 < tlipcon> St^Ack: nope, inserting a new storefile into a live table 23:58 < St^Ack> tlipcon: if you asked RS for a seqid 23:58 < tlipcon> that's the fun of this patch 23:58 < St^Ack> that might help 23:58 < tlipcon> it all happens inside the RS 23:58 < St^Ack> hmmm.... 23:58 < St^Ack> that won't help 23:58 < tlipcon> we don't want a new seqid though 23:59 < tlipcon> we explicitly want an old seqid 23:59 < St^Ack> because then you could have memsotre flush that had edits either side of your new file 23:59 < jgray> St^Ack: where does it look at storefile order? 23:59 < St^Ack> jgray: files are ordered in the Store 23:59 < St^Ack> by seqid 23:59 < St^Ack> newest to oldest 00:00 < jgray> yeah, i thought u meant when it does the merge? 00:00 < jgray> u mean in picking which files to compact? 00:00 < St^Ack> yeah.... keesp order 00:00 < jgray> yeah ok 00:00 < jgray> sorry thought u meant something else 00:00 < St^Ack> but you saying that because it uses mergesort it dont' matter 00:00 < St^Ack> tlipcon: thats hard. 00:00 < jgray> once it picks the files to merge, from then on it doesn't look at seqid 00:01 < jgray> but we do look at them in seqid order when picking files to compact 00:01 < dj_ryan> during the process of compaction yes we dont use the sequence ordering 00:01 < St^Ack> so, if 10 versions and we're to keep 3, we could be keping any 3? 00:01 < tlipcon> yea, but I think we all agree current compaction heuristic is kind of silly 00:01 < dj_ryan> well 00:02 < dj_ryan> so the thing is picking which store files uses the sequence 00:02 < dj_ryan> because when we do minor compaction its important that we compact adjacent files 00:02 < dj_ryan> and it will be important for future patches 00:02 < jgray> St^Ack: 10 columns w/ same timestamp? 00:02 < jgray> the same 10 cols 00:02 < St^Ack> thats diff topic (smile) 00:02 < dj_ryan> heh 00:03 < dj_ryan> i mean how often do we want to actually do do loadFile 00:03 < tlipcon> dj_ryan: the particular customer that has requested this essentially wants to only do loadfile 00:03 < dj_ryan> well 00:03 < dj_ryan> you could pick the current sequence id then increment it 00:03 < jgray> St^Ack: if the columns have unique versions/stamps, then we do it right... we merge them in KV order... the issue there is more about when they have the same stamp, we would want to know storefile stamps to use that to determine order 00:03 < dj_ryan> thus creating a 1 edit hole in the hlog 00:04 < dj_ryan> in fact you could even log it 00:04 < tlipcon> dj_ryan: isn't it the opposite? we want to pick the latest flushed store file, and go one less? 00:04 < dj_ryan> but then the metadata in the HFile wouldnt match 00:04 < dj_ryan> unless we re-wrote the HFile during load (Doh) 00:04 < St^Ack> jgray: makes sense 00:04 < tlipcon> yea that's another pain in the ass 00:04 < dj_ryan> hmm 00:04 < dj_ryan> yeah i guess you're right 00:05 < dj_ryan> again the problem would be the sequence id in the HFile 00:05 < dj_ryan> right now its being baked in at the map reduce time 00:05 < dj_ryan> no matter what we pick at regionserver time 00:05 < tlipcon> right, I think our best bet is to either not put one in there, or put in a special BULKLOAD_HFILE signifier of some sort 00:05 < tlipcon> either in place of or in addition to the seqid 00:05 < dj_ryan> you'd potentially have an issue during the next region re-open 00:09 < dj_ryan> 0 might be the best choice at MR time 00:09 < tlipcon> I think adding a new metadata entry saying it was a bulk load is our best bet 00:09 < tlipcon> then treating them specially where we need to 00:09 < tlipcon> i hate special casing, but other things seem hackish 00:09 < St^Ack> can't be 0 because next time he runs nd if its still around... clash 00:09 < dj_ryan> St^Ack: right 00:10 < dj_ryan> can we call seqid 0 special 00:10 < tlipcon> I'd rather put no seqid at all 00:10 < dj_ryan> we will never have a seqid 0 in the wild 00:10 < tlipcon> calling seqid 0 special is easy to miss 00:10 < dj_ryan> hmm 00:10 < tlipcon> whereas if we make it null, we'll get NPEs if we forget to account for this case 00:10 < tlipcon> which I think is preferable 00:10 < tlipcon> otherwise people will forget about it and code stuff that breaks subtly instead of loudly 00:10 < dj_ryan> hmm 00:11 < dj_ryan> without a sequence id you cant compact those files to each other 00:11 < dj_ryan> or maybe they can 00:11 < dj_ryan> so if you grab all the hfiles w/sequence ids 00:11 < St^Ack> w/o seqid can't load it 00:11 < dj_ryan> then fit in the ones without it around it 00:13 < dj_ryan> so when you compact N files together you pick the largest sequence id 00:13 < dj_ryan> then use that as the new sequence id of the file 00:13 < dj_ryan> (of the output file) 00:14 < dj_ryan> and that comes from the file's metadata i think
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review90
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          <http://review.hbase.org/r/87/#comment514>

          Oops, left this in.

          • Todd
          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review90 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/util/Bytes.java < http://review.hbase.org/r/87/#comment514 > Oops, left this in. Todd
          Hide
          Todd Lipcon added a comment -

          Regarding the logseqid issue, I think I like the following solution:

          • we insert no seqid at all in the HFile when we create it
          • instead, we insert a special meta field that specifies that this is a bulk load result file (perhaps give the timestamp or the MR job ID or whatever)
          • change the Map<Long, StoreFile> inside HRegion to be a List<StoreFile>, making necessary changes so that existing functionality still works (maybe need to add logseqid as a member of storefile?)
          • ensure that RS startup handles hfiles correctly when they have no seqid (in terms of log replay)

          The hard part is that last bit - where in the StoreFile list do bulk load files end up? Do we always put them "on top"? I think with our new gets-as-scans implementation this will be fine, but we need to think whether it will cause issues with compaction, etc. Does anyone have some thoughts here?

          Show
          Todd Lipcon added a comment - Regarding the logseqid issue, I think I like the following solution: we insert no seqid at all in the HFile when we create it instead, we insert a special meta field that specifies that this is a bulk load result file (perhaps give the timestamp or the MR job ID or whatever) change the Map<Long, StoreFile> inside HRegion to be a List<StoreFile>, making necessary changes so that existing functionality still works (maybe need to add logseqid as a member of storefile?) ensure that RS startup handles hfiles correctly when they have no seqid (in terms of log replay) The hard part is that last bit - where in the StoreFile list do bulk load files end up? Do we always put them "on top"? I think with our new gets-as-scans implementation this will be fine, but we need to think whether it will cause issues with compaction, etc. Does anyone have some thoughts here?
          Hide
          HBase Review Board added a comment -

          Message from: stack@duboce.net

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review96
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          <http://review.hbase.org/r/87/#comment554>

          @Ryan: This shouldn't block commit of Todd's patch though, right? We can move it out after your refactoring?

          src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
          <http://review.hbase.org/r/87/#comment555>

          Is this patch for trunk or branch? If branch, this addition might break RPC. It might be ok on the end here but we should test.

          src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java
          <http://review.hbase.org/r/87/#comment556>

          Is this a good description? Its underneath the Import.class which imports by going against the API. You might add a work about it going behind the API, writing hfiles?

          Hmmm... later I see that it can do both, write hfile or write to table

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment557>

          Nice!

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment558>

          Man, we never enable asserts.... throw an exception!

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment559>

          Just asking, the comparator for sure is going to do the right thing here?

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment560>

          Won't javadoc mangle your nice formatting here? Use ul/li or wrap w/ <pre>?

          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java
          <http://review.hbase.org/r/87/#comment561>

          Why two types? Is this legacy? KV has advantage of being able to carry a Delete

          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
          <http://review.hbase.org/r/87/#comment562>

          In KV, there is a parseColumns function that might help here

          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
          <http://review.hbase.org/r/87/#comment563>

          Is this safe? Is there escaping you should be handling here?

          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
          <http://review.hbase.org/r/87/#comment564>

          I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later...

          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
          <http://review.hbase.org/r/87/#comment566>

          THis is good

          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
          <http://review.hbase.org/r/87/#comment565>

          Yeah, there is probably other metadata you'd want to bring over beyond blocksize – like whether its compressed? For later.

          src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java
          <http://review.hbase.org/r/87/#comment567>

          Class comment. Whats this thing do?

          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java
          <http://review.hbase.org/r/87/#comment568>

          We need this even though you wrote out partition edges earlier when you read region boundaries?

          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
          <http://review.hbase.org/r/87/#comment569>

          This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods (HBASE-588) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217. Lets make an issue to clean it up.

          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          <http://review.hbase.org/r/87/#comment570>

          Nice

          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          <http://review.hbase.org/r/87/#comment571>

          There is FSUtils in hbase util

          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
          <http://review.hbase.org/r/87/#comment572>

          What you going to do here? Get the last key from the map of storefiles in current region and make your file be one less than it? Your timestamp is current though so edits in here could overshadow those in another storefile. Thats ok?

          src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          <http://review.hbase.org/r/87/#comment573>

          You were going to remove this

          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          <http://review.hbase.org/r/87/#comment574>

          Good

          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          <http://review.hbase.org/r/87/#comment575>

          Ugh... ok.

          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          <http://review.hbase.org/r/87/#comment576>

          Whats this about?

          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          <http://review.hbase.org/r/87/#comment577>

          Its not removing it itself? Its registered to cleanup on exit.

          src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java
          <http://review.hbase.org/r/87/#comment578>

          How? If nullwritables, don't I just get nulls?

          • stack
          Show
          HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review96 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java < http://review.hbase.org/r/87/#comment554 > @Ryan: This shouldn't block commit of Todd's patch though, right? We can move it out after your refactoring? src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java < http://review.hbase.org/r/87/#comment555 > Is this patch for trunk or branch? If branch, this addition might break RPC. It might be ok on the end here but we should test. src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java < http://review.hbase.org/r/87/#comment556 > Is this a good description? Its underneath the Import.class which imports by going against the API. You might add a work about it going behind the API, writing hfiles? Hmmm... later I see that it can do both, write hfile or write to table src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment557 > Nice! src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment558 > Man, we never enable asserts.... throw an exception! src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment559 > Just asking, the comparator for sure is going to do the right thing here? src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment560 > Won't javadoc mangle your nice formatting here? Use ul/li or wrap w/ <pre>? src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java < http://review.hbase.org/r/87/#comment561 > Why two types? Is this legacy? KV has advantage of being able to carry a Delete src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < http://review.hbase.org/r/87/#comment562 > In KV, there is a parseColumns function that might help here src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < http://review.hbase.org/r/87/#comment563 > Is this safe? Is there escaping you should be handling here? src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java < http://review.hbase.org/r/87/#comment564 > I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later... src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java < http://review.hbase.org/r/87/#comment566 > THis is good src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java < http://review.hbase.org/r/87/#comment565 > Yeah, there is probably other metadata you'd want to bring over beyond blocksize – like whether its compressed? For later. src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java < http://review.hbase.org/r/87/#comment567 > Class comment. Whats this thing do? src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java < http://review.hbase.org/r/87/#comment568 > We need this even though you wrote out partition edges earlier when you read region boundaries? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/87/#comment569 > This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods ( HBASE-588 ) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217 . Lets make an issue to clean it up. src/main/java/org/apache/hadoop/hbase/regionserver/Store.java < http://review.hbase.org/r/87/#comment570 > Nice src/main/java/org/apache/hadoop/hbase/regionserver/Store.java < http://review.hbase.org/r/87/#comment571 > There is FSUtils in hbase util src/main/java/org/apache/hadoop/hbase/regionserver/Store.java < http://review.hbase.org/r/87/#comment572 > What you going to do here? Get the last key from the map of storefiles in current region and make your file be one less than it? Your timestamp is current though so edits in here could overshadow those in another storefile. Thats ok? src/main/java/org/apache/hadoop/hbase/util/Bytes.java < http://review.hbase.org/r/87/#comment573 > You were going to remove this src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java < http://review.hbase.org/r/87/#comment574 > Good src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java < http://review.hbase.org/r/87/#comment575 > Ugh... ok. src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java < http://review.hbase.org/r/87/#comment576 > Whats this about? src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java < http://review.hbase.org/r/87/#comment577 > Its not removing it itself? Its registered to cleanup on exit. src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java < http://review.hbase.org/r/87/#comment578 > How? If nullwritables, don't I just get nulls? stack
          Hide
          HBase Review Board added a comment -

          Message from: stack@duboce.net

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review97
          -----------------------------------------------------------

          Ship it!

          Looks good to me. I don't have any major issues so go ahead and commit addressing any comments of mine you think prompt fixup.

          • stack
          Show
          HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review97 ----------------------------------------------------------- Ship it! Looks good to me. I don't have any major issues so go ahead and commit addressing any comments of mine you think prompt fixup. stack
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java, line 264

          > <http://review.hbase.org/r/87/diff/1/?file=610#file610line264>

          >

          > Is this patch for trunk or branch? If branch, this addition might break RPC. It might be ok on the end here but we should test.

          I will be backporting to branch for our customer, but as a major new feature I didn't expect it would be committed to Apache branch. If people want it, though, I'll put the patch up.

          In general, adding a new RPC doesn't cause incompatibility - if a client calls it and it's not there on the server side, the client will just get an exception that the method doesn't exist. (eg see my backport of HDFS-630 to hadoop 20)

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 181

          > <http://review.hbase.org/r/87/diff/1/?file=612#file612line181>

          >

          > Man, we never enable asserts.... throw an exception!

          >

          Switched to using google Preconditions

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 187

          > <http://review.hbase.org/r/87/diff/1/?file=612#file612line187>

          >

          > Just asking, the comparator for sure is going to do the right thing here?

          yea, ImmutableBytesWritable implements Comparable

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 213

          > <http://review.hbase.org/r/87/diff/1/?file=612#file612line213>

          >

          > Won't javadoc mangle your nice formatting here? Use ul/li or wrap w/ <pre>?

          good call, fixed to ul/li

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 234

          > <http://review.hbase.org/r/87/diff/1/?file=612#file612line234>

          >

          > Why two types? Is this legacy? KV has advantage of being able to carry a Delete

          yea, it's for compatibility with TableOutputFormat which can take either. Kind of a pain, maybe we should just get rid of it and only accept KV?

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 73

          > <http://review.hbase.org/r/87/diff/1/?file=613#file613line73>

          >

          > In KV, there is a parseColumns function that might help here

          doesn't seem much more convenient (since I already have String here)
          One question, though: if I want to insert into a family that has no qualifiers, should I be using EMPTY_BYTE_ARRAY when I construct the KV, or passing null?

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 96

          > <http://review.hbase.org/r/87/diff/1/?file=613#file613line96>

          >

          > Is this safe? Is there escaping you should be handling here?

          Plan to address this in docs for importtsv - it's not a good TSV parser that supports quoting, etc.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, line 229

          > <http://review.hbase.org/r/87/diff/1/?file=614#file614line229>

          >

          > I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later...

          if there's been a split, the new daughter region will detect that the HFile doesn't "fit" and throw WrongRegionException. This then triggers the retry on ServerCallable, which fetches the new region info from meta, and realizes it has to split the hfile before trying again.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java, line 29

          > <http://review.hbase.org/r/87/diff/1/?file=615#file615line29>

          >

          > Class comment. Whats this thing do?

          fixed - copied from KeyValueSortReducer

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java, line 57

          > <http://review.hbase.org/r/87/diff/1/?file=617#file617line57>

          >

          > We need this even though you wrote out partition edges earlier when you read region boundaries?

          You're right that it's unused, but I figured I'd put it in for convenience - when doing a bulk load into a new table, for example, you may want to use this in order to figure out where to set region splits.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1897

          > <http://review.hbase.org/r/87/diff/1/?file=619#file619line1897>

          >

          > This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods (HBASE-588) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217. Lets make an issue to clean it up.

          Which one should I be taking? Do I need both? Which order?

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, line 511

          > <http://review.hbase.org/r/87/diff/1/?file=621#file621line511>

          >

          > There is FSUtils in hbase util

          On second thought actually this should be somewhat store-specific, so took out the TODO

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/util/Bytes.java, line 1105

          > <http://review.hbase.org/r/87/diff/1/?file=622#file622line1105>

          >

          > You were going to remove this

          fixed

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, line 535

          > <http://review.hbase.org/r/87/diff/1/?file=621#file621line535>

          >

          > What you going to do here? Get the last key from the map of storefiles in current region and make your file be one less than it? Your timestamp is current though so edits in here could overshadow those in another storefile. Thats ok?

          going to make storefiles be a list instead of map... new review coming in a bit

          On 2010-05-28 17:19:22, stack wrote:

          > src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java, line 646

          > <http://review.hbase.org/r/87/diff/1/?file=623#file623line646>

          >

          > Whats this about?

          added comment - idea is to restore back to local-mode job runner when cluster shuts down

          On 2010-05-28 17:19:22, stack wrote:

          > src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java, line 784

          > <http://review.hbase.org/r/87/diff/1/?file=623#file623line784>

          >

          > Its not removing it itself? Its registered to cleanup on exit.

          was seeing cases where in between different tests in the same test case, it wouldn't get cleaned up, etc.

          On 2010-05-28 17:19:22, stack wrote:

          > src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java, line 40

          > <http://review.hbase.org/r/87/diff/1/?file=624#file624line40>

          >

          > How? If nullwritables, don't I just get nulls?

          Changed to:

          /**

          • Input format that creates as many map tasks as configured in
          • mapred.map.tasks, each provided with a single row of
          • NullWritables. This can be useful when trying to write mappers
          • which don't have any real input (eg when the mapper is simply
          • producing random data as output)
            */
          • Todd

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review96
          -----------------------------------------------------------

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java, line 264 > < http://review.hbase.org/r/87/diff/1/?file=610#file610line264 > > > Is this patch for trunk or branch? If branch, this addition might break RPC. It might be ok on the end here but we should test. I will be backporting to branch for our customer, but as a major new feature I didn't expect it would be committed to Apache branch. If people want it, though, I'll put the patch up. In general, adding a new RPC doesn't cause incompatibility - if a client calls it and it's not there on the server side, the client will just get an exception that the method doesn't exist. (eg see my backport of HDFS-630 to hadoop 20) On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 181 > < http://review.hbase.org/r/87/diff/1/?file=612#file612line181 > > > Man, we never enable asserts.... throw an exception! > Switched to using google Preconditions On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 187 > < http://review.hbase.org/r/87/diff/1/?file=612#file612line187 > > > Just asking, the comparator for sure is going to do the right thing here? yea, ImmutableBytesWritable implements Comparable On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 213 > < http://review.hbase.org/r/87/diff/1/?file=612#file612line213 > > > Won't javadoc mangle your nice formatting here? Use ul/li or wrap w/ <pre>? good call, fixed to ul/li On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 234 > < http://review.hbase.org/r/87/diff/1/?file=612#file612line234 > > > Why two types? Is this legacy? KV has advantage of being able to carry a Delete yea, it's for compatibility with TableOutputFormat which can take either. Kind of a pain, maybe we should just get rid of it and only accept KV? On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 73 > < http://review.hbase.org/r/87/diff/1/?file=613#file613line73 > > > In KV, there is a parseColumns function that might help here doesn't seem much more convenient (since I already have String here) One question, though: if I want to insert into a family that has no qualifiers, should I be using EMPTY_BYTE_ARRAY when I construct the KV, or passing null? On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 96 > < http://review.hbase.org/r/87/diff/1/?file=613#file613line96 > > > Is this safe? Is there escaping you should be handling here? Plan to address this in docs for importtsv - it's not a good TSV parser that supports quoting, etc. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, line 229 > < http://review.hbase.org/r/87/diff/1/?file=614#file614line229 > > > I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later... if there's been a split, the new daughter region will detect that the HFile doesn't "fit" and throw WrongRegionException. This then triggers the retry on ServerCallable, which fetches the new region info from meta, and realizes it has to split the hfile before trying again. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java, line 29 > < http://review.hbase.org/r/87/diff/1/?file=615#file615line29 > > > Class comment. Whats this thing do? fixed - copied from KeyValueSortReducer On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java, line 57 > < http://review.hbase.org/r/87/diff/1/?file=617#file617line57 > > > We need this even though you wrote out partition edges earlier when you read region boundaries? You're right that it's unused, but I figured I'd put it in for convenience - when doing a bulk load into a new table, for example, you may want to use this in order to figure out where to set region splits. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1897 > < http://review.hbase.org/r/87/diff/1/?file=619#file619line1897 > > > This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods ( HBASE-588 ) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217 . Lets make an issue to clean it up. Which one should I be taking? Do I need both? Which order? On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, line 511 > < http://review.hbase.org/r/87/diff/1/?file=621#file621line511 > > > There is FSUtils in hbase util On second thought actually this should be somewhat store-specific, so took out the TODO On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/util/Bytes.java, line 1105 > < http://review.hbase.org/r/87/diff/1/?file=622#file622line1105 > > > You were going to remove this fixed On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, line 535 > < http://review.hbase.org/r/87/diff/1/?file=621#file621line535 > > > What you going to do here? Get the last key from the map of storefiles in current region and make your file be one less than it? Your timestamp is current though so edits in here could overshadow those in another storefile. Thats ok? going to make storefiles be a list instead of map... new review coming in a bit On 2010-05-28 17:19:22, stack wrote: > src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java, line 646 > < http://review.hbase.org/r/87/diff/1/?file=623#file623line646 > > > Whats this about? added comment - idea is to restore back to local-mode job runner when cluster shuts down On 2010-05-28 17:19:22, stack wrote: > src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java, line 784 > < http://review.hbase.org/r/87/diff/1/?file=623#file623line784 > > > Its not removing it itself? Its registered to cleanup on exit. was seeing cases where in between different tests in the same test case, it wouldn't get cleaned up, etc. On 2010-05-28 17:19:22, stack wrote: > src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java, line 40 > < http://review.hbase.org/r/87/diff/1/?file=624#file624line40 > > > How? If nullwritables, don't I just get nulls? Changed to: /** Input format that creates as many map tasks as configured in mapred.map.tasks, each provided with a single row of NullWritables. This can be useful when trying to write mappers which don't have any real input (eg when the mapper is simply producing random data as output) */ Todd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review96 -----------------------------------------------------------
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/
          -----------------------------------------------------------

          (Updated 2010-05-30 16:13:22.509910)

          Review request for hbase, stack and Jonathan Gray.

          Changes
          -------

          Here's an updated diff which includes Stack's comments, plus does the junk inside Store to get rid of the currentTimeMillis hack. Also augments the test case a bit to force the bulk-loaded regions to reload.

          Still needs further work, and some new unit tests, again just want to get early comments since this has become rather invasive.

          Summary
          -------

          Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review

          I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP.

          The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review.

          This addresses bug HBASE-1923.
          http://issues.apache.org/jira/browse/HBASE-1923

          Diffs (updated)


          pom.xml 0a009cf
          src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6
          src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b
          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85
          src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a
          src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695
          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e
          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223
          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1
          src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe
          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f
          src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3
          src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb
          src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e

          Diff: http://review.hbase.org/r/87/diff

          Testing
          -------

          Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well.

          Thanks,

          Todd

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/ ----------------------------------------------------------- (Updated 2010-05-30 16:13:22.509910) Review request for hbase, stack and Jonathan Gray. Changes ------- Here's an updated diff which includes Stack's comments, plus does the junk inside Store to get rid of the currentTimeMillis hack. Also augments the test case a bit to force the bulk-loaded regions to reload. Still needs further work, and some new unit tests, again just want to get early comments since this has become rather invasive. Summary ------- Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP. The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review. This addresses bug HBASE-1923 . http://issues.apache.org/jira/browse/HBASE-1923 Diffs (updated) pom.xml 0a009cf src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6 src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85 src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695 src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588 src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223 src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1 src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2 src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3 src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e Diff: http://review.hbase.org/r/87/diff Testing ------- Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well. Thanks, Todd
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/
          -----------------------------------------------------------

          (Updated 2010-05-31 00:22:24.084089)

          Review request for hbase, stack and Jonathan Gray.

          Changes
          -------

          Yet another draft, addresses some of the TODOs, adds some more unit tests, etc.

          I think this is nearing commitability.

          Summary
          -------

          Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review

          I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP.

          The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review.

          This addresses bug HBASE-1923.
          http://issues.apache.org/jira/browse/HBASE-1923

          Diffs (updated)


          pom.xml 0a009cf
          src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6
          src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b
          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85
          src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a
          src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695
          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e
          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588
          src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java b332280
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223
          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1
          src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe
          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f
          src/test/java/org/apache/hadoop/hbase/PerformanceEvaluation.java d76c75e
          src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3
          src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb
          src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e
          src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 2e4c7df
          src/test/java/org/apache/hadoop/hbase/util/TestBytes.java c7361cb

          Diff: http://review.hbase.org/r/87/diff

          Testing
          -------

          Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well.

          Thanks,

          Todd

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/ ----------------------------------------------------------- (Updated 2010-05-31 00:22:24.084089) Review request for hbase, stack and Jonathan Gray. Changes ------- Yet another draft, addresses some of the TODOs, adds some more unit tests, etc. I think this is nearing commitability. Summary ------- Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP. The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review. This addresses bug HBASE-1923 . http://issues.apache.org/jira/browse/HBASE-1923 Diffs (updated) pom.xml 0a009cf src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6 src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85 src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695 src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588 src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java b332280 src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223 src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1 src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f src/test/java/org/apache/hadoop/hbase/PerformanceEvaluation.java d76c75e src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2 src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3 src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 2e4c7df src/test/java/org/apache/hadoop/hbase/util/TestBytes.java c7361cb Diff: http://review.hbase.org/r/87/diff Testing ------- Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well. Thanks, Todd
          Hide
          Todd Lipcon added a comment -

          Attaching latest patch. This needs to go on top of HBASE-2588

          Show
          Todd Lipcon added a comment - Attaching latest patch. This needs to go on top of HBASE-2588
          Hide
          HBase Review Board added a comment -

          Message from: stack@duboce.net

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 234

          > <http://review.hbase.org/r/87/diff/1/?file=612#file612line234>

          >

          > Why two types? Is this legacy? KV has advantage of being able to carry a Delete

          Todd Lipcon wrote:

          yea, it's for compatibility with TableOutputFormat which can take either. Kind of a pain, maybe we should just get rid of it and only accept KV?

          Doing both is a pain yes. If Delete and Put had a common ancestor/Interface, we could use that but it ain't there yet. File an issue to undo Put option?

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 73

          > <http://review.hbase.org/r/87/diff/1/?file=613#file613line73>

          >

          > In KV, there is a parseColumns function that might help here

          Todd Lipcon wrote:

          doesn't seem much more convenient (since I already have String here)

          One question, though: if I want to insert into a family that has no qualifiers, should I be using EMPTY_BYTE_ARRAY when I construct the KV, or passing null?

          It looks like either works. It looks like nulls are handled properly.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 96

          > <http://review.hbase.org/r/87/diff/1/?file=613#file613line96>

          >

          > Is this safe? Is there escaping you should be handling here?

          Todd Lipcon wrote:

          Plan to address this in docs for importtsv - it's not a good TSV parser that supports quoting, etc.

          ok

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, line 229

          > <http://review.hbase.org/r/87/diff/1/?file=614#file614line229>

          >

          > I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later...

          Todd Lipcon wrote:

          if there's been a split, the new daughter region will detect that the HFile doesn't "fit" and throw WrongRegionException. This then triggers the retry on ServerCallable, which fetches the new region info from meta, and realizes it has to split the hfile before trying again.

          Nice.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java, line 57

          > <http://review.hbase.org/r/87/diff/1/?file=617#file617line57>

          >

          > We need this even though you wrote out partition edges earlier when you read region boundaries?

          Todd Lipcon wrote:

          You're right that it's unused, but I figured I'd put it in for convenience - when doing a bulk load into a new table, for example, you may want to use this in order to figure out where to set region splits.

          Ok. Sounds good.

          On 2010-05-28 17:19:22, stack wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1897

          > <http://review.hbase.org/r/87/diff/1/?file=619#file619line1897>

          >

          > This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods (HBASE-588) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217. Lets make an issue to clean it up.

          Todd Lipcon wrote:

          Which one should I be taking? Do I need both? Which order?

          Seems like it depends on what you need. I like your issue on rethinking these locks. Sounds like we might need to do stuff like break apart the splittingAndClose lock into a splitting and closing.

          • stack

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review96
          -----------------------------------------------------------

          Show
          HBase Review Board added a comment - Message from: stack@duboce.net On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java, line 234 > < http://review.hbase.org/r/87/diff/1/?file=612#file612line234 > > > Why two types? Is this legacy? KV has advantage of being able to carry a Delete Todd Lipcon wrote: yea, it's for compatibility with TableOutputFormat which can take either. Kind of a pain, maybe we should just get rid of it and only accept KV? Doing both is a pain yes. If Delete and Put had a common ancestor/Interface, we could use that but it ain't there yet. File an issue to undo Put option? On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 73 > < http://review.hbase.org/r/87/diff/1/?file=613#file613line73 > > > In KV, there is a parseColumns function that might help here Todd Lipcon wrote: doesn't seem much more convenient (since I already have String here) One question, though: if I want to insert into a family that has no qualifiers, should I be using EMPTY_BYTE_ARRAY when I construct the KV, or passing null? It looks like either works. It looks like nulls are handled properly. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java, line 96 > < http://review.hbase.org/r/87/diff/1/?file=613#file613line96 > > > Is this safe? Is there escaping you should be handling here? Todd Lipcon wrote: Plan to address this in docs for importtsv - it's not a good TSV parser that supports quoting, etc. ok On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, line 229 > < http://review.hbase.org/r/87/diff/1/?file=614#file614line229 > > > I suppose in rare circumstances, it could split in here and mess you up. You need a no-split flag per region out in zk or something? We'd need such a thing for snapshotting methinks. But, thats for a later... Todd Lipcon wrote: if there's been a split, the new daughter region will detect that the HFile doesn't "fit" and throw WrongRegionException. This then triggers the retry on ServerCallable, which fetches the new region info from meta, and realizes it has to split the hfile before trying again. Nice. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java, line 57 > < http://review.hbase.org/r/87/diff/1/?file=617#file617line57 > > > We need this even though you wrote out partition edges earlier when you read region boundaries? Todd Lipcon wrote: You're right that it's unused, but I figured I'd put it in for convenience - when doing a bulk load into a new table, for example, you may want to use this in order to figure out where to set region splits. Ok. Sounds good. On 2010-05-28 17:19:22, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1897 > < http://review.hbase.org/r/87/diff/1/?file=619#file619line1897 > > > This is old cruddy stuff you are asking about. It looks like I added splitsAndClosesLock to stop either happening during critical periods ( HBASE-588 ) and the splits lock looks now like a half measure from way back in 2007 – HBASE-217 . Lets make an issue to clean it up. Todd Lipcon wrote: Which one should I be taking? Do I need both? Which order? Seems like it depends on what you need. I like your issue on rethinking these locks. Sounds like we might need to do stuff like break apart the splittingAndClose lock into a splitting and closing. stack ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review96 -----------------------------------------------------------
          Hide
          HBase Review Board added a comment -

          Message from: "Todd Lipcon" <todd@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/
          -----------------------------------------------------------

          (Updated 2010-06-01 00:33:30.030753)

          Review request for hbase, stack and Jonathan Gray.

          Changes
          -------

          One more day of hacking on this, adds some docs, cleans up the command line parameters, fixes a couple bugs identified in cluster testing.

          Calling this one final, I swear

          I don't know how to build the forrest docs in trunk, but the new xml file does pass xmllint.

          Summary
          -------

          Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review

          I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP.

          The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review.

          This addresses bug HBASE-1923.
          http://issues.apache.org/jira/browse/HBASE-1923

          Diffs (updated)


          pom.xml 0a009cf
          src/docs/src/documentation/content/xdocs/bulk-loads.xml PRE-CREATION
          src/docs/src/documentation/content/xdocs/site.xml 0d644f5
          src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6
          src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b
          src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85
          src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a
          src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695
          src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e
          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588
          src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java 07d7911
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223
          src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1
          src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe
          src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f
          src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3
          src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb
          src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e
          src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 2e4c7df
          src/test/java/org/apache/hadoop/hbase/util/TestBytes.java c7361cb

          Diff: http://review.hbase.org/r/87/diff

          Testing
          -------

          Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well.

          Thanks,

          Todd

          Show
          HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/ ----------------------------------------------------------- (Updated 2010-06-01 00:33:30.030753) Review request for hbase, stack and Jonathan Gray. Changes ------- One more day of hacking on this, adds some docs, cleans up the command line parameters, fixes a couple bugs identified in cluster testing. Calling this one final, I swear I don't know how to build the forrest docs in trunk, but the new xml file does pass xmllint. Summary ------- Here's a first patch that implements bulk import into existing tables. This applies on top of HBASE-2586 and HBASE-2588 - I've pushed the series of the three to my github: http://github.com/toddlipcon/hbase/tree/hfof-review I have some TODOs left that I want to take care of before this gets committed, but since it's a pretty large patch, I figured I'd get it out for review ASAP. The stuff in the hadoopbackport package is essentially copypaste from Hadoop trunk, so you can ignore that in the review. This addresses bug HBASE-1923 . http://issues.apache.org/jira/browse/HBASE-1923 Diffs (updated) pom.xml 0a009cf src/docs/src/documentation/content/xdocs/bulk-loads.xml PRE-CREATION src/docs/src/documentation/content/xdocs/site.xml 0d644f5 src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 29b0cd6 src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java 0a9ec4b src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java b912a85 src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 4cbe52a src/main/java/org/apache/hadoop/hbase/mapreduce/Driver.java 3d40695 src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java 9c8e53e src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/SimpleTotalOrderPartitioner.java af3d588 src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java 07d7911 src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/InputSampler.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/mapreduce/hadoopbackport/TotalOrderPartitioner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 287cd48 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 78f3223 src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 7de766d src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 80bf09a src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java c8941f1 src/main/java/org/apache/hadoop/hbase/util/Bytes.java a53dafe src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java ed8709f src/test/java/org/apache/hadoop/hbase/mapreduce/NMapInputFormat.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java d04ced2 src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java f1566d3 src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java fcb22fb src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java 4595e6e src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 2e4c7df src/test/java/org/apache/hadoop/hbase/util/TestBytes.java c7361cb Diff: http://review.hbase.org/r/87/diff Testing ------- Primary unit/functional testing, a bit of pseudo-distributed testing. Plan on doing full system tests before commit as well. Thanks, Todd
          Hide
          HBase Review Board added a comment -

          Message from: stack@duboce.net

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review108
          -----------------------------------------------------------

          Ship it!

          This is a killer feature. Ship it. There is one nitpick in the below but can be done on commit (if its legit)

          src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java
          <http://review.hbase.org/r/87/#comment658>

          duh

          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
          <http://review.hbase.org/r/87/#comment659>

          Where is the javadoc that the tsv parse doesn't do escaping?

          src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
          <http://review.hbase.org/r/87/#comment660>

          Good

          src/docs/src/documentation/content/xdocs/bulk-loads.xml
          <http://review.hbase.org/r/87/#comment655>

          This is excellent. We do have this though already: http://hbase.apache.org/docs/r0.20.4/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk

          We should scrap the above if we're going to go w/ this or update the former to include this.

          src/docs/src/documentation/content/xdocs/bulk-loads.xml
          <http://review.hbase.org/r/87/#comment656>

          method_s_

          src/docs/src/documentation/content/xdocs/bulk-loads.xml
          <http://review.hbase.org/r/87/#comment657>

          This is a killer feature

          • stack
          Show
          HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review108 ----------------------------------------------------------- Ship it! This is a killer feature. Ship it. There is one nitpick in the below but can be done on commit (if its legit) src/main/java/org/apache/hadoop/hbase/io/ImmutableBytesWritable.java < http://review.hbase.org/r/87/#comment658 > duh src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < http://review.hbase.org/r/87/#comment659 > Where is the javadoc that the tsv parse doesn't do escaping? src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < http://review.hbase.org/r/87/#comment660 > Good src/docs/src/documentation/content/xdocs/bulk-loads.xml < http://review.hbase.org/r/87/#comment655 > This is excellent. We do have this though already: http://hbase.apache.org/docs/r0.20.4/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk We should scrap the above if we're going to go w/ this or update the former to include this. src/docs/src/documentation/content/xdocs/bulk-loads.xml < http://review.hbase.org/r/87/#comment656 > method_s_ src/docs/src/documentation/content/xdocs/bulk-loads.xml < http://review.hbase.org/r/87/#comment657 > This is a killer feature stack
          Hide
          HBase Review Board added a comment -

          Message from: stack@duboce.net

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/87/#review112
          -----------------------------------------------------------

          Ship it!

          Reviewed changes between v2 and v3. It looks like they were lost – some failed communication with the server. The storefile loading stuff is good. I like the new Comparator. Oh, one thing is that we have MD5Hash class now.. maybe your md5'ing could go there.

          src/main/java/org/apache/hadoop/hbase/util/Bytes.java
          <http://review.hbase.org/r/87/#comment666>

          Ignore

          • stack
          Show
          HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/87/#review112 ----------------------------------------------------------- Ship it! Reviewed changes between v2 and v3. It looks like they were lost – some failed communication with the server. The storefile loading stuff is good. I like the new Comparator. Oh, one thing is that we have MD5Hash class now.. maybe your md5'ing could go there. src/main/java/org/apache/hadoop/hbase/util/Bytes.java < http://review.hbase.org/r/87/#comment666 > Ignore stack
          Hide
          Todd Lipcon added a comment -

          This should be good to go, just blocked on HBASE-2588 getting committed.

          Show
          Todd Lipcon added a comment - This should be good to go, just blocked on HBASE-2588 getting committed.
          Hide
          Todd Lipcon added a comment -

          Attaching final patch, about to commit.

          Show
          Todd Lipcon added a comment - Attaching final patch, about to commit.
          Hide
          Todd Lipcon added a comment -

          Spoke to Ryan offline and he's OK reviewing this later this week post-commit.

          Committed to trunk. There's some discussion on the dev list about putting this in branch-20, but will hold off on that until there is some consensus.

          Thanks to all who helped review!

          Show
          Todd Lipcon added a comment - Spoke to Ryan offline and he's OK reviewing this later this week post-commit. Committed to trunk. There's some discussion on the dev list about putting this in branch-20, but will hold off on that until there is some consensus. Thanks to all who helped review!
          Hide
          Lars Francke added a comment -

          Can loadtable.rb be removed? As far as I understand it your LoadIncrementalHFiles replaces it completely? If this is the case I'd suggest adding some documentation to that effect. The class leads to believe that it only reads incremental imports while the javadoc seems to be more generic.

          Show
          Lars Francke added a comment - Can loadtable.rb be removed? As far as I understand it your LoadIncrementalHFiles replaces it completely? If this is the case I'd suggest adding some documentation to that effect. The class leads to believe that it only reads incremental imports while the javadoc seems to be more generic.
          Hide
          Todd Lipcon added a comment -

          Yea, I think it can be removed – assuming it still works, we should probably keep it around for one release in case anyone has scripting against it. But, we should add a deprecation warning to its out. What do you think?

          Show
          Todd Lipcon added a comment - Yea, I think it can be removed – assuming it still works, we should probably keep it around for one release in case anyone has scripting against it. But, we should add a deprecation warning to its out. What do you think?
          Hide
          Lars Francke added a comment -

          I haven't tried after your patch but it should probably work in non-incremental mode. I guess that might change after HBASE-1861. Thanks for the clarification.

          Show
          Lars Francke added a comment - I haven't tried after your patch but it should probably work in non-incremental mode. I guess that might change after HBASE-1861 . Thanks for the clarification.
          Hide
          Amr Awadallah added a comment -

          I am out of office on vacation and will be slower than usual in
          responding to emails. If this is urgent then please call my cell phone
          (or send an sms), otherwise I will reply to your email when I get
          back.

          Thanks for your patience,

          – amr

          Show
          Amr Awadallah added a comment - I am out of office on vacation and will be slower than usual in responding to emails. If this is urgent then please call my cell phone (or send an sms), otherwise I will reply to your email when I get back. Thanks for your patience, – amr
          Hide
          stack added a comment -

          IMO, remove it now. The likelihood of anyone scripting against it is low and I'm pretty sure it pales compared to its replacement.

          Show
          stack added a comment - IMO, remove it now. The likelihood of anyone scripting against it is low and I'm pretty sure it pales compared to its replacement.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              anty.rao
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development