Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.1, 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hbase needs tools to facilitate bulk upload and possibly dumping. Going via the current APIs, particularly if the dataset is large and cell content is small, uploads can take a long time even when using many concurrent clients.

      PNUTS folks talked of need for a different API to manage bulk upload/dump.

      Another notion would be to somehow have the bulk loader tools somehow write regions directly in hdfs.

      1. 48.patch
        15 kB
        stack
      2. 48-v2.patch
        17 kB
        stack
      3. loadtable.rb
        4 kB
        stack
      4. 48-v3.patch
        21 kB
        stack
      5. 48-v4.patch
        23 kB
        stack
      6. 48-v5.patch
        17 kB
        stack
      7. HBASE-48-v6-branch.patch
        22 kB
        Jonathan Gray
      8. 48-v7.patch
        27 kB
        stack

        Issue Links

          Activity

          Hide
          stack added a comment -

          Committed last patch to trunk and branch (jgray reviewed and tested). Opening new issue for multi-family bulk import.

          Show
          stack added a comment - Committed last patch to trunk and branch (jgray reviewed and tested). Opening new issue for multi-family bulk import.
          Hide
          stack added a comment -

          This version adds doc to the mapreduce package-info explaining how bulk import works.

          Show
          stack added a comment - This version adds doc to the mapreduce package-info explaining how bulk import works.
          Hide
          Jonathan Gray added a comment -

          v6 patch works as advertised.

          Just imported 200k rows with avg of 500 columns each for 100M total KVs (24 regions). MR job ran in under 2 minutes into a 4 node cluster of 2core/2gb/250gb nodes. Ruby script takes 3-4 seconds and then about 30 seconds for cluster to assign out regions.

          I'm ready to commit this to trunk and branch though we need some docs. Will open separate JIRA for multi-family support.

          Show
          Jonathan Gray added a comment - v6 patch works as advertised. Just imported 200k rows with avg of 500 columns each for 100M total KVs (24 regions). MR job ran in under 2 minutes into a 4 node cluster of 2core/2gb/250gb nodes. Ruby script takes 3-4 seconds and then about 30 seconds for cluster to assign out regions. I'm ready to commit this to trunk and branch though we need some docs. Will open separate JIRA for multi-family support.
          Hide
          Jonathan Gray added a comment -

          Patch that applies cleanly to branch.

          Includes two modifications to the ruby script. Ignores _log directories (preventing multiple-family error) and copies hfiles into proper region directories, after creating them.

          Running final test now.

          Show
          Jonathan Gray added a comment - Patch that applies cleanly to branch. Includes two modifications to the ruby script. Ignores _log directories (preventing multiple-family error) and copies hfiles into proper region directories, after creating them. Running final test now.
          Hide
          Jonathan Gray added a comment -

          I've got this working. Running final tests on branch and trunk and will post patch.

          Show
          Jonathan Gray added a comment - I've got this working. Running final tests on branch and trunk and will post patch.
          Hide
          Jonathan Gray added a comment -

          The MR job is working tremendously well for me. I'm able to almost instantly saturate my entire cluster during an upload and it remains saturated until the end. Full CPU usage, lots of io-wait, so I'm disk io-bound as I should be.

          I did a few runs of a job which imported between 1M and 10M rows, each row containing a random number of columns from 1 to 1000. In the end, I imported between 500M and 5B KeyValues.

          On a 5 node cluster of 2core/2gb/250gb nodes, I could import 1M rows / 500M keys in 7.5 minutes (2.2k rows/sec, 1.1M keys/sec).

          On a 10 node cluster of 4core/4gb/500gb nodes, I could do the same import in 2.5 minutes. On this larger cluster I also ran the same job but with 10M rows / 5B keys in 25 minutes (6.6k rows/sec, 3.3M keys/sec).

          Previously running HTable-based imports on these clusters, I was seeing between 100k and 200k keys/sec, so this represents a 5-15X speed improvement. In addition, the imports finish without any problem (I would have killed the little cluster running these imports through HBase).

          I think there is a bug with the ruby script though. It worked sometimes, but other times it ended up hosing the cluster until I restarted. Things worked fine after restart.

          Still digging...

          Show
          Jonathan Gray added a comment - The MR job is working tremendously well for me. I'm able to almost instantly saturate my entire cluster during an upload and it remains saturated until the end. Full CPU usage, lots of io-wait, so I'm disk io-bound as I should be. I did a few runs of a job which imported between 1M and 10M rows, each row containing a random number of columns from 1 to 1000. In the end, I imported between 500M and 5B KeyValues. On a 5 node cluster of 2core/2gb/250gb nodes, I could import 1M rows / 500M keys in 7.5 minutes (2.2k rows/sec, 1.1M keys/sec). On a 10 node cluster of 4core/4gb/500gb nodes, I could do the same import in 2.5 minutes. On this larger cluster I also ran the same job but with 10M rows / 5B keys in 25 minutes (6.6k rows/sec, 3.3M keys/sec). Previously running HTable-based imports on these clusters, I was seeing between 100k and 200k keys/sec, so this represents a 5-15X speed improvement. In addition, the imports finish without any problem (I would have killed the little cluster running these imports through HBase). I think there is a bug with the ruby script though. It worked sometimes, but other times it ended up hosing the cluster until I restarted. Things worked fine after restart. Still digging...
          Hide
          Jonathan Gray added a comment -

          v5 patch does not include ruby script. Posted ruby script is incomplete. Working one was in v4 patch.

          Will attach new patch shortly.

          Show
          Jonathan Gray added a comment - v5 patch does not include ruby script. Posted ruby script is incomplete. Working one was in v4 patch. Will attach new patch shortly.
          Hide
          stack added a comment -

          Updated patch. Rolls hfile now at row boundary.

          Regards TotalOrderPartitioner, there is no such facility in the new mapreduce package. That said, shouldn't be too hard making a partitioner of our own. Here is the default hash partitioner:

            /** Use {@link Object#hashCode()} to partition. */
            public int getPartition(K key, V value,
                                    int numReduceTasks) {
              return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
            }
          

          We could take as inputs a start and end key and then divide the key space using our key bigdecimal math into numReduceTasks partitions?

          Show
          stack added a comment - Updated patch. Rolls hfile now at row boundary. Regards TotalOrderPartitioner, there is no such facility in the new mapreduce package. That said, shouldn't be too hard making a partitioner of our own. Here is the default hash partitioner: /** Use {@link Object #hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer .MAX_VALUE) % numReduceTasks; } We could take as inputs a start and end key and then divide the key space using our key bigdecimal math into numReduceTasks partitions?
          Hide
          stack added a comment -

          Oh, and 3., to do multiple families shouldn't be hard; just rotate all files when one is in excess of maximum (on a row boundary)

          Show
          stack added a comment - Oh, and 3., to do multiple families shouldn't be hard; just rotate all files when one is in excess of maximum (on a row boundary)
          Hide
          stack added a comment -

          Two things.

          1. Manish Shah points out that we need to ensure a total ordering of all keys (See pg.218 of hadoop book and on). We need to supply a partitioner that does total ordering AND that ensures we don't partition across rows (Thanks for this Manish).

          2. Current patch does not guarantee rows do not span storefiles.

          Show
          stack added a comment - Two things. 1. Manish Shah points out that we need to ensure a total ordering of all keys (See pg.218 of hadoop book and on). We need to supply a partitioner that does total ordering AND that ensures we don't partition across rows (Thanks for this Manish). 2. Current patch does not guarantee rows do not span storefiles.
          Hide
          stack added a comment -

          I deleted my last comment. It was duplication of stuff said earlier in this issue a good while ago.

          I changed title of this issue to only be about bulk upload. The bulk dump is going on elsewhere: e.g. HBASE-1684

          On earlier comments about splitting table so at least a region per regionserver, that ain't hard to do now. You can do it via UI – force a split – or just write a little script to add a table and initial region range (for example, see the script in the attached patch).

          I think criteria for closing this issue is commit of some set of tools that allow writing hfiles either into new tables or into extant tables.

          Show
          stack added a comment - I deleted my last comment. It was duplication of stuff said earlier in this issue a good while ago. I changed title of this issue to only be about bulk upload. The bulk dump is going on elsewhere: e.g. HBASE-1684 On earlier comments about splitting table so at least a region per regionserver, that ain't hard to do now. You can do it via UI – force a split – or just write a little script to add a table and initial region range (for example, see the script in the attached patch). I think criteria for closing this issue is commit of some set of tools that allow writing hfiles either into new tables or into extant tables.
          Hide
          stack added a comment -

          This patch seems to basically work. I took files made by the TestHFileInputFormat test and passed them to the script as follows:

          $  ./bin/hbase org.jruby.Main bin/loadtable.rb xyz /tmp/testWritingPEData/
          

          The script expects hbase to be running.

          It ran through the list of hfiles, read their meta info and last key. It then sorted the hfiles by end key. It makes a HTableDescriptor and HColumnDescriptor with defaults (If want other than defaults, then after upload, alter table). It then takes the sorted files and per file moves it into place and adds a row to .META. Doesn't take long.

          The meta scanner runs after the upload and deploys the regions.

          Done.

          I'll not work on this anymore, not till someone else wants to try it.

          Show
          stack added a comment - This patch seems to basically work. I took files made by the TestHFileInputFormat test and passed them to the script as follows: $ ./bin/hbase org.jruby.Main bin/loadtable.rb xyz /tmp/testWritingPEData/ The script expects hbase to be running. It ran through the list of hfiles, read their meta info and last key. It then sorted the hfiles by end key. It makes a HTableDescriptor and HColumnDescriptor with defaults (If want other than defaults, then after upload, alter table). It then takes the sorted files and per file moves it into place and adds a row to .META. Doesn't take long. The meta scanner runs after the upload and deploys the regions. Done. I'll not work on this anymore, not till someone else wants to try it.
          Hide
          stack added a comment -

          More fixup. Add the script to the patch. Turns out, multiple families is a bit more complicated. Can do that later if wanted.

          Show
          stack added a comment - More fixup. Add the script to the patch. Turns out, multiple families is a bit more complicated. Can do that later if wanted.
          Hide
          stack added a comment -

          Start of a script that will move hfiles into place under hbase.rootdir and that then updates catalog table. Not finished yet.

          Show
          stack added a comment - Start of a script that will move hfiles into place under hbase.rootdir and that then updates catalog table. Not finished yet.
          Hide
          stack added a comment -

          Have HFileOutputFormat handle multiple families. Reducer can pass any number of families and HFO will write hfiles to family named subdirs.

          Show
          stack added a comment - Have HFileOutputFormat handle multiple families. Reducer can pass any number of families and HFO will write hfiles to family named subdirs.
          Hide
          stack added a comment -

          Here is a patch to add to classes to MapReduce: KeyValueSortReducer and HFileOutputFormat. This patch also adds a small test class that runs a MR job that has custom mapper and inputformat. The inputformat produces PerformanceEvaluation type keys and values (keys are a zero-padded long and values are random 1k of bytes). The mapper takes this inputformat and outputs the key as row and then makes a KeyValue of the row, a defined column and the value.

          KeyValueSortReducer takes as input an ImmutableBytesWritable as key/row. It then pulls on the Iterator to read in all of the passed KeyValues, sorts then, and then starts outputting the sorted key/row+KeyValue.

          HFileOutputFormat takes ImmutableBytesWritable and KeyValue. On setup, it reads configuration for stuff like blocksize and compression to use. It then writes HFiles of < hbase.hregion.max.filesize size.

          Next I'll work on a script that takes an HTableDescriptor and some other parameters and that then puts the output of this MR into proper layout in HDFS with an hfile per region making proper insertions into catalog tables.

          Show
          stack added a comment - Here is a patch to add to classes to MapReduce: KeyValueSortReducer and HFileOutputFormat. This patch also adds a small test class that runs a MR job that has custom mapper and inputformat. The inputformat produces PerformanceEvaluation type keys and values (keys are a zero-padded long and values are random 1k of bytes). The mapper takes this inputformat and outputs the key as row and then makes a KeyValue of the row, a defined column and the value. KeyValueSortReducer takes as input an ImmutableBytesWritable as key/row. It then pulls on the Iterator to read in all of the passed KeyValues, sorts then, and then starts outputting the sorted key/row+KeyValue. HFileOutputFormat takes ImmutableBytesWritable and KeyValue. On setup, it reads configuration for stuff like blocksize and compression to use. It then writes HFiles of < hbase.hregion.max.filesize size. Next I'll work on a script that takes an HTableDescriptor and some other parameters and that then puts the output of this MR into proper layout in HDFS with an hfile per region making proper insertions into catalog tables.
          Hide
          stack added a comment -

          Lets test Billy and make sure that doesn't happen.

          Show
          stack added a comment - Lets test Billy and make sure that doesn't happen.
          Hide
          Billy Pearson added a comment -

          Been thanking on this one what happens if a region get to many map files written to it and there to large to compaction and we OOME because of lack memory to compact the files or will this not happen?

          Show
          Billy Pearson added a comment - Been thanking on this one what happens if a region get to many map files written to it and there to large to compaction and we OOME because of lack memory to compact the files or will this not happen?
          Hide
          stack added a comment -

          We do. Went in as part of Andrew Purtell's HBASE-62 work. There's even a unit test!

          Show
          stack added a comment - We do. Went in as part of Andrew Purtell's HBASE-62 work. There's even a unit test!
          Hide
          Billy Pearson added a comment -

          With a read-only option that would simplify things a lot. Do we have this yet I remember reading something about adding it in the past.

          Show
          Billy Pearson added a comment - With a read-only option that would simplify things a lot. Do we have this yet I remember reading something about adding it in the past.
          Hide
          stack added a comment -

          Thinking more on this issue, in particular on Billy's suggestion above ('Billy Pearson - 06/Feb/08 01:07 PM'), bulk uploading by writing store files ain't hard:

          For a new table (as per Bryan above), its particularly easy. Do something like:

          + Create table in hbase
          + Mark it read-only or disabled even.
          + Start mapreduce job. In its configuration would go to the master to read the table description.
          + Map reads whatever the input using whatever formatter and ouputs from the map using HStoreKey for key and cell content for value.
          + Job would use fancy new TableFileReduce. Each reducer would write a region. It'd know what for start and end keys – they'd be the first and last it'd see. Could output these somewhere so a tail task could find them. The file outputter would need to also do sequenceids of some form.
          + When job was done, tail task would insert regions into meta using MetaUtils.
          + Enable the table.
          + If regions are lop-sided, hbase will do the fixup.

          If table already exists:

          + Mark table read-only (ensure this prevents splits and that it means memcache is flushed)
          + Start a mapreduce job that read from master the table schema and its regions (and the master's current time so we don't write records older).
          + Map as above.
          + Reducer as above only insert smarter partitioner, one that respected region boundaries and that made a reducer per current region.
          + Enable hbase and let it fix up where storefiles written were too big by splitting etc.

          It don't seem hard at all to do.

          Show
          stack added a comment - Thinking more on this issue, in particular on Billy's suggestion above ('Billy Pearson - 06/Feb/08 01:07 PM'), bulk uploading by writing store files ain't hard: For a new table (as per Bryan above), its particularly easy. Do something like: + Create table in hbase + Mark it read-only or disabled even. + Start mapreduce job. In its configuration would go to the master to read the table description. + Map reads whatever the input using whatever formatter and ouputs from the map using HStoreKey for key and cell content for value. + Job would use fancy new TableFileReduce. Each reducer would write a region. It'd know what for start and end keys – they'd be the first and last it'd see. Could output these somewhere so a tail task could find them. The file outputter would need to also do sequenceids of some form. + When job was done, tail task would insert regions into meta using MetaUtils. + Enable the table. + If regions are lop-sided, hbase will do the fixup. If table already exists: + Mark table read-only (ensure this prevents splits and that it means memcache is flushed) + Start a mapreduce job that read from master the table schema and its regions (and the master's current time so we don't write records older). + Map as above. + Reducer as above only insert smarter partitioner, one that respected region boundaries and that made a reducer per current region. + Enable hbase and let it fix up where storefiles written were too big by splitting etc. It don't seem hard at all to do.
          Hide
          Jean-Daniel Cryans added a comment -

          Something HBase should have is a BatchUpdate that takes multiple row keys. A simple version of it would be doing many BatchUpdate like we already have but in an iteration. An enhanced version would instead do something like this when there is only a few regions :

          • Sort the row keys
          • Sample some rows to get an average row size
          • Using the existing region(s) with the row keys to insert and the average row size, figure how the splits would be done
          • Insert the missing rows that would be the new lows and highs
          • Force the desired splits
          • Insert remaining data
          Show
          Jean-Daniel Cryans added a comment - Something HBase should have is a BatchUpdate that takes multiple row keys. A simple version of it would be doing many BatchUpdate like we already have but in an iteration. An enhanced version would instead do something like this when there is only a few regions : Sort the row keys Sample some rows to get an average row size Using the existing region(s) with the row keys to insert and the average row size, figure how the splits would be done Insert the missing rows that would be the new lows and highs Force the desired splits Insert remaining data
          Hide
          stack added a comment -

          If a new table, make a region per reducer (Configure many reducers if table is big). The framework will have done the sorting (lexigraphically if thats our key compare function) for us (Might have to add to the framework to ensure we don't split a key in the middle of a row).

          If a table that already exists, would be reducer per existing region and yeah, there'll be a splitting and compacting price to pay.

          To see difference in speeds going via API versus writing direct to mapfiles, see primitive PerformanceEvaluation and compare numbers writing mapfiles directly rather than going API.

          Show
          stack added a comment - If a new table, make a region per reducer (Configure many reducers if table is big). The framework will have done the sorting (lexigraphically if thats our key compare function) for us (Might have to add to the framework to ensure we don't split a key in the middle of a row). If a table that already exists, would be reducer per existing region and yeah, there'll be a splitting and compacting price to pay. To see difference in speeds going via API versus writing direct to mapfiles, see primitive PerformanceEvaluation and compare numbers writing mapfiles directly rather than going API.
          Hide
          Bryan Duxbury added a comment -

          In theory, writing directly to HDFS would be the fastest way to import data. However, the tricky part in my mind is that you need all the partitions not just to be sorted internally but sorted amongst each other. This means that the partitioning function you use has to be able to sort lexically as well. Without knowing what the data looks like ahead of time, how can you know how to efficiently partition the data into regions?

          This doesn't account for trying to import a lot of data into a new table. In that case, it'd be quite futile to write tons of data into the existing regions range, because that would just cause the existing regions would just become enormous, and then all you're really doing is putting off the speed hit until the split/compact stage.

          What is it that actually holds back the speed of imports? The API mechanics and nothing else? The number of region servers participating in the import? The speed of the underlying disk? Do we even have a sense of what would be a good speed for bulk imports in the first place? I think this issue needs better definition before we can say what we should do.

          Show
          Bryan Duxbury added a comment - In theory, writing directly to HDFS would be the fastest way to import data. However, the tricky part in my mind is that you need all the partitions not just to be sorted internally but sorted amongst each other. This means that the partitioning function you use has to be able to sort lexically as well. Without knowing what the data looks like ahead of time, how can you know how to efficiently partition the data into regions? This doesn't account for trying to import a lot of data into a new table. In that case, it'd be quite futile to write tons of data into the existing regions range, because that would just cause the existing regions would just become enormous, and then all you're really doing is putting off the speed hit until the split/compact stage. What is it that actually holds back the speed of imports? The API mechanics and nothing else? The number of region servers participating in the import? The speed of the underlying disk? Do we even have a sense of what would be a good speed for bulk imports in the first place? I think this issue needs better definition before we can say what we should do.
          Hide
          stack added a comment -

          Yes. Going behind the API would be the faster way to load hbase. It'd be dangerous to do into a live hbase. Should we write something like TableOutputFormatter except it writes region files directly into hdfs? It'd make a region per reducer instance? It'd know how to write keys, etc. properly and what location in hdfs to place files.

          Show
          stack added a comment - Yes. Going behind the API would be the faster way to load hbase. It'd be dangerous to do into a live hbase. Should we write something like TableOutputFormatter except it writes region files directly into hdfs? It'd make a region per reducer instance? It'd know how to write keys, etc. properly and what location in hdfs to place files.
          Hide
          Billy Pearson added a comment -

          Would not the best way to do this would be to do a map that formats and sorts the data per column family then a reduce that writes a mapfiles directly to the regions columns?

          Then that would skip the api and speed up the loading of the data and it would not matter so much if we has 1 region or not sense all we would be doing is adding a mapfile to hdfs.
          Course the map would have to know if there is 1 region or 1000 and split the data correctly but even if each map
          only produces a few lines of data per column family the compactor will come along sooner or later and clean up and split where needed.

          So if we add 100 map files to one column I would assume that it would slow reads down a little bit havening to sort threw all the map files while scanning but that would be a temporary speed problem.

          Show
          Billy Pearson added a comment - Would not the best way to do this would be to do a map that formats and sorts the data per column family then a reduce that writes a mapfiles directly to the regions columns? Then that would skip the api and speed up the loading of the data and it would not matter so much if we has 1 region or not sense all we would be doing is adding a mapfile to hdfs. Course the map would have to know if there is 1 region or 1000 and split the data correctly but even if each map only produces a few lines of data per column family the compactor will come along sooner or later and clean up and split where needed. So if we add 100 map files to one column I would assume that it would slow reads down a little bit havening to sort threw all the map files while scanning but that would be a temporary speed problem.
          Hide
          Chad Walters added a comment -

          Eventually that might be true but merging is currently a manually-triggered operation. Also, unless a more intelligent heuristic were in place, a small region would count against a whole region server until it was merged, which would slow down the loading.

          Show
          Chad Walters added a comment - Eventually that might be true but merging is currently a manually-triggered operation. Also, unless a more intelligent heuristic were in place, a small region would count against a whole region server until it was merged, which would slow down the loading.
          Hide
          Bryan Duxbury added a comment -

          Actually you wouldn't have to be too concerned with the distribution of splits early on, because even if some of the regions ended up being abnormally small, they would eventually be merged with neighboring regions, no?

          Show
          Bryan Duxbury added a comment - Actually you wouldn't have to be too concerned with the distribution of splits early on, because even if some of the regions ended up being abnormally small, they would eventually be merged with neighboring regions, no?
          Hide
          Chad Walters added a comment -

          I like the idea of lots of splits early on when the number of regions is less than the number of region servers. You want to make sure the splits are made at points that relatively well-distributed, of course, so don't make it so small that you split without a representative sampling. This would be a good general purpose solution that doesn't create a new API. Then the bulk upload simply looks like partitioning the data set and uploading via Map-Reduce, perhaps with batched inserts. Do you really think this would be dog slow?

          If that is not fast enough, I suppose we could have a mapfile uploader. This would require the dataset to be prepared properly, which could be a bit fidgety (sorting, properly splitting across columns, etc.).

          Show
          Chad Walters added a comment - I like the idea of lots of splits early on when the number of regions is less than the number of region servers. You want to make sure the splits are made at points that relatively well-distributed, of course, so don't make it so small that you split without a representative sampling. This would be a good general purpose solution that doesn't create a new API. Then the bulk upload simply looks like partitioning the data set and uploading via Map-Reduce, perhaps with batched inserts. Do you really think this would be dog slow? If that is not fast enough, I suppose we could have a mapfile uploader. This would require the dataset to be prepared properly, which could be a bit fidgety (sorting, properly splitting across columns, etc.).
          Hide
          Bryan Duxbury added a comment -

          A really cool feature for bulk loading would be artificially lowering the split size so that splits occur really often, at least until there are as many regions as there are regionservers. That way, the load operation could have a lot more parallelism early on.

          Show
          Bryan Duxbury added a comment - A really cool feature for bulk loading would be artificially lowering the split size so that splits occur really often, at least until there are as many regions as there are regionservers. That way, the load operation could have a lot more parallelism early on.
          Hide
          stack added a comment -

          Bulk uploader needs to be able to tolerate myriad data input types. Data will likely need massaging and ultimately, if writing HRegion content directly into HDFS rather than going against hbase API – preferred since it'll be dog slow doing bulk uploads going against hbase API – then it has to be sorted. Using mapreduce would make sense.

          Look too at using PIG because it has a few LOAD implementations – from files on local or HDFS – and some facility for doing transforms on data moving tuples around. Would need to write a special STORE operator that wrote the data sorted out as HRegions direct into HDFS (This would be different than PIG-6 which is about writing into hbase via API).

          Also, chatting with Jim, this is a pretty important issue. This is the first folks run into when they start to get serious about hbase.

          Show
          stack added a comment - Bulk uploader needs to be able to tolerate myriad data input types. Data will likely need massaging and ultimately, if writing HRegion content directly into HDFS rather than going against hbase API – preferred since it'll be dog slow doing bulk uploads going against hbase API – then it has to be sorted. Using mapreduce would make sense. Look too at using PIG because it has a few LOAD implementations – from files on local or HDFS – and some facility for doing transforms on data moving tuples around. Would need to write a special STORE operator that wrote the data sorted out as HRegions direct into HDFS (This would be different than PIG-6 which is about writing into hbase via API). Also, chatting with Jim, this is a pretty important issue. This is the first folks run into when they start to get serious about hbase.

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              5 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development