Accumulo
  1. Accumulo
  2. ACCUMULO-456

Need utility for exporting and importing tables

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: None
    • Labels:
      None

      Description

      Need a utility to to export and import tables. A use case would be export table on cluster A, distcp to cluter B, import.

      1. ACCUMULO-456-4.patch
        2 kB
        Christopher Tubbs
      2. ACCUMULO-456-3.txt
        88 kB
        Keith Turner
      3. ACCUMULO-456-2.txt
        87 kB
        Keith Turner
      4. ACCUMULO-456-1.txt
        77 kB
        Keith Turner

        Issue Links

          Activity

          Hide
          Keith Turner added a comment -

          The following steps would partially accomplish this.

          • Clone table
          • Disable bulk imports
          • Compact table
          • Take table offline
          • Wait for Accumulo GC to clean up files in table dir
          • Distcp files
          • Bulk import files on remote cluster

          There are two reasons to compact.

          • Tablets can reference files that extend beyond tablet range and contain stale data.
          • Tablets can point to files in other table dirs as a result of cloning.

          The above procedure does not copy existing table configuration stored in zookeeper or the tables split points.

          Show
          Keith Turner added a comment - The following steps would partially accomplish this. Clone table Disable bulk imports Compact table Take table offline Wait for Accumulo GC to clean up files in table dir Distcp files Bulk import files on remote cluster There are two reasons to compact. Tablets can reference files that extend beyond tablet range and contain stale data. Tablets can point to files in other table dirs as a result of cloning. The above procedure does not copy existing table configuration stored in zookeeper or the tables split points.
          Hide
          Keith Turner added a comment -

          Could have an export command that creates special files along with a list of files to distcp. The create table command could have an option to read this information.

          Show
          Keith Turner added a comment - Could have an export command that creates special files along with a list of files to distcp. The create table command could have an option to read this information.
          Hide
          jv added a comment -

          Two things-
          1. we may need to adjust bulk import to handle importing a hierarchy vs. importing a directory of files.
          2. (Adam gets credit for this) It may be more prudent to nix the compaction-distcp and instead run a mapreduce with identity mappers, no reducers, using Accumulo Output Format pointing to the destination hdfs.

          Show
          jv added a comment - Two things- 1. we may need to adjust bulk import to handle importing a hierarchy vs. importing a directory of files. 2. (Adam gets credit for this) It may be more prudent to nix the compaction-distcp and instead run a mapreduce with identity mappers, no reducers, using Accumulo Output Format pointing to the destination hdfs.
          Hide
          Keith Turner added a comment -

          The procedure I pointed out earlier w/ compacting the table is something that a user could do now w/ existing code. For future code changes I think generalizing the chop compaction used by merge would be a good thing to do. This way only the files that needs to be compacted are compacted, it minimizes the amount of decompression, deserialization, serialization, and compression that needs to be done. I think chop+distcp is a good way to go. distcp is a well tested tool that copies bytewise and does not decompress, etc. The identity map reduce operation suggested above would be more efficient when all files need to be chopped, but I am not sure this will be the usual case. When only a small number of files need to be chopped the identity map reduce will result in a lot more CPU load than chop+distcp. I suppose the ultimiate optimization is a map reduce job that copies bytewise when no chop is needed and does the chop as part of the map reduce job when needed. This would be a fairly complex bit of code that may not get the testing it needs.

          Making bulk import handle multiple dirs would be a nice convenience feature for users. At the moment its fairly easy to work around w/ one hadoop command for anyone trying to do this w/ the current system.

            hadoop fs -mv <table dir>/*/*.rf <bulk import dir>
          
          Show
          Keith Turner added a comment - The procedure I pointed out earlier w/ compacting the table is something that a user could do now w/ existing code. For future code changes I think generalizing the chop compaction used by merge would be a good thing to do. This way only the files that needs to be compacted are compacted, it minimizes the amount of decompression, deserialization, serialization, and compression that needs to be done. I think chop+distcp is a good way to go. distcp is a well tested tool that copies bytewise and does not decompress, etc. The identity map reduce operation suggested above would be more efficient when all files need to be chopped, but I am not sure this will be the usual case. When only a small number of files need to be chopped the identity map reduce will result in a lot more CPU load than chop+distcp. I suppose the ultimiate optimization is a map reduce job that copies bytewise when no chop is needed and does the chop as part of the map reduce job when needed. This would be a fairly complex bit of code that may not get the testing it needs. Making bulk import handle multiple dirs would be a nice convenience feature for users. At the moment its fairly easy to work around w/ one hadoop command for anyone trying to do this w/ the current system. hadoop fs -mv <table dir>/*/*.rf <bulk import dir>
          Hide
          jv added a comment -

          Be careful with that command. You can have name collisions, and you don't want to inadvertently drop data like this. It shouldn't be a problem with 1.4 with the new naming scheme, but if you still have files from 1.3 with it's naming convention, you need to be careful. Or major compact first in 1.4

          Show
          jv added a comment - Be careful with that command. You can have name collisions, and you don't want to inadvertently drop data like this. It shouldn't be a problem with 1.4 with the new naming scheme, but if you still have files from 1.3 with it's naming convention, you need to be careful. Or major compact first in 1.4
          Hide
          Dave Marion added a comment -

          In addition to the other necessary items (split points, table config, etc.) you may need to copy jars out of the classpath for custom iterators, or at least throw a warning.

          Show
          Dave Marion added a comment - In addition to the other necessary items (split points, table config, etc.) you may need to copy jars out of the classpath for custom iterators, or at least throw a warning.
          Hide
          David Medinets added a comment -

          Does it make sense to build on top of an existing utility like Sqoop?
          I don't know how Sqoop could be expanded to consider split points and
          iterators. Maybe there is something else?

          Show
          David Medinets added a comment - Does it make sense to build on top of an existing utility like Sqoop? I don't know how Sqoop could be expanded to consider split points and iterators. Maybe there is something else?
          Hide
          Josh Elser added a comment -

          Given that this is an Accumulo to Accumulo transfer, I don't believe Sqoop makes any sense. My understanding of Sqoop is that it is intended from unstructured to structured (e.g. HDFS to relational). All of the information for the data already in an Accumulo instance is present in the source instance, it's just a matter of making sure all of that necessary information is actually transferred to the destination instance.

          Show
          Josh Elser added a comment - Given that this is an Accumulo to Accumulo transfer, I don't believe Sqoop makes any sense. My understanding of Sqoop is that it is intended from unstructured to structured (e.g. HDFS to relational). All of the information for the data already in an Accumulo instance is present in the source instance, it's just a matter of making sure all of that necessary information is actually transferred to the destination instance.
          Hide
          Keith Turner added a comment -

          An implementation of import/export table

          Show
          Keith Turner added a comment - An implementation of import/export table
          Hide
          Keith Turner added a comment -

          In addition to the other necessary items (split points, table config, etc.) you may need to copy jars out of the classpath for custom iterators, or at least throw a warning.

          Ah yes, I can log an INFO when a table with custom iterators is imported? Doing more automatically starts to get tricky ( jars from system A that conflict in some way with existing jars on system B).

          Show
          Keith Turner added a comment - In addition to the other necessary items (split points, table config, etc.) you may need to copy jars out of the classpath for custom iterators, or at least throw a warning. Ah yes, I can log an INFO when a table with custom iterators is imported? Doing more automatically starts to get tricky ( jars from system A that conflict in some way with existing jars on system B).
          Hide
          Keith Turner added a comment -

          Would also need to consider the case when an imported table has constraints or a balancer configured.

          Show
          Keith Turner added a comment - Would also need to consider the case when an imported table has constraints or a balancer configured.
          Hide
          Keith Turner added a comment -

          The patch depends on a table being in the offline state. It relies on the user to keep the table in the offline state while the distcp runs. I was thinking about a new table state, call this state FROZEN. A frozen table can be read, but never written to. A FROZEN table can not transition to ONLINE, it can only transition to DELETED. So a user could clone a table and then tranistion it to FROZEN instead of OFFLINE. I like this because it makes the users intent for a table very clear on a system with many users. I don't like that it adds complexity to the system. If someone brings a table online while the distcp is running, the distcp may fail which is probably not too big of a deal.

          Show
          Keith Turner added a comment - The patch depends on a table being in the offline state. It relies on the user to keep the table in the offline state while the distcp runs. I was thinking about a new table state, call this state FROZEN. A frozen table can be read, but never written to. A FROZEN table can not transition to ONLINE, it can only transition to DELETED. So a user could clone a table and then tranistion it to FROZEN instead of OFFLINE. I like this because it makes the users intent for a table very clear on a system with many users. I don't like that it adds complexity to the system. If someone brings a table online while the distcp is running, the distcp may fail which is probably not too big of a deal.
          Hide
          Christopher Tubbs added a comment -

          Added ACCUMULO-456-4.patch to check for write ahead logs before exporting.

          Show
          Christopher Tubbs added a comment - Added ACCUMULO-456 -4.patch to check for write ahead logs before exporting.
          Hide
          Eric Newton added a comment -

          SVN#1381570 patch applied.

          Show
          Eric Newton added a comment - SVN#1381570 patch applied.
          Hide
          Billie Rinaldi added a comment -

          Chris, I've added you to our contributor role in JIRA. Thanks for the patches!

          Show
          Billie Rinaldi added a comment - Chris, I've added you to our contributor role in JIRA. Thanks for the patches!

            People

            • Assignee:
              Keith Turner
              Reporter:
              Keith Turner
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development