HCatalog
  1. HCatalog
  2. HCATALOG-333

changes to simplify external data reader/writer api

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This jira proposes some simplifications to the data reader/writer api created in HCATALOG-287.

        Issue Links

          Activity

          Thejas M Nair created issue -
          Thejas M Nair made changes -
          Field Original Value New Value
          Fix Version/s 0.5 [ 12320147 ]
          Hide
          Thejas M Nair added a comment -

          Illustrating the proposed interface with example of a class that reads (DataReader) and one that writes (DataWriter) -

          public class DataReader {
          
          	public static void main(String[] args) throws HCatException {
          
          		// This config contains all the configuration that master node wants to provide
          		// to the HCatalog, such as the metastore uri.
          		Map<String,String> config = new HashMap<String, String>();
          
          		// This piece of code runs in master node and gets necessary context.
          		ReaderContext context = runsInMaster(config);
          
          		
          		// Possible things in future - 
          		// Have logic to determine number of slaves based on total size 
          		// the ReaderContext is associated with, or number of unique partition 
          		//key values using functions that might,
          		// be added in future
          
          		//For now, use a constant number of slaves.
          		int NUM_SLAVES = 10; 
          		//in future, you might also have functions in ReaderContext context
          		// that specify how to group the data in slaves (eg by partition key).
          		
          		//If hcat is unable to split the reading across the specified number of slaves,
          		// it will return fewer of them
          		List<SlaveReaderContext> slaveContexts = context.getSlaveContexts(NUM_SLAVES); 
          
          		// Master node will serialize slave reader context and will make it available  at slaves.
          		for (SlaveReaderContext slaveContext : slaveContexts){
          			// slaves do actual work
          			runsInSlave(slaveContext);			
          		}
          		
          		
          		
          	}
          
          	private static ReaderContext runsInMaster(Map<String,String> config) throws HCatException {    
          	    //The builder always needs a table name
          	    ReaderContext.Builder builder = new ReadContext.Builder(config);
          	    ReaderContext cntxt = builder.withDb("default").withTable("myTbl").withFilter("date < '2012' ").build();
          	    return cntxt;
          	}
          	
          
          	private static void runsInSlave(SlaveContext slaveContext) throws HCatException {
          
          		HCatReader reader = DataTransferFactory.getHCatReader(slaveContext);
          		Iterator<HCatRecord> itr = reader.read();
          		while(itr.hasNext()){
          			System.out.println(itr.next());
          		}
          	}
          }
          
          public class DataWriter {
          
          	public static void main(String[] args) throws HCatException {
          
          		// This config contains all the configuration that master node wants to provide
          		// to the HCatalog, such as metastore uri
          		Map<String,String> config = new HashMap<String, String>();
          
          		// This piece of code runs in master node and gets necessary context.
          		WriterContext cntxt = runsInMaster(config);
          		
          		// Master node will serialize writercontext and will make it available at slaves.
          		
          		int numOfSlaves = Integer.parseInt(args[0]);
          		//Can't think of use case for having separate SlaveWriterContext, but
          		// this will be consitent with the reading interface and future proof
          		List<SlaveWriterContext> slaveContexts = cntxt.getSlaveWriterContexts(numOfSlaves);
          		for (SlaveWriterContext slaveContext : slaveContexts){
          			// Slaves do actual work.
          			runsInSlave(slaveContext);			
          		}
          		
          		// Then, master commits if everything goes well.
          		commit(config,true, cntxt);
          	}
          
          	private static WriterContext runsInMaster(Map<String, String> config) throws HCatException {
          	    WriterContext.Builder builder = new WriterContext.Builder(config);
          	    WriterContext info = builder.withDb("mydb").withTable("myTbl").build();
          	    return info;
          	}
          	
          	
          	private static void runsInSlave(SlaveWriterContext slaveContext) throws HCatException {
          		
          		HCatWriter writer = DataTransferFactory.getHCatWriter(slaveContext);
          		writer.write(new HCatRecordItr());
          	}
          	
          	private static void commit(Map<String, String> config, boolean status, WriterContext cntxt) throws HCatException {
          
          		WriteEntity.Builder builder = new WriteEntity.Builder();
          		WriteEntity entity = builder.withTable("myTbl").build();
          		HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config);
          		if(status){
          			writer.commit(cntxt);			
          		} else {
          			writer.abort(cntxt);
          		}
          	} 
          	
          	private static class HCatRecordItr implements Iterator<HCatRecord> {
          
          		@Override
          		public boolean hasNext() {
          			// TODO Auto-generated method stub
          			return false;
          		}
          
          		@Override
          		public HCatRecord next() {
          			// TODO Auto-generated method stub
          			return null;
          		}
          
          		@Override
          		public void remove() {
          			// TODO Auto-generated method stub
          			
          		}
          	}
          }
          
          
          
          
          Show
          Thejas M Nair added a comment - Illustrating the proposed interface with example of a class that reads (DataReader) and one that writes (DataWriter) - public class DataReader { public static void main( String [] args) throws HCatException { // This config contains all the configuration that master node wants to provide // to the HCatalog, such as the metastore uri. Map< String , String > config = new HashMap< String , String >(); // This piece of code runs in master node and gets necessary context. ReaderContext context = runsInMaster(config); // Possible things in future - // Have logic to determine number of slaves based on total size // the ReaderContext is associated with, or number of unique partition //key values using functions that might, // be added in future //For now, use a constant number of slaves. int NUM_SLAVES = 10; //in future , you might also have functions in ReaderContext context // that specify how to group the data in slaves (eg by partition key). //If hcat is unable to split the reading across the specified number of slaves, // it will return fewer of them List<SlaveReaderContext> slaveContexts = context.getSlaveContexts(NUM_SLAVES); // Master node will serialize slave reader context and will make it available at slaves. for (SlaveReaderContext slaveContext : slaveContexts){ // slaves do actual work runsInSlave(slaveContext); } } private static ReaderContext runsInMaster(Map< String , String > config) throws HCatException { //The builder always needs a table name ReaderContext.Builder builder = new ReadContext.Builder(config); ReaderContext cntxt = builder.withDb( " default " ).withTable( "myTbl" ).withFilter( "date < '2012' " ).build(); return cntxt; } private static void runsInSlave(SlaveContext slaveContext) throws HCatException { HCatReader reader = DataTransferFactory.getHCatReader(slaveContext); Iterator<HCatRecord> itr = reader.read(); while (itr.hasNext()){ System .out.println(itr.next()); } } } public class DataWriter { public static void main( String [] args) throws HCatException { // This config contains all the configuration that master node wants to provide // to the HCatalog, such as metastore uri Map< String , String > config = new HashMap< String , String >(); // This piece of code runs in master node and gets necessary context. WriterContext cntxt = runsInMaster(config); // Master node will serialize writercontext and will make it available at slaves. int numOfSlaves = Integer .parseInt(args[0]); //Can't think of use case for having separate SlaveWriterContext, but // this will be consitent with the reading interface and future proof List<SlaveWriterContext> slaveContexts = cntxt.getSlaveWriterContexts(numOfSlaves); for (SlaveWriterContext slaveContext : slaveContexts){ // Slaves do actual work. runsInSlave(slaveContext); } // Then, master commits if everything goes well. commit(config, true , cntxt); } private static WriterContext runsInMaster(Map< String , String > config) throws HCatException { WriterContext.Builder builder = new WriterContext.Builder(config); WriterContext info = builder.withDb( "mydb" ).withTable( "myTbl" ).build(); return info; } private static void runsInSlave(SlaveWriterContext slaveContext) throws HCatException { HCatWriter writer = DataTransferFactory.getHCatWriter(slaveContext); writer.write( new HCatRecordItr()); } private static void commit(Map< String , String > config, boolean status, WriterContext cntxt) throws HCatException { WriteEntity.Builder builder = new WriteEntity.Builder(); WriteEntity entity = builder.withTable( "myTbl" ).build(); HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config); if (status){ writer.commit(cntxt); } else { writer.abort(cntxt); } } private static class HCatRecordItr implements Iterator<HCatRecord> { @Override public boolean hasNext() { // TODO Auto-generated method stub return false ; } @Override public HCatRecord next() { // TODO Auto-generated method stub return null ; } @Override public void remove() { // TODO Auto-generated method stub } } }
          Hide
          Thejas M Nair added a comment - - edited

          The Motivations and changes made to the api (illustrated above) -
          1. Simplify the interface by reducing classes that are exposed in the user api, and some additional steps. - Removed the class ReadEntity , WriteEntity.

          2. In original user api, some functions in HcatWriter and HcatReader were to used only in the master, and others in the slaves. - Made the entire class functionality available in slaves by removing the use of functions HcatWriter and HcatReader by user in master.

          3. The interface for reading exposed InputSplit as the unit of work. Users might want more information than what is provided by InputSplit, and also different number of InputSplits, and different ways of combining the InputSplit (eg divide inputs on partition values). Replaced InputSplit with a SlaveReaderContext as way to divide the input. SlaveReaderContext can potentially add more functions in future that help in getting metadata information about the unit of work.

          4. In future, for reading, user should be able to specify different ways of splitting the data across slaves (eg by partition column value). User should also be able to use custom logic for splitting data across slaves. The use of SlaveReaderContext will help with this.

          5. The writer API has similar change to use SlaveWriterContext .

          Show
          Thejas M Nair added a comment - - edited The Motivations and changes made to the api (illustrated above) - 1. Simplify the interface by reducing classes that are exposed in the user api, and some additional steps. - Removed the class ReadEntity , WriteEntity. 2. In original user api, some functions in HcatWriter and HcatReader were to used only in the master, and others in the slaves. - Made the entire class functionality available in slaves by removing the use of functions HcatWriter and HcatReader by user in master. 3. The interface for reading exposed InputSplit as the unit of work. Users might want more information than what is provided by InputSplit, and also different number of InputSplits, and different ways of combining the InputSplit (eg divide inputs on partition values). Replaced InputSplit with a SlaveReaderContext as way to divide the input. SlaveReaderContext can potentially add more functions in future that help in getting metadata information about the unit of work. 4. In future, for reading, user should be able to specify different ways of splitting the data across slaves (eg by partition column value). User should also be able to use custom logic for splitting data across slaves. The use of SlaveReaderContext will help with this. 5. The writer API has similar change to use SlaveWriterContext .
          Thejas M Nair made changes -
          Link This issue is related to HCATALOG-287 [ HCATALOG-287 ]
          Thejas M Nair made changes -
          Fix Version/s 0.5 [ 12320147 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development