17:36 <jmhsieh> Here's the story I had in mind after talking with carl (guy who is doing the next hive release)
17:37 <jmhsieh> we use the collectorsink with the escaping to write out events to the different directories.
17:37 <jmhsieh> then can be gzip files – that shouldn't matter too much.
17:37 <jmhsieh> Hive needs a thing called a SerDe (Serializer / Deserializer) to understand data.
17:38 <jmhsieh> Ideally flume would write to a format that hive understands already, or we'd hook in a new format.
17:38 <jmhsieh> Instead of loading each record one by one, we bulk load data into hive. This is done by pointing hive to particular directories in hdfs (which flume has nicely laid out!).
17:39 <jmhsieh> the directory structure reflects the partitioning used in hive.
17:39 <jmhsieh> a new hive query runs, and since the data is there, it uses the new data.
17:40 <jmhsieh> The catches are – there needs to be something to notify hive that thre is a new partition/dir to load when doing queries.
17:40 <aphadke> (i'll chime in once u r done - but mostly our thoughts match)
17:40 <jmhsieh> and there needs to be a table defined that knows how to read the data (this is where the SerDe comes in), and that explains how data is partitioned
17:41 <jmhsieh> The hooks in HiveNotifying* does the first part – it is when flume knows it needs to notify hive, and currently just a stub.
17:41 <jmhsieh> the latter part – defining the table is probably something to do manually and document in the docs.
17:42 <aphadke> 1 - defining table should be manual and mentioned in docs.
17:43 <hammer> good to see you guys moving forward
17:43 <aphadke> hammer:
17:43 <aphadke> 2 - directory structure for FLUME is fantastic for HIVE…. no worries there
17:44 <aphadke> 3 - HiveNotifying* knows that the file has been written, which is good….
17:44 <aphadke> take a look at http://svn.apache.org/viewvc/hadoop/hive/trunk/service/src/test/org/apache/hadoop/hive/service/TestHiveServer.java?view=markup
! abecc [~email@example.com] has joined #flume
17:45 <aphadke> the thrift API allows us to load data inside hive.. i am not exactly sure the benefits of SerDe as against using the thrift api…. the api should essentially do the SerDe thing for us
17:46 <aphadke> its prolly not a good idea for FLUME to change the log data, HIVE allows regex while creating the table… so the raw log file should just load in the table as per the regex and again, the thrift API should take care of that.
17:46 <jmhsieh> aphadke: ah! I didn't know what the hive side looked like. the thift stuff starts the make sense now..
17:47 <jmhsieh> aphadke: yeah, I assumed that flume would be writing out raw logs
17:47 <aphadke> so essentially, we need to add thrift API to HiveNotifying*, read the gzip'ed log files and load them based on date format + partition
17:48 <jmhsieh> aphadke: prolly need to talk to carl or one of the hive guys to get pointers to exactly how adding a partition works. I think if you use dates in the path of flumes output dirs, it is just a matter of letting hive know about the structure.
17:49 <aphadke> jmhsieh: hql syntax would be :
17:49 <aphadke> LOAD DATA INPATH '<somepath>' INTO TABLE test_table PARTITION (ds='2010-06-17');
17:49 <jmhsieh> aphadke: I'd prefer if the thrift thing implemented the HiveDirCreatedHandler interface
17:49 <jmhsieh> aphadke: hql stuff looks simple enough
17:50 <aphadke> jmhsieh: afaik, thrift reads the file from HDFS, creates the directory structure inside hive and moves file from HDFS to hive directory structure
17:51 <jmhsieh> aphadke: I think the hive stuff actually uses the data inplace (without moving it) – but I'm not completely sure about this.
17:51 <aphadke> it definitely moves it..
17:51 <jmhsieh> ok
17:52 <aphadke> i.e. data from /user/aphadke/someLogs/2010-07-07 is moved to /user/hive/warehouse/<table_name>/partition/ etc.