|
Note that this contains some installation-specific stuff to start up a Thrift hdfs server that I will have to abstract away before public release.
Thanks, Venky. Instead of parsing the log file to get the port, you can pass it as the first argument to HadoopThriftServer (see hdfs_py_venky.diff
I don't know how to abstract away the basedir stuff, though - keep me posted, since I'm interested in doing something similar. I have started to implement reading on a per-block basis:
string readBlock(1:FileHandle handle, 2:BlocLocation block)
throws (1: IOException ouch)
I have first added two more fields to BlockLocation: struct BlockLocation {
1: list<string> hosts,
2: list<string> names,
3: i64 offset,
4: i64 length,
5: i64 id, /* bock id */
6: i64 genStamp /* block generation timestamp */
}
I then provide two implementations of getFileBlockLocations(): One for Does it sound like a reasonable plan? Thanks Carlos. Your proposed change to BlockLocation sounds good.
>Instead of parsing the log file to get the port, you can pass it as the first argument to HadoopThriftServer (see hdfs_py_venky.diff. A value of '0' will make HadoopThriftServer pick the first one available, which is probably what you want in your hadoopThriftServer Python class. In the usual case, we want the HadoopThriftServer to bind to any available port. This allows running multiple HadoopThriftServer on the same machine. Also, the python client needs to know the port that its associated instance of the HadoopThriftServer is using. That's the reason why the python client parses the log file to find the port number. Are you saying that there is another way to achieve the same functionality? Patch HADOOP-4707.diff removes the need for an extra Thrift server
process by creating Thrift server instances on the namenode and datanode processes. It also allows direct reads and writes to datanode instances. The Thrift namenode server instance implements the Thrift service The Thrift datanode server instance implements the Thrift service In order to read data from a file, Thrift clients request a list Data writes are implemented in a similar way: Thrift Namenode.addBlock() must be called on files opened by Namenode.create() Since Datanode.readBlock() and Datanode.writeBlock() expect data blobs The following entries in hdfs-default.xml define the locations for
The following entries limit the number and lifetime of Thrift server
Thrift namenode and datanode servers try to obtain the identity of Perl and Python higher-level APIs for read and write functionality are $ ant -Dcompile.thrift=true test-thrift You may also need to pass the location of Thrift's Python library $ ant -Dcompile.thrift=true \ Perl and Python tests for chmod() functionality show that write Perl and Python high-level APIs call Namenode.addBlock() on each I have used Thrift version 20080411-r743112 from Thrift's Hi Carlos, This looks awesome! Lots of cool stuff. I have not looked at the details, but here are some of my initial comments:
1. The idea of making the Namenode expose a Thrift interface is a great idea. This removes the need for a separate daemon to map Thrift methods to ClientProcotol methods. 2. Is there a way to "layer" the Thrift layer (DatanodeThriftServer and NameNodeThrift server) around the org.apache.hadoop.hdfs package (instead of making them part of org.apache.hadoop.hdfs). Can these be part of org.apache.hadoop.thriftfs and reside in the src/contrib/thriftfs directory? This essentially means that the Thrift server is a "layer" then encapsulates the hadoop Namenode/DataNode. This could be in its own library called libhdfsthrift.jar (or something like that) 3. There could be new source files in the base hdfs package that allows plugging in of multiple protocol stacks (the first one being Thrift). This code could use reflection (and configuration settings) to figure out if the libhdfsthrift.jar is in the classpath, and if so, then use those methods from that jar to initialize the Thrift server. The reason I propose the above is because it does not force every Hadoop install to use Thrift. It keeps the base Namenode/Datanode code clean and elegent. It also allows plugging in other protocol stacks to expose the Namenode/datanode functionality. 4. It is possible that the Thrift implementation might need to use some (currently) package-private methods in the Datanode/Namenode, but we can work on making them public if need be. 5. Allowing the Thrift interface to read file contents is easy. However, writing to blocks is more difficult, especially because the DFSClient.java is a heavy-weight piece of code and participates heavily in ensuring correct recovery from write-pipeline failures, allows "appending" to existing files, ensuring all blocks of a file are equal size, etc.etc. Do you have an application that will need to write to HDFS files using this Thrift interface? 6. Your unit tests are nice. It is imperative for us to detect incompatible changes to base HDFS APIs earlier rather than later. Thanks for your feedback, Dhruba!
Yes, no problem with that. I chose to work in src/hdfs mainly to avoid conflicts with your code under src/contrib/thriftfs.
Yes, that sounds good. Would it make sense to open a new JIRA issue to implement the plugin functionality? Otherwise the attachment list to this issue might end up quite crowded.
OK, so no real reason for hosting code under org.hadoop.hdfs.server.{datanode,namenode}, then.
Reading is by far more important to me than writing, and I could just do the writes using Java. If the current implementations of append(), create(), addBlock() and writeBlock() were show-stoppers for getting the stuff committed sooner, I can focus first on addressing your points, and then figure out later what to do with them. What would you recommend?
Being able to test the Thrift code (and nothing else) by doing a: $ ant -Dtestcase=TestThriftServer test-core was very helpful during development. Will it be possible to do a similar thing when the Thrift code moves back to src/contrib/thriftfs? > Would it make sense to open a new JIRA issue to implement the plugin functionality?
Yes, please open a new JIRA to implement "export namenode/datanode functionality through a pluggable RPC layer". The component for this jira will be "dfs". You can submit the code for this new JIRA, get it vetted by the HDFS community and then commit to svn. After that is done, then we can work on getting this one (HADOOP-4707) commiited. You might need a separate unit test associated with this new JIRA. > Reading is by far more important to me than writing, and I could just do the writes using Java. > was very helpful during development. Will it be possible to do a similar thing when the Thrift code moves back Yes, If you move the TestThriftServer code to src/contrib/thriftfs/test/org/apache/hadoop/thriftfs, that should be all that is required.
I just filed Now that
I'm going to reshape HADOOP-4707.diff Patch HADOOP-4707.patch
and datanodes using HDFS service plugins. Both plugins implement the Thrift services defined in src/contrib/thriftfs/if/hdfs.thrift. In order to read data from a file, Thrift clients request a list Both plugins add thriftfs-default.xml and
The following properties limit the number and lifetime of Thrift server
Since Datanode.readBlock() expects data blobs On Dhruba's suggestion, I've removed all write-related methods for now. Thrift namenode and datanode servers try to obtain the identity of I've removed the Perl and Python high-level APIs from this patch in I've updated Thrift's libthrift.jar A few comments on this patch:
1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode. Calling getDatanodeReport then triggers an NPE in ThriftUtils.toThrift It seems to me that we need to add some hooks to DataNode and/or NameNode that allow the plugins to register callbacks on certain events. Specifically for this case, the DataNode needs to re-register its thrift port with the NameNode when it reconnects. Specifically, I think the DataNode needs a hook in DataNode.register() that calls through to plugins. Doing this in a generalized way on the 2) I think the datanode hostnames need to be canonicalized somehow when inserting into the thriftPorts map. On a pseudodistributed cluster, I'm seeing getDatanodeReport fail to find the thriftPort since the DN is registering under the name 127.0.1.1, but then being looked up as 127.0.0.1 for whatever reason. I'll look into a solution for this. 3) Lastly, I think the "Unknown Thrift port for Datanode" NPE is unnecessarily strict. I'd prefer for it to return a -1 or a 0 to indicate that the DN thrift server isn't running. This would require some extra checks elsewhere in the code before trying to contact a non-existent thrift server, but it enables getDatanodeReport to work even without the thrift plugin on the DNs. > 1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode.
Is there a way for the Datanode plugin to try re-register its port with the NameNode-plugin when it encounters an error? >> 1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode.
> Is there a way for the Datanode plugin to try re-register its port with the NameNode-plugin when it encounters an error? The problem is that the error is encountered on the NameNode. The flow is that the client is asking the NameNode for info, and the NameNode needs to send the client the host:port for the relevant DN. There's no way for the NN plugin to contact the DN plugin to tell it to re-register.
Yep. A simpler option might be to spawn a thread on the datanode which calls Namenode.datanodeUp() every so often. Any preference?
I came across that same issue when writing the test suite. The "canonical" name, as far as the thriftPorts map is concerned, is org.apache.hadoop.thriftfs.DatanodePlugin.datanode.DataNode.dnRegistration.getName(). On my real 6-node test cluster, that value is the same as the value of org.apache.hadoop.hdfs.protocol.DatanodeInfo.name for every DatanodeInfo instance, so everything works. On a MiniDFSCluster cluster, however, it is not — just as you found out in your case, Todd (classloader issues, perharps?). My workaround for the test suite was to set the property slave.host.name to the expected value.
Yep, makes sense. I propose setting the port to -1, and also doing something like this: public static Block toThrift(LocatedBlock block, String path, Map<String, Integer> thriftPorts) { if (block == null) { return new Block(); } List<DatanodeInfo> nodes = new ArrayList<DatanodeInfo>(); for (org.apache.hadoop.hdfs.protocol.DatanodeInfo n: block.getLocations()) { DatanodeInfo node = toThrift(n, thriftPorts); if (node.thriftPort != -1) { nodes.add(toThrift(n, thriftPorts)); } } // [...] } That way we return to the client the (possibly empty) list of all block locations accessible by the Thrift interface.
That is definitely simpler but seems ugly to me. I know heartbeats are the norm elsewhere in Hadoop, but I'd prefer not to introduce more.
I think I've figured out the issue here: In ThriftUtils.createNamenodeClient, it uses dfs.thrift.address to connect to the namenode. This same configuration variable is used in NamenodePlugin.getAddress(). So, with the default configuration of 0.0.0.0:9090, the NN plugin binds to all local interfaces, and the datanodes attempt to connect to whatever IP is first in the local wildcard. It seems to me that the correct behaviour would be:
I'll work on hacking these up and see where I get. Did some more hacking around in the code and came to some conclusions:
In order to guarantee that ThriftUtils.toThrift works on a DatanodeInfo properly, we need to ensure that the host name sent to NamenodePlugin.datanodeUp() is the same name as is in the DatanodeInfo. These DatanodeInfo names come from the DatanodeRegistration that gets sent from DN->NN at DN startup. Unfortunately, with the current setup this is broken for the following reason:
My proposed solution is: 1) Add a hook to the Datanode Plugin interface to allow the DN plugin to defer the datanodeUp call until after the DN has registered with the NN. This puts the dnRegistration member in a stable state that matches the DatanodeInfo stored on the NN side. The necessary plugin infrastructure for this is in HADOOP-5640, newly created. 2) This step isn't strictly necessary, but I'd like to modify the datanodeUp and datanodeDown functions to take three parameters (name, storage id, and thrift port) rather than the current two (name and thrift port). This allows those functions to construct DatanodeID objects and then maintain a Map<DatanodeID, Integer> thriftPorts array instead. The advantages here are:
Alternatively, rather than taking three arguments, we could introduce a DatanodeID thrift struct with name and storageID members. Thanks for your comments, Todd!
Sounds good to me.
It wouldn't hurt to pass the storageID as well.
I'd rather stick to datanodeUp(name, storageID, thiftPort)} — one class less to generate, and smaller Thrift IDL file.
I agree that HADOOP-5640 is a cleaner path. Minor corrections:
Attached is a patch with some further improvements:
Thanks for HADOOP-4707-55c046a.txt
I just noticed a small improvement to DatanodePlugin.ThriftHandler.readBlock: If we implement it like this, we might save quite a few memory copies: public BlockData readBlock(Block block, long offset, int length) throws IOException, TException { LOG.debug("readBlock(" + block.blockId + "," + offset + "," + length // [..] // Allocate read buffer on ret directly, so that no extra memory copy is done // if we read all bytes ret.data = new byte[length]; int n = reader.read(ret.data, 0, length); if (n == -1) { throw new EOFException("EOF reading " + length + " bytes at offset " + offset + " from " + block); } LOG.debug("readBlock(" + block.blockId + ", " + offset + ", " + length + "): Read " + n + " bytes"); if (n < length) { byte[] buf = new byte[n]; System.arraycopy(ret.data, 0, buf, 0, n); ret.data = buf; } ret.length = n; // [..] return ret; } Nice fix. Here's a diff against my previous patch, and I'll upload a new patch against trunk as well with a couple more fixes.
diff --git a/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java b/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/D index 6929a7b..14b2e32 100644 --- a/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java +++ b/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java @@ -100,8 +100,19 @@ public class DatanodePlugin } LOG.debug("readBlock(" + block.blockId + ", " + offset + ", " + length + "): Read " + n + " bytes"); - ret.data = new byte[n]; - System.arraycopy(buf, 0, ret.data, 0, n); + + if (n == length) { + // If we read exactly the same number of bytes that was asked for, + // we can simply return the buffer directly + ret.data = buf; + } else { + assert n < length; + // If we read fewer bytes than they asked for, we need to write + // back a smaller byte array. With the appropriate thrift hook + // we could avoid this copy, too. + ret.data = new byte[n]; + System.arraycopy(buf, 0, ret.data, 0, n); + } ret.length = n; summer.update(ret.data); This fixes the test cases, and makes one slightly incompatible change. DatanodeInfo.name was previously just the hostname, but is now the hostname:port (datanodeInfo.getName()) as described in the thrift interface docs. This was required to get tests to pass for me.
Some more notes after using this API for a bit:
Regarding the authentication issue, I would propose a somewhat major change. We should add a struct called CallContext that looks something like this: struct CallContext {
1:string user
2:list<string> groups
}
We should then add this as a parameter to every RPC, at least in the NameNode service. Since Thrift uses id numbers for parameters, this will safely end up as null for any callers that are using the old interface. We should then add something to ThriftUtils along the lines of: CallContext completeCallContext(CallContext ctx) {
if (null == ctx) {
ctx = new CallContext();
... // get call context using identd as it does currently;
}
return ctx
}
With this code, any existing clients preserve their identd behavior with no issues. If people are concerned with the lack of security here, I would make the following points:
Yep, sounds very good.
I would prefer not to follow that route and leave the method signatures untouched, for the following reasons:
If peope are really concerned about authentication, perhaps it would be better to add some kind of transport-level authentication support for Thrift itself (like TLS- or SSL-enabled Thrift transports, for instance). Then on the server side we could either start a org.apache.thrift.transport.TServerSocket or a org.apache.thrift.transport.TSecureServerSocket, depending on some configuration value, and then everyone would be catered for.
Not to be a pain, but unfortunately it's completely inadequate for my particular requirements
This is somewhat true. We can use the new-ish "thrift -gen html" to generate html documentation, though, which at least is slightly nicer to look at. I don't think adding a single common parameter to every call really clutters things that much, though, since the additional mental burden is "O(1)". Once you understand what the parameter means on one call, you understand it everywhere
Yes, it's slightly less readable, but on the other hand I assume that anyone using this service will be wrapping the thrift client in some kind of fliesystem facade anyway. The change then is confined to a single module. The fact that existing C++ code will need changes doesn't concern me too much since (a) it's a very clear compilation-time error, not something subtle and hard to find, and (b) this hasn't been committed yet, so now is the one time we should feel especially free to break backwards compat!
This doesn't really solve my issue. I don't care about authentication from an access control perspective. I care more about the ability to assume different user roles. One primary use case for this Thrift interface, as I see it, is so that web applications written in non-Java languages can access HDFS more easily. Web applications typically run as some user like www-data or nobody, which is problematic if they really need to be assuming the identity of the end user (tlipcon) who has provided some authentication via a login form, etc. The other option I considered was to add a su(...) RPC and use a specialized TTransport on the server side to hold the current UGI, but this breaks a lot of layer encapsulation properties. Namely, the TTransport cannot automatically reconnect to the server, since the UGI will be lost after the reconnect.
Not a pain at all — just a different set of requirements
Good point: I already have wrapper code for Perl, Python and I'm starting with the C++ stuff. I could handle the user parameter in the method wrappers without exposing it on the method wrappers' signature. No objections from my side, then. Assigning this to Todd (with his agreement), since I won't have much time to work on this for the coming weeks.
Made some more iterations on this. As this is growing somewhat large (especially with the diff including all the generated code), I've also put a branch up on github for those wanting to follow along in a more easily parseable fashion:
http://github.com/toddlipcon/hadoop/commits/hadoop-4707 I'll also upload an up to date patch momentarily. Summary of changes since last upload:
I'd also like to propose moving this code into contrib/thrift rather than contrib/thriftfs. The motivation for this is that I will soon be working on JobTracker and TaskTracker thrift plugins, and there's a lot of common code I want to share. Putting that code in a module called "thriftfs" seems incorrect. Additionally, contrib/thriftfs currently is the home for two entirely separate projects - this, and the earlier "proxy style" thrift wrapper of the DFS client. Since these two projects share no code, it makes no sense for them to be in the same contrib dir. If no one has any objections, I'll do this move in the next day or two. Todd: Any chance you could try out my patch for THRIFT-394 and include the C# bindings in the patch for this?
Michael: Carlos and I decided offline that we probably shouldn't be including the language bindings in the patch at all, but rather expect users to generate them themselves. Can you try using thrift -gen csharp if/hdfs.thrift and see how that goes?
> "proxy style" thrift wrapper of the DFS client.
This code should be made obselete (and deleted) when this patch goes into trunk. It would be nice to keep the name of the file system related classes (from this patch)to be org.apache.hadoop.thriftfs though.
Sure thing. You're fine with renaming the directory to src/contrib/thrift though? > You're fine with renaming the directory to src/contrib/thrift though?
Sure. The only caveat is that the convention is to match the package name with the directory name. So, if you have the package name is org.apache.hadoop.thriftfs, then it might be convenient to keep the directory name same as what we got now.
How about a package structure like:
I think this makes the most sense to me at this point. Renaming contrib/thriftfs to contrib/thrift is a pain, but I would hate to see a mapred dir inside a "thriftfs" package or contrib project. I'd like to stop work on this JIRA with the most recent patch (31c331) and perform the above package refactoring in a new JIRA.
For those that would like a more granular review, the commits are available on github: http://github.com/toddlipcon/hadoop/tree/hadoop-4707 The diff includes all of these commits compared against HADOOP-5640, which is required by this issue. I'll also upload the newest libthrift.jar that I've been testing with (slightly newer than Carlos's). This needs to be put in src/contrib/thrift/lib/. md5sum should be fc6596d8c9b8964ac60cd80c277b9f92 How about a package structure that is like this:
org.apache.hadoop.thriftfs
Where does the common code go? In the most recent patch, the NN and DN services extend a common HadoopService which exposes common elements like JVM stats, VersionInfo, and Metrics. I'd certainly want to share this code between mr and fs. There's also some common code for starting up a Thrift server, handling UserGroupInformation, etc. I just opened HADOOP-5840 to continue this discussion. The old patch fell out of date because of the new AccessToken work in
I also took the opportunity to remove all of the generated code aside from gen-java as we decided above. As before, this patch is relative to and blocked by HADOOP-5640 (plugin infrastructure with hooks on important actions) Marking as patch available, though Hudson is likely to fail building it since it requires the new libthrift.jar, and the patch is gzipped since it was large. Tests pass for me. Since the only modification to non-contrib is the addition of DatanodeID.getLongString(), and this is currently blocking HADOOP-5840 (thrift contrib reorg) and through that, HADOOP-5703 (jobtracker/mapred thrift), I'd really appreciate if someone could take a look at this (and HADOOP-5640 which it depends on) ASAP.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408978/hadoop-4707-867f26.txt.gz against trunk revision 779106. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/412/console This message is automatically generated. Thsi patch allows the hdfs-thrift server to run inside the NN. This is a good thing. However, some administrators might like to keep running the hdfs-thrift server separately from the namenode (as it currently stands in trunk)... the reason being that the NN is already a big memory hog and allowing the hdfs-thrift server to share the same address space could introduce new resource bottlenecks in the NN that could impact the whole cluster stability. Do you agree?
Hey Dhruba,
It's true that this introduces code that runs in the NN. However, the design of this new code maps one-for-one onto the communication design of HDFS itself - no data goes through the NN thrift server. The additional load on the NN is simply the metadata RPCs that the clients are sending, and of course the sockets serving the Thrift connections. Given that the existing (external) Thrift service just fowards the metadata RPCs to the NameNode, there's no additional load there. The external server probably does demultiplex them down to fewer RPC connections, and Thrift is probably less CPU-efficient than the hand-coded Writable-based RPCs that Hadoop uses internally. As you mentioned, and correlated with my experience, the NN resource that is in short supply is memory more often than CPU. Given that, I don't think the additional load from the Thrift SerDe is going to end up being significant. If we expect that there will be a high number of concurrent clients to the NN Thrift service, switching out the TThreadPoolServer for something based on the new-ish TNonBlockingServer would decrease memory consumption by reducing the number of threads. I can look into that if you like. However, I do agree that some ops people will have philosophical objections to putting more code in the NameNode. This makes sense, and might be a good reason to keep the external Thrift gateway alive. However, I haven't touched its code, and don't want to commit to maintain it implicitly by adding new code next to it Thanks @Todd, thanks for the info.
I am kinda suspicious about the memory requirements for the Thrift server. It is possible that I might like to to run it separately from the NN for starters. For example, our NN is always kinds maxed out on memory and it would be a definite no-no to make the thridt server run inside it. On the other hand, some admins might be ok runnign these two pieces two together. is it possible to enhance this patch so that the thrift-hdfs server can be run separately from the NN? is it possible to enhance this patch so that the thrift-hdfs server can be run separately from the NN?
Woops, sorry, didn't realize I hadn't updated the JIRA a couple weeks ago.
With the current feature set it's possible to do so, since the NN plugin only accesses the NameNode by making calls on it that are available over RPC. Doing so would probably be a day's work, so I'd like to do it as a separate JIRA later on. Additionally, I'm worried that making this architecture change will tie our hands in the future if we want to make any calls from the NN into the Plugin using hooks like in HADOOP-5640. I think we had talked offline and decided that, as long as the old Thrift server remains in tact, your use case won't be disturbed for now. Is that OK? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
See https://issues.apache.org/jira/secure/attachment/12394600/hdfs.py