Issue Details (XML | Word | Printable)

Key: HDFS-417
Type: Bug Bug
Status: Patch Available Patch Available
Priority: Minor Minor
Assignee: Todd Lipcon
Reporter: Carlos Valiente
Votes: 1
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Hadoop HDFS

Improvements to Hadoop Thrift bindings

Created: 21/Nov/08 09:35 PM   Updated: 01/Jul/09 05:27 PM
Return to search
Component/s: contrib/thriftfs
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works all.diff 2008-11-21 09:39 PM Carlos Valiente 1.09 MB
Java Source File Licensed for inclusion in ASF works BlockManager.java 2008-11-25 07:43 PM Carlos Valiente 0.8 kB
File Licensed for inclusion in ASF works build_xml.diff 2008-11-21 09:36 PM Carlos Valiente 0.7 kB
Java Source File Licensed for inclusion in ASF works DefaultBlockManager.java 2008-11-25 07:43 PM Carlos Valiente 2 kB
Java Source File Licensed for inclusion in ASF works DFSBlockManager.java 2008-11-25 07:43 PM Carlos Valiente 3 kB
File Licensed for inclusion in ASF works gen.diff 2008-11-21 09:36 PM Carlos Valiente 1.06 MB
GZip Archive Licensed for inclusion in ASF works hadoop-4707-31c331.patch.gz 2009-05-06 09:23 PM Todd Lipcon 257 kB
Text File Licensed for inclusion in ASF works HADOOP-4707-55c046a.txt 2009-04-17 09:41 PM Todd Lipcon 2.36 MB
Text File Licensed for inclusion in ASF works hadoop-4707-6bc958.txt 2009-04-29 12:56 AM Todd Lipcon 2.33 MB
GZip Archive Licensed for inclusion in ASF works hadoop-4707-867f26.txt.gz 2009-05-26 12:46 AM Todd Lipcon 155 kB
File Licensed for inclusion in ASF works HADOOP-4707.diff 2009-02-10 11:04 PM Carlos Valiente 3.02 MB
Text File Licensed for inclusion in ASF works HADOOP-4707.patch 2009-04-13 10:39 AM Carlos Valiente 2.30 MB
Text File Licensed for inclusion in ASF works HADOOP-4707.patch 2009-04-06 12:48 PM Carlos Valiente 2.30 MB
File Licensed for inclusion in ASF works hadoopfs_thrift.diff 2008-11-21 09:35 PM Carlos Valiente 5 kB
Java Archive File Licensed for inclusion in ASF works hadoopthriftapi.jar 2008-11-21 09:38 PM Carlos Valiente 124 kB
Java Source File Licensed for inclusion in ASF works HadoopThriftServer.java 2008-11-25 07:43 PM Carlos Valiente 20 kB
File HadoopThriftServer_java.diff 2008-11-21 09:37 PM Carlos Valiente 17 kB
File hdfs.py 2008-11-24 09:57 PM Venky Iyer 17 kB
File hdfs_py_venky.diff 2008-11-24 11:35 PM Carlos Valiente 1 kB
Java Archive File Licensed for inclusion in ASF works libthrift.jar 2009-05-14 07:17 PM Todd Lipcon 91 kB
Java Archive File Licensed for inclusion in ASF works libthrift.jar 2009-04-06 12:48 PM Carlos Valiente 168 kB
Java Archive File Licensed for inclusion in ASF works libthrift.jar 2009-02-10 11:04 PM Carlos Valiente 91 kB
Java Archive File Licensed for inclusion in ASF works libthrift.jar 2008-11-21 09:37 PM Carlos Valiente 72 kB
Environment: Tested under Linux x86-64
Issue Links:
Blocker
 
Reference
 


 Description  « Hide
I have made the following changes to hadoopfs.thrift:
  1. Added namespaces for Python, Perl and C++.
  1. Renamed parameters and struct members to camelCase versions to keep them consistent (in particular FileStatus{blockReplication,blockSize} vs FileStatus.{block_replication,blocksize}).
  1. Renamed ThriftHadoopFileSystem to FileSystem. From the perspective of a Perl/Python/C++ user, 1) it is already clear that we're using Thrift, and 2) the fact that we're dealing with Hadoop is already explicit in the namespace. The usage of generated code is more compact and (in my opinion) clearer:

    Perl:
    use HadoopFS;

    my $client = HadoopFS::FileSystemClient->new(..);

    instead of:

    my $client = HadoopFS::ThriftHadoopFileSystemClient->new(..);

    Python:

    from hadoopfs import FileSystem

    client = FileSystem.Client(..)

    instead of

    from hadoopfs import ThriftHadoopFileSystem

    client = ThriftHadoopFileSystem.Client(..)

    (See also the attached diff [^scripts_hdfs_py.diff] for the
    new version of 'scripts/hdfs.py').

    C++:

    hadoopfs::FileSystemClient client(..);

    instead of:

    hadoopfs::ThriftHadoopFileSystemClient client(..);

  1. Renamed ThriftHandle to FileHandle: As in 3, it is clear that we're dealing with a Thrift object, and its purpose (to act as a handle for file operations) is clearer.
  1. Renamed ThriftIOException to IOException, to keep it simpler, and consistent with MalformedInputException.
  1. Added explicit version tags to fields of ThriftHandle/FileHandle, Pathname, MalformedInputException and ThriftIOException/IOException, to improve compatibility of existing clients with future versions of the interface which might add new fields to those objects (like stack traces for the exception types, for instance).

Those changes are reflected in the attachment hadoopfs_thrift.diff.

Changes in generated Java, Python, Perl and C++ code are also attached in gen.diff. They were generated by a Thrift checkout from trunk
(http://svn.apache.org/repos/asf/incubator/thrift/trunk/) as of revision
719697, plus the following Perl-related patches:

The Thrift jar file libthrift.jar built from that Thrift checkout is also attached, since it's needed to run the Java Thrift server.

I have also added a new target to src/contrib/thriftfs/build.xml to build the Java bindings needed for org.apache.hadoop.thriftfs.HadoopThriftServer.java (see attachment build_xml.diff and modified HadoopThriftServer.java to make use of the new bindings (see attachment HadoopThriftServer_java.diff).

The jar file [^lib/hadoopthriftapi.jar] is also included, although it can be regenerated from the stuff under 'gen-java' and the new 'compile-gen' Ant target.

The whole changeset is also included as all.diff.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Venky Iyer added a comment - 24/Nov/08 09:53 PM - edited
These look good to me. I've built a python interface to this, to provide a more pythonic file-object like interface as well as a layer that mimics python's os module. These will need some minor fixes to adjust to the patches here. You're welcome to add these if you feel like it, or I can do so once your patch is committed. See attached:

See https://issues.apache.org/jira/secure/attachment/12394600/hdfs.py


Venky Iyer added a comment - 24/Nov/08 09:57 PM
Note that this contains some installation-specific stuff to start up a Thrift hdfs server that I will have to abstract away before public release.

Carlos Valiente added a comment - 24/Nov/08 11:35 PM
Thanks, Venky. Instead of parsing the log file to get the port, you can pass it as the first argument to HadoopThriftServer (see hdfs_py_venky.diff. A value of '0' will make HadoopThriftServer pick the first one available, which is probably what you want in your hadoopThriftServer Python class.

I don't know how to abstract away the basedir stuff, though - keep me posted, since I'm interested in doing something similar.


Carlos Valiente added a comment - 25/Nov/08 07:42 PM
I have started to implement reading on a per-block basis:
string  readBlock(1:FileHandle handle, 2:BlocLocation block)
              throws (1: IOException ouch)

I have first added two more fields to BlockLocation:

struct BlockLocation {
      1: list<string> hosts,     
      2: list<string> names,     
      3: i64 offset,             
      4: i64 length,             
      5: i64 id,           /* bock id */
      6: i64 genStamp      /* block generation timestamp */
    }

I then provide two implementations of getFileBlockLocations(): One for
HDFS filesystems (which may fill in the values for the extra fields),
and another one for generic filesystems (like local ones), which just
return zeroes.

Does it sound like a reasonable plan?


dhruba borthakur added a comment - 30/Dec/08 08:03 PM
Thanks Carlos. Your proposed change to BlockLocation sounds good.

>Instead of parsing the log file to get the port, you can pass it as the first argument to HadoopThriftServer (see hdfs_py_venky.diff. A value of '0' will make HadoopThriftServer pick the first one available, which is probably what you want in your hadoopThriftServer Python class.

In the usual case, we want the HadoopThriftServer to bind to any available port. This allows running multiple HadoopThriftServer on the same machine. Also, the python client needs to know the port that its associated instance of the HadoopThriftServer is using. That's the reason why the python client parses the log file to find the port number. Are you saying that there is another way to achieve the same functionality?


Carlos Valiente added a comment - 10/Feb/09 11:04 PM
Patch HADOOP-4707.diff removes the need for an extra Thrift server
process by creating Thrift server instances on the namenode and
datanode processes. It also allows direct reads and writes to datanode
instances.

The Thrift namenode server instance implements the Thrift service
'Namenode' defined in src/thrift/hdfs.thrift. It is created by calling
org.apache.hadoop.hdfs.server.namenode.NameNode.startThriftServer()
in org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(),
and acts as a facade to org.apache.hadoop.hdfs.protocol.ClientProtocol.

The Thrift datanode server instance implements the Thrift service
'Datanode', defined as well in src/thrift/hdfs.thrift. It is created
in org.apache.hadoop.hdfs.datanode.DataNode.startDatanode().

In order to read data from a file, Thrift clients request a list
of blocks with Namenode.getBlocks(path, offset, length), and then
call Datanode.readBlock() on the appropriate datanode servers
for each block in the returned list. The Thrift datanode server
instance then opens a local socket to the datanode server via
org.apache.hadoop.hdfs.DFSClient.BlockReader.newBlockReader().

Data writes are implemented in a similar way: Thrift
clients call Namenode.addBlock(path) on the Thrift
namenode server, and then call DataNode.writeBlock() on
as many Thrift Datanode servers as they wish. The Thrift
datanode servers write the block to their local storage using
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(),
and then inform the namenode about the new block using
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceived(),
so that full block replication is eventually achieved when the next
block report is processed by the namenode.

Namenode.addBlock() must be called on files opened by Namenode.create()
or Namenode.append(). This last call might return a Block object
representing the last partial block of an existing file, which is
currently ignored because I don't know how to handle a write to
that block (FSDataset.writeToBlock() complains about that block
being valid).

Since Datanode.readBlock() and Datanode.writeBlock() expect data blobs
of Thrift type 'binary', and that type translates to Java's byte[],
data reads and writes are limited to 2**31 -1 bytes (whereas Hadoop
blocks may be much larger, since their length is measured in longs).

The following entries in hdfs-default.xml define the locations for
the Thrift servers:

  • dfs.thrift.address, set by default to '0.0.0.0:9090'
  • dfs.thrift.datanode.address, set by default to '0.0.0.0:0'.

The following entries limit the number and lifetime of Thrift server
threads:

  • dfs.thrift.threads.min, set by default to 5
  • dfs.thrift.threads.max, set by default to 20,
  • dfs.thrift.timeout, set by default to 60 seconds.

Thrift namenode and datanode servers try to obtain the identity of
the client by calling org.apache.hadoop.net.NetUtils.getRemoteUser(),
which implements the IDENT protocol defined by RFC 1413. If that call
fails, the value returned by security.UnixUserGroupInformation.login()
is used instead.

Perl and Python higher-level APIs for read and write functionality are
also included under src/thrift, together with test suites. Perhaps
it makes more sense to release that code independently from Hadoop,
so that they might be updated more frequently. The Perl and Python
test suites may be run with the following ant target:

$ ant -Dcompile.thrift=true test-thrift

You may also need to pass the location of Thrift's Python library
modules, since Thrift does not install them under Python's lib
directory by default. For instance, if your Python bindings for
Thrift are installed under /opt/thrift/lib/python2.x/site-packages,
then you should do the following:

$ ant -Dcompile.thrift=true \
-Dthrift.pythonpath=/opt/thrift/lib/python2.x/site-packages \
test-thrift

Perl and Python tests for chmod() functionality show that write
operations succeed for read-only files. I guess that's because all
tests are run as HDFS superuser.

Perl and Python high-level APIs call Namenode.addBlock() on each
invocation of Datanode.writeBlock(), so it is possible to create
files with many blocks of smaller size.

I have used Thrift version 20080411-r743112 from Thrift's
Subversion repository, in order to include the resolution of
THRIFT-193 (http://issues.apache.org/jira/browse/THRIFT-193),
which fixes Perl package generation, and THRIFT-249
(https://issues.apache.org/jira/browse/THRIFT-249), which emits Javadoc
comments for Java classes. Also attached is the libthrift.jar from that
build, which is needed at run-time by the Thrift server instances. The
patch also includes Thrift stubs for all languages supported by Thrift,
very much like the stuff under src/contrib/thiftfs.


dhruba borthakur added a comment - 11/Feb/09 08:14 PM
Hi Carlos, This looks awesome! Lots of cool stuff. I have not looked at the details, but here are some of my initial comments:

1. The idea of making the Namenode expose a Thrift interface is a great idea. This removes the need for a separate daemon to map Thrift methods to ClientProcotol methods.

2. Is there a way to "layer" the Thrift layer (DatanodeThriftServer and NameNodeThrift server) around the org.apache.hadoop.hdfs package (instead of making them part of org.apache.hadoop.hdfs). Can these be part of org.apache.hadoop.thriftfs and reside in the src/contrib/thriftfs directory? This essentially means that the Thrift server is a "layer" then encapsulates the hadoop Namenode/DataNode. This could be in its own library called libhdfsthrift.jar (or something like that)

3. There could be new source files in the base hdfs package that allows plugging in of multiple protocol stacks (the first one being Thrift). This code could use reflection (and configuration settings) to figure out if the libhdfsthrift.jar is in the classpath, and if so, then use those methods from that jar to initialize the Thrift server. The reason I propose the above is because it does not force every Hadoop install to use Thrift. It keeps the base Namenode/Datanode code clean and elegent. It also allows plugging in other protocol stacks to expose the Namenode/datanode functionality.

4. It is possible that the Thrift implementation might need to use some (currently) package-private methods in the Datanode/Namenode, but we can work on making them public if need be.

5. Allowing the Thrift interface to read file contents is easy. However, writing to blocks is more difficult, especially because the DFSClient.java is a heavy-weight piece of code and participates heavily in ensuring correct recovery from write-pipeline failures, allows "appending" to existing files, ensuring all blocks of a file are equal size, etc.etc. Do you have an application that will need to write to HDFS files using this Thrift interface?

6. Your unit tests are nice. It is imperative for us to detect incompatible changes to base HDFS APIs earlier rather than later.


Carlos Valiente added a comment - 11/Feb/09 09:11 PM
Thanks for your feedback, Dhruba!

2. Is there a way to "layer" the Thrift layer (DatanodeThriftServer and NameNodeThrift server) around the org.apache.hadoop.hdfs package (instead of making them part of org.apache.hadoop.hdfs). Can these be part of org.apache.hadoop.thriftfs and reside in the src/contrib/thriftfs directory?

Yes, no problem with that. I chose to work in src/hdfs mainly to avoid conflicts with your code under src/contrib/thriftfs.

3. There could be new source files in the base hdfs package that allows plugging in of multiple protocol stacks (the first one being Thrift). This code could use reflection (and configuration settings) to figure out if the libhdfsthrift.jar is in the classpath, and if so, then use those methods from that jar to initialize the Thrift server. The reason I propose the above is because it does not force every Hadoop install to use Thrift. It keeps the base Namenode/Datanode code clean and elegent. It also allows plugging in other protocol stacks to expose the Namenode/datanode functionality.

Yes, that sounds good. Would it make sense to open a new JIRA issue to implement the plugin functionality? Otherwise the attachment list to this issue might end up quite crowded.

4. It is possible that the Thrift implementation might need to use some (currently) package-private methods in the Datanode/Namenode, but we can work on making them public if need be.

OK, so no real reason for hosting code under org.hadoop.hdfs.server.{datanode,namenode}, then.

5. Allowing the Thrift interface to read file contents is easy. However, writing to blocks is more difficult, especially because the DFSClient.java is a heavy-weight piece of code and participates heavily in ensuring correct recovery from write-pipeline failures, allows "appending" to existing files, ensuring all blocks of a file are equal size, etc.etc. Do you have an application that will need to write to HDFS files using this Thrift interface?

Reading is by far more important to me than writing, and I could just do the writes using Java. If the current implementations of append(), create(), addBlock() and writeBlock() were show-stoppers for getting the stuff committed sooner, I can focus first on addressing your points, and then figure out later what to do with them. What would you recommend?

6. Your unit tests are nice. It is imperative for us to detect incompatible changes to base HDFS APIs earlier rather than later.

Being able to test the Thrift code (and nothing else) by doing a:

$ ant -Dtestcase=TestThriftServer test-core

was very helpful during development. Will it be possible to do a similar thing when the Thrift code moves back to src/contrib/thriftfs?


dhruba borthakur added a comment - 11/Feb/09 09:49 PM
> Would it make sense to open a new JIRA issue to implement the plugin functionality?

Yes, please open a new JIRA to implement "export namenode/datanode functionality through a pluggable RPC layer". The component for this jira will be "dfs". You can submit the code for this new JIRA, get it vetted by the HDFS community and then commit to svn. After that is done, then we can work on getting this one (HADOOP-4707) commiited. You might need a separate unit test associated with this new JIRA.

> Reading is by far more important to me than writing, and I could just do the writes using Java.
I like this staged approach. I know that many would like to write to HDFS using Thrift, but with an eye to gettting this committed sooner, I would rather that we take it in steps.

> was very helpful during development. Will it be possible to do a similar thing when the Thrift code moves back

Yes, If you move the TestThriftServer code to src/contrib/thriftfs/test/org/apache/hadoop/thriftfs, that should be all that is required.


Carlos Valiente added a comment - 13/Feb/09 06:44 PM

Yes, please open a new JIRA to implement "export namenode/datanode functionality through a pluggable RPC layer".

I just filed HADOOP-5257, Dhruba.


Michael Greene added a comment - 03/Apr/09 03:35 PM
Now that HADOOP-5257 is in trunk, what does that mean for this issue?

Carlos Valiente added a comment - 03/Apr/09 03:54 PM

Now that HADOOP-5257 is in trunk, what does that mean for this issue?

I'm going to reshape HADOOP-4707.diff into a plugin during the weekend.


Carlos Valiente added a comment - 06/Apr/09 12:48 PM
Patch HADOOP-4707.patch provides Thrift interfaces to HDFS namenodes
and datanodes using HDFS service plugins. Both
plugins implement the Thrift services
defined in src/contrib/thriftfs/if/hdfs.thrift.

In order to read data from a file, Thrift clients request a list
of blocks with Namenode.getBlocks(path, offset, length), and then
call Datanode.readBlock() on the appropriate datanode servers
for each block in the returned list. The Thrift datanode server
instance then opens a local socket to the datanode server via
org.apache.hadoop.hdfs.DFSClient.BlockReader.newBlockReader().

Both plugins add thriftfs-default.xml and
thriftfs-site.xml as configuration resources. The following
properties define the addresses for
the Thrift servers:

  • dfs.thrift.address, set by default to '0.0.0.0:9090'
  • dfs.thrift.datanode.address, set by default to '0.0.0.0:0'.

The following properties limit the number and lifetime of Thrift server
threads:

  • dfs.thrift.threads.min, set by default to 5
  • dfs.thrift.threads.max, set by default to 20,
  • dfs.thrift.timeout, set by default to 60 seconds.

Since Datanode.readBlock() expects data blobs
of Thrift type 'binary', and that type translates to Java's byte[],
data reads are limited to 2**31 -1 bytes (whereas Hadoop
blocks may be much larger, since their length is measured in longs).

On Dhruba's suggestion, I've removed all write-related methods for now.

Thrift namenode and datanode servers try to obtain the identity of
the client by calling
org.apache.hadoop.thriftfs.PluginBase.getRemoteUser(),
which implements the IDENT protocol defined by RFC 1413. If that call
fails, the value returned by
security.UnixUserGroupInformation
is used instead.

I've removed the Perl and Python high-level APIs from this patch in
order to make it simpler. Those APIs are available at
http://code.pepelabs.net/git/?p=hadoop-thrift.git. Perhaps it's better
to keep them separate from Hadoop's code base?

I've updated Thrift's libthrift.jar to a recent Subversion checkout.
It seems that a Thrift release is imminent, so the final JAR (and the
Java code it generates) should not be too different from what's included
here.


Todd Lipcon added a comment - 07/Apr/09 06:12 PM
A few comments on this patch:

1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode. Calling getDatanodeReport then triggers an NPE in ThriftUtils.toThrift

It seems to me that we need to add some hooks to DataNode and/or NameNode that allow the plugins to register callbacks on certain events. Specifically for this case, the DataNode needs to re-register its thrift port with the NameNode when it reconnects.

Specifically, I think the DataNode needs a hook in DataNode.register() that calls through to plugins. Doing this in a generalized way on the HADOOP-5257 plugin interface might be nice - some kind of "hook point" pubsub kind of interface. Opinions solicited

2) I think the datanode hostnames need to be canonicalized somehow when inserting into the thriftPorts map. On a pseudodistributed cluster, I'm seeing getDatanodeReport fail to find the thriftPort since the DN is registering under the name 127.0.1.1, but then being looked up as 127.0.0.1 for whatever reason. I'll look into a solution for this.

3) Lastly, I think the "Unknown Thrift port for Datanode" NPE is unnecessarily strict. I'd prefer for it to return a -1 or a 0 to indicate that the DN thrift server isn't running. This would require some extra checks elsewhere in the code before trying to contact a non-existent thrift server, but it enables getDatanodeReport to work even without the thrift plugin on the DNs.


dhruba borthakur added a comment - 07/Apr/09 06:58 PM
> 1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode.

Is there a way for the Datanode plugin to try re-register its port with the NameNode-plugin when it encounters an error?


Todd Lipcon added a comment - 07/Apr/09 07:08 PM
>> 1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode.

> Is there a way for the Datanode plugin to try re-register its port with the NameNode-plugin when it encounters an error?

The problem is that the error is encountered on the NameNode. The flow is that the client is asking the NameNode for info, and the NameNode needs to send the client the host:port for the relevant DN. There's no way for the NN plugin to contact the DN plugin to tell it to re-register.


Carlos Valiente added a comment - 07/Apr/09 07:54 PM

1) If the NameNode restarts but the DataNodes stay up, the DataNodes don't re-register their Thrift ports with the NameNode. Calling getDatanodeReport then triggers an NPE in ThriftUtils.toThrift
[..]
Specifically, I think the DataNode needs a hook in DataNode.register() that calls through to plugins. Doing this in a generalized way on the HADOOP-5257 plugin interface might be nice - some kind of "hook point" pubsub kind of interface.

Yep. A simpler option might be to spawn a thread on the datanode which calls Namenode.datanodeUp() every so often. Any preference?

2) I think the datanode hostnames need to be canonicalized somehow when inserting into the thriftPorts map. On a pseudodistributed cluster, I'm seeing getDatanodeReport fail to find the thriftPort since the DN is registering under the name 127.0.1.1, but then being looked up as 127.0.0.1 for whatever reason. I'll look into a solution for this.

I came across that same issue when writing the test suite. The "canonical" name, as far as the thriftPorts map is concerned, is org.apache.hadoop.thriftfs.DatanodePlugin.datanode.DataNode.dnRegistration.getName(). On my real 6-node test cluster, that value is the same as the value of org.apache.hadoop.hdfs.protocol.DatanodeInfo.name for every DatanodeInfo instance, so everything works. On a MiniDFSCluster cluster, however, it is not — just as you found out in your case, Todd (classloader issues, perharps?).

My workaround for the test suite was to set the property slave.host.name to the expected value.

3) Lastly, I think the "Unknown Thrift port for Datanode" NPE is unnecessarily strict. I'd prefer for it to return a -1 or a 0 to indicate that the DN thrift server isn't running. This would require some extra checks elsewhere in the code before trying to contact a non-existent thrift server, but it enables getDatanodeReport to work even without the thrift plugin on the DNs.

Yep, makes sense. I propose setting the port to -1, and also doing something like this:

public static Block toThrift(LocatedBlock block, String path,
      Map<String, Integer> thriftPorts) {
    if (block == null) {
      return new Block();
    }

    List<DatanodeInfo> nodes = new ArrayList<DatanodeInfo>();
    for (org.apache.hadoop.hdfs.protocol.DatanodeInfo n:
        block.getLocations()) {

        DatanodeInfo node = toThrift(n, thriftPorts);
        if (node.thriftPort != -1) {
          nodes.add(toThrift(n, thriftPorts));
        }
    }  
    // [...]
  }

That way we return to the client the (possibly empty) list of all block locations accessible by the Thrift interface.


Todd Lipcon added a comment - 08/Apr/09 12:41 AM

Yep. A simpler option might be to spawn a thread on the datanode which calls Namenode.datanodeUp() every so often. Any preference?

That is definitely simpler but seems ugly to me. I know heartbeats are the norm elsewhere in Hadoop, but I'd prefer not to introduce more.

On my real 6-node test cluster, that value is the same as the value of org.apache.hadoop.hdfs.protocol.DatanodeInfo.name for every DatanodeInfo instance, so everything works. On a MiniDFSCluster cluster, however, it is not — just as you found out in your case, Todd (classloader issues, perharps?).

I think I've figured out the issue here:

In ThriftUtils.createNamenodeClient, it uses dfs.thrift.address to connect to the namenode. This same configuration variable is used in NamenodePlugin.getAddress(). So, with the default configuration of 0.0.0.0:9090, the NN plugin binds to all local interfaces, and the datanodes attempt to connect to whatever IP is first in the local wildcard.

It seems to me that the correct behaviour would be:

  • The NamenodePlugin continues to listen on dfs.thrift.address
  • Having a :0 port on dfs.thrift.address seems inadvisable since there's currently no way for the DatanodePlugin to locate the thrift server in that case.
  • The DatanodePlugin looks at dfs.thrift.address. If it is a "wildcard" address (0.0.0.0) it uses only the port portion, and locates the NN host using datanode.getNameNodeAddr()
  • Additionally, it we should inspect the TTransport from the client for the hostname in datanodeUp/datanodeDown rather than taking those as parameters. The hostname registered in the DatanodeRegistration comes from the remote side of the Hadoop RPC socket, so it makes sense to use the same method for getting the hostname on the Thrift side.

I'll work on hacking these up and see where I get.


Todd Lipcon added a comment - 08/Apr/09 07:23 AM
Did some more hacking around in the code and came to some conclusions:

In order to guarantee that ThriftUtils.toThrift works on a DatanodeInfo properly, we need to ensure that the host name sent to NamenodePlugin.datanodeUp() is the same name as is in the DatanodeInfo. These DatanodeInfo names come from the DatanodeRegistration that gets sent from DN->NN at DN startup. Unfortunately, with the current setup this is broken for the following reason:

  • in FSNameSystem.register, the "name" field passed in by the DN is ignored in favor of taking the remote host out of the RPC stack. This host is then written back into the DatanodeRegistration which is then returned to the DN.
  • This means that, when the DN registers itself with the NameNode, its dnRegistration variable is mutated (specifically with regard to the name field)
  • In the current Plugin architecture, the DN thrift plugin has already started before the DN calls 'register". This means that when the DatanodePlugin calls the NamenodePlugin.datanodeUp function, it's passing a different "name" String than will eventually end up registered on the NN.

My proposed solution is:

1) Add a hook to the Datanode Plugin interface to allow the DN plugin to defer the datanodeUp call until after the DN has registered with the NN. This puts the dnRegistration member in a stable state that matches the DatanodeInfo stored on the NN side. The necessary plugin infrastructure for this is in HADOOP-5640, newly created.

2) This step isn't strictly necessary, but I'd like to modify the datanodeUp and datanodeDown functions to take three parameters (name, storage id, and thrift port) rather than the current two (name and thrift port). This allows those functions to construct DatanodeID objects and then maintain a Map<DatanodeID, Integer> thriftPorts array instead. The advantages here are:

  • Consistency with data structures elsewhere in the code
  • Clarity of what the map keys actually are
  • Some small level of "security", in the sense that the Storage ID makes these calls a little harder to spoof.

Alternatively, rather than taking three arguments, we could introduce a DatanodeID thrift struct with name and storageID members.


Carlos Valiente added a comment - 08/Apr/09 07:51 AM
Thanks for your comments, Todd!

1) Add a hook to the Datanode Plugin interface to allow the DN plugin to defer the datanodeUp call until after the DN has registered with the NN. This puts the dnRegistration member in a stable state that matches the DatanodeInfo stored on the NN side. The necessary plugin infrastructure for this is in HADOOP-5640, newly created.

Sounds good to me.

2) This step isn't strictly necessary, but I'd like to modify the datanodeUp and datanodeDown functions to take three parameters (name, storage id, and thrift port) rather than the current two (name and thrift port).

It wouldn't hurt to pass the storageID as well.

Alternatively, rather than taking three arguments, we could introduce a DatanodeID thrift struct with name and storageID members.

I'd rather stick to datanodeUp(name, storageID, thiftPort)} — one class less to generate, and smaller Thrift IDL file.


Carlos Valiente added a comment - 08/Apr/09 08:02 AM

Yep. A simpler option might be to spawn a thread on the datanode which calls Namenode.datanodeUp() every so often. Any preference?

That is definitely simpler but seems ugly to me. I know heartbeats are the norm elsewhere in Hadoop, but I'd prefer not to introduce more.

I agree that HADOOP-5640 is a cleaner path.


Carlos Valiente added a comment - 13/Apr/09 10:39 AM
Minor corrections:
  • Included Todd's suggestion of not throwing a NPE when translating HDFS DatanodeInfo instances to their Thrift equivalents.
  • Ruby module names must be capitalized.
  • Since JUnit 4 is now fetched by Ivy, moved MiniDFSCluster initialization to @BeforeClass method. Execution of TestNamenode goes down from 120 to 10. seconds

Todd Lipcon added a comment - 17/Apr/09 09:41 PM
Attached is a patch with some further improvements:
  • datenodeUp/datanodeDown now take a storage ID, and the thriftPorts maps from DatanodeID instances to ints
  • Uses thrift enums now for DatanodeReportType and DatanodeState
  • Reran thrift with newest trunk thrift for all languages (looked like some languages hadn't been re-run recently)
  • Depends on HADOOP-5640 for:
    • Restarting the namenode while DN is still up now properly triggers datanodeUp messages from DNs after they've re-registered at the NN
  • Few misc fixes:
    • DatanodePlugin 'register' member is now volatile since it's accessed by multiple threads without synchronization
    • typo: THRFIT -> THRIFT
    • PluginBase has been refactored to a containment relationship instead of a subclassing relationship, so hence renamed to ThriftPluginServer
    • When connecting to the NN thrift service, if it's configured as 0.0.0.0, use NameNode.getAddress to figure out what the external IP it's actually listening on is.

Carlos Valiente added a comment - 28/Apr/09 09:38 PM
Thanks for HADOOP-4707-55c046a.txt, Todd — much better than my previous patch.

I just noticed a small improvement to DatanodePlugin.ThriftHandler.readBlock: If we implement it like this, we might save quite a few memory copies:

public BlockData readBlock(Block block, long offset, int length)
        throws IOException, TException {
      LOG.debug("readBlock(" + block.blockId + "," + offset + "," + length

        // [..]

        // Allocate read buffer on ret directly, so that no extra memory copy is done
        //  if we read all bytes
        ret.data = new byte[length];
        int n = reader.read(ret.data, 0, length);
        if (n == -1) {
          throw new EOFException("EOF reading " + length + " bytes at offset "
              + offset + " from " + block);
        }
        LOG.debug("readBlock(" + block.blockId + ", " + offset + ", " + length
            + "): Read " + n + " bytes");
        if (n < length) {
          byte[] buf = new byte[n];
          System.arraycopy(ret.data, 0, buf, 0, n);
          ret.data = buf;
        }
        ret.length = n;

        // [..]

      return ret;
    }

Todd Lipcon added a comment - 29/Apr/09 12:54 AM
Nice fix. Here's a diff against my previous patch, and I'll upload a new patch against trunk as well with a couple more fixes.
diff --git a/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java b/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/D
index 6929a7b..14b2e32 100644
--- a/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java
+++ b/src/contrib/thriftfs/src/java/org/apache/hadoop/thriftfs/DatanodePlugin.java
@@ -100,8 +100,19 @@ public class DatanodePlugin
         }
         LOG.debug("readBlock(" + block.blockId + ", " + offset + ", " + length
             + "): Read " + n + " bytes");
-        ret.data = new byte[n];
-        System.arraycopy(buf, 0, ret.data, 0, n);
+
+        if (n == length) {
+            // If we read exactly the same number of bytes that was asked for,
+            // we can simply return the buffer directly
+            ret.data = buf;
+        } else {
+            assert n < length;
+            // If we read fewer bytes than they asked for, we need to write
+            // back a smaller byte array. With the appropriate thrift hook
+            // we could avoid this copy, too.
+            ret.data = new byte[n];
+            System.arraycopy(buf, 0, ret.data, 0, n);
+        }
         ret.length = n;
 
         summer.update(ret.data);

Todd Lipcon added a comment - 29/Apr/09 12:56 AM
This fixes the test cases, and makes one slightly incompatible change. DatanodeInfo.name was previously just the hostname, but is now the hostname:port (datanodeInfo.getName()) as described in the thrift interface docs. This was required to get tests to pass for me.

Todd Lipcon added a comment - 29/Apr/09 08:38 PM
Some more notes after using this API for a bit:
  • The Block struct was missing the startOffset from LocatedBlock. This made it somewhat useful for actually reading out of the middle of files. I'll submit a patch for this once I aggregate a few more changes.
  • I think we should think a little bit about authentication. Currently the interface uses identd to determine the unix username of the connecting socket, but this has several problems:
    • identd doesn't run on a lot of systems
    • This doesn't really provide any more security than the default hadoop model of "trust that you are who you say you are"
    • This is problematic if you need to be able to spoof other user roles. For the example of a web UI, the authentication would happen on that layer, and even though the thrift connections would be coming from "www-data" the actual access should assume the role of the authenticated end user for all RPCs

Regarding the authentication issue, I would propose a somewhat major change. We should add a struct called CallContext that looks something like this:

struct CallContext {
  1:string user
  2:list<string> groups
}

We should then add this as a parameter to every RPC, at least in the NameNode service. Since Thrift uses id numbers for parameters, this will safely end up as null for any callers that are using the old interface. We should then add something to ThriftUtils along the lines of:

CallContext completeCallContext(CallContext ctx) {
  if (null == ctx) {
    ctx = new CallContext();
    ... // get call context using identd as it does currently;
  }
  return ctx
}

With this code, any existing clients preserve their identd behavior with no issues.

If people are concerned with the lack of security here, I would make the following points:

  • This is the status quo for security in Hadoop - none of the IPC protocols have any kind of strong authentication
  • The current Thrift interfaces don't check UGI at the datanode level anyway
  • identd is easily hackable in the first place
  • using a generic CallContext struct for all calls would easily allow us to extend the struct later with some kind of authorization token to provide strong authentication when such a system is devised elsewhere in Hadoop

Carlos Valiente added a comment - 30/Apr/09 01:10 PM
  • The Block struct was missing the startOffset from LocatedBlock. This made it somewhat useful for actually reading out of the middle of files. I'll submit a patch for this once I aggregate a few more changes.

Yep, sounds very good.

Regarding the authentication issue, I would propose a somewhat major change. We should add a struct called CallContext that looks something like this:

struct CallContext

Unknown macro: { 1}

We should then add this as a parameter to every RPC, at least in the NameNode service.

I would prefer not to follow that route and leave the method signatures untouched, for the following reasons:

  1. The identd approach is far from perfect, definitely, but it is more than enough for my particular requirements.
  2. The Thrift IDL file, which I see as the main documentation source for the exposed service, becomes cluttered.
  3. Although old clients would still be able to call the Thrift services, C++ clients generated from the new IDL source would get the extra argument in the signature, and the user code would be less readable.

If peope are really concerned about authentication, perhaps it would be better to add some kind of transport-level authentication support for Thrift itself (like TLS- or SSL-enabled Thrift transports, for instance). Then on the server side we could either start a org.apache.thrift.transport.TServerSocket or a org.apache.thrift.transport.TSecureServerSocket, depending on some configuration value, and then everyone would be catered for.


Todd Lipcon added a comment - 30/Apr/09 03:20 PM

The identd approach is far from perfect, definitely, but it is more than enough for my particular requirements.

Not to be a pain, but unfortunately it's completely inadequate for my particular requirements I need to be able to assume the identity of other users in the system.

The Thrift IDL file, which I see as the main documentation source for the exposed service, becomes cluttered.

This is somewhat true. We can use the new-ish "thrift -gen html" to generate html documentation, though, which at least is slightly nicer to look at. I don't think adding a single common parameter to every call really clutters things that much, though, since the additional mental burden is "O(1)". Once you understand what the parameter means on one call, you understand it everywhere

Although old clients would still be able to call the Thrift services, C++ clients generated from the new IDL source would get the extra argument in the signature, and the user code would be less readable.

Yes, it's slightly less readable, but on the other hand I assume that anyone using this service will be wrapping the thrift client in some kind of fliesystem facade anyway. The change then is confined to a single module. The fact that existing C++ code will need changes doesn't concern me too much since (a) it's a very clear compilation-time error, not something subtle and hard to find, and (b) this hasn't been committed yet, so now is the one time we should feel especially free to break backwards compat!

Then on the server side we could either start a org.apache.thrift.transport.TServerSocket or a org.apache.thrift.transport.TSecureServerSocket, depending on some configuration value

This doesn't really solve my issue. I don't care about authentication from an access control perspective. I care more about the ability to assume different user roles. One primary use case for this Thrift interface, as I see it, is so that web applications written in non-Java languages can access HDFS more easily. Web applications typically run as some user like www-data or nobody, which is problematic if they really need to be assuming the identity of the end user (tlipcon) who has provided some authentication via a login form, etc.

The other option I considered was to add a su(...) RPC and use a specialized TTransport on the server side to hold the current UGI, but this breaks a lot of layer encapsulation properties. Namely, the TTransport cannot automatically reconnect to the server, since the UGI will be lost after the reconnect.


Carlos Valiente added a comment - 30/Apr/09 04:27 PM

Not to be a pain, but unfortunately it's completely inadequate for my particular requirements I need to be able to assume the identity of other users in the system.

Not a pain at all — just a different set of requirements

The Thrift IDL file, which I see as the main documentation source for the exposed service, becomes cluttered.

[..] Yes, it's slightly less readable, but on the other hand I assume that anyone using this service will be wrapping the thrift client in some kind of fliesystem facade anyway.

Good point: I already have wrapper code for Perl, Python and I'm starting with the C++ stuff. I could handle the user parameter in the method wrappers without exposing it on the method wrappers' signature.

No objections from my side, then.


Carlos Valiente added a comment - 06/May/09 07:13 AM
Assigning this to Todd (with his agreement), since I won't have much time to work on this for the coming weeks.

Todd Lipcon added a comment - 06/May/09 09:21 PM
Made some more iterations on this. As this is growing somewhat large (especially with the diff including all the generated code), I've also put a branch up on github for those wanting to follow along in a more easily parseable fashion:

http://github.com/toddlipcon/hadoop/commits/hadoop-4707

I'll also upload an up to date patch momentarily.

Summary of changes since last upload:

  • Avoid NPE when getBlocks is called with offset past end of file - instead, return no blocks
  • Make stat throw FileNotFoundException for bad paths
  • Add a clazz field to IOException thrift type for the specific subclass.
  • Add thriftfs to contrib test target
  • Refactor thrift contrib into more classes to clean things up a bit
  • HADOOP-4707: Add RequestContext parameter to all RPCs and extract UGI from them
  • HADOOP-4707: Add a basic DFS Health Report call
  • HADOOP-4707: Make services inherit from a base service that provides VersionInfo
  • HADOOP-4707: Expose JVM heap information and a stack trace in base
  • HADOOP-4707: Add access to Metrics in base service

I'd also like to propose moving this code into contrib/thrift rather than contrib/thriftfs. The motivation for this is that I will soon be working on JobTracker and TaskTracker thrift plugins, and there's a lot of common code I want to share. Putting that code in a module called "thriftfs" seems incorrect. Additionally, contrib/thriftfs currently is the home for two entirely separate projects - this, and the earlier "proxy style" thrift wrapper of the DFS client. Since these two projects share no code, it makes no sense for them to be in the same contrib dir.

If no one has any objections, I'll do this move in the next day or two.


Michael Greene added a comment - 06/May/09 10:29 PM
Todd: Any chance you could try out my patch for THRIFT-394 and include the C# bindings in the patch for this?

Todd Lipcon added a comment - 06/May/09 10:33 PM
Michael: Carlos and I decided offline that we probably shouldn't be including the language bindings in the patch at all, but rather expect users to generate them themselves. Can you try using thrift -gen csharp if/hdfs.thrift and see how that goes?

dhruba borthakur added a comment - 07/May/09 12:58 AM
> "proxy style" thrift wrapper of the DFS client.

This code should be made obselete (and deleted) when this patch goes into trunk. It would be nice to keep the name of the file system related classes (from this patch)to be org.apache.hadoop.thriftfs though.


Todd Lipcon added a comment - 07/May/09 03:33 AM

It would be nice to keep the name of the file system related classes (from this patch)to be org.apache.hadoop.thriftfs though.

Sure thing. You're fine with renaming the directory to src/contrib/thrift though?


dhruba borthakur added a comment - 07/May/09 03:40 AM
> You're fine with renaming the directory to src/contrib/thrift though?

Sure.

The only caveat is that the convention is to match the package name with the directory name. So, if you have the package name is org.apache.hadoop.thriftfs, then it might be convenient to keep the directory name same as what we got now.


Todd Lipcon added a comment - 14/May/09 07:12 PM

The only caveat is that the convention is to match the package name with the directory name. So, if you have the package name is org.apache.hadoop.thriftfs, then it might be convenient to keep the directory name same as what we got now.

How about a package structure like:

  • org.apache.hadoop.thrift - base service, utility classes
  • org.apache.hadoop.thrift.hdfs - NameNodePlugin, DataNodePlugin, common DFS-related "convert to thrift" functions
  • org.apache.hadoop.thrift.mapred - for JobTracker/TaskTracker thrift interfaces when they come along

I think this makes the most sense to me at this point. Renaming contrib/thriftfs to contrib/thrift is a pain, but I would hate to see a mapred dir inside a "thriftfs" package or contrib project.


Todd Lipcon added a comment - 14/May/09 07:17 PM
I'd like to stop work on this JIRA with the most recent patch (31c331) and perform the above package refactoring in a new JIRA.

For those that would like a more granular review, the commits are available on github:

http://github.com/toddlipcon/hadoop/tree/hadoop-4707

The diff includes all of these commits compared against HADOOP-5640, which is required by this issue. I'll also upload the newest libthrift.jar that I've been testing with (slightly newer than Carlos's). This needs to be put in src/contrib/thrift/lib/. md5sum should be fc6596d8c9b8964ac60cd80c277b9f92


dhruba borthakur added a comment - 14/May/09 07:24 PM
How about a package structure that is like this:

org.apache.hadoop.thriftfs
org.apache.hadoop.thriftmp


Todd Lipcon added a comment - 14/May/09 08:27 PM

How about a package structure that is like this:

org.apache.hadoop.thriftfs
org.apache.hadoop.thriftmp

Where does the common code go? In the most recent patch, the NN and DN services extend a common HadoopService which exposes common elements like JVM stats, VersionInfo, and Metrics. I'd certainly want to share this code between mr and fs. There's also some common code for starting up a Thrift server, handling UserGroupInformation, etc.

I just opened HADOOP-5840 to continue this discussion.


Todd Lipcon added a comment - 26/May/09 12:46 AM
The old patch fell out of date because of the new AccessToken work in HADOOP-4359. The attached patch adds an AccessToken struct to the Thrift side.

I also took the opportunity to remove all of the generated code aside from gen-java as we decided above.

As before, this patch is relative to and blocked by HADOOP-5640 (plugin infrastructure with hooks on important actions)


Todd Lipcon added a comment - 26/May/09 12:50 AM
Marking as patch available, though Hudson is likely to fail building it since it requires the new libthrift.jar, and the patch is gzipped since it was large. Tests pass for me. Since the only modification to non-contrib is the addition of DatanodeID.getLongString(), and this is currently blocking HADOOP-5840 (thrift contrib reorg) and through that, HADOOP-5703 (jobtracker/mapred thrift), I'd really appreciate if someone could take a look at this (and HADOOP-5640 which it depends on) ASAP.

Hadoop QA added a comment - 27/May/09 05:58 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408978/hadoop-4707-867f26.txt.gz
against trunk revision 779106.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/412/console

This message is automatically generated.


dhruba borthakur added a comment - 06/Jun/09 07:18 AM
Thsi patch allows the hdfs-thrift server to run inside the NN. This is a good thing. However, some administrators might like to keep running the hdfs-thrift server separately from the namenode (as it currently stands in trunk)... the reason being that the NN is already a big memory hog and allowing the hdfs-thrift server to share the same address space could introduce new resource bottlenecks in the NN that could impact the whole cluster stability. Do you agree?

Todd Lipcon added a comment - 06/Jun/09 06:28 PM
Hey Dhruba,

It's true that this introduces code that runs in the NN. However, the design of this new code maps one-for-one onto the communication design of HDFS itself - no data goes through the NN thrift server. The additional load on the NN is simply the metadata RPCs that the clients are sending, and of course the sockets serving the Thrift connections. Given that the existing (external) Thrift service just fowards the metadata RPCs to the NameNode, there's no additional load there. The external server probably does demultiplex them down to fewer RPC connections, and Thrift is probably less CPU-efficient than the hand-coded Writable-based RPCs that Hadoop uses internally.

As you mentioned, and correlated with my experience, the NN resource that is in short supply is memory more often than CPU. Given that, I don't think the additional load from the Thrift SerDe is going to end up being significant.

If we expect that there will be a high number of concurrent clients to the NN Thrift service, switching out the TThreadPoolServer for something based on the new-ish TNonBlockingServer would decrease memory consumption by reducing the number of threads. I can look into that if you like.

However, I do agree that some ops people will have philosophical objections to putting more code in the NameNode. This makes sense, and might be a good reason to keep the external Thrift gateway alive. However, I haven't touched its code, and don't want to commit to maintain it implicitly by adding new code next to it

Thanks
-Todd


dhruba borthakur added a comment - 06/Jun/09 11:40 PM
@Todd, thanks for the info.

I am kinda suspicious about the memory requirements for the Thrift server. It is possible that I might like to to run it separately from the NN for starters. For example, our NN is always kinds maxed out on memory and it would be a definite no-no to make the thridt server run inside it. On the other hand, some admins might be ok runnign these two pieces two together.

is it possible to enhance this patch so that the thrift-hdfs server can be run separately from the NN?


dhruba borthakur added a comment - 01/Jul/09 05:13 PM
is it possible to enhance this patch so that the thrift-hdfs server can be run separately from the NN?

Todd Lipcon added a comment - 01/Jul/09 05:27 PM
Woops, sorry, didn't realize I hadn't updated the JIRA a couple weeks ago.

With the current feature set it's possible to do so, since the NN plugin only accesses the NameNode by making calls on it that are available over RPC. Doing so would probably be a day's work, so I'd like to do it as a separate JIRA later on. Additionally, I'm worried that making this architecture change will tie our hands in the future if we want to make any calls from the NN into the Plugin using hooks like in HADOOP-5640.

I think we had talked offline and decided that, as long as the old Thrift server remains in tact, your use case won't be disturbed for now. Is that OK?