I'd like to propose an alternative to 'real' hardlinkes: "reference counted soft-Links", or all the hardness you really need in a distributed FS.
In this implementation of "hard" links, I would propose that wherever the file is created is considered the "owner" of that file. Initially, when created, the file has a reference count of (1) on the local namespace. If you want another hardlink to the file in the same namespace, you then talk to the NN and request another handle to that file, which implicitly updates the references to the file. The reference to that file could be stored in memory (and journaled) or written as part of the file metadata (more on that later, but lets ignore that for the moment).
Suppose instead that you are in a separate namespace and want a hardlink to the file in the original namespace. Then you would make a request to your NN (NNa) for a hardlink. Since NNa doesn't own the file you want to reference, it makes a hardlink request to NN which originally created the file, the file 'owner' (or NNb). NNb then says 'Cool, I've got your request and increment the ref-count for the file." NNa can then grant your request and give you a link to that file. The failure case here is either
1) NNb goes down, in which case you can just keep around the reference requests and batch them when NNb comes back up.
2) NNa goes down mid-request - if NNa doesn't recieve an ACK back for the granted request, it can then disregard that request and re-decrement the count for that hardlink.
Deleting the hardlink then follows a similar process. You issue a request to the owner NN, either directly from the client if you are deleting a link in the current namespace or through a proxy NN to the original namenode. It then decrements the reference count on the file and allows the deletion of the link. If the reference count ever hits 0, then the NN also deletes the file since there are no valid references to that file.
This has the implicit implication though that the file will not be visible in the namespace that created it if all the hardlinks to it are removed. This means it essentially becomes a 'hidden' inode. We could, in the future, also work out a mechanism to transfer the hidden inode to a NN that has valid references to it (maybe via a gossip-style protocol), but that would be out of the current scope.
There are some implications for this model. If the owner NN manages the ref-count in memory, if that NN goes down, its whole namespace then becomes inaccessible, including creating new hardlinks to any of the files (inodes) that it owns. However, the owner NN going down doesn't preclude the other NN from serving the file from their own 'soft' inodes.
Alternatively, the NN could have a lock on the a hardlinked file, with the ref-counts and ownership info in the file metadata. This might introduce some overhead when creating new hardlinks (you need to reopen and modify the block or write a new block with the new information periodically - this latter actually opens a route to do ref-count management via appends to a file-ref file), but has the added advantage that if the owner NN crashed, an alternative NN could some and claim ownership of that file. This is similar to doing Paxos style leader-election for a given hardlinked file combined with leader-leases. However, this very unlikely to see lots of fluctuation as the leader can just reclaim the leader token via appends to the file-owner file, with periodic rewrites to minimize file size.
The on-disk representation of the extreme version I'm proposing is then this: the full file then is actually composed of three pieces: (1) the actual data and then two metadata files, "extents" (to add a new word/definition), (2) an external-reference extent: each time a reference is made to the file a new count is appended and it can periodically recompacted to a single value, (3) an owner-extent with the current NN owner and the lease time on the file, dictating who controls overall deletion of the file (since ref counts are done via the external-ref file). This means (2) and (3) are hidden inodes, only accessible to the namenode. We can minimize overhead to these file extents by ensuring a single writer via messaging to the the owner NN (as specified by the owner-file), though this is not strictly necessary.
Further, (1) could become a hidden inode if all the local namespace references are removed, but it could eventually be transferred over to another NN shard (namespace) to keep overhead at a minimum, though (again), this is not a strict necessity.
The design retains the NN view of files as directory entries, just entries with a little bit of metadata. The metadata could be in memory or part of the file and periodically modified, but that’s more implementation detail than anything (as mentioned above).