This document discusses directory write leases in MagFS, a globally distributed file system. It introduces the concept of directory write leases, which allow clients to cache and execute namespace-modifying operations locally to improve performance over high-latency networks. Evaluation results show that directory write leases enable workloads to complete much faster with increasing network latency compared to synchronous approaches.
20. File Write
Lease
Gives authority over a single
file
Exclusive, single-writer
Client can cache file
modifications locally
Must flush dirty data on lease
break
Dir Write
Lease
Gives authority over single
directory (not subtree!)
Exclusive, single-writer
Client can cache namespace-modifying
ops in that directory
Must replay directory
modifications on lease break
21. Lease grant
conditions
• Client must request DWL on the directory
• When to issue?
• Detect pattern and request lease upgrade in background
• Exclusivity is
• No other client has opens on this directory AND its
children
27. Lease break semantics
• Server must issue a lease break when another
client tries to:
• Open this directory
• Open anywhere in this sub-tree
• Rename into this directory
• Client must drain all pending ops on this
directory, AND on all children in that directory
Hi everyone, my name is Deepti Chheda and I’m a Staff Engineer at maginatics, and this is my colleague Nate. Today we are going to show you how we considerably sped up small-file metadata heavy workloads using the concept of Directory Write Leases in the Maginatics Distributed File system – also known as MagFS.
Let me give you a brief overview of the Maginatics cloud storage platform. It’s an enterprise storage platform built on top of object stores. It’s Strongly-consistent, geo-distributed platform designed with strong focus on security & mobility.. MCSP is not a storage gateway. And at the heart of this platform lies the Maginatics File system. MagFS which uses it’s own proprietary protocol and hence is not compatible with SMB/NFS. However MagFS is POSIX compliant and so your applications should run seamlessly against MagFS without needing any modifications to it.
Quick look at the architecture of magfs. We have a clean metadata/data separation. Clients directly read&write data to the object stores.
This allows the clients to take advantage of scalability properties of underlying object stores.
Clients communicate with metadata server using our proprietary wan optimized protocol.
Server provider one consistent view of the file system at all times, by synchronizing access across all clients.
So the previous slide starts looking something like this. Where clients are accessing the server across low bandwidth, high latency networks.
Most distributed file systems have strong consistency requirements. So clients need to make synchronous calls to the server in order to enforce consistency. Each synchronous op incurs a network round trip. Over WAN latencies this can become quite expensive leading to poor performance. This makess the file system almost unusable over WAN
Let’s take a step back and look at how traditional network or distributed file systems try to alleviate this problem? Any guesses?
Leases allow clients to cache files locally for a period of time. …. Leases provide performance improvements without sacrificing strong consistency guarantees. MagFS employs a similar caching/leasing mechanism. In fact, Leases found in SMB2 or delegations found in NFS are a good example of this general concept.
Let’s take an example of SMB leases to get a closer look of how they work
Don’t forget about the NONE Lease State
Some of the most common fs ops can be optimized using leases. But you might notice that the namespace modifying ops are absent
Using leases we can optimize file data operations, and some of the most common metadata operations like readdir and stat.
To provide strong consistency guarantees each of these ops needs to be a synchrounous call to the server.
To visualize the impact of this take a look at this graph. This workload creates 8000 creates and deletes. At LAN speeds it takes 4mins, but over something even remotely remote like 50msec takes close to an hour! 50msec latency is what I have when I’m working from home!! In San Francisco!! And for our colleague on the east coast – it would take upto 4 hours!!
So obviously metatadata ops are a problem. If we could satisfies these
locally, much like data operations then we could see a significant improvement. The key question is…
http://imgc.allpostersimages.com/images/P-473-488-90/57/5788/5ABOG00Z/posters/yes-we-can-rosie-the-riveter.jpg
Otherwise this would have been a boring talk!
Using Directory Write Leases in MagFS we were able to successfully hide network latencies and significantly improve such workloads
. The rest of this talk will focus on how we achieved this.
http://copeco.com/blog/wp-content/uploads/2012/11/folder-lock-logo.jpg
Lets dive deeper into the semantics of such a lease state, and how it can provide strong consistency guarantees in a distributed file system like MagFS, while still allowing the client to perform all kinds of magic, in order to hide the network latency from the application
http://dspace.mit.edu/bitstream/handle/1721.1/36365/24-973Spring-2003/NR/rdonlyres/Global/7/7C1F3EE3-0A12-4EE5-B4E2-76FB01AC52BD/0/CHP_Semantics1.jpg
MagFS clients are allowed to hold write leases on directories, much like files
Let’s draw up an analogy with file write leases to see what a directory lease would look like.
First of all, client must explicitly request a DWL. But this is an exclusive lease and to avoid contention, the client must intelligently be able to figure out when to request this lease. Access mask is not a good indicator unlike file leases. Client must intelligently detect a pattern from the application e..g create, delete which might benefit from holding a DWL. . Background upgrade lease from RH -> RWH. Second, the server needs to ensure no other client is accessing this directory or namespace.
http://www.urbanathlete.tv/wp-content/uploads/2013/04/12/working-out-for-the-weekend-68/key.jpg
Client might have to vend out “fids”, using an InodeNumber reservation scheme. Client might need to maintain a local fid to remote fid mapping
Record the cached “op” in a pending ops queue. Transient or persistent?
Upper limit on max # of pending ops to ensure it can be drained within the lease break interval
Client may satisfy create/rename/deletes locally for a directory, if it holds DWL on it
In order to do so, client must perform all checks that server would have. This includes parameter, pathname validations, Existence checks. Access checks, sharing violation checks, etc
Client needs to have previously cache the entire directory enumeration to perform these checks correctly
Newly created files/directories…automatically assume DWL since this child is not visible to any other client yet, and hence there is implicit exclusivity.
Now what happens if user2 tries to access file ‘foo’
That can be a slipper slope! http://www.officesafety.co.uk/attachments/products/917/2t6yp_product_image_medium.jpg
Let’s define some lease break conditions to ensure consistency in the system. New opens need to traverse the path - breaking any leasing on an intermediate dir because the namespace could have been modified
This example demonstrates how we can ensure full consistency in the system by following Directory Write lease semantics outlined in this talk. However the true power of this lease can only be realized by the client if the client can efficiently maximizes the gains from it. At this point, I’m going to hand off to my colleague Nate who is going to talk about how the client
The earlier portion of the talk has covered the client’s responsibilities under the Directory Write Leasing mechanism: enforcing the file system consistency semantics, ensuring security, integrity, and in general correct file system behavior: “as if DWL not in use”.
Primary focus of DWL is performance. Burst performance and sustained performance
Talk through burst performance optimization
Talk through how (1) there are dependencies on the pending operations
Mention the fsync support here
Reemphasize the high level bit.
Talk about experimental procedures: draining outstanding pending operations and accounting for it.
Mixed creation + data workload
XXX need to determine linear effect --- shape will be good for this
SMB has better performance at lower latencies
No panacea: WAN is painful
http://www.sxc.hu/photo/1009933
Fastest way to send data around the world is to send no data