SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Directory Write Leases in 
MagFS 
Deepti Chheda, Staff Engineer @Maginatics 
Nate Rosenblum, Architect @Maginatics 
© mekuria getinet / www.mekuriageti.net
Maginatics Cloud Storage Platform 
Maginatics Cloud Storage Platform 
 Strongly consistent 
 Geo-distributed 
 Secure 
 Mobile-enabled 
 Layered on object stores 
 POSIX compliant 
✗ Storage Gateway 
✗ SMB/NFS compatible
Object 
Storage 
(public, on-premises, or 
hybrid) 
Maginatics File System 
Metadata 
* 
Data 
Metadata 
Servers 
MagFS Clients 
*MagFS proprietary WAN-optimized protocol
Object 
Storage 
(public, on-premises, or 
hybrid) 
Metadata 
Data 
Metadata 
Server 
MagFS 
Clients
Problems with geo-distributed file 
systems 
synchronous calls 
WAN link latencies 
unsuitable for global distribution enforcing consistency
How do traditional network file 
systems alleviate this problem?
leases / caching 
changes later 
propagated to 
server 
Hiding latencies in network file 
systems 
clients serve reads 
& writes locally • Performance improvements 
• Strong consistency guarantees
Read Lease 
• File reads 
• Directory 
enumeration 
• Metadata & 
file attributes 
• Shared 
Write Lease 
• File 
modifications 
• Exclusive, 
single-writer 
Handle Lease 
• Open 
handles after 
application 
has closed 
them 
• Shared 
Example: SMB Leases
Strong Consistency
SMB Valid Leases 
Read 
Lease 
Handle 
Lease 
Write 
Lease 
File ✔ ✔ ✔ 
Directory ✔ ✔ ✗
Common FS operations optimized 
Read 
Lease 
Handle 
Lease 
Write 
Lease 
File read() 
stat() 
open() 
close() 
write() 
Directory readdir() 
stat() 
opendir() 
close()
Namespace modifying 
operations? 
create() 
mkdir() 
rename() 
unlink() 
rmdir() 
chmod() 
Synchronous ops => incur a network 
RTT !
4:19 
3:50 
3:21 
2:52 
2:24 
1:55 
1:26 
0:57 
0:28 
0:00 
Create and delete workload 
5msec 50msec 100msec 150msec 
Time in hours 
Network RTT 
SMB
Can we safely delegate 
namespace modifying operations 
to clients?
Directory Write Leases in MagFS
4:19 
3:50 
3:21 
2:52 
2:24 
1:55 
1:26 
0:57 
0:28 
0:00 
Create & delete workload 
5msec 50msec 100msec 150msec 
Time in hours 
Network RTT 
SMB 
MagFS
Directory Write Leases (DWL)
Semantics
MagFS Lease states 
Read 
Lease 
Handle 
Lease 
Write 
Lease 
File ✔ ✔ ✔ 
Directory ✔ ✔ ✔
File Write 
Lease 
Gives authority over a single 
file 
Exclusive, single-writer 
Client can cache file 
modifications locally 
Must flush dirty data on lease 
break 
Dir Write 
Lease 
Gives authority over single 
directory (not subtree!) 
Exclusive, single-writer 
Client can cache namespace-modifying 
ops in that directory 
Must replay directory 
modifications on lease break
Lease grant 
conditions 
• Client must request DWL on the directory 
• When to issue? 
• Detect pattern and request lease upgrade in background 
• Exclusivity is 
• No other client has opens on this directory AND its 
children
home 
user1 
foo bar 
user2 
file 
Home directory use-case
home 
user1 
foo bar 
user2 
file 
Home directory use-case
home 
user1 
foo bar 
user2 
file 
Home directory use-case 
baz
home 
user1 
quux bar 
user2 
file 
Home directory use-case 
baz
Directory Write Leases in MagFS
Lease break semantics 
• Server must issue a lease break when another 
client tries to: 
• Open this directory 
• Open anywhere in this sub-tree 
• Rename into this directory 
• Client must drain all pending ops on this 
directory, AND on all children in that directory
Directory Write Lease break 
Client Client1 Server 2 
Create(user1, bar) P1 
Create(bar, baz) 
Open(user1) 
P2 
Rename(foo, quux) 
P3 
Lease Break (RWH->RH) 
Handle (RH) 
Open(user1) 
Handle (RWH) 
P1 P2 P3 + ACK
[client transition slide] 
Client support for 
Directory Write 
Leases
Client responsibilities 
File system 
consistency 
semantics 
Security, integrity, 
correct behavior of 
local file system 
operations 
Performance
Lifetime of DWL 
pattern 
detection / 
trigger 
write-behind opportunistic replay 
full queue 
lease break 
forced replay
Limits of burst performance 
full queue 
in-queue dependencies
Operational dependencies 
$ mkdir foo 
$ touch foo/bar{1,2,3} 
$ mkdir foo/baz 
$ rm foo/bar1 
$ mv foo quux
Exploiting parallelism 
$ mkdir foo 
$ touch foo/bar{1,2,3} 
$ mkdir foo/baz 
$ rm foo/bar1 
$ mv foo quux
Replaying operations faster
minimize 
Uncommitted operations 
operations 
reported 
complete but at 
risk 
t0 tk tn 
application-visible 
completion 
durable / committed 
operations
Results
●● 
● 
● 
● 
● 
●●● ● ● ● 
150 
100 
50 
0 
0 10 20 30 40 50 
network latency (ms) 
duration (s) 
● 
● 
sync 
async 
Finding workload parallelism 
# Populate a subtree 
# 
for i in {1..10}; do 
mkdir p${i} 
for j in {1..100}; do 
touch p${i}/f${j} 
done 
done
Mixed data + metadata: extracting 
archives 
● 
●● 
● 
● 
● 
●●● ● ● 
● 
400 
300 
200 
100 
0 
0 10 20 30 40 50 
network latency (ms) 
duration (s) 
● 
● 
sync 
async. 
target: openssl-1.0.1i.tar.gz 
Combines namespace 
mutation + data 
operations 
Intractable over WAN 
for even modest 
archives
●●● 
● 
● 
● 
●● ● ● 
● 
●● 
● 
● 
● 
● 
600 
400 
200 
0 
0 10 20 30 40 50 
network latency (ms) 
duration (s) 
● 
● 
● 
sync 
async 
smb 
Mixed data + metadata: extracting 
archives 
target: openssl-1.0.1i.tar.gz 
Combines namespace 
mutation + data 
operations 
Intractable over WAN 
for even modest 
archives
Multi-phase workloads (building OpenSSL) 
● ● 
● ● 
● ● 
● ● 
● ● 
● ● 
● 
● 
● ● 
1000 
750 
500 
250 
0 
untar config build clean untar config build clean untar config build clean untar config build clean 
duration (s) 
latency 
● 
● 
● 
● 
0.5 
5 
10 
50 
● async 
sync
Directory write leases in MagFS 
Wan-optimized file system 
for global enterprise 
Directory write leases 
delegate namespace 
responsibility to clients 
● 
●● 
● 
● 
● 
●●● ● ● 
● 
400 
300 
200 
100 
0 
0 10 20 30 40 50 
network latency (ms) 
duration (s) 
● 
● 
sync 
async. 
Leasing helps 
performance scale with 
latency
Try MagFS at http://maginatics.com
Backup
Simple extension: compounding
Advanced optimization: cancellation 
collapsing redundant operation 
mv foo bar ; mv bar baz 
operation cancellation 
touch foo ; rm foo
Simple dependency graph
Extract OpenSSL dependency graph 
* tiny fraction of dependency graph

Weitere ähnliche Inhalte

Was ist angesagt?

Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSGlusterFS
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformMongoDB
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Gluster Storage
Gluster StorageGluster Storage
Gluster StorageRaz Tamir
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBPowering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBMongoDB
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologiesMariaDB plc
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with GlusterVijay Bellur
 
Mongo DB Monitoring - Become a MongoDB DBA
Mongo DB Monitoring - Become a MongoDB DBAMongo DB Monitoring - Become a MongoDB DBA
Mongo DB Monitoring - Become a MongoDB DBASeveralnines
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstackopenstackindia
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s worldDávid Kőszeghy
 
Dustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep DiveDustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep DiveGluster.org
 
{code} and Containers - Open Source Infrastructure within Dell Technologies
{code} and Containers - Open Source Infrastructure within Dell Technologies{code} and Containers - Open Source Infrastructure within Dell Technologies
{code} and Containers - Open Source Infrastructure within Dell TechnologiesThe {code} Team
 
Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise MongoDB
 
Storing and processing data with the wso2 platform
Storing and processing data with the wso2 platformStoring and processing data with the wso2 platform
Storing and processing data with the wso2 platformWSO2
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringErik Krogen
 
Cloud economics design, capacity and operational concerns
Cloud economics  design, capacity and operational concernsCloud economics  design, capacity and operational concerns
Cloud economics design, capacity and operational concernsMarcos García
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Idan Atias
 
Rhel cluster basics 1
Rhel cluster basics   1Rhel cluster basics   1
Rhel cluster basics 1Manoj Singh
 

Was ist angesagt? (20)

Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFS
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Gluster Storage
Gluster StorageGluster Storage
Gluster Storage
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBPowering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
 
Storage as a Service with Gluster
Storage as a Service with GlusterStorage as a Service with Gluster
Storage as a Service with Gluster
 
Mongo DB Monitoring - Become a MongoDB DBA
Mongo DB Monitoring - Become a MongoDB DBAMongo DB Monitoring - Become a MongoDB DBA
Mongo DB Monitoring - Become a MongoDB DBA
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Dustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep DiveDustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep Dive
 
{code} and Containers - Open Source Infrastructure within Dell Technologies
{code} and Containers - Open Source Infrastructure within Dell Technologies{code} and Containers - Open Source Infrastructure within Dell Technologies
{code} and Containers - Open Source Infrastructure within Dell Technologies
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 
Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise
 
Storing and processing data with the wso2 platform
Storing and processing data with the wso2 platformStoring and processing data with the wso2 platform
Storing and processing data with the wso2 platform
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Cloud economics design, capacity and operational concerns
Cloud economics  design, capacity and operational concernsCloud economics  design, capacity and operational concerns
Cloud economics design, capacity and operational concerns
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
 
Rhel cluster basics 1
Rhel cluster basics   1Rhel cluster basics   1
Rhel cluster basics 1
 

Andere mochten auch

FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Inc.
 
2014-01-28 Operation in the future
2014-01-28 Operation in the future2014-01-28 Operation in the future
2014-01-28 Operation in the futureOperation Lab, LLC.
 
Getting Started with Performance Co-Pilot
Getting Started with Performance Co-PilotGetting Started with Performance Co-Pilot
Getting Started with Performance Co-PilotPaul V. Novarese
 
PyCon JP 2014 plone terada
PyCon JP 2014 plone teradaPyCon JP 2014 plone terada
PyCon JP 2014 plone teradaManabu Terada
 
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」WebSig24/7
 
SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようManabu Terada
 
Pyconjp2014_implementations
Pyconjp2014_implementationsPyconjp2014_implementations
Pyconjp2014_implementationsmasahitojp
 
Site Search Analytics in a Nutshell
Site Search Analytics in a NutshellSite Search Analytics in a Nutshell
Site Search Analytics in a NutshellLouis Rosenfeld
 
Pelicanによる www.python.jpの構築
Pelicanによる www.python.jpの構築Pelicanによる www.python.jpの構築
Pelicanによる www.python.jpの構築Atsuo Ishimoto
 
"Continuous Publication" with Python: Another Approach
"Continuous Publication" with Python: Another Approach"Continuous Publication" with Python: Another Approach
"Continuous Publication" with Python: Another ApproachDaisuke Miyakawa
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)Osaka University
 
Riverbed Software Defined IT Survey
Riverbed Software Defined IT SurveyRiverbed Software Defined IT Survey
Riverbed Software Defined IT SurveyRiverbed Technology
 
OnLab Japan introduction to Lean Analytics
OnLab Japan introduction to Lean AnalyticsOnLab Japan introduction to Lean Analytics
OnLab Japan introduction to Lean AnalyticsLean Analytics
 
Hacking Marketing By Scott Brinker
Hacking Marketing By Scott BrinkerHacking Marketing By Scott Brinker
Hacking Marketing By Scott BrinkerMarTech Conference
 
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQS
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQSAWS Black Belt Techシリーズ Amazon SNS / Amazon SQS
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQSAmazon Web Services Japan
 

Andere mochten auch (20)

Cracking PRNG
Cracking PRNGCracking PRNG
Cracking PRNG
 
FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Autoload: 事例集
FlyData Autoload: 事例集
 
2014-01-28 Operation in the future
2014-01-28 Operation in the future2014-01-28 Operation in the future
2014-01-28 Operation in the future
 
PCP
PCPPCP
PCP
 
Getting Started with Performance Co-Pilot
Getting Started with Performance Co-PilotGetting Started with Performance Co-Pilot
Getting Started with Performance Co-Pilot
 
PyCon JP 2014 plone terada
PyCon JP 2014 plone teradaPyCon JP 2014 plone terada
PyCon JP 2014 plone terada
 
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」
第29回WebSig会議「効率化だけではない!中小~中堅ECサイトの成果を上げる「メディア編集力」とは」
 
SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えよう
 
Pyconjp2014_implementations
Pyconjp2014_implementationsPyconjp2014_implementations
Pyconjp2014_implementations
 
Site Search Analytics in a Nutshell
Site Search Analytics in a NutshellSite Search Analytics in a Nutshell
Site Search Analytics in a Nutshell
 
Pelicanによる www.python.jpの構築
Pelicanによる www.python.jpの構築Pelicanによる www.python.jpの構築
Pelicanによる www.python.jpの構築
 
"Continuous Publication" with Python: Another Approach
"Continuous Publication" with Python: Another Approach"Continuous Publication" with Python: Another Approach
"Continuous Publication" with Python: Another Approach
 
Pyramid入門
Pyramid入門Pyramid入門
Pyramid入門
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)
Nttドコモ事例から見るモバイル&クラウド時代のサービス開発についてr4(public)
 
Riverbed Software Defined IT Survey
Riverbed Software Defined IT SurveyRiverbed Software Defined IT Survey
Riverbed Software Defined IT Survey
 
Docomo Cloud Package
Docomo Cloud PackageDocomo Cloud Package
Docomo Cloud Package
 
OnLab Japan introduction to Lean Analytics
OnLab Japan introduction to Lean AnalyticsOnLab Japan introduction to Lean Analytics
OnLab Japan introduction to Lean Analytics
 
Hacking Marketing By Scott Brinker
Hacking Marketing By Scott BrinkerHacking Marketing By Scott Brinker
Hacking Marketing By Scott Brinker
 
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQS
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQSAWS Black Belt Techシリーズ Amazon SNS / Amazon SQS
AWS Black Belt Techシリーズ Amazon SNS / Amazon SQS
 

Ähnlich wie Directory Write Leases in MagFS

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18Derek Downey
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016panagenda
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overviewhowie YU
 
Mma 10g r2_936
Mma 10g r2_936Mma 10g r2_936
Mma 10g r2_936Alf Baez
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld
 
New Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureNew Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureKhalid Al-Ghamdi
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceVMware Tanzu
 
VMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphereVMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphereVMworld
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Eran Gampel
 
How to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleHow to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleMariaDB plc
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 

Ähnlich wie Directory Write Leases in MagFS (20)

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overview
 
Mma 10g r2_936
Mma 10g r2_936Mma 10g r2_936
Mma 10g r2_936
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Next-Gen DHCP
Next-Gen DHCPNext-Gen DHCP
Next-Gen DHCP
 
The Quick Migration of File Servers
The Quick Migration of File ServersThe Quick Migration of File Servers
The Quick Migration of File Servers
 
New Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureNew Exchange Server 2013 Architecture
New Exchange Server 2013 Architecture
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
 
VMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphereVMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphere
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
Cnam azure 2015 storage
Cnam azure 2015  storageCnam azure 2015  storage
Cnam azure 2015 storage
 
How to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleHow to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScale
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 

Kürzlich hochgeladen

Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingMarian Marinov
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid BodyAhmadHajasad2
 
OS Services, System call, Virtual Machine
OS Services, System call, Virtual MachineOS Services, System call, Virtual Machine
OS Services, System call, Virtual MachineDivya S
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Apollo Techno Industries Pvt Ltd
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technologyabdulkadirmukarram03
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfNaveenVerma126
 
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...Gaurav Singh Rajput
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...soginsider
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide LaboratoryBahzad5
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical SensorTanvir Moin
 
Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxHome
 

Kürzlich hochgeladen (20)

Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & Logging
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
 
OS Services, System call, Virtual Machine
OS Services, System call, Virtual MachineOS Services, System call, Virtual Machine
OS Services, System call, Virtual Machine
 
計劃趕得上變化
計劃趕得上變化計劃趕得上變化
計劃趕得上變化
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technology
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
 
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...
Lifting Plan | Lifting Plan for Different Process Equipment | Gaurav Singh Ra...
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
 
Présentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdfPrésentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdf
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
Litature Review: Research Paper work for Engineering
Litature Review: Research Paper work for EngineeringLitature Review: Research Paper work for Engineering
Litature Review: Research Paper work for Engineering
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical Sensor
 
Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptx
 

Directory Write Leases in MagFS

  • 1. Directory Write Leases in MagFS Deepti Chheda, Staff Engineer @Maginatics Nate Rosenblum, Architect @Maginatics © mekuria getinet / www.mekuriageti.net
  • 2. Maginatics Cloud Storage Platform Maginatics Cloud Storage Platform  Strongly consistent  Geo-distributed  Secure  Mobile-enabled  Layered on object stores  POSIX compliant ✗ Storage Gateway ✗ SMB/NFS compatible
  • 3. Object Storage (public, on-premises, or hybrid) Maginatics File System Metadata * Data Metadata Servers MagFS Clients *MagFS proprietary WAN-optimized protocol
  • 4. Object Storage (public, on-premises, or hybrid) Metadata Data Metadata Server MagFS Clients
  • 5. Problems with geo-distributed file systems synchronous calls WAN link latencies unsuitable for global distribution enforcing consistency
  • 6. How do traditional network file systems alleviate this problem?
  • 7. leases / caching changes later propagated to server Hiding latencies in network file systems clients serve reads & writes locally • Performance improvements • Strong consistency guarantees
  • 8. Read Lease • File reads • Directory enumeration • Metadata & file attributes • Shared Write Lease • File modifications • Exclusive, single-writer Handle Lease • Open handles after application has closed them • Shared Example: SMB Leases
  • 10. SMB Valid Leases Read Lease Handle Lease Write Lease File ✔ ✔ ✔ Directory ✔ ✔ ✗
  • 11. Common FS operations optimized Read Lease Handle Lease Write Lease File read() stat() open() close() write() Directory readdir() stat() opendir() close()
  • 12. Namespace modifying operations? create() mkdir() rename() unlink() rmdir() chmod() Synchronous ops => incur a network RTT !
  • 13. 4:19 3:50 3:21 2:52 2:24 1:55 1:26 0:57 0:28 0:00 Create and delete workload 5msec 50msec 100msec 150msec Time in hours Network RTT SMB
  • 14. Can we safely delegate namespace modifying operations to clients?
  • 16. 4:19 3:50 3:21 2:52 2:24 1:55 1:26 0:57 0:28 0:00 Create & delete workload 5msec 50msec 100msec 150msec Time in hours Network RTT SMB MagFS
  • 19. MagFS Lease states Read Lease Handle Lease Write Lease File ✔ ✔ ✔ Directory ✔ ✔ ✔
  • 20. File Write Lease Gives authority over a single file Exclusive, single-writer Client can cache file modifications locally Must flush dirty data on lease break Dir Write Lease Gives authority over single directory (not subtree!) Exclusive, single-writer Client can cache namespace-modifying ops in that directory Must replay directory modifications on lease break
  • 21. Lease grant conditions • Client must request DWL on the directory • When to issue? • Detect pattern and request lease upgrade in background • Exclusivity is • No other client has opens on this directory AND its children
  • 22. home user1 foo bar user2 file Home directory use-case
  • 23. home user1 foo bar user2 file Home directory use-case
  • 24. home user1 foo bar user2 file Home directory use-case baz
  • 25. home user1 quux bar user2 file Home directory use-case baz
  • 27. Lease break semantics • Server must issue a lease break when another client tries to: • Open this directory • Open anywhere in this sub-tree • Rename into this directory • Client must drain all pending ops on this directory, AND on all children in that directory
  • 28. Directory Write Lease break Client Client1 Server 2 Create(user1, bar) P1 Create(bar, baz) Open(user1) P2 Rename(foo, quux) P3 Lease Break (RWH->RH) Handle (RH) Open(user1) Handle (RWH) P1 P2 P3 + ACK
  • 29. [client transition slide] Client support for Directory Write Leases
  • 30. Client responsibilities File system consistency semantics Security, integrity, correct behavior of local file system operations Performance
  • 31. Lifetime of DWL pattern detection / trigger write-behind opportunistic replay full queue lease break forced replay
  • 32. Limits of burst performance full queue in-queue dependencies
  • 33. Operational dependencies $ mkdir foo $ touch foo/bar{1,2,3} $ mkdir foo/baz $ rm foo/bar1 $ mv foo quux
  • 34. Exploiting parallelism $ mkdir foo $ touch foo/bar{1,2,3} $ mkdir foo/baz $ rm foo/bar1 $ mv foo quux
  • 36. minimize Uncommitted operations operations reported complete but at risk t0 tk tn application-visible completion durable / committed operations
  • 38. ●● ● ● ● ● ●●● ● ● ● 150 100 50 0 0 10 20 30 40 50 network latency (ms) duration (s) ● ● sync async Finding workload parallelism # Populate a subtree # for i in {1..10}; do mkdir p${i} for j in {1..100}; do touch p${i}/f${j} done done
  • 39. Mixed data + metadata: extracting archives ● ●● ● ● ● ●●● ● ● ● 400 300 200 100 0 0 10 20 30 40 50 network latency (ms) duration (s) ● ● sync async. target: openssl-1.0.1i.tar.gz Combines namespace mutation + data operations Intractable over WAN for even modest archives
  • 40. ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● 600 400 200 0 0 10 20 30 40 50 network latency (ms) duration (s) ● ● ● sync async smb Mixed data + metadata: extracting archives target: openssl-1.0.1i.tar.gz Combines namespace mutation + data operations Intractable over WAN for even modest archives
  • 41. Multi-phase workloads (building OpenSSL) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1000 750 500 250 0 untar config build clean untar config build clean untar config build clean untar config build clean duration (s) latency ● ● ● ● 0.5 5 10 50 ● async sync
  • 42. Directory write leases in MagFS Wan-optimized file system for global enterprise Directory write leases delegate namespace responsibility to clients ● ●● ● ● ● ●●● ● ● ● 400 300 200 100 0 0 10 20 30 40 50 network latency (ms) duration (s) ● ● sync async. Leasing helps performance scale with latency
  • 43. Try MagFS at http://maginatics.com
  • 46. Advanced optimization: cancellation collapsing redundant operation mv foo bar ; mv bar baz operation cancellation touch foo ; rm foo
  • 48. Extract OpenSSL dependency graph * tiny fraction of dependency graph

Hinweis der Redaktion

  1. Hi everyone, my name is Deepti Chheda and I’m a Staff Engineer at maginatics, and this is my colleague Nate. Today we are going to show you how we considerably sped up small-file metadata heavy workloads using the concept of Directory Write Leases in the Maginatics Distributed File system – also known as MagFS.
  2. Let me give you a brief overview of the Maginatics cloud storage platform. It’s an enterprise storage platform built on top of object stores. It’s Strongly-consistent, geo-distributed platform designed with strong focus on security & mobility.. MCSP is not a storage gateway. And at the heart of this platform lies the Maginatics File system. MagFS which uses it’s own proprietary protocol and hence is not compatible with SMB/NFS. However MagFS is POSIX compliant and so your applications should run seamlessly against MagFS without needing any modifications to it.
  3. Quick look at the architecture of magfs. We have a clean metadata/data separation. Clients directly read&write data to the object stores. This allows the clients to take advantage of scalability properties of underlying object stores. Clients communicate with metadata server using our proprietary wan optimized protocol. Server provider one consistent view of the file system at all times, by synchronizing access across all clients.
  4. So the previous slide starts looking something like this. Where clients are accessing the server across low bandwidth, high latency networks.
  5. Most distributed file systems have strong consistency requirements. So clients need to make synchronous calls to the server in order to enforce consistency. Each synchronous op incurs a network round trip. Over WAN latencies this can become quite expensive leading to poor performance. This makess the file system almost unusable over WAN
  6. Let’s take a step back and look at how traditional network or distributed file systems try to alleviate this problem? Any guesses?
  7. Leases allow clients to cache files locally for a period of time. …. Leases provide performance improvements without sacrificing strong consistency guarantees. MagFS employs a similar caching/leasing mechanism. In fact, Leases found in SMB2 or delegations found in NFS are a good example of this general concept.
  8. Let’s take an example of SMB leases to get a closer look of how they work
  9. Don’t forget about the NONE Lease State
  10. Some of the most common fs ops can be optimized using leases. But you might notice that the namespace modifying ops are absent Using leases we can optimize file data operations, and some of the most common metadata operations like readdir and stat.
  11. To provide strong consistency guarantees each of these ops needs to be a synchrounous call to the server.
  12. To visualize the impact of this take a look at this graph. This workload creates 8000 creates and deletes. At LAN speeds it takes 4mins, but over something even remotely remote like 50msec takes close to an hour! 50msec latency is what I have when I’m working from home!! In San Francisco!! And for our colleague on the east coast – it would take upto 4 hours!!
  13. So obviously metatadata ops are a problem. If we could satisfies these locally, much like data operations then we could see a significant improvement. The key question is…
  14. http://imgc.allpostersimages.com/images/P-473-488-90/57/5788/5ABOG00Z/posters/yes-we-can-rosie-the-riveter.jpg Otherwise this would have been a boring talk!
  15. Using Directory Write Leases in MagFS we were able to successfully hide network latencies and significantly improve such workloads
  16. . The rest of this talk will focus on how we achieved this. http://copeco.com/blog/wp-content/uploads/2012/11/folder-lock-logo.jpg
  17. Lets dive deeper into the semantics of such a lease state, and how it can provide strong consistency guarantees in a distributed file system like MagFS, while still allowing the client to perform all kinds of magic, in order to hide the network latency from the application http://dspace.mit.edu/bitstream/handle/1721.1/36365/24-973Spring-2003/NR/rdonlyres/Global/7/7C1F3EE3-0A12-4EE5-B4E2-76FB01AC52BD/0/CHP_Semantics1.jpg
  18. MagFS clients are allowed to hold write leases on directories, much like files
  19. Let’s draw up an analogy with file write leases to see what a directory lease would look like.
  20. First of all, client must explicitly request a DWL. But this is an exclusive lease and to avoid contention, the client must intelligently be able to figure out when to request this lease. Access mask is not a good indicator unlike file leases. Client must intelligently detect a pattern from the application e..g create, delete which might benefit from holding a DWL. . Background upgrade lease from RH -> RWH. Second, the server needs to ensure no other client is accessing this directory or namespace. http://www.urbanathlete.tv/wp-content/uploads/2013/04/12/working-out-for-the-weekend-68/key.jpg
  21. Client might have to vend out “fids”, using an InodeNumber reservation scheme. Client might need to maintain a local fid to remote fid mapping Record the cached “op” in a pending ops queue. Transient or persistent? Upper limit on max # of pending ops to ensure it can be drained within the lease break interval
  22. Client may satisfy create/rename/deletes locally for a directory, if it holds DWL on it
  23. In order to do so, client must perform all checks that server would have. This includes parameter, pathname validations, Existence checks. Access checks, sharing violation checks, etc Client needs to have previously cache the entire directory enumeration to perform these checks correctly
  24. Newly created files/directories…automatically assume DWL since this child is not visible to any other client yet, and hence there is implicit exclusivity.
  25. Now what happens if user2 tries to access file ‘foo’
  26. That can be a slipper slope! http://www.officesafety.co.uk/attachments/products/917/2t6yp_product_image_medium.jpg
  27. Let’s define some lease break conditions to ensure consistency in the system. New opens need to traverse the path - breaking any leasing on an intermediate dir because the namespace could have been modified
  28. This example demonstrates how we can ensure full consistency in the system by following Directory Write lease semantics outlined in this talk. However the true power of this lease can only be realized by the client if the client can efficiently maximizes the gains from it. At this point, I’m going to hand off to my colleague Nate who is going to talk about how the client
  29. The earlier portion of the talk has covered the client’s responsibilities under the Directory Write Leasing mechanism: enforcing the file system consistency semantics, ensuring security, integrity, and in general correct file system behavior: “as if DWL not in use”. Primary focus of DWL is performance. Burst performance and sustained performance
  30. Talk through burst performance optimization
  31. Talk through how (1) there are dependencies on the pending operations
  32. Mention the fsync support here
  33. Reemphasize the high level bit. Talk about experimental procedures: draining outstanding pending operations and accounting for it.
  34. Mixed creation + data workload XXX need to determine linear effect --- shape will be good for this
  35. SMB has better performance at lower latencies
  36. No panacea: WAN is painful
  37. http://www.sxc.hu/photo/1009933
  38. Fastest way to send data around the world is to send no data
  39. Probably will not get to this… it could happen!