Damon Edwards, co-founder of Rundeck, presents at Interop ITX in Las Vegas on May 3, 2018.
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
6. Deployment doesn’t make us money. Operations does.
Deployment isn’t the goal. But we treat it like it is.
7. Deployment doesn’t make us money. Operations does.
Deployment isn’t the goal. But we treat it like it is.
8. Deployment doesn’t make us money. Operations does.
Deployment isn’t the goal. But we treat it like it is.
Operations rarely gets to transform. Time to start.
9. Deployment doesn’t make us money. Operations does.
Deployment isn’t the goal. But we treat it like it is.
Operations rarely gets to transform. Time to start.
16. SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
18. k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
19. Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
20. ase
s entire
e correct
iable”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App M
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
21. Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
22. Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
23. Share
point
Middleware
(Scott)
Trial and error
.doc
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
ket
Context Wagon
Middleware
(Scott)
Bar
Service
10 min Middleware
(Scott)
Waiting for
Acme Service
Acme startup
failed
Bar
Service
6:00pm
27. -Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
28. Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
v (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
29. Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
30. Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
31. Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm
32. d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdm
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
33. I’ll have
Network
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
APPROVE: Emergency
firewall rule change
Ticket
Update
Firewall
Ticket
NetSec
(Nicole)
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
Firewall
(Freddie)
Net L2
(Bob)
Middl
(Sc
Firewall
change
Restart Bar
9:30pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
36. e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
37. e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
.
38. NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
Next Day
SVP
(Susan)
Whose fault is this?!
Why are we so bad at change?
What additional processes
and approvals are you
adding to never let this
happen again?!
VP
VP
Dir
Dir
VP
Dir
VP
cott)
a)
Carlos)
(Bob)
die)
NetSec (Nicole)
Cust. Engmt. (Varsha)
40. We’ve invested in Cloud, Agile,
DevOps, Containers…
Why does everything still take too
long and cost too much?
Executive Team
Our transformation has
largely ignored Ops
47. Manual /
Motion Manual /
Motion
Manual /
Motion
Manual /
Motion
Manual /
Motion
Task
Switching
Task
Switching Task
SwitchingTask
Switching Task
Switching
Task
Switching
Task
Switching
Task
Switching
Task
Switching
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Waiting ! Defects ! Motion/Manual ! Task Switching ! Partially Done ! Extra Process
56. Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
57. All work is contextual
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
58. All work is contextual
rm -rf $PATHNAME
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
59. All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
60. All work is contextual
rm -rf $PATHNAME
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
61. All work is contextual
rm -rf $PATHNAME
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
62. All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
63. All work is contextual
rm -rf $PATHNAME
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
64. All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
DevOps Enterprise Summit
San Francisco 2017
65. escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
They have all of
the context
But, decisions are
made here?
66. “Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
77. What gets in the way?
Silos
Queues
Toil
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
??
Silo A Silo B
Ticket
Queue
80. Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
81. Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
82. Function A
Function B
Function C
Becomes siloed labor pools of functional specialists
Requests fulfilled by semi-
manual or manual effort
Primary management focus is
on protecting team capacity
83. How do we cover for our silos disconnects and mismatches?
Silo A Silo B
84. How do we cover for our silos disconnects and mismatches?
Silo A Silo B
Ticket
Queue
85. ??
Silo A Silo B
We all know how well that works
Ticket
Queue
86. Ticket queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
88. Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
(each unique, technically acceptable but unreproducible and brittle)
89. Silos + Ticket Queues = Locked into a broken system
Unreproducible
Snowflakes
More
Outages
Longer Lead Times /
Unpredictibility
Siloed
Labor Pools
Ticket
Queues
Diminished
Labor Capacity
Brittle
Environments
Errors
Error-Prone
ManualTasks
Interrupt
project work
Information Mismatches /
Miscommunication
Silo-Specific
Optimization
Handoffs With
Capacity, Pace,
Priority Mismatches
“Managing to the
Queue”
Working out of
Context
“Align by
Functional
Capability”
Management
Strategy
91. Excessive toil prevents fixing the system
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau
Google
92. Excessive Toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
93. Excessive Toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
95. Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
96. Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Obvious: Get rid of as many silos as possible
“Horizontal” shared
responsibility is key
feature!
97. But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
98. But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
99. But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue
100. Operations as a Service: Turn handoffs into self-service
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
101. Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(builds & operates)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Operations as a Service: Works with any org model
103. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
Ticket
System
104. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System
105. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System
106. Security or compliance “in the way”?
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Build-in
Security
Here
Build-in
Compliance
Here
109. Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
110. Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
111. Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Bonus: Use Service Level Objectives, Error Budgets, and other lessons from SRE
112. Recap
Don’t forget about Ops.
Challenge conventional wisdom.
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Leverage the Operations as a
Service design pattern
“Shift-Left” control and
decision making.
Understand the cost of silos and
ticket-driven request queues
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Focus on removing silos and
queues
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Learn from SRE: Reduce toil to
create capacity to change
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”