Hey everybody! I'm Jon Boutelle, cofounder of SlideShare. SlideShare is THE site for sharing presentations online. It's the best way to share powerpoint, pdf, and open office documents. It works kinda like youtube: you upload the document, we convert it to flash, and you can embed it back into your blog or social networking site. SlideShare uses pretty much every amazon web service that is available. All the slideshows and all the original documents are saved on Amazon S3, and only on S3. I’m going to talk a little bit today about WHY.
And it wasn’t because S3 is cheap on a per-gigabyte basis. We didn’t even think twice about the price before deciding to use it. Why was that? Was it because we’re a venture backed startup that has money to burn and is used to paying lots of money for stuff? Au contraire. We're a scrappy startup, so we're incredibly thrifty. We're so cheap that my entire company uses gtalk instead of telephones for voice communication. So cheap that the first 3 servers we rented from our hosting provider were out-of-spec celeron boxes. So cheap that our first office was one room deal that was super-cheap because it didn't have any windows. A paradox? We cared a lot about MONEY. We didn’t care so much about PRICE. The specific price of Amazon services was pretty far down the list of priorities in our decision making. We cared a lot more about WHEN and under WHAT CONDITIONS we would have to pay. We cared about pushing a complex engineering problem onto an outsourced vender so that we wouldn't have to deal with it. But the actual price wasn’t very important to us.
When I heard about S3, I knew that we had to try it to see if it would work for us. Why was I so interested? A cursury inspection of their business model told me that we would have to pay only minimal bills until we launched and started gaining traction with users. Since there’s no up-front cost and you only pay for the amount of the service you use, during the test phase your S3 bills will be minimal: 20 or 30 bucks, like a phone bill. This was a couple of months before launch. So assuming we launched in two months, and got traction in another two, I was saving for only 4 months. Why was I so excited about that? Well, you guys know what happens when you assume, don’t you? Anybody? “ You make an ass out of you and me” Right. Assuming you’re going to launch in two months, assuming you’re going to get traction in another two is dangerous. Because most IT projects fail, and most businesses fail. FAIL I don't know exactly what the failure rate is for consumer web businesses, but I wouldn't be surprised if it was worse that starting a restaurant: and restaurants fail about 80% of the time. The restaurant in this photo is the forth one in this location in three years. The problem is essentially the same one: you can hang up your shingle, but customers won't necessarily come in the door. So failure is the norm, not the exception. COPING WITH POTENTIAL FAILURE Now failing sucks. It's depressing to talk about. But if you're in an industry where it happens so damn frequently, then you can't just ignore it. So you want to spend as little money as possible before you find out whether people are going to come in the door and order food. S3 let me try my idea out in the cheapest way possible: with "No Money Down". If no one had uploaded anything to SlideShare, if no one had visited my restaurant, then my S3 bill would have been pretty close to zero. You only pay a real amount of money if you actually get users. That's because there was no upfront cost to start using S3, and the ongoing cost is directly tied to actual use. REDUCING Reducing the cost of failure sounds like a depressing advantage. But it's not: the effect on the individual entrepreneur is inspiring: it makes us braver: more willing to try ideas, less dependent on outside money. I had no idea whether SlideShare was going to be a success before I launched, but AWS helped reduce the cost of trying to the point where I could afford it *personally*. And that's the key. If the cost of failure is brought low enough, than individual hackers can afford to start up businesses on a couple of credit cards. No VCs or even angels are needed. That's awesome. This has been happening with web-based businesses already. It used to take 5 million dollars to build a web app and see if people would use it. Open source and Moore's law have brought that price way way down. Amazon web services are the next leap forward in terms of reducing startup costs for web businesses. When we launched, we had 4 servers. Three of them were out-of-spec celerons. The forth was an out-of-spec P4 (that one was the database server). We prepaid for all of them and got them for about 5000$ for the year. Throwing in load balancing services and a firewall took the price up to about $8000. That's definitely a small enough infrastructure to fit on one credit card.
Now of course Murphys Law is in effect, so you plan your best for failure, and then success comes and smacks you in the ass. Now, the day our site went out of private beta, we were lucky enough to be techcrunched. Everything you've heard about techcrunching is true: it's a huge wave of traffic hitting your site all at once. Michael had interviewed us the night before, and was particularly nervous about putting a slideshare embed into his site. He was all "You have no idea of the amount of traffic I get. The first time I covered Youtube, I embedded a video and it brought their site down". I told him that we were using Amazon for hosting and that it wouldn't be a big deal. And it wasn't. We had lots of problems on that first day, but serving up the slideshows was not one of them. Ten times as much traffic was hitting amazon's servers as was hitting ours. Michael's embed was being served directly from S3, and wasn't our problem at all. In fact, all of our servers could have crashed and his embed would still work. That's the beauty of a solution that scales automatically, without you even having to think about it.
As a site's traffic increases, it starts to need additional resources. A lot of people think "big deal, add some more hard drives and some more servers". But making a cluster that will serve up a large volume of binary content, and will save a massive and ever-expanding collection of media, is not a simple job. Everything takes time, takes resources, and comes with it's own technical risk. A sudden spike in demand like the techcrunching we experienced might end up being impossible to plan for, and if you did plan for it you might be left the excess infrastructure that you don't need. Plus, every change to your system is a chance for things to break. Murphy’s Law applies! Anything that can break will break. Using S3 means that you can just ignore the problem of building a scalable infrastructure for saving and serving binary files to your users: that's Amazon's job. This means you can focus on listening to your users, fixing bugs, adding features, instead of dealing with infrastructure headaches. This is really important, because if you want to grow the site’s community as fast as possible before the competition shows up. Anyway, I don't know how you value the ability to offload a problem like this onto a reliable vendor. I’m a techie, not an MBA. But I know it's worth a lot. Think about it: even assuming money and time are no object, what are the odds that you're going to solve this problem better than Amazon did?
You pay for amazon services after you use them, by credit card. This isn't because they're extra nice or anything: since you are paying for use, they need to measure your use before they can charge you. But it's very different from traditional dedicated hosting, where you have to pay for the month of usage in advance. Amazon bills your credit card at the end of each month for the services you used that month. By happy coincidence, this is usually right after your credit card payment was due. So your credit card payment for the Amazon services isn't due until the end of the NEXT month. That's an average of 45 days after you used the service. Most ways of monetizing your site have some latency as well. So you probably don't get to keep that 45 days of float. But you can use it to make sure you get paid before you have to pay Amazon. For example, let's say you have adsense ads on your site. Google will pay you by automatic bank transfer on the 22nd of the month, for the previous months ad impressions. That means that the money you made from last month will arrive in your bank account before your credit card payment for amazon is due! And THAT means that a business where the only cost was Amazon services, and the only revenue was from google adsense would be inherently cash-flow positive. That's assuming you can make at least 20 cents worth of ad revenue for each gig of content you serve up. Whatever your monetization strategy, as long as you can get the money to your bank account on average within 45 days of delivering services, your business will be inherently cash-flow positive. Again, this means you are less likely to need VC money. Money is oxygen for a business. If you run out of money and credit for even a few weeks, you're dead. So cash flow matters a lot more than price to your average startup.
My main point today is that price per gigabyte is NOT what you should care about when you're making the decision of whether or not to use Amazon Web Services. It's quite likely that if you can get a cheaper price on bandwidth if you shop around. You have to look at the time to market advantages of NOT building your own solution. You have to look at the ongoing benefit you'll get from reducing the complexity of your own system. And you have to look at the fact that Amazon will only cost you money if your site takes off, and that the bill is due after you've received your revenue, rather than before. This combination of cash flow properties and time savings makes Amazon web services a secret weapon for early-stage self-funded startups. It means you can delay taking VC money until you have a proven concept and a real user base, or possibly avoid it altogether!
-hadoop fiasco -random ec2 machines that you don’t know whether they’re important or not -NEED to have a handle on your daily spend. Operations needs to subscribe to a daily email from Amazon and respond to anomolies. Hourly thinking is powerful, but it can get you in trouble if you leave things running for days!
We all have experienced that sinking feeling when we’re planning on decommissioning a server. You’re never really 100% sure that the server isn’t being used by some process until you shut it down, and then wait a month to see if anyone complains. This happens with regular servers as well, but it’s even more likely in the cloud. In a regular data center, the pondorous nature of the procurement process means that machines tend to be labeled with their intended uses. But in a cloud computing situation, it’s very possible to be in a situation where you have a machine and you’re not sure what it does, who instantiated it, and if it’s still needed.
Ops strictly controls production. Dev and staging have oversite from ops, but QA and engineering co-operate it and have permissions to create new machines. A wiki for each account is maintained with up-to-date info about what instances are currently up and what they are used for. The freefirezone is anything goes, but it could get deleted by an admin at any minute. You need to have the freefirezone to really take advantage of the agility that you get from the cloud. When a developer gets a random idea and wants to test if it scales, she should be be able to test it out. Get in a rhythm … if it’s not a problem, check it on Fridays, if it is a problem have a script shut everything down at 10PM every night. Don’t forget to link via consolidated billing!
Not enough to have cloud infrastructure. You have to be able to reliably spin up a new copy of any machine that’s part of your infrastructure. One way to do that is by saving it as what Amazon calls an “AMI”. But this isn’t good enough. What’s in the AMI? It’s a binary blog. Changes aren’t auditable. AMIs don’t enforce a configuration either … if a dev or ops person logs on to one of your two instances and changes something about the configuration after it has spun up, then your system will be in an inconsistent state. The only sane way to manage a big cluster of machines is with system automation of some kind. There are several alternatives out there, all of them free and open source. We use Puppet for controlling the configuration of all our EC2 instances, .
How to scale up and down automatically? We deal with very spiky load. We need to be able to scale up rapidly, then scale back down when demand is over. Scaling up is easy. We keep track of work that we have dispached on our home cluster. We look at the derivative of the queue size. In other words, is the amount of work we’re waiting for increasing. If it is we spawn new workers. Scaling down was a bit trickier for us to figure out. We finally came up with the solution of suicide workers. A suicide worker is just like a regular ec2 server doing a job for us, but it is programmed to shut itself down in 1 hour and 50 minutes. So in response to a massive spike in uploads, our systems starts spawning new workers and keeps doing it until the queue stops growing. The workers then bleed off slowly. If they bleed off and there’s still load, then the derivative will go up again and new workers will be spawned. It’s a relatively simple system with only a few moving parts, and it works well for us. I understand that Amazon has some technology they’ve developed that also addresses this issue. . NEED GRAPHIC OF SPIKE AND RESPONSE HERE
Reserved Instances guarantee that you’ll have access to the infrastructure you need. In exchange, you pay an up-front cost, sacrificing some of the benefits of a pay-by-the-drink pricing. Spot Instances you bid for on the Amazon Spot Market. You explicitly are not guaranteed them. BUT they are on average super-cheap. Regular instances are the default. Fixed price per hour, getting them is usually not a problem. My analysis shows that Reserved Instances don’t make much sense economically. In other words, the savings probably don’t justify the up-front payment. But they make sense form a risk-mitigation perspective. Knowing you have a contract that guarantees you access to a certain number of machines gives peace of mind. A very conservative business would pay for reserved instances for all the machines they need. A very aggressive business would buy everything on the spot market.