关于Amazon云宕机的网贴收集
最近,互联网上最大的事可能是Amazon的AWS宕机了,而且好几天都没有完全恢复。整个Internet都在讨论这个事,Internet很不高兴,后果可能很严重。可能是因为这个事件对中国没有影响,所以中文这边相关的文章不多,大家可以参考一下和讯网的这篇《伤不起!亚马逊史前最大宕机事件的启示》。
国外有人把所有和这个事件相关的贴子都收集了起来,都是一些相当不错的贴子和文章,尤其是一些经验教训的贴子,很受教,转给大家看看。这个贴子的来源在这里。
目录
个别公司的经历,有好有坏
- How Heroku Survived the Amazon Outage on the Heroku status page
- How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
- How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
- How Bizo survived the Great AWS Outage of 2011 relatively unscathed… by Someone at Bizo
- Joe Stump’s explanation of how SimpleGeo survived
- How Netflix Survived the Outage
- Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering’s Blog (Hacker News thread)
- On reddit’s outage
- What caused the Quora problems/outage in April 2011?
- Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment
Amazon Web Services 讨论区
有一些有经验的人共享了很多相当不错的宕机的经历。
- Amazon Web Services Discussion Forum
- Cost-effective backup plan from now on?
- Life of our patients is at stake – I am desperately asking you to contact
- Why did the EBS, RDS, Cloudformation, Cloudwatch and Beanstalk all fail?
- Moved all resources off of AWS
- Any success stories?
- Is the mass exodus from East going to cause demand problems in the West?
- Finally back online after about 71 hours
- Amazon EC2 features vs windows azure
- Aren’t Availability Zones supposed to be “insulated from failures”?
- What a lot of people aren’t realizing about the downtime:
- ELB CNAME
- Availability Zones were used in a misleading manner
- Tip: How to recover your instance
- Crying in Forum Gets Results, Silver-level AWS Premium Support Doesn’t
- Well-worth reading: “design for failure” cloud deployment strategy
- New best practice
- Don’t bother with Premium Support
- Best practices for multi-region redundancy
- “Postmortum“
- Learning from this case
- Amazon, still no instructions what to do?
- Anyone else prepared for an all-nighter?
- Is Jeff Bezos going to give a public statement?
- Rackspace, GoGrid, StormonDemand and Others
- Jeff Barr, Werner Vogels and other AWS persons – where have you been???
- After you guys fix EBS do I have do anything on my side?
- Need Help!!! Lives of people and billions in revenue are at risk now!!!
- I’ve Got A Suspicion
- Farewell EC2, Farewell
There were also many many instances of support and help in the log.
总结
- Amazon EC2 outage: summary and lessons learned by RightScale
- AWS outage timeline & downtimes by recovery strategy by Eric Kidd
- The Aftermath of Amazon’s Cloud Outage by Rich Miller
立场:这是用户的错
- So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
- The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
- The AWS Outage: The Cloud’s Shining Moment by George Reese (Hacker News discussion)
- Failing to Plan is Planning to Fail by Ted Theodoropoulos
- Get a life and build redundancy/resiliency in your apps on the Cloud Computing group
立场:这是Amazon的错
- Stop Blaming the Customers – the Fault is on Amazon Web Services by Klint Finley
- AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
- Amazon Web Services are down – Huge Hacker News thread
教训和启示
- People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
- Basic scalability principles to avert downtime by Ronald Bradford
- Amazon crash reveals ‘cloud’ computing actually based on data centers by Kevin Fogarty
- Seven lessons to learn from Amazon’s outage By Phil Wainewright
- The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
- Some thoughts on outages by Till Klampaeckel
- Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
- How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
- Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
- Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
- Single Points of Failure by Mat
- Coping with Cloud Downtime with Puppet
- Amazon Outage Concerns Are Overblown by Tim Crawford
- Where There Are Clouds, It Sometimes Rains by Clay Loveless
- Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
- Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
- Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
Vendor很生气
- Amazon Outage Proves Value of Riak’s Vision by Basho
- Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
- On Cascading Failures and Amazon’s Elastic Block Store by Jason
- An unofficial EC2 outage postmortem – the sky is not falling from CloudHarmony
本博客所有文章均镜像自酷壳(Coolshell.cn),所有内容版权归原作者所有。