Congratulations to me, as I’ve got a new job, and I’m in a new team here at the ‘soft. Specifically, I’m in Azure, in the Internet of Things space, working on a Thing. I can’t talk about the Thing. Some day I will talk about the Thing. But not now.
This means I’m back on a live product (or a product that will be a live product, it’s all very complicated) and that means I am on a Live Site team and I’m pretty happy about that. I enjoy the Live Site process because it’s basically enforcing a culture of learning from mistakes.
What is Live Site Practice
Generally speaking, Live Site means that your site is… live. Meaning when something goes wrong (and there are varying levels of wrong to Wrong to WRONG to WRONG!!!) you have a person responsible to fix it, you have expectations of how quickly it gets fixed, you put a plan in place to make sure it never happens again and monitoring to catch it when it inevitably does. Live Site incidents can be singular (this one experience happened this one time) or multitudinous (cascading incidents, parallel problems, etc.) or chronic (a liberal application of the philosophy of Live Site could categorize a series of data breaches or questionable data sharing practices by a given company, for example, as a very large Live Site Incident).
Measuring the Live Site Response
There are four major ways to measure the response to a Live Site Incident. These are: Time to Detect (how long it took you to figure out something is wrong from the time something actually went wrong), Time to Engage (how long it took you to start trying to fix it from the time it was detected), Time to Mitigate (how long before the customer stopped having the negative experience), and Time to Resolve (how long before the actual problem was fixed).
General prudence means I don’t illustrate this with an Actual Thing From Work because I like my job and I want to keep it, so I’ll use a recent personal experience to illustrate.
At about 8am on August 25th I went to the gym and my contacts clouded over. It was annoying so when I got home I took them out and put them in a fresh solution/case and went about my day in glasses. At night we had friends to dinner so I wore my contacts with no trouble. At about 9am on August 26th I went to the gym and my contacts clouded over. It wasn’t horrible, just annoying, and so when I got home I took them out and put them in a fresh solution/fresh case and ran around with my glasses. No problem.
- By 4pm that afternoon my eyes were itching. Because we’d had smoke issues lately coming in from the Canada and Eastern Washington fires, I figured my eyes had got irritated from that, and put some drops in.
- By 5pm my eyes were uncontrollably watering and itchy.
- By 8pm I had to stop watching Aliens, one of my very favorite movies, because the following hurt: opening my eyes, closing my eyes, and having my eyes closed. Thinking that eye irritations usually resolve themselves with a good night’s sleep (hello, morning eye crud) I went to bed (yes, at 8pm). The software equivalent of this is turning the machine off and turning it on again.
- By 10:30pm I woke from a dead sleep feeling like someone was stabbing me in my eyeballs and asked my husband to drive me to the ER.
- By 11pm they had put numbing drops in my eyes. Ensuing investigation showed my corneas had all kinds of pitting all over them and possibly dual infection in both eyes.
- By 12:30pm they discharged me with a Percocet (to help me sleep and ignore the pain), antibiotics (for my eyes) and an instruction to see an eye doctor the next day.
- By 10:30am the next day the eye doctor confirmed the infection, noted some abrasions, and said I’d self-heal in about five days.
Time to Detect
This one is tricky, because on one hand you can say I “detected” it at 9am when my contacts clouded over… but on which day? As nothing hurt and I wasn’t inconvenienced and I carried on with my day. So I’ll say I detected it at 4pm. But it’s likely the problem actually started at 9am on the Saturday, so my Time to Detect was 31 hours.
Time to Engage
Again, it’s not a clear line (and I’ll point out these things are hashed over in the Live Site world a lot as well). I started “engaging” with eye drops at 4pm. I didn’t request professional help though until 10:30pm when it got really bad. I’m calling it 6.5 hours (4pm-10:30pm).
Time to Mitigate
Mitigation is all about the customer’s perspective. How long from the time the problem started actually happening (and the customer was inconvenienced) to the time it got fixed from the customer’s perspective. For me, that’s from 4pm (eyes watering) to 11pm when I got my first numbing drops. Seven hours. If you want to be really specific, my eyes had stopped hurting mostly by the next day, *without* numbing drops, so a more conservative mitigation time would be from 4pm Sunday to 10:30am Monday – 18.5 hours.
Time to Resolve
Resolution is about the actual problem being fixed (perspective or otherwise). In this case, five days from Monday the 27th, or September 1st. Time to Resolve: a little over six days. As part of resolution I had to throw out all open saline/lens solution containers, contact lenses, etc. As a “customer” of this experience I also took the added step of “re-architecting” my framework: I went and got a different brand of contact lenses (that change out more frequently), and started wearing my glasses more often.
Measuring the Impact
The Emergency Room is not cheap, although by comparative standards I got off easy. My bill, after insurance, was roughly $700 (not including the follow-up eye doctor visits, new contact lenses, replaced makeup, etc.). The bill sent to the insurance company was roughly 3 times that amount.
Money isn’t everything, and time is more precious: I lost about 4 hours’ sleep, I lost 6 hours’ quality time with my husband and a favorite movie. I lost another 2 hours or so to the ER and another 2 to/from the eye doctor.
I had to work from home on that Monday, and that meant even though it was my last week with my old team they didn’t have me right there to help with my transition; that’s 4 people impacted. My husband had to take time from his evening and next day to take me to appointments, which he was super supportive of and insisted upon, but it also meant he couldn’t do whatever it is he should have been doing during those hours. Rarely is it just one customer who is impacted in Live Site.
Yes, post-mortem means “after death”, and no one died. In the Live Site world, no one dies. (Well, we hope no one dies). The Post Mortem is when you look over what and how it happened, figure out how to keep it from happening again, and figure out how to detect if it does.
What Happened – also known as the Root Cause Analysis
Root Cause Analysis (RCA) is the review of what instigated the problem. In this case, what happened was that I somehow (?) got either smoke between my contact lenses and cornea, creating a corneal abrasion that then lead to dual infection, OR the I got an infection, which led to corneal abrasion. The experts weren’t really worried about which came first, and if I had wanted to spend lab money to dig into which came first, I don’t know that they would have been able to figure that out. It is, in fact, a moot point. If it was smoke from the environment, that’s how that could have happened. Or it could be infection from saline solution, eye rubbing, random bacteria, etcetera. It could have been from contact lens over-use. If they would have been able to tell me definitively the root cause that would be great, because it would impact my next two steps, but rarely do you get a clean root cause.
How to Keep it from Happening Again
As we read up above, I trashed all of my eye-based items (including, incidentally, my mascara, every one of my eyeliners, etc.). I washed all of my makeup brushes and sterilized them. I got a new brand of contact lens that is changed out more frequently. I got new glasses and wear them more often than I used to. This may be overkill, but it is everything I can do to ensure I don’t have to miss one of my favorite movies.
How to Detect if it Happens Again
In this case, my first clue was my contact lenses clouding over on the Saturday. At that point I should have quit wearing contacts for a few days and thrown those lenses out instead of trying to disinfect them. My second detect point was the second day of clouding lenses — those two combined should have sent me to the urgent care or an eye doctor, which would certainly have been more cost-effective than the ER. Uncontrollable eye watering, foggy lenses, and/or gritty pain when opening, closing, or having closed eyes are all reasons to see a professional right away.
You’ll notice in most of this I’ve not beaten myself up about being stupid, making poor choices, etc. That’s because it wouldn’t help (either me or the situation) and it’s entirely beside the point. I can’t go and change what happened, so the best practice is to learn from it and ensure others do, too. *That* is what I like about Live Site. If your Live Site culture feels like a giant finger-pointing exercise, then it isn’t being implemented properly, and it’s time to do some Root Cause Analysis.