Difference between revisions of "MDD Monitoring Driven Development"
From CitconWiki
Jump to navigationJump to searchPascaldufour (talk | contribs) |
|||
(2 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | |||
+ | == From the Rubber Chicken to MDD == | ||
+ | |||
+ | @jtf's "presentation" | ||
+ | |||
+ | # James Shore's Rubber Chicken | ||
+ | ** physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit | ||
+ | ** had to use a separate physical machine (solving the 'It works on my machine' problem) | ||
+ | # CI | ||
+ | ** can run more stuff now (fast tests, slow tests) - but separate build for deploy | ||
+ | # pipelines with artifact passing | ||
+ | # promoting to test/prod | ||
+ | # CD - blue green deploy - rolling back based on KPIs '''CI + monitoring now controls production''' | ||
+ | ** if any step fails, the change is automatically to be reverted | ||
+ | ** if it made to prod, but business metrics down | ||
+ | *** not reverting code | ||
+ | *** take out from production cluster to investigate | ||
+ | |||
+ | == State of the monitoring (first) == | ||
+ | |||
+ | * metrics used in monitoring are not specific (high level business metric down, must have been this change) | ||
+ | * just like adding tests after writing the code is hard, so is adding monitoring/metrics | ||
+ | * who tried monitoring first? | ||
+ | ** zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project | ||
+ | ** PJ/intent media | ||
+ | *** monitoring can stop deploy/rollout | ||
+ | *** stopped doing acceptance tests in favor of monitoring | ||
+ | ** aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out | ||
+ | *** how could it fail | ||
+ | *** what impact it would have | ||
+ | *** how would we know (from customers? ) | ||
+ | *** it it worth adding it? (metric, alert) | ||
+ | |||
+ | == alerting == | ||
+ | |||
+ | * how many alerts should we create | ||
+ | ** high level? e.g.: number failed API requests? | ||
+ | ** more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware? | ||
+ | * metrics vs. monitoring | ||
+ | ** monitoring triggers somene to look at it | ||
+ | ** metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate" | ||
+ | * who should we call (e.g.: if only high level metrics, who should the alerts wake up?) | ||
+ | * (pagerduty.com) | ||
+ | |||
+ | == "Failure Friday" practice == | ||
+ | * during work hours! | ||
+ | * we think this should be redundant, so let's shut this off and see the team recover | ||
+ | * important: do it when you expect the exercise to be successful | ||
+ | |||
+ | |||
+ | == Feature validation / AB testing == | ||
+ | |||
+ | not the same as monitoring | ||
+ | |||
+ | |||
+ | == Alert thresholds == | ||
+ | |||
+ | * it's not always binary (on/off) | ||
+ | * normal is not the same as yesterday/last week / last year | ||
+ | ** seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads" | ||
+ | ** event driven - e.g.: if you publish tips, it depends on what happens in the world | ||
+ | * factor into | ||
+ | ** what can we measure | ||
+ | ** what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels | ||
+ | |||
+ | == Improving Alerts == | ||
+ | |||
+ | * make them actionable | ||
+ | ** link to wiki of runbook how to fix | ||
+ | ** write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented | ||
+ | * metrics you don't use is inventory, thus not useful | ||
+ | |||
+ | (question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?) | ||
+ | |||
+ | * should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those) | ||
+ | |||
+ | == Workshop on MDD - 2 minutes to dropped jaws == | ||
+ | |||
+ | Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels | ||
+ | |||
+ | what can we measure? | ||
+ | |||
+ | * nr of FAQ views | ||
+ | * # of calls | ||
+ | * ask support reps to ask if caller read the FAQ & feed that back to the system? | ||
+ | * instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-)) | ||
+ | |||
+ | => the way you think of validation/measuring changes the product | ||
+ | |||
+ | == Monitoring Embedded into Business == | ||
+ | |||
+ | * SRE handbook only focuses on the tech | ||
+ | * if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring | ||
+ | |||
+ | == Links == | ||
+ | |||
* My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit | * My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit | ||
* Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/ | * Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/ | ||
+ | * Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/ | ||
− | good questions to ask: | + | == good questions to ask: == |
* what does this data mean? | * what does this data mean? | ||
* If we are not wachting it -> delete it? | * If we are not wachting it -> delete it? |
Latest revision as of 00:47, 18 October 2022
From the Rubber Chicken to MDD
@jtf's "presentation"
- James Shore's Rubber Chicken
- physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
- had to use a separate physical machine (solving the 'It works on my machine' problem)
- CI
- can run more stuff now (fast tests, slow tests) - but separate build for deploy
- pipelines with artifact passing
- promoting to test/prod
- CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production
- if any step fails, the change is automatically to be reverted
- if it made to prod, but business metrics down
- not reverting code
- take out from production cluster to investigate
State of the monitoring (first)
- metrics used in monitoring are not specific (high level business metric down, must have been this change)
- just like adding tests after writing the code is hard, so is adding monitoring/metrics
- who tried monitoring first?
- zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
- PJ/intent media
- monitoring can stop deploy/rollout
- stopped doing acceptance tests in favor of monitoring
- aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
- how could it fail
- what impact it would have
- how would we know (from customers? )
- it it worth adding it? (metric, alert)
alerting
- how many alerts should we create
- high level? e.g.: number failed API requests?
- more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
- metrics vs. monitoring
- monitoring triggers somene to look at it
- metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
- who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
- (pagerduty.com)
"Failure Friday" practice
- during work hours!
- we think this should be redundant, so let's shut this off and see the team recover
- important: do it when you expect the exercise to be successful
Feature validation / AB testing
not the same as monitoring
Alert thresholds
- it's not always binary (on/off)
- normal is not the same as yesterday/last week / last year
- seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
- event driven - e.g.: if you publish tips, it depends on what happens in the world
- factor into
- what can we measure
- what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels
Improving Alerts
- make them actionable
- link to wiki of runbook how to fix
- write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
- metrics you don't use is inventory, thus not useful
(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)
- should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)
Workshop on MDD - 2 minutes to dropped jaws
Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels
what can we measure?
- nr of FAQ views
- # of calls
- ask support reps to ask if caller read the FAQ & feed that back to the system?
- instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))
=> the way you think of validation/measuring changes the product
Monitoring Embedded into Business
- SRE handbook only focuses on the tech
- if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring
Links
- My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
- Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
- Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/
good questions to ask:
- what does this data mean?
- If we are not wachting it -> delete it?
- Should we try "Failure Friday"?
- Should we use "Daily Red"?
- Is this indicator fast enough (leading or lagging indicator) to react?