What this Job Entails:
The Business Analyst IV will provide solutions that help attain business outcomes. The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24×7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions.
This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.
Your Roles and Responsibilities:
1) Alert Rationalization & Prioritization (Core)
Establish and maintain a department-wide alert rationalization framework that evaluates alerts for:
Perform regular alert reviews (new + existing) to ensure alert quality, correct routing, and alignment with operational coverage.
Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high-impact degradation.
2) Standards, Policies, and Guardrails
Define and enforce alerting standards including:
Create a standardized Alert Design Checklist and approval workflow (e.g., “Definition of Done” for alert onboarding).
Partner with tool/platform owners to ensure standards are embedded in monitoring tooling (templates, required fields, automated validation).
3) Routing Decisions to 24×7 Eyes-on-Glass
Act as gatekeeper (or lead the governance process) for determining which alerts should:
Ensure routing aligns with:
4) Runbook / Response Instruction Cataloging (Knowledge System)
Establish a consistent approach to cataloging response instructions for every actionable alert, including:
Own the runbook template and ensure runbooks are versioned, maintained, and reviewed on a defined cadence.
Partner with service owners to ensure runbooks stay current as systems change.
5) Reporting & Operational Outcomes
Define and publish KPIs that demonstrate alerting health and operational performance, such as:
Facilitate governance forums (weekly/monthly) with service owners and engineering leads to review alert quality and backlog.
6) Cross-Functional Enablement
Coach service teams on best practices: SLIs/SLOs, alert thresholds, dependency monitoring, and incident correlation.
Drive adoption of observability patterns (golden signals, health indicators, multi-signal alerting).
Support major incident learning by feeding post-incident insights back into improved alerts and runbooks.
7) Able to Deliver the following in the first 45 days:
Alerting standards (severity model, metadata, naming, routing policy) published and adopted
Intake and approval workflow established for new/changed alerts
Top 20 noisy services rationalized (dedupe/suppress/threshold tuning) with measurable noise reduction
Runbook template launched; minimum runbook coverage targets set (e.g., 80% of paged alerts)
Central alert catalog created (ownership + routing + runbook link + last review date)
Required Qualifications/Skills:
5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management
Demonstrated success reducing noise and improving actionability across enterprise alerting ecosystems
Experience with common monitoring/observability tools (e.g., Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, ServiceNow Event Mgmt or similar)
Strong understanding of:
Excellent stakeholder management and ability to drive standards across teams
Preferred Qualifications:
Physical Demand & Work Environment:
What can Astreya offer you?
Salary Range
$98,040.00 – $154,800.00 USD (Salary)
Astreya offers comprehensive benefits to all Regular, Full-Time Employees, including:
Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only
Dental provided through UHC
Nationwide Vision provided by UHC
Flexible Spending Account for Health & Dependent Care
Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)
Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera
Corporate Wellness Program provided by Goomi Group
Employee Assistance Program
Wellness Days
401k Plan
Basic and Supplemental Life Insurance
Short Term & Long Term Disability
Critical Illness, Critical Hospital, and Voluntary Accident Insurance
Tuition Reimbursement (available 6 months after start date, capped)
Paid Time Off (accrued and prorated, maximum of 120 hours annually)
Paid Holidays
Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law