Your product is currently deployed in three Google Cloud Platform (GCP) zones with your users divided between the zones.
You can fail over from one zone to another, but it causes a 10-minute service disruption for the affected users. You typically
experience a database failure once per quarter and can detect it within five minutes. You are cataloging the reliability risks of
a new real-time chat feature for your product. You catalog the following information for each risk:
Mean Time to Detect (MTTD) in minutes
Mean Time to Repair (MTTR) in minutes
Mean Time Between Failure (MTBF) in days
User Impact Percentage
The chat feature requires a new database system that takes twice as long to successfully fail over between zones. You want
to account for the risk of the new database failing in one zone. What would be the values for the risk of database failover with
the new system?
C
You are responsible for the reliability of a high-volume enterprise application. A large number of users report that an
important subset of the applications functionality a data intensive reporting feature is consistently failing with an HTTP
500 error. When you investigate your applications dashboards, you notice a strong correlation between the failures and a
metric that represents the size of an internal queue used for generating reports. You trace the failures to a reporting backend
that is experiencing high I/O wait times. You quickly fix the issue by resizing the backends persistent disk (PD). How you
need to create an availability Service Level Indicator (SLI) for the report generation feature. How would you define it?
C
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some
personally identifiable information (PII) is leaking into certain log entry fields. All PII entries begin with the text userinfo. You
want to capture these log entries in a secure location for later review and prevent them from leaking to Stackdriver Logging.
What should you do?
A
You are part of an organization that follows SRE practices and principles. You are taking over the management of a new
service from the Development Team, and you conduct a Production Readiness Review (PRR). After the PRR analysis
phase, you determine that the service cannot currently meet its Service Level Objectives (SLOs). You want to ensure that
the service can meet its SLOs in production. What should you do next?
B
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that
wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to
set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
A
Your team uses Cloud Build for all CI/CD pipelines. You want to use the kubectl builder for Cloud Build to deploy new
images to Google Kubernetes Engine (GKE). You need to authenticate to GKE while minimizing development effort. What
should you do?
C
Your company follows Site Reliability Engineering practices. You are the Incident Commander for a new, customer-impacting
incident. You need to immediately assign two incident management roles to assist you in an effective incident response.
What roles should you assign? (Choose two.)
A E
You are running a real-time gaming application on Compute Engine that has a production and testing environment. Each
environment has their own Virtual Private Cloud (VPC) network. The application frontend and backend servers are located
on different subnets in the environments VPC. You suspect there is a malicious process communicating intermittently in
your production frontend servers. You want to ensure that network traffic is captured for analysis. What should you do?
D
You are developing a strategy for monitoring your Google Cloud Platform (GCP) projects in production using Stackdriver
Workspaces. One of the requirements is to be able to quickly identify and react to production environment issues without
false alerts from development and staging projects. You want to ensure that you adhere to the principle of least privilege
when providing relevant team members with access to Stackdriver Workspaces. What should you do?
C
You deploy a new release of an internal application during a weekend maintenance window when there is minimal user
tragic. After the window ends, you learn that one of the new features isn't working as expected in the production
environment. After an extended outage, you roll back the new release and deploy a fix. You want to modify your release
process to reduce the mean time to recovery so you can avoid extended outages in the future. What should you do?
(Choose two.)
A C
Your company follows Site Reliability Engineering practices. You are the person in charge of Communications for a large,
ongoing incident affecting your customer-facing applications. There is still no estimated time for a resolution of the outage.
You are receiving emails from internal stakeholders who want updates on the outage, as well as emails from customers who
want to know what is happening. You want to efficiently provide updates to everyone affected by the outage. What should
you do?
C
You created a Stackdriver chart for CPU utilization in a dashboard within your workspace project. You want to share the
chart with your Site Reliability Engineering (SRE) team only. You want to ensure you follow the principle of least privilege.
What should you do?
A
Your team of Infrastructure DevOps Engineers is growing, and you are starting to use Terraform to manage infrastructure.
You need a way to implement code versioning and to share code with other team members. What should you do?
A
Explanation:
Reference: https://www.terraform.io/docs/cloud/guides/recommended-practices/part3.3.html
You have a CI/CD pipeline that uses Cloud Build to build new Docker images and push them to Docker Hub. You use Git for
code versioning. After making a change in the Cloud Build YAML configuration, you notice that no new artifacts are being
built by the pipeline. You need to resolve the issue following Site Reliability Engineering practices. What should you do?
B
Your application runs on Google Cloud Platform (GCP). You need to implement Jenkins for deploying application releases to
GCP. You want to streamline the release process, lower operational toil, and keep user data secure. What should you do?
D
Explanation:
References: https://plugins.jenkins.io/google-compute-engine/
MTTD: 5 MTTR: 20 MTBF: 90 Impact: 33%