Skip to main content

Running your own Azure Proactive Resiliency Assessment

Β· 11 min read

The Azure Proactive Resiliency Library is a curated collection of best practices, guidance, and recommendations designed to improve the resiliency of applications and services running in Azure. Built on the Resiliency pillar of the Well-Architected Framework, this catalog provides valuable insights to ensure your workloads remain robust and reliable.

The library also includes automation capabilities (using Azure Graph queries) that allow you to collect data, analyze it, and generate detailed Word and PowerPoint reports. These reports, part of the Well-Architected Resiliency Assessment workshop, provide visibility into the resiliency of your Azure workloads. This toolset, often used by Microsoft Cloud Solution architects, can be leveraged to identify areas for improvement in your own Azure estate, following Resiliency well-architected principles.

In this article, we will walk through the process of running the scripts to collect, analyze, and report on resiliency data for your workloads using the Azure Proactive Resiliency Library to help you identify and address potential reliability issues in your Azure environment.

🌐 Overview​

Along with recommendations to increase Azure resource resiliency, supplementing the guidance on the Well-Architected Framework, the proactive resiliency library includes sample Azure Resource Graph queries, to determine the current state of your resources.

Azure Proactive Resiliency Library - Use Premium tier for critical production workloads

An improvement to this automation is the ability to run queries to collect, analyze, and create a Word and PowerPoint report for your Azure workloads; these reports are part of a Workshop delivery initiative - Well-Architected Reliability Assessment, usually driven by Microsoft Cloud Solution architects, we can leverage the same toolsets, to help increase visibility across the state of the reliability of our own Azure estate. This is not a set-and-forget tool but a starting point to help you understand the current state of your resources to help you make informed decisions on how to improve the reliability of your Azure workloads and an avenue to highlight Reliability improvements. This process works best when scoped to a specific workload vs an entire environment.

This article is not meant as a substitute for the Microsoft-delivered Resiliency workshop but mainly to help you start your resiliency journey. The information in this article is based on the Azure Proactive Resiliency Library v2 and the scripts provided in the public repository. Please contact your local Microsoft contact for information about their Resiliency workshop and the full delivery.

Azure Proactive Resiliency Library - Assessment

So, why must we consider reliability and resiliency in our Azure workloads?

The answer is simple: ensure that your applications and services are available and reliable, even when failures occur. This is important for business continuity and ensuring that your customers can access your services when needed.

Azure Proactive Resiliency Library - Importance of realibility

When managing resources and workloads in Azure, ensuring reliability is paramount to avoid unexpected failures and downtime. But why exactly do bad things happen, even when there seem to be numerous safeguards in place? Let's delve into reliability and why it's essential for your Azure environment by exploring the "Swiss Cheese Model" of accident causation, illustrated in the image above.

The Swiss Cheese Model, developed by James Reason, is a popular way to understand how errors and failures occur in complex systems. Each defense layerβ€”represented as Swiss cheese slicesβ€”has holes or weaknesses. When these holes align across multiple layers, an accident can occur. Let's break down this model and see how it applies to Azure reliability.

  • Institutional Level: Institutions may need more procedures and regulatory narrowness at the highest level. In the context of Azure, this could mean inadequate governance policies or a lack of compliance with best practices. These gaps can create vulnerabilities that might be exploited, leading to system failures.
  • Organizational Level: Organizations often face mixed messages and production pressures. This could translate to conflicting priorities for Azure between rapid deployment and maintaining robust security and reliability. Pressure to deliver quickly might lead to overlooked reliability checks, creating potential points of failure.
  • Professional Level: Issues such as responsibility shifting can arise at the professional level. In an Azure environment, this might occur when there is a lack of clear ownership over different architecture components, leading to neglected maintenance and updates.
  • Team Level: Teams might suffer from inadequate training. In Azure, this could mean insufficient knowledge about Azure services, leading to suboptimal configuration and resource management. Proper training ensures that teams can effectively manage and troubleshoot their environments.
  • Individual Level: Individual contributors might be distracted or make errors. In Azure, this could involve misconfigurations or overlooking critical logs and alerts. Encouraging a culture of attention to detail and thoroughness can help mitigate these risks.
  • Technical Level: Finally, technical issues such as clumsy technology, coding bugs, and deferred maintenance can directly impact reliability; in Azure, this might involve outdated software versions, unpatched vulnerabilities, or inefficient resource configurations.

By understanding the layers of defense and potential triggers for failures, you can take proactive steps to enhance the reliability of your Azure resources and workloads, ensuring a robust, secure, and efficient cloud environment.

The Azure Proactive Resiliency library and the assessments we will run through will concentrate on the technical level to ensure that your Azure resources are configured correctly and are resilient to failures; however, the other layers of defense are also essential to consider and are a shared responsibility between Microsoft and the customer.

Azure Proactive Resiliency Library - shared responsibility

So - let us get started with the Azure Proactive Resiliency Library and the assessment report.

πŸ“Š Create Resiliency Reports​

The Assessment report is a 3 stage process.

  1. collect the data
  2. analyse the data
  3. create the report

Azure Proactive Resiliency Library - Assessment report

You can do both the data collection in an Azure CloudShell, which is useful for running this in customer environments; however, to create and analyze the data, you must run the script from a computer with Word and Excel installed.

It is worth noting that not all services will be supported by the automated resource graph queries, so you will need to check any services that are not supported manually, and when looking at running this for Production workloads, remember to consider the business context around the resiliency requirements.

πŸ“Š Collect the data​

The Collector PowerShell script is designed to collect data from the Azure environment to help identify potential issues and areas for improvement using the Azure Resource Graph queries within this repository.

ParameterDescriptionExample
TenantIDOptional: tenant to be used.12345678-1234-1234-1234-123456789012
SubscriptionIdsRequired (or SubscriptionsFile): Specifies Subscription(s) to be included in the analysis.12345678-1234-1234-1234-123456789012, 12345678-1234-1234-1234-123456789012
SubscriptionsFileOptional (or SubscriptionIds): Specifies the file with the list to be analyzed (one subscription per line).subscriptions.txt
RunbookFileOptional: Specifies the file with the runbook (selectors & checks) for use.runbook.json
ResourceGroupsOptional: Specifies Resource Group(s) to be included in the analysis.ResourceGroup1,ResourceGroup2
DebugOptional: Writes debugging information for the script during the execution.TRUE
  1. Open an Azure Cloud Shell session
  2. Type in:
wget https://github.com/Azure/Azure-Proactive-Resiliency-Library-v2/raw/main/tools/1_wara_collector.ps1
.\1_wara_collector.ps1 -SubscriptionIds 'YOURSUBID'
  1. Once the script has been completed, you will have a JSON file with the data collected and can proceed to the next step.

1_wara_collector.ps1

πŸ“ˆ Analyse the data​

The Data Analyzer PowerShell script is the second script in the Azure Proactive Resiliency Library (APRL) tooling suite. This tool summarizes the collected data and provides actionable insights into the health and resiliency of the Azure environment.

If you used CloudShell in the previous step, you must download the JSON file to your local machine to run the Data Analyzer script.

(The Data Analyzer script must be run from a Windows Machine with Excel installed.)

WARA_File download

Once the file is downloaded, you can run the Data Analyzer script to create the ExcelPlan.

  1. Open a PowerShell 7 terminal
  2. Download the Data Analyzer script:
https://github.com/Azure/Azure-Proactive-Resiliency-Library-v2/raw/main/tools/2_wara_data_analyzer.ps1
# Define the URL of the file to be downloaded
$url = "https://github.com/Azure/Azure-Proactive-Resiliency-Library-v2/raw/main/tools/2_wara_data_analyzer.ps1"
# Define the path where the file will be saved
$outputPath = "d:\APRL\"
# Use Invoke-WebRequest to download the file
Invoke-WebRequest -Uri $url -OutFile $outputPath

2_wara_data_analyzer.ps1 download

Once downloaded, we can run it against the JSON file from the Data Collector script.

$JSONFile = 'WARA_File_2024-07-06_08_48.json'
.\2_wara_data_analyzer.ps1 -JSONFile $JSONFile

Run Data Analyzer

You should now have an Excel Action Plan with the data from the JSON file, including the recommendations and the current state of the resources.

Excel Action Plan

Worksheet NameDescription
RecommendationsYou will find all Recommendations, their category, impact, description, learn more links, and much more. Columns A and B count the number of Azure Resources associated with RecommendationID.
ImpactedResourcesYou will find a list of Azure Resources associated with a RecommendationID. These Azure Resources do not follow Microsoft's best practices for Reliability.
PivotTableYou will find a couple of pivot tables used to automatically create the charts.
ChartsYou will find three charts that will be used in the Executive Summary PPTx.
ResourceTypesYou will find a list of all ResourceTypes the customer uses, the number of Resources deployed for each one, and if there are Recommendations for the ResourceType in APRL.

There will be resources that are not supported by the Resource Graph queries, so you will need to check these manually. Could you remove or add any recommendations based on your analysis before generating reports?

πŸ“ Create the report​

The Report Generator PowerShell script is the final script in the Azure Proactive Resiliency Library (APRL) tooling suite. Its goal is to generate a Word and PowerPoint report based on the data collected and analyzed by the Collector and Data Analyzer scripts.

Now that we have our main Excel Action plan with the data, we can run the Report Generator script to create the Word and PowerPoint reports.

The previous script, would have downloaded the git repository for the Azure Proactive Library to your local device, so you can run the Report Generator script directly from that, the script needs to be ran.

This script requires specific Microsoft PowerPoint and Word templates:

  • Mandatory - Executive Summary presentation - Template.pptx
  • Optional - Assessment Report - Template.docx

These are stored in the Tools folder of the Azure Proactive Library repository.

The 3_wara_report_generator.ps1 script has the following parameters:

ParameterDescriptionExample Demo Parameters
ExcelFileMandatory: WARA Excel file generated by β€˜2_wara_data_analyzer.ps1’ script and customized.C:\path\to\WARA_Analysis.xlsx
CustomerNameOptional: specifies the Name of the Customer to be added to the PPTx and DOCx files.Contoso Ltd
HeavyOptional: Run the script at a lower pace to handle heavy environments.TRUE
WorkloadNameOptional: specifies the Name of the Workload of the analyses to be added to the PPTx and DOCx files.E-commerce Platform
PPTTemplateFileOptional: specifies the PPTx template file for use as source. If not specified, the script will look for the file in the same path as the script.C:\path\to\Template.pptx
WordTemplateFileOptional: specifies the DOCx template file for use as source. If not specified, the script will look for the file in the same path as the script.C:\path\to\Template.docx
DebuggingOptional: Writes a piece of Debugging information to a log file.TRUE
  1. Open a PowerShell 7 terminal
  2. Run the following script (adjust for your needs, it will look in the same folder that the script is located in for the word and excel templates):
$Excellocation = 'D:\APRL\WARA Action Plan 2024-07-06_21_22.xlsx'
$WorkLoadName = 'eCommerce'
$CustomerName = 'Contoso Ltd'
cd D:\APRL\Azure-Proactive-Resiliency-Library\tools\
.\3_wara_reports_generator.ps1 -ExcelFile $Excellocation -WorkLoadName $WorkLoadName -CustomerName $CustomerName

3_wara_reports_generator.ps1

The script will generate a Word and PowerPoint report with the data from the Excel Action Plan and the recommendations for the Azure resources automatically included. You can then add additional information and recommendations based on the analysis of the data and the business context around the resources.

PowerPoint Report

More detail can then be added to the Word document.

Word Report

πŸ“– References:​