May 20, 2023

Managing Resource Drift in Azure with Terraform: Tracking and Resolving Configuration Drift for Reliable Infrastructure

Infrastructure as Code

In this blog, we will learn about the process of detecting and managing resource drift in Azure with Terraform. This blog also covers some best-practices and guidelines for putting it into practice.

Introduction

Building a reliable and consistent infrastructure is a fundamental requirement for any organisation that wants to succeed in the cloud. Cloud infrastructure management involves deploying and configuring several resources such as virtual machines, storage accounts, and networking components. These resources are defined and managed more and more through infrastructure-as-code tools such as Terraform.

Despite the benefits of using infrastructure-as-code, cloud infrastructure management is not without its challenges. One of the most common problems that arise is resource drift. Resource drift happens when the actual state of a resource in the cloud differs from its defined or expected state. This can occur due to several reasons, such as manual changes made to the resource, unintended updates caused by third-party tools, or even bugs in the infrastructure code. Resource drift can lead to a variety of issues such as security vulnerabilities, unexpected downtime, and in extreme cases, complete infrastructure failure. Therefore, tracking and resolving configuration drift is essential to ensure reliable and consistent infrastructure.

Tracking resources with Terraform state

Terraform state is a crucial component of the infrastructure-as-code process. The Terraform state file is a file that Terraform uses to keep track of the infrastructure resources it manages. The state file is a JSON file that contains the current state of all resources managed by Terraform. This includes information such as the resource ID, the resource type, and any associated metadata.

When Terraform executes the actions proposed in a Terraform plan, it uses the state file to compare the current state of the resources of the infrastructure with the desired state defined in the Terraform configuration. If there is a difference between the two, Terraform will create, update, or delete the resources as necessary to ensure they match the desired state. In other words, Terraform uses the state file as a source of truth for the current state of the infrastructure.

By using the state file, Terraform ensures that the infrastructure remains consistent and reliable. It provides a mechanism to detect and manage resource drift, which is one of the most common problems that arise in cloud infrastructure management. Resource drift can occur due to several reasons, such as manual changes made to the resource, unintended updates caused by third-party tools, or even bugs in the infrastructure code.

Detecting and managing resource drift is essential to ensure reliable and consistent infrastructure. Terraform state allows for tracking and resolving configuration drift, which can lead to a variety of issues such as security vulnerabilities, unexpected downtime, and in extreme cases, complete infrastructure failure. Therefore, it is crucial to understand the importance of Terraform state and its role in maintaining a reliable and consistent infrastructure.

Tackling resource drift managed by Terraform

When you’ve identified resource drift in your cloud resources, managed by Terraform, there are two main ways to tackle it. One way is to reapply your configuration, which involves applying the configuration again. This approach is used when unwanted changes have been made outside of your Terraform configuration, or when the drift is minor, and you want to ensure that the infrastructure remains consistent with the desired state defined in the configuration files.

Another way to tackle drift is by refreshing and updating the configuration. This involves refreshing the Terraform state and then updating the Terraform configuration files to match the current state of the resources. This approach can be useful when the drift is significant, and you want to ensure that the configuration files reflect the actual state of the resources in the cloud. To refresh the state, you can use the terraform refresh command. This command updates the state file with the current state of the resources in the cloud. Once the state has been refreshed, you can update the configuration files to match the current state of the resources. Finally, you can apply the updated configuration to ensure that the infrastructure is consistent with the desired state.

When dealing with mission-critical workloads, it is essential to have fine-grained control over the lifecycle of resources managed by Terraform. One powerful configuration option that can be leveraged in such scenarios is prevent_destroy. By setting prevent_destroy to true for a resource, Terraform will prevent the destruction of that resource, even if it is no longer defined in the Terraform configuration. This configuration option acts as a safeguard against accidental resource deletion, ensuring that critical resources remain intact, regardless of drift detection. This level of protection is beneficial for mission-critical workloads, but could also be considered for other workloads when there is a broader need to avoid potential disruptions.

Tracking resources created outside of Terraform

While Terraform is a powerful tool for infrastructure-as-code management, there may be scenarios where engineers decide to work outside of Terraform. This can happen for several reasons, such as the need to make immediate changes to the infrastructure due to an urgent issue or the use of third-party tools that require manual configuration.

When resources are created outside of Terraform, it can be particularly challenging to track and manage them. If these resources do not exist in the Terraform configuration files or Terraform state file, Terraform is not aware of them by default. This can lead to resource drift, where the actual state of the resource in the infrastructure differs from its defined or expected state. To mitigate the risk of resource drift, organisations can consider using alternative approaches to manage resources that are created outside of Terraform.

One approach is to use Azure Monitor and Azure Activity Logs to monitor changes made to the infrastructure outside of Terraform. Azure Activity Logs logs all API calls made to your account and can be used to detect any changes made to the infrastructure configuration. To access and analyse Azure Activity Logs, you can use Azure Monitor. Azure Monitor allows you to query, visualise, and alert on activity log data. You can leverage the powerful query capabilities of Azure Monitor Logs (formerly known as Azure Log Analytics) to search for specific events, filter data based on criteria, and set up alerts to notify you when changes occur.

While this approach does not provide a mechanism for managing resources outside of Terraform, it can be used to detect changes and ensure that they are reflected in the Terraform configuration.

Another, and bolder, approach is to implement specific access controls. By restricting resource creation privileges with access controls within Azure Resource Manager and Azure Active Directory, you reduce the likelihood of resources being created outside of Terraform. A way to do this efficiently is implementing a scenario where a Git repository serves as the single source of truth for Terraform code, and only a managed identity or service principal is privileged to apply the code and manage Azure resources. This managed identity or service principal should be used as the authentication method for a deployment pipeline, such as Azure DevOps or GitHub Actions, that integrates with the Git repository. Developers and engineers should been given the ability to make changes to the Terraform code in the Git repository. When changes are pushed to the repository, the deployment pipeline can be triggered automatically, ensuring that the deployment is performed by the privileged managed identity or service principal. Developers and engineers should still have read access on Azure with their personal identities, so they can review the cloud resources. However, in this scenario there should be a restriction on creating, updating and removing cloud resources directly from personal identities.

By following this approach, developers and engineers have the ability to update the Terraform configuration in the Git repository, but only the managed identity or service principal has the permission to deploy and manage Azure resources. This enforces security, governance, and a centralised control over the resource provisioning process, ensuring adherence to the desired workflow and preventing direct deployments using personal identities, and therefore resource drift.

It is wise to conduct regular environment scans or audits of your cloud environment to identify resources that are not managed by Terraform. Use tools like cloud provider APIs, command-line interfaces (CLIs), or third-party discovery tools to detect and list these resources. One way to effectively do so is with the use of resource tagging and naming conventions. Establishing consistent tagging or naming conventions within your Terraform configuration makes it easier to perform scans or audits of your cloud environment. Regularly review the tags or names of existing resources and compare them against your Terraform configuration. For example, you could add a resource tag ManagedBy to all your Terraform-managed resources, with the value of Terraform. Any untagged or differently tagged resources could indicate ones created outside of Terraform.

Obviously, establishing a clear process can help prevent resource drift. When defining and communicating a clear process within your team or organisation, that outlines the use of Terraform as the primary tool for provisioning and managing infrastructure resources, chances are reduced that resources will be created, modified or deleted outside Terraform. The same goes for proper education and training. Providing comprehensive education and training on Terraform to your team members, ensures they understand its benefits, workflows, and best practices. This will help encourage adherence to using Terraform and discourage manual resource creation.

Tackling resource drift outside of Terraform

When you’ve identified resources, which are not managed by Terraform, you can import them into Terraform. This involves importing it into the Terraform state and creating a Terraform configuration file that describes the existing resource. Once the resource has been imported into the Terraform state, it can be managed like any other resource.

To import a resource into Terraform, you can use the terraform import command. This command takes two arguments: the resource type and the resource ID. The resource type is the type of resource you are importing, such as azurerm_virtual_machine, and the resource ID is the ID of the resource you are importing. The resource ID is a unique identifier that is assigned to the resource when it is created. Once the resource has been imported into Terraform, you can manage it like any other resource. However, terraform import can only import resources into the state. Importing does not generate the configuration files, which will lead to the destruction of these resources as soon as you run the terraform apply command. You’ll have to manually write a resource configuration block for the resource(s), which describes where Terraform should map the imported object, or use a tool to bring existing Azure resources under Terraform’s management, like Azure Export for Terraform. After, you can modify the configuration in the Terraform configuration files, apply changes to the resource, and track its state using the Terraform state file.

Finally, if certain resources cannot be incorporated into Terraform for any reason, document them explicitly within your Terraform configuration files. Use manual data blocks or comments to indicate their existence and provide relevant information for future reference. For these resources it might be wise to use the ManagedBy tagging strategy as well, for example with Manual as its value. This is to inform you why these resources aren’t managed by Terraform, and giving you the possibility to filter on these resources in your hunt for resource drift.

Closing words

In this article, we explored the importance of detecting and managing resource drift in Azure with Terraform. We learned about the role of Terraform state in tracking and resolving configuration drift, which is a crucial step in maintaining reliable and consistent infrastructure. We also discussed the challenges of identifying resources that are created outside of Terraform and how to tackle them.

To learn more about the topics that were covered in this blog article, refer to the links below:

Thank you for taking the time to go through this post and making it to the end. Stay tuned because we’ll keep continuing providing more content on topics like these in the future.

Author: Rolf Schutten

Posted on: May 20, 2023