Kai Siren's Blog

3 Cloud Standoff: IAM Across AWS, GCP, and Azure

@coilysiren — Mon, 13 Apr 2026 00:00:00 GMT

The hardest part of moving between clouds isn't the compute or the networking. It's figuring out what "identity" even means in a given ecosystem. Maybe your company acquired something that runs on GCP. Maybe the data science team got a budget for Azure OpenAI. Maybe you're just tired of AWS. Either way, you have to relearn IAM from scratch, and the vocabulary is a minefield.

This post maps the three clouds against each other. I'll walk through each cloud's identity fundamentals, then line them up adjacent to each other, and finally call out the places where the mental models genuinely diverge (as opposed to just using different words for the same thing).

This post assumes you already understand IAM in at least one cloud. If you don't, start with the AWS IAM docs. They're the most thorough and the concepts transfer reasonably well.

The fundamentals, by cloud

AWS

AWS has the oldest and most-copied identity model, so it's a reasonable starting point.

IAM User. A long-lived identity, usually with an access key ID and secret access key. Meant for humans before SSO existed, and for machines that live outside AWS. Today, best practice is to have as few of these as possible.
IAM Role. An identity with no long-lived credentials. You assume a role via STS (sts:AssumeRole), which hands you back temporary credentials (typically valid for one hour). Roles have a trust policy that says who's allowed to assume them.
IAM Group. A bag of users, used only for attaching policies. Groups are not identities; you can't "log in as" a group.
Instance Profile. A wrapper around a role that lets an EC2 instance use it. The EC2 metadata service (IMDS) hands out temporary credentials based on the attached instance profile.
IAM Policy. The permission document itself. Attached to users, groups, roles, or (for some services) to resources directly.
IAM Identity Center (formerly AWS SSO). The modern way to give humans access. It federates from an external IdP and vends short-lived role sessions.

The basic AWS idea: roles are portable identities that nothing owns. An EC2 instance, a Lambda function, a pod in EKS, and a user from another AWS account can all assume the same role.

GCP

GCP's identity model looks superficially similar to AWS, but the center of gravity is in a totally different place.

Google Account / Workspace User. A human identity, managed entirely outside your GCP project (in Google Workspace or Cloud Identity). You can't create one "inside" a project the way you create an IAM user in AWS.
Service Account. GCP's machine identity. This is the one that trips people up, because a service account is simultaneously:
1. An identity (it has an email like my-sa@my-project.iam.gserviceaccount.com and can be granted roles on resources), and
2. A resource (you can be granted roles on the service account itself, such as roles/iam.serviceAccountTokenCreator to impersonate it).
Google Group. A bag of users. Same idea as AWS IAM groups, but managed in Workspace rather than in GCP itself.
IAM Role. In GCP, a "role" is not an identity. It's a collection of permissions (what AWS would call a managed policy). Roles come in three flavors: basic (Owner/Editor/Viewer, avoid these), predefined (per-service, curated by Google), and custom.
IAM Binding / Allow Policy. The grant itself. On each resource, there's an "allow policy" that says "principal X has role Y on this resource." This is the opposite direction from AWS, where policies are usually attached to the principal.
Workload Identity Federation. Lets external identities (like a GitHub Actions OIDC token or an AWS IAM role) act as a GCP service account without a long-lived JSON key.

The basic GCP idea: permissions are granted on resources, not on principals. You walk up to a resource and say "who can do what here." That's the opposite of AWS, where you walk up to a principal and ask "what can this identity do."

Azure

Azure has the most confusing vocabulary of the three, largely because the identity layer (Entra ID, formerly Azure AD) and the resource layer (Azure Resource Manager) were duct taped together later.

User. A human identity in an Entra ID tenant. Managed at the tenant level, not the subscription level.
App Registration. The definition of an application in Entra ID. It lives in the tenant where the app was created ("home tenant") and describes the app's identity, permissions, and secrets. This is just a blueprint, it's not the thing that actually signs in.
Service Principal. The instance of an app registration in a given tenant. The service principal is what actually gets assigned permissions and signs in. If you see the Azure portal distinguish between "App registrations" and "Enterprise applications," that's this split: app registrations are definitions, enterprise applications are service principals.
Managed Identity. A service principal whose credentials are managed entirely by Azure. You never see a secret. Comes in two flavors: system-assigned (tied 1:1 to the lifecycle of a single Azure resource) and user-assigned (a standalone resource you can attach to multiple things).
Azure RBAC Role. Like GCP, this is a permission set, not an identity. Comes as built-in roles (curated by Microsoft) or custom roles.
Role Assignment. The three-tuple of (principal, role, scope) that actually grants access. Scope can be management group, subscription, resource group, or individual resource.

The basic Azure idea: identity and resource access are two separate systems stitched together by role assignments. The identity lives in Entra ID; the thing it's allowed to do lives in Azure RBAC; the role assignment is the bridge.

The adjacent view

Here's the side-by-side mapping. Rows are concepts; cells are the closest equivalent in each cloud.

Concept	AWS	GCP	Azure
Human identity	IAM User or Identity Center user	Google / Workspace account	Entra ID user
Grouping of humans	IAM Group	Google Group	Entra ID security group
Long-lived machine identity	IAM User with access keys	Service Account with JSON key	Service Principal with client secret or certificate
Credential-less workload identity	IAM Role (via instance profile, IRSA, etc.)	Service Account attached to a workload	Managed Identity
Permission set / definition	Managed or inline IAM policy	IAM Role (predefined or custom)	Azure RBAC role definition
Permission grant	Policy attached to principal (or resource)	IAM binding on a resource	Role assignment `(principal, role, scope)`
"Assume another identity"	`sts:AssumeRole`	Service account impersonation	No direct equivalent (see below)
Federation from external IdP	IAM Identity Center, SAML, OIDC provider	Workload Identity Federation	Entra ID external federation / B2B
Temporary credentials source	STS	Metadata server (OAuth2 access tokens)	IMDS (bearer tokens)
Kubernetes workload identity	IRSA or EKS Pod Identity	GKE Workload Identity	Entra Workload Identity
Cross-boundary access	Role trust policy allowing external account	Grant role to principal from another project	Guest users or multi-tenant app registrations

Where the mental models actually diverge

The table above makes things look tidier than they are. If you're going to get tripped up, it'll be on one of the following.

1. Which direction do permissions flow?

This is the biggest one.

AWS: Policies mostly attach to the principal. "This user / role can do X, Y, Z across these resources." Resource-based policies (S3 bucket policies, KMS key policies, etc.) exist as an escape hatch, but the default direction is principal-first.
GCP: Bindings attach to the resource. Every resource has an allow policy listing which principals have which roles. You almost never "look up what a service account can do" as a top-level operation. You look up resources and see who's on them.
Azure: Role assignments are a three-tuple stored in the subscription. You can query them by principal, by role, or by scope with roughly equal effort. In practice, most teams think about them resource-first, like GCP.

Why this matters: your mental debugging process is different. In AWS, "why can't this thing do X?" usually starts with aws iam get-role-policy. In GCP, it usually starts with gcloud projects get-iam-policy on the target resource. In Azure, it's az role assignment list --assignee <principal> or --scope <resource>.

2. Is the machine identity an "it" or a "thing"?

AWS IAM roles are abstract. Nothing owns them. Multiple EC2 instances, Lambdas, and EKS pods can all assume the same role simultaneously, and the role doesn't know or care.
GCP service accounts are real resources. They live in a project, they have an email, you can grant IAM roles on them (like roles/iam.serviceAccountUser to let someone attach the SA to a VM), and you can impersonate them if you have the right permission. They feel more like "a user that happens to be a robot" than AWS's "a hat anyone can wear."
Azure managed identities sit in the middle. System-assigned managed identities are tied to the lifecycle of exactly one Azure resource. Delete the VM, the identity is gone. User-assigned managed identities are standalone Azure resources that you can attach to multiple things, which makes them closer to AWS roles.

If you're coming from AWS and expect to "just assume the role from wherever," GCP and Azure will feel more constrained. If you're coming from GCP and expect your identity to have an email you can grant things to, AWS roles will feel invisible in a way that's hard to get used to.

3. Where does "a machine's identity" actually come from at runtime?

All three clouds have a metadata endpoint that hands out short-lived credentials to workloads running on their compute. The mechanism is remarkably similar across all three:

AWS: IMDSv2 at 169.254.169.254, returns STS credentials for the attached instance profile.
GCP: Metadata server at metadata.google.internal (also 169.254.169.254), returns OAuth2 access tokens for the attached service account.
Azure: IMDS at 169.254.169.254, returns bearer tokens for the attached managed identity.

They all default to a roughly one-hour token lifetime. They all require the SDK (or your own code) to refresh before expiry. They all fail in subtle ways if the workload's network egress to 169.254.169.254 is blocked, which is a fun thing to remember the first time you put a service mesh in front of a pod.

4. Kubernetes workload identity is a zoo

If you run Kubernetes on any of these, you'll eventually need to map a Kubernetes service account (KSA) to a cloud identity, so that pods can call cloud APIs without a mounted secret.

AWS has two options that both work. IRSA uses an OIDC provider on the EKS cluster; pods get STS tokens via a projected service account token. EKS Pod Identity is the newer mechanism that removes the OIDC provider setup. IRSA is still more widely deployed.
GCP has Workload Identity, which maps a KSA to a GCP service account via an annotation. Under the hood it's also OIDC-based, but you don't manage the provider.
Azure has Entra Workload Identity, which replaced the deprecated AAD Pod Identity. Like the others, it's OIDC-based. AKS publishes an issuer, and the federated credential on an Entra app or managed identity trusts tokens from that issuer bound to a specific KSA.

All three are "the same idea" (projected service account tokens + OIDC federation), but each has its own setup tax. Expect to do some real reading the first time you touch a new one.

5. Long-lived credentials: dangerous in different ways

Every cloud has a "here be dragons" credential type:

AWS IAM user access keys are the oldest of these, and the worst-behaved. They live forever, they're often committed to git, and they're the reason gitleaks exists. Modern AWS guidance is: don't create them at all. Use Identity Center for humans and roles for machines.
GCP service account JSON keys are the same problem in a different wrapper. They look innocuous (just a JSON file!) but they're long-lived bearer tokens that bypass every "no external access" control you have. GCP now disables key creation by default at the org level, and for good reason. Prefer impersonation or Workload Identity Federation.
Azure service principal client secrets have the same failure mode. Prefer certificate-based auth or (ideally) managed identities, which never expose a credential in the first place.

The common thread: any time you're holding a credential that doesn't have an expiry measured in hours, you're holding a bomb.

A worked example: "give this CI job read access to one bucket"

To make the differences concrete, here's the same task expressed three ways.

AWS: create an IAM role with a policy granting s3:GetObject on arn:aws:s3:::my-bucket/*, and a trust policy that allows your CI's OIDC provider (e.g. token.actions.githubusercontent.com) to assume it for a specific repo/branch. The CI job calls sts:AssumeRoleWithWebIdentity and gets temporary credentials.

GCP: create a Workload Identity Pool and Provider for your CI's OIDC issuer. Grant the external identity (e.g. a specific GitHub repo) the roles/storage.objectViewer role directly on the bucket, or have it impersonate a service account that has the role. The CI job exchanges its OIDC token for a GCP access token at the STS endpoint.

Azure: create an app registration (or user-assigned managed identity) with a federated credential trusting your CI's OIDC issuer for a specific repo/branch. Give the resulting service principal the Storage Blob Data Reader role, scoped to the storage account or container. The CI job exchanges its OIDC token for an Azure access token.

Notice the shape is identical: OIDC issuer, federation trust, role grant, scoped resource. The words and the order of operations are all different. This is why "I know AWS IAM, how hard can GCP be?" is a trap. The concepts transfer, but the muscle memory doesn't.

Where to go next

An adjacent view only gets you to the point of being able to read signs. To actually work in a second cloud, go read the primary docs for its identity system end to end:

AWS: IAM User Guide
GCP: IAM overview
Azure: Azure RBAC overview and Entra ID fundamentals. You need both, because they're genuinely two separate systems.

And when you get stuck debugging a permission error in a cloud you don't live in full-time, remember which direction that cloud flows permissions. Half the "why isn't this working" problems across clouds come from looking at the wrong end of the grant.

This post was drafted autonomously by Claude (Opus 4.6), then edited by me for tone and voice. The technical claims are mine to stand behind. If you find an error, let me know.

Deploying Azure OpenAI via Terraform

@coilysiren — Wed, 11 Oct 2023 00:00:00 GMT

This post is aimed at reasonably experienced engineers who are deploying Azure OpenAI for their day job. It assumes you're familiar with:

cloud deployments in general
terraform specifically

As such, this post functions primarily as a reference for the terraform configuration you would need. If you are looking for a more basic understanding of how terraform works, then Hashicorp has some great tutorials for you!

Prerequisites

In order to get started here, you need to have already done a few things:

Setup terraform locally with the state backend that most fits your needs.
Signed up for an Azure account.
- Circa October 2023: a simple personal Azure account will not work here. The Azure OpenAI signup process requires that your Azure account have enterprise support. I would not recommend creating a totally new Azure account as a part of this process.
Created a subscription within Azure. This post assumes you are using a paid subscription, which will need to be setup by someone in your company with billing permissions.
Installed the Azure CLI, and logged into it.
Been granted an Azure role like the Contributor role that allows you perform the relevant API actions within Azure. While it would be ideal to mention the fine grain access control you need to perform these actions, that is out of scope for this blog post. Someone with the Owner role on your subscription should be able to grant you the Contributor role.

The above steps will likely require the assistance of your finance and IT teams. Feel free to come back to this post once you've finished coordinating with them!

After setting up all of the above, then the should have all the fundamentals you need to deploy things! Lets go...

Security Preface

...okay wait.

Before I mention the terraform itself, I must give an important caveat. The terraform configuration describes here is in its least secure configuration. Specifically, its in its least secure configuration with regards to network security. If you are following this configuration as-is, then you should only be doing so as a prototype. Essentially you deploying this to prove to your stakeholders, "yes I have the skills required to deploy Azure OpenAI via Terraform". You must then follow-up via starting work up the network security improvements.

Having said that. Lets go...

Terraform Configuration

...deploy this thing! This blog post presents the configuration as two terraform files. This is for display simplicity, and you really shouldn't be just stuffing everything into two adjacent files like this. The files have the following folder structure:

# folder structure

terraform/main.tf
terraform/modules/azure-openai/main.tf

And here are the files:

# file: terraform/main.tf

# Set required versions.
#
# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3"
    }
  }
}

# Configure your Azure authentication. At my job we use multiple providers with
# the `alias` key to configure prod vs non-prod resources.
#
# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
provider "azurerm" {
  features {}
}

# Here we init our module which contains all of our actual resources. The keys
# here (eg. stage, vnet_cidr, etc) are all defined in our next file.
#
# docs: https://developer.hashicorp.com/terraform/language/modules
module "azure_openai" {
  source = "modules/azure-openai"

  # eg. prod, staging, dev, etc...
  stage = "prod"

  # Pick somewhere close to you, or close to your customers.
  # Use the following command to get locations:
  #
  #   $ az account list-locations -o table
  location = "westus"

  # The SKU is the pricing tier. I haven't been able to find a page or command
  # that provides a flat list of all the available SKUs. This page describes some
  # of them, though:
  #
  # https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity
  cognitive_sku = "S0"

  # This configuration is for our network architecture. A complete description
  # of what these numbers mean, and how to set them, is beyond the scope of this
  # post. But I will try my best to describe it in brief!
  #
  # A "vnet" is a virtual network group. This network group needs an address,
  # similar to a street address. The "cidr" is that address, represented as a range.
  # A "subnet" or "sub network" is simply a sub group of the broader vnet.
  # A vnet is like a house. The vnet cdir is the address to that house. The
  # subnets are individual rooms within that house. The subnet cidr is the address
  # for each room. The cidrs are all ranges, so the subnet cidrs are ranges contained
  # within the vnet range. You can use a website like https://cidr.xyz/ to confirm this.
  #
  # ...This was a lot. It deserves its own post...!
  vnet_cidr    = "10.0.0.0/19"
  subnet0_cidr = "10.0.0.0/24"
  subnet1_cidr = "10.0.1.0/24"
  subnet2_cidr = "10.0.10.0/24"
  subnet3_cidr = "10.0.11.0/24"
}

# file: terraform/modules/azure-openai/main.tf

# Set required versions. The module probably doesn't need to do this
# when the parent context is already doing it. But it's here anyway, can't hurt.
#
# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3"
    }
  }
}

# These variables are the inputs for our module. Their context us best understood
# by looking at where they are used inside this module.
#
# docs: https://developer.hashicorp.com/terraform/language/values/variables
variable "stage"          { type = string }
variable "location"       { type = string }
variable "cognitive_sku"  { type = string }
variable "vnet_cidr"      { type = string }
variable "subnet0_cidr"   { type = string }
variable "subnet1_cidr"   { type = string }
variable "subnet2_cidr"   { type = string }
variable "subnet3_cidr"   { type = string }

###########
# PREFACE #
###########

# Past this point, documentation is mostly non-existent on my part.
# This is primarily due to the wall clock time I had available to write this post.
# All of these resources do deserve documentation to some extent!

####################
# SHARED RESOURCES #
####################

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/resource_group
resource "azurerm_resource_group" "default" {
  name     = "azure-openai-${var.stage}"
  location = var.location
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_network
resource "azurerm_virtual_network" "default" {
  name                = "azure-openai-${var.stage}"
  address_space       = [var.vnet_cidr]
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/route_table
resource "azurerm_route_table" "default" {
  name                = "azure-openai-${var.stage}"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/route
resource "azurerm_route" "local" {
  name                = "local"
  resource_group_name = azurerm_resource_group.default.name
  route_table_name    = azurerm_route_table.default.name
  address_prefix      = var.vnet_cidr
  next_hop_type       = "VnetLocal"
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/route
resource "azurerm_route" "internet" {
  name                = "internet"
  resource_group_name = azurerm_resource_group.default.name
  route_table_name    = azurerm_route_table.default.name
  address_prefix      = "0.0.0.0/0"
  next_hop_type       = "Internet"
}

####################
# PUBLIC RESOURCES #
####################

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet
resource "azurerm_subnet" "subnet2" {
  name                 = "subnet2"
  resource_group_name  = azurerm_resource_group.default.name
  virtual_network_name = azurerm_virtual_network.default.name
  address_prefixes     = [var.subnet2_cidr]
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet
resource "azurerm_subnet" "subnet3" {
  name                 = "subnet3"
  resource_group_name  = azurerm_resource_group.default.name
  virtual_network_name = azurerm_virtual_network.default.name
  address_prefixes     = [var.subnet3_cidr]
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_route_table_association
resource "azurerm_subnet_route_table_association" "subnet2" {
  subnet_id      = azurerm_subnet.subnet2.id
  route_table_id = azurerm_route_table.default.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_route_table_association
resource "azurerm_subnet_route_table_association" "subnet3" {
  subnet_id      = azurerm_subnet.subnet3.id
  route_table_id = azurerm_route_table.default.id
}

#####################
# PRIVATE RESOURCES #
#####################

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet
resource "azurerm_subnet" "subnet0" {
  name                                          = "subnet0"
  resource_group_name                           = azurerm_resource_group.default.name
  virtual_network_name                          = azurerm_virtual_network.default.name
  address_prefixes                              = [var.subnet0_cidr]
  service_endpoints                             = ["Microsoft.CognitiveServices"]
  private_link_service_network_policies_enabled = true
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet
resource "azurerm_subnet" "subnet1" {
  name                                          = "subnet1"
  resource_group_name                           = azurerm_resource_group.default.name
  virtual_network_name                          = azurerm_virtual_network.default.name
  address_prefixes                              = [var.subnet1_cidr]
  service_endpoints                             = ["Microsoft.CognitiveServices"]
  private_link_service_network_policies_enabled = true
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_route_table_association
resource "azurerm_subnet_route_table_association" "subnet0" {
  subnet_id      = azurerm_subnet.subnet0.id
  route_table_id = azurerm_route_table.default.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_route_table_association
resource "azurerm_subnet_route_table_association" "subnet1" {
  subnet_id      = azurerm_subnet.subnet1.id
  route_table_id = azurerm_route_table.default.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/public_ip
resource "azurerm_public_ip" "nat0" {
  name                = "azure-openai-${var.stage}-nat0"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  allocation_method   = "Static"
  sku                 = "Standard"
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/nat_gateway
resource "azurerm_nat_gateway" "nat0" {
  name                = "azure-openai-${var.stage}-nat0"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  sku_name            = "Standard"
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/nat_gateway_public_ip_association
resource "azurerm_nat_gateway_public_ip_association" "nat0" {
  nat_gateway_id       = azurerm_nat_gateway.nat0.id
  public_ip_address_id = azurerm_public_ip.nat0.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_nat_gateway_association
resource "azurerm_subnet_nat_gateway_association" "nat0" {
  subnet_id      = azurerm_subnet.subnet0.id
  nat_gateway_id = azurerm_nat_gateway.nat0.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/public_ip
resource "azurerm_public_ip" "nat1" {
  name                = "azure-openai-${var.stage}-nat1"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  allocation_method   = "Static"
  sku                 = "Standard"
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/nat_gateway
resource "azurerm_nat_gateway" "nat1" {
  name                = "azure-openai-${var.stage}-nat1"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  sku_name            = "Standard"
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/nat_gateway_public_ip_association
resource "azurerm_nat_gateway_public_ip_association" "nat1" {
  nat_gateway_id       = azurerm_nat_gateway.nat1.id
  public_ip_address_id = azurerm_public_ip.nat1.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_nat_gateway_association
resource "azurerm_subnet_nat_gateway_association" "nat1" {
  subnet_id      = azurerm_subnet.subnet1.id
  nat_gateway_id = azurerm_nat_gateway.nat1.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_dns_zone
resource "azurerm_private_dns_zone" "openai" {
  name                = "azure-openai-${var.stage}.privatelink.openai.azure.com"
  resource_group_name = azurerm_resource_group.default.name
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_dns_zone_virtual_network_link
resource "azurerm_private_dns_zone_virtual_network_link" "openai" {
  name                  = "azure-openai-${var.stage}"
  resource_group_name   = azurerm_resource_group.default.name
  private_dns_zone_name = azurerm_private_dns_zone.openai.name
  virtual_network_id    = azurerm_virtual_network.default.id
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint
resource "azurerm_private_endpoint" "private0" {
  name                = "azure-openai-${var.stage}-private0"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  subnet_id           = azurerm_subnet.subnet0.id

  private_service_connection {
    name                           = "azure-openai-${var.stage}-private0"
    private_connection_resource_id = azurerm_cognitive_account.private.id
    subresource_names              = ["account"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "default"
    private_dns_zone_ids = [azurerm_private_dns_zone.openai.id]
  }
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint
resource "azurerm_private_endpoint" "private1" {
  name                = "azure-openai-${var.stage}-private1"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  subnet_id           = azurerm_subnet.subnet1.id

  private_service_connection {
    name                           = "azure-openai-${var.stage}-private1"
    private_connection_resource_id = azurerm_cognitive_account.private.id
    subresource_names              = ["account"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "default"
    private_dns_zone_ids = [azurerm_private_dns_zone.openai.id]
  }
}

#######################
# COGNITIVE RESOURCES #
#######################

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/cognitive_account
resource "azurerm_cognitive_account" "private" {
  name                               = "azure-openai-${var.stage}"
  location                           = azurerm_resource_group.default.location
  resource_group_name                = azurerm_resource_group.default.name
  kind                               = "OpenAI"
  sku_name                           = var.cognitive_sku
  outbound_network_access_restricted = false
  local_auth_enabled                 = true

  public_network_access_enabled = true
  # public_network_access_enabled is a misleading setting name. It is best understood in
  # connection with the network_acls.defeault_action setting. Here's effect of various
  # combinations of these two settings:
  #
  # public_network_access_enabled: true, default_action: Allow
  #   - Accessible anyhere from the internet. THIS IS DANGEROUS.
  #
  # public_network_access_enabled: true, default_action: Deny
  #   - Accessible from the specified IP ranges in `ip_rules`.
  #
  # public_network_access_enabled: false, default_action: Allow
  #   - Adding `ip_rules` has no effect, they're fake news.
  #
  # public_network_access_enabled: false, default_action: Deny
  #   - Same as above.
  #
  # At a high level, this setting needs only to be `true` when you haven't yet
  # setup private network ingress points from your primary network group (vnet, VPC, VPN, etc)
  # to this new vnet. Explaining that in detail is beyond the scope of this post!

  custom_subdomain_name = "azure-openai-${var.stage}-v1"
  # custom_subdomain_name has a version incrementor because it's a global resource,
  # and doesn't always get deleted immediately when the resource is destroyed.
  # The error message you get will look something like:
  #
  # > The subdomain name ... is not available as it's already used by a resource
  #
  # When I get that message, I simply increment the version.

  network_acls {
    default_action = "Deny"
    ##########################################
    # !!! IMPORTANT SECURITY ACTION ITEM !!! #
    ##########################################
    # This is a "prototype" configuration where you just grab your IP address
    # via `$ curl http://ifconfig.me` and stick it right this file. This is bad
    # security practice and is only fit for proving to your stackholders that
    # you are skilled enough to deploy Azure OpenAI via terraform.
    #
    # What you want to do, is setup a peering connection from your main network groups
    # to the new network groups created by this terraform file. You will need
    # to setup the peering connection to support network requests from humans
    # (via a VPN or similar) and network requests from automated services
    # (via another network group, like a different vnet).
    #
    # Once you do that, you can set `public_network_access_enabled = false`,
    # because the network requests will be coming from private IPs routed through
    # your peering connection. Then you will set these ip_rules to private cidrs.
    ip_rules = [
      "255.255.255.255", # <= !!! your IP address goes here !!!
    ]
    virtual_network_rules {
      subnet_id = azurerm_subnet.subnet0.id
    }
    virtual_network_rules {
      subnet_id = azurerm_subnet.subnet1.id
    }
  }
}

# docs: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/cognitive_deployment
resource "azurerm_cognitive_deployment" "private" {
  name                 = "azure-openai-${var.stage}"
  cognitive_account_id = azurerm_cognitive_account.private.id
  model {
    format  = "OpenAI"
    name    = "gpt-35-turbo"
    version = "0301"
  }

  scale {
    type = "Standard"
  }
}

All of that! Should be deploy-able with a terraform init && terraform apply without requiring much additional configuration. With, of course, the significant except of putting your IP address into the ip_rules.

I don't expect this example will quite work so easily if you are working with existing architecture. If you're looking for another example terraform configuration to compare against your architecture, I would recommend this Github repo from Azure:

https://github.com/Azure-Samples/azure-openai-terraform-deployment-sample

Good luck! And do follow-up with those network security improvements, dear reader.

On Permissions Models for Cloud Platform Providers

@coilysiren — Sun, 12 Jan 2020 00:00:00 GMT

I was recently reflecting on the fact that AWS IAM has the best permissions model on the market, and wondering why I think most other platforms have such a dramatically inferior setup. I had never put much thought into this prior, so my first conclusion was "wow everyone but Amazon is bad at this!".

Except that's not actually how the world works 🙂 In reality there are concrete tradeoffs and decisions that lead to products being in a certain state. And in situations like this? A well designed foundation will get you 90% of the way there. With that context I imagined: how would I design a platform such that future people working on it (eg. without my direct assistance) would replicate the success and effectiveness of IAM? This post is me taking a shot at that.

Guiding Principles

To start, there's a few principles we want to establish. The principles establish the direction and essence of our foundation, and all of the large design choices will flow from the principles. Some of the principles are lifted verbatim from AWS, which is to be expected since AWS is the clear leader in it's realm. Here they are:

all services must interact exclusively through public APIs
all service endpoints and clients must be created and validated with code generation
all service operations must take fully qualified unique resource identifiers (RSIDs) as inputs
all service operations must only use the resources passed in as RSIDs during their operation
every unique independent resource within a service must be able to by specified via a RSID
if a resource is a child of another resource, that must be encoded into the RSID
every resource is a child of a "organization", which generally presents a company
every unique independent action that can be taken on a service must have a unique action identifier (ACID)
ACIDs must follow the Create Read Update Delete (CRUD) naming scheme, but may also include additional verbs such as "List"
IAM is the only service that gets any degree of special treatment
the code generation must wrap every service operation with the same IAM checks
the platform must be capable of running itself
service usage can be simplified for customers by creating aliases that group operations, but never by compromising on the functionality of the operations themselves
everything defaults to allowing no access

An example case

With the principles above, I want to detail an example case. Say our cloud platform (cloudOps) just launched, and we want to provide a database service (kittenDB) to our clients. Our platform itself also runs on that same database service! How do we go about setting that up? We start by defining all the RSIDs and ACIDs for both our permissions service (IAM) and our database service (kittenDB), both for our use and for use by our clients.

Service IAM has 3 resources: "users", "policies", and "organizations". Policies can be granted to users and organizations. The complete layout of ids is like so:

RSID::IAM::{{ organization }}::/organization/*
RSID::IAM::{{ organization }}::/user/*
RSID::IAM::{{ organization }}::/policy/*
RSID::IAM::{{ organization }}::/policy/{{ policy }}/user/*
RSID::IAM::{{ organization }}::/policy/{{ policy }}/organization/*

ACID::IAM::create-organization
ACID::IAM::list-organization
ACID::IAM::read-organization
ACID::IAM::update-organization
ACID::IAM::delete-organization

ACID::IAM::create-user
ACID::IAM::list-user
ACID::IAM::read-user
ACID::IAM::update-user
ACID::IAM::delete-user

ACID::IAM::create-policy
ACID::IAM::list-policy
ACID::IAM::read-policy
ACID::IAM::update-policy
ACID::IAM::delete-policy

ACID::IAM::attach-policy-to-user
ACID::IAM::attach-policy-to-organization

Service kittenDB has 2 resources: "maps" and "entries", in addition to the database itself. Entries are contained with maps. The complete layout of ids is like so:

RSID::KITTENDB::{{ organization }}::/database/*
RSID::KITTENDB::{{ organization }}::/map/*
RSID::KITTENDB::{{ organization }}::/map/{{ map }}/entry/*

ACID::KITTENDB::create-database
ACID::KITTENDB::read-database
ACID::KITTENDB::update-database
ACID::KITTENDB::delete-database

ACID::KITTENDB::create-map
ACID::KITTENDB::read-map
ACID::KITTENDB::update-map
ACID::KITTENDB::delete-map

ACID::KITTENDB::create-entry-in-map
ACID::KITTENDB::read-entry-in-map
ACID::KITTENDB::update-entry-in-map
ACID::KITTENDB::delete-entry-in-map

Platform Bootstrapping

The very very first thing we need to do, is allow the platform to manage itself. And in the beginning... there is no platform! So someone at cloudOps need to run some commands to spin up the very first kittenDB instance. We'll assume this some employee on the physical cloudDB server running make build after doing git clone kittenDB ... or whatever.

From there we have to "hand load" a few policies into the database. Our goal here is to do all of the manual direct-to-database work required in order to let our users utilize the platform. It's important to recall here that cloudOps is using kittenDB as it's database, and every action requires a database call, so for even very mundane actions like "organization creation" we need to start defining access policies. We'll do this with a single instance "global policy" that applies to everyone. It'll look like this:

# in case you're unacquainted, this is cloudformation syntax!

GlobalPolicy:
    Name: "global-policy"
    Type: "IAM::Policy"
    Properties:
        Rules:
            - Name: "allow-creating-organizations"
              Resources:
                # for any organization name
                - "IAM::*::/organization/*"
              Actions:
                # allow seeing if an organization of that name exists
                - "IAM::list-organization"
                # allow creating an organization with that name
                - "IAM::create-organization"

This would translate to API calls that, in a CLI for example, would look like this

cloud-ops iam create-organization --name "my-new-organization"

More Policy Definitions

The last section described the most basic action for the platform: organization creation. From there we have a few more "personas" we need to consider. The personas represent either a distinct state for a given user, or a "machine user" created for a specific purpose. At any rate, they all need policies!

# this is for an "admin" user for the organization
OrgAdmin:
    Name: "org-admin"
    Type: "IAM::Policy"
    Properties:
        Rules:
            - Name: "allow-managing-my-organization"
              Resources:
                # for anything within my organization
                - "IAM::my-new-organization::*"
              Actions:
                # allow me to do anything
                - "IAM::*"
            - Name: "allow-database-access"
              Resources:
                # for a database within my organization
                - "KITTENDB::my-new-organization::*"
              Actions:
                # allow "database" level actions such as creating and deleting
                # the database, but not reading its contents
                - "KITTENDB::*database*"

# this is for an "operator" user for the organization, such as an engineer
OrgOperator:
    Name: "org-operator"
    Type: "IAM::Policy"
    Properties:
        Rules:
            - Name: "allow-database-access"
              Resources:
                # for a database within my organization
                - "KITTENDB::my-new-organization::*"
              Actions:
                # allow me to see anything I might need to for debugging
                - "KITTENDB::*list*"
                - "KITTENDB::*read*"

# these are for two distinct types of "machine user"

MachineUserReadAccess:
    Name: "machine-read-access"
    Type: "IAM::Policy"
    Properties:
        Rules:
            - Name: "database-read"
              Resources:
                # for a database within my organization
                - "KITTENDB::my-new-organization::*"
              Actions:
                # allow read actions
                - "KITTENDB::*read*"
                - "KITTENDB::*list*"

MachineUserWriteAccess:
    Name: "machine-write-access"
    Type: "IAM::Policy"
    Properties:
        Rules:
            - Name: "database-read"
              Resources:
                # for a database within my organization
                - "KITTENDB::my-new-organization::*"
              Actions:
                # allow read actions
                - "KITTENDB::*read*"
                - "KITTENDB::*list*"
            - Name: "database-contents-write"
              Resources:
                # for the contents of my organization's database
                - "KITTENDB::my-new-organization::/map/*"
              Actions:
                # allow all actions
                - "KITTENDB::*"

These policies would be pre-populated in the user's new organization, so they could assign them to people as needed. So you would create your new organization, and be presented with a list of policies like:

org-admin
org-operator
machine-read-access
machine-write-access
etc etc

The policies would be editable and for whatever specific purpose the user requires. A fairly common one would likely be giving read / write access only to certain paths:

Rules:
    - Name: "read-shared-paths"
      Resources:
        # for all shared paths
        - "KITTENDB::my-new-organization::/map/shared/*"
      Actions:
        # allow read actions
        - "KITTENDB::*read*"
        - "KITTENDB::*list*"
    - Name: "write-my-shared-paths"
      Resources:
        # for my shared paths
        - "KITTENDB::my-new-organization::/map/shared/my-paths/*"
      Actions:
        # allow all actions
        - "KITTENDB::*"

Implementation Challenges

This is an "overview" level description of how one would setup a system like this, but there are a great many things that represent implementation challenges. Including:

given the auto-generated stub kittenDB API endpoints, how do you implement the actual logic?
given how "hot" the code and data paths will be for the IAM access systems, how do you keep performance high?
how do you resolve intersecting / conflicting access policies, ideally in such a way that users can debug them?
how do you consistently ensure that every single API endpoint effectively communicates to the user when there's a permissions issue?

Those are all "tactical" level challenges that will remain relevant for the entire course of the business. You'll need people dedicated to solving them, and also people working on expanding cloudOps out horizontally to support more types of data stores. That said! All of that ongoing work would happen on top of the existing strong fundamental permissions model, and you would be well suited to provide high flexibility access control for any client's business needs. You could also look into expanding the permissions model for more fine grain cases 👀 such as attribute and tag based conditions.

Golang PR Field Notes (Part 3)

@coilysiren — Sun, 29 Dec 2019 00:00:00 GMT

if you're looking for part 2, it's back this way <===

When we last saw our protagonist

The last time we were here, we took the time to identify some concepts we may need to understand, in order to do our work. That list of concepts is very long, and the actual relevance of the individual concepts varies fairly dramatically.

Today, we're going to build up the knowledge we don't have, and start identifying and defining our "foundational concepts". Those are the things we absolutely need to understand in order to do any well grounded work in this space.

The foundational concepts

Judging from how they were mentioned and the frequency of mentions, here's what I think I need to understand first. I've included links to the place where they appear in the golang issue tracker, links to various places on the internet that contain definitions or helpful information, and finally my personal re-definition of the term.

high level programming concepts

code compilation (wikipedia)
- In a general sense, code compilation is any process of turning one programming language into another. It's most common specific usage is in turning a high level language (like python or go) into a low level language (like assembly or machine code) to create an executable.
compiler optimization (wikipedia)
- compiler optimization is the practice of trying to optimize the process of code compilation for any of the following attributes: compile time, code output size, memory requirements, and power consumption.
assembly code (wikipedia)
- assembly code is the code that computers actually know how to read. Most compilers work by turning some source code into some machine code, where the target machine code is often an assembly language. It's possible, but very annoying, to write assembly code directly.

computer processor concepts

inter process communication (IPC) (github) (wikipedia)
- by "default" processes share no memory and cannot directly communicate with each other. IPC is a set of techniques on can use to share data between processes.
semaphore (github) (wikipedia)
- a semaphore is a construct that is used to control access to shared data across multiple concurrent processes. Semaphores are conceptually related to UNIX pipes, although it's unclear to me if pipes specifically are semaphores. In general, semaphores become required for certain kinds of complex work across multiple processes.
instruction cache (icache) (github) (wikipedia)
- a cache is a place where you store things so that you can access them more quickly. An icache is a cache for instructions to be run by the CPU. Instructions in this context are lines of source code that are compiled to some machine code. CPUs may have some mechanism for moving instructions into / out of the icache, but that's a subject for another day.
registers (github) (wikipedia)
- processor registers are a type of cache used by the CPU. They are the top of the pyramid in the memory hierarchy, which is to say that they are the fastest memory location available to the CPU. Registers are very very small, their maximum size typically being registered in bits. The amount of memory available to registers (~32 bits) can be contrasted with the memory available to RAM (~8GB, so 10^9 larger) and the memory available to hard drives (~1TB, so 10^12 larger)

general compilation concepts

bounds checks (github 1, 2) (wikipedia)
- bounds checking is a process via which a CPU can check that a variable meets some conditions before it is used. One common type of bounds checking is range checking, where a variable is checked to see if it fits into a given type, like a u8 (unsigned 8 bit int) or i16 (signed 16 bit int). Another common type of bounds checking is index checking, where an index to an array is checked to see if it actually fits in that array.
compiled function nodes / abstract syntax trees (ASTs) (github) (wikipedia)
- ASTs are an abstract representation of some source code, generally created by some kind of parser or compiler. ASTs represent certain elements as nodes in the tree, such as functions, if statements, and comparisons. The term compiled function node can refer to either the literal individual node in an AST representing the function, or the individual function node and all of it's children in the tree.
inlining functions (github) (wikipedia)
- this concept primarily relates to compiler optimization. For some parent function that calls some child function, the child function can either be represented as a separate function "node" - or it can be inlined into the parent function in a manner functionally equivalent to copy pasting. The dynamics of when something should vs should not be inlined are a matter of long and frequent discussion within golang, in particular in this issue.
intrinsic functions (github) (wikipedia)
- intrinsic functions are special and fancy functions within a language, that have some extra tricks that leverage more low level processor / memory resources. They're sometimes written directly in assembly, and are often architecture specific (eg. like specific to AMD processors).

compilation concepts with large impact on golang

static single assignment (SSA) (github 1, 2, 3) (godoc) (wikipedia)
- SSA is a compiler design mechanism, wherein each variable is assigned to be assigned once and only once. In SSA, if a new value is assigned to a variable then a new version of that variable if created. SSA is primarily used for compiler optimization, since the guarantee of "a variable's value will never change" can enable a variety of compiler optimization algorithms.
- golang has a large SSA package that does various activities related to SSA.
binary files / executables (github 1 2 3) (wikipedia)
- An executable is a file that can be run by a machine in order to do some task. Executable files contrast most sharply with data files, which do not contain task instructions and instead contain information. Many executables are binary files mean to be run by a machine directly, although the term executable can also be applied to the source code files of scripting languages (like python, and unlike go).
- In golang care has been taken to make the generation of executable binary files both fast and reliable.
*.o object files (github 1, 2) (googlesource) (wikipedia 1, 2)
- object files are the output of a compiler, and it's contents are usually in a machine language such as binary. Object files usually contain data points that a linker can use to fill in code from other places.
- Go's compiler produces object files as one of the compiler outputs.
*.a files (github 1, 2, 3) (stackoverflow) (googlesource) (wikipedia)
- Go's *.a files are package archive files, they were originally in the ar archive format. There was at some point a proposal to change them to the more standard zip file format. Archive files contain compiled package code, and additionally some debugging information.
file linking / static vs dynamic linking (github 1, 2, 3) (wikipedia) (stackoverflow 1, 2) (reddit)
- linking is the process by which an object file can request that other code be inserted into it. dynamic linking is a linking process where the linking is not truly resolved until runtime, creating a dependency on external files (like DDLs). static linking is a linking process where any links are resolved at compile time, creating a fully self-contained object.
- static linking is the better choice in like 99% of all cases.
- golang's compiler uses static linking, either exclusively or by default (I'm not sure which).
export data (github 1, 2, 3) (godoc 1, 2, 3)
- the literal reference for "the data that is exported from a package".
- it's usage within golang is golang specific, because there are standard library tools (linked above) in golang built around managing export data. Those tools are responsible for the format and content of the export data.
- that said, the term itself is fairly generic since "exported data" can mean an entire universe of things.

general computer science concepts

runes (github) (stackoverflow) (blog.golang.org)
- runes are integer values that point to particular unicode code points.
the stack vs the heap (github) (stackoverflow)
- the stack is an area of memory that uses last-in-first-out access patterns, and is used for rapid access to small data. The term stack can be remembered by thinking of someone quickly stacking a pile of plates as they're being washed.
- the heap is an area of memory where access is done in an ad-hoc manner, and is used for longer term access to large data. In general, high level scripting languages (ex. ruby and python) allocate most / all of their memory of the heap, for the sake of simplicity at the cost of performance. The term heap can be remembered by thinking of a large pile (eg. a "heap") of objects scattered about randomly.

golang general concepts

goroutines (github 1, 2) (gobyexample) (golang-book)
- goroutines are not threads 😅. They're go's concurrency model. goroutines are to go as processes are to a CPU.
buffering and channels (github) (gobyexample) (stackoverflow) (medium) (tour.golang.org)
- channels are like IPC except for goroutines. They allow communication and synchronization between goroutines. buffering allows channels to act asynchronously by creating little pools (buffers) that each side can interact with without blocking.

Wow! What have we learned?

...everything 😆. Interestingly, a lot (most?) of these concepts were concepts that I already had some awareness of, I just wasn't able to define them. Now empowered with these definitions, when I read issues like https://github.com/golang/go/issues/15752 I have a much much stronger understanding of what's being described.

Up next

Now that I know a ton more about this space, I can more accurately judge whether or not a particular github issue is a good fit for me to work on. There are several types of issues that I now know aren't quite a good fit for a first-timer, for example this issue about improving inlining. In my next post I'll setup a more targeted framework for determining who approachable the various issues are, and (🤞🏽ideally🤞🏽) pick one to start working on.

Golang PR Field Notes (Part 2)

@coilysiren — Thu, 26 Dec 2019 00:00:00 GMT

if you're looking for part 1, it's back this way <===

Starting up again

Something I under-estimated, is how amazingly useful having my notes from yesterday would be. Having written so much down gives me a much stronger launching pad for furthering my knowledge, particularly when the alternative is having to remember everything.

To kick off the day, here's the list of questions I need to answer:

what does it mean to build a dependent package, and when do we know that a dependent package needs to be rebuilt?
how many more types of code change do we need to make change the export data?
what are object files?
how do we know when object files are stale?
how do we change the export data when object files are stale?
how do executables relate to needing to rebuild dependent packages?
what does it mean to re-link executables?

I need to be able to answer all those questions before I can confidently make any movement here, but I actually don't want to start on that yet. What I want to do first is do more requirements gathering, because the issue we're looking at has links to other issues that may contain relevant information.

Looking through related issues

Related issues often contain context that's required to fully understand the origin issue. When working in very large repos when potentially complex histories, I read all the issues out two steps from the origin issue. So if I've started on issue #101 which mentions #102 which mentions #103 which mentions #104, I'll read all of #101, #102, and #103 (eg. and not #104). This is essentially a triage mechanism, in a repo as large as golang I could probably spend an entire calendar year just reading issues.

So here are all the issues mentioned up to 2 steps out, in order of their creation.

There's a lot of them ^^ some of the linked issues are near / exact matches of my search criteria from yesterday. This is a good sign, it tells me that even if I can't solve my origin issue, that I'll be able to use the context I'm gaining to look into a nearby issue.

Concept discovery

Now what? Well, we need to look through the issues to look for a few things. Red flags, yellow flag, potential PRs that might solve the origin issue, and concepts I don't understand. The unknown concepts are going to be what requires the most work here, I'll need to list them out and add gain concrete understanding of them all before I can do any "real work". Scanning through the issues, here's those concepts:

what is the difference between $GOROOT/pkg, $GOPATH/pkg? via #4719
what is *.a cache? via #4719
what does "lazy cleaning" mean? via #4719
what are *.a binaries? via #4719
what are binary-only packages? via #4719
what is the difference between go build and go install? via #4719
how could go's build system work without a pkg directory? via #4719
what are object files, what does it mean to link object files? via #4719
what is gb? via #4719 and #14271
what is cgo, what does it mean to be cgo-enabled? via #4719
what does compilation specifically mean for go? via #4719
how do binaries relate to source code? via #4719

Short pause. Certain people are referenced in these issues a TON. I'm getting the impression that programming languages are only really written by a very small number of people (less than a dozen) who have a lot of support infrastructure (of hundreds or thousands). At any rate, back to concepts.

how does go compilation differ from cgo compilation? via #9887
does golang support compilation to multiple different target languages, eg. C, C++, and Fortran? if so, why? via #9887
what does the go tool "scheduling work" mean? via #8893
what does topological mean, and how could it lead to suboptimal parallelization? via #8893
what is critical path scheduling? via #8893
why is cgo compilation heavy? via #8893

Another aside -- I think I just spotted a remnant of the point at which golang switched to compiling itself. A quick google and I now know that go was originally written in C. Back to concepts!

what is asm compilation? via #8893 and #17566
what is SSA? via #8893 and #15736 and #17566
how do you avoid recompiling a package every time you run tests? via #11193
what does it mean to compile versus to link? via #14271
what is cmd/go? via #14271
what is bazel? via #14271

Did you know -- if you're spending hours reading GitHub issues in the winter, you can concurrently be applying lotion to your dry skin? It's true! Anyways.

what are go:cgo_import_dynamic directives? via #15681
what is the relationship between golang and C? what is the relationship between c compilations and the linker? via #15681
how do .a files relate to .o files? via #15681
what are .c files? via #15681
does golang use gcc? via #15681
what is the difference between export data and machine code? via #15734
what are inlined functions? via #15734 and #17566
what does it mean to walk a function? via #15734
what is a semaphore? via #15734
what is a VFS? via #15734
how do processes communicate with each other? via #15734
how does export data relate to .a files? via #15734
what is the relationship between export data and objects? via #15734
what is cmd/go? via #15734

Something I've noticed while writing this, is that the detail amount of time I can spend reading before encountering a new unknown concept is increasing. I'm also gaining some understanding of the unknown concepts simply via seeing them mentioned in context, eg. without needing to look up their definitions. That said, the next step will definitely be to look up definitions ^^. Alright, back to concepts.

how are go's cmds organized, eg. cmd/go vs cmd/compile, etc? via #15736
what is typechecking? via #15756
what is escape analysis? via #15756
what steps are required for compilation? via #15756
what does it mean to "rebuild even when target is up to date" eg. what is the target of builds? via #15799
what is yyerrorn? via #15913
what is a gc node? via #17566
what are OKEY and OAPPEND nodes? via #17566
what does it mean to inline during compilation? via #17566
what are bounds checks? via #17566 and #25862
what does "generating code" mean in the context of compilation? via #17566
what would be the benefit of inlining performance critical functions? via #17566
what is the cost of inlining a function? via #17566
what is a pragma? via #17566
what is closure conversion? via #17566
what does it mean to export a function? via #17566
what is zero-cost control flow abstraction? via #17566
what is an intrinsic? via #17566

I just went through an entire (long) issue without noticing any new concepts! Exciting.

what is the binary export format? via #20070
what is a package archive? via #20579
what are registers? via #25999
what does it mean to spill a register? via #25999
what is the stack, and how does it relate to optimization? via #25999
what does it mean to preempt a goroutine with signals? via #25999
what are runes? via #27148
what is a goroutine? via #27345
what is an unbuffered channel? via #27345
what is BCE? via #28314
what is morestack? via #29067
what is a branch target buffer, what are backwards versus forward branches? via #29067
what is icache? via #29067

Wow that was exhausting

So here's what we did:

dropped into the golang issue tracker with a specific learning goal in mind
found a specific issue, and walked through all the related issue for concepts that need to be learned

And here's what we will do next:

define and understand most (not all) of the concepts, focusing on hotspots
return to the original issue, and determine if we have enough information to create a fix (likely yes, 90% confidence)
start working on a fix

That'll come later! For today, I'm headed back to video games.

Up next

===> part 3 is this way

Golang PR Field Notes (Part 1)

@coilysiren — Wed, 25 Dec 2019 00:00:00 GMT

Identifying an issue

I started with of the following search:

open issues
that need a fix
that don't have a CL (eg. a changelist)
that aren't planned to be fixed by the go team
about toolspeed

From this url. There were a few considerations that went into that search. In my experience with open source, and also under the guidance of go's contributing guide, I determined that "open issues that will take a fix, but no fix is planned" is the best path for contribution. They essentially say "we will take a fix that looks like this, if anyone drops in to create it". The golang issue tracker separates "backlog" from "unplanned", which I learned about here => https://github.com/golang/go/issues/34376. I assume that a backlog issue may have someone on the golang team drop in and sweep it out from under me, whereas unplanned issues are free game essentially indefinitely. I've had a variant of that happen to me before, where I created a PR for django and then a maintainer decided to take some of my work and start their own PR. So, I'm painfully allergic to that happening again here.

The focus on toolspeed is inspired by my learning goals, I go into that briefly on twitter here => https://twitter.com/coilysiren/status/1209628089346490368. There's another label that fits my learning goal, the "performance" label. Here's both the toolspeed and performance searches, mostly for my future reference:

Short aside on labeling

If you can't tell from my methodology here, accurately labeling issues is critically import for new contributors. When I first drop into a repo, the labels are the first thing I look at. There's a lot of new information to ingest when entering a repo, and often the labels are the best "front door".

This makes accurately labeling issues a fairly high priority concern for repository maintainers, in my opinion. I was very happy to see a cpython proposal by Mariatta on creating a "bug triage" role. Essentially the entire job of that role is to apply labels. I would totally apply to that role, if contributing to cpython looked like it would helpful for my career. But it's looking like golang is going to be where my interests lie, so that's why you're reading a post about golang.

Exploring the issue

So, with the previously mentioned search as my guide, I went for the first issue that looked the most approachable. "Approachable" here is a subjective definition, it translates roughly to "do I feel capable of doing this". The issue I picked was:

https://github.com/golang/go/issues/15752

cmd/go: only rebuild dependent packages when export data has changed

From the title, I determined that the issue relies on core understanding of two concepts:

dependent packages (which I understand)
export data (which is unknown to me)

This prompts our focus to revolve around the "export data" concept. There is a particular dependency in this issue on "the export data changing" which is a pre-requisite for making any enhancements based on that fact. That is, in order to create any logic that relies on diffs in the export data, we have to be able to fully trust that diffs in the export data are sufficiently reliable to make decisions. We can state that fact without first even having an understanding of what "export data" is.

The export data reliability point comes up in the issues description, specifically in this line

Not sure how often it happens that code changes don't impact export data (or how cheap that is to detect), but when it does happen, that could save a bunch of computation.

From this context I'm able to determine that "export data" means something like "metadata about a function". I assume that the export data contains some basic information about a function of package, such as last edited time / file size / function signatures / etc.

There's one part here that's particularly important to emphasize

Not sure how often it happens that code changes don't impact export data

So golang teams' current understanding is that it's unknown if the export data is currently reliable enough to make decisions based on it. At this point, I'm starting to reframe the issue into two separate tasks:

only rebuild dependent packages when export data has changed (the original task)
change the export data for a dependent package whenever that package needs to be rebuilt (a derived task from the above)

This surfaces another question, specifically: what does it mean to build a dependent package, and when do we know that a dependent package needs to be rebuilt? Being able to answer that question is a pre-requisite to writing code for the conditional rebuilding of dependent packages. This creates a set of subtasks, specifically

change the export data during (case 1)
change the export data during (case 2)
etc...

Further comments give more information about the export data concept

the plan would be to do something like check the SHA of the export data. If that hasn't changed, then the downstream compilation won't change.

...

Yeah, I was just thinking that you'd store the SHA1 of the export data of

all imported packages in the .a file and only recompile if any of them had changed. No sense trying to be too sophisticated.

I don't know often this would be useful. Certainly sometimes (especially in

a "add more and more debug prints" cycle)

Emphasis mine. So part of the goal is to avoid rebuilding a dependent package if you've made trivial internal changes to it, like for example adding some debugging prints.

Further down the thread I'm seeing several people mention that they're working on some portion of this (1, 2) which gives me some pause. But those comments are from years ago, so I believe I'm in the clear.

Another comment by the author further reinforces the direction I believe this should be approached from

Is this now just (“just”) a matter of only hashing the export data when determining whether object files are stale?

specifically my note about change the export data during... above. This comment surfaces some new questions, specifically:

what are object files?
how do we know when object files are stale?
how do we change the export data when object files are stale?

The final comment in this issue is a wildcard for me, it says

I think it's a little bit more than that because you still need to re-link any executables?

This statement is hard for me to parse, and raises several questions:

how do executables relate to needing to rebuild dependent packages?
what does it mean to re-link executables?

I know that executables are a type of asset produced by a build process. Specifically I believe that go build ... produces executables. It was my understanding that go build produced a single statically linked executable, but I now believe that understanding is incorrect. My new understanding is that go build ... possibly creates multiple executables, perhaps one for each package? If you were conditionally rebuilding executables based on export data, then you would end up with multiple executables where some of them are outdated and need to be pointed to new links? I'll be sure to investigate this.

Next steps

So far we've identified the following:

there is a problem I can solve here
the desired solution has some performance impact
I don't currently possess all the knowledge I need to solve them problem, but the amount of knowledge that I need to gain feels reasonable
there is some active work on this issue, but not so much active work that my contributions would be invalidated

In summary, there's work to do! My next step is to double down on investigating this issue, and gain any surrounding context that I might need.

Up next

===> part 2 is this way

The Maintainer - Aspect of Code Janitor

@coilysiren — Sat, 27 Jul 2019 00:00:00 GMT

The code janitor is a servant leadership of a codebase. They use their skills primarily to increase the software engineering velocity of the team as a whole / increase the stability of the code. Their roles are such:

Encourage continually better code pratices in your coworkers. If they mention writing code they haven't pushed to a remote branch, get them to push that branch! If they push code without tests, get them to write tests! Continually iterate on best practices for branching and test coverage. Good branching enables good review and helps avoid duplication of work. The benefits of good tests are a subject for another post ^^. "Better code practices" is a continously moving target, so the code janitor must also be tuned into a technical community to stay up to date on them.

Proactively review => merge the pull requests of coworkers / community members. Coworker's PRs should all be reviewed at least every 24 hours. Community member's PRs should be reviewed all be reviewed at least every 7 days. If you're approaching a time limit and aren't sure what feedback to give, really push yourself to surface why a pull request shouldn't be merged. Continuous proactive reviews are a powerful tool for surfacing people's blockers.

As a special case for reviews and merges, the code janitor should maintain a requirements bot. The benefits of a requirements bot are described in another post. The timelines for requirements updates should be relative to the semver range of the update, and the type of update. For example, security patches should be merged within 24 hours.

Tangentially related to maintaining your current set of requirements, is watching for competitor libraries. When your team's use of a tool is non-standard / non-ideal, its often a good target for being replaced by a competitor. For the majority of your libraries this won't be an option, but there's some cases where its easily possible. Switching my team's code formatter and test framework are two examples of this. A good code janitor developes a sense for where / when these switches can happen, and is keyed into the communities where these libraries are created.

Watch how your coworkers are interacting with the linter config. Ask your coworkers how they are interacting with the linter config! Be very receptive to changing the linter config over time, as personal styles shift. Definitely change linter config whenever you get a new coworker of at least median skill level, they know what works well for them. The gold standard is config that minizes the time programmers spend thinking about formatting, while also completely preventing the possibility of Formatting Wars.

Driving refactor plans, such as:

renames
merging projects / splitting projects apart
open sourcing all of / parts of a project

Identifying technical debt, and surfacing the relative priority of reducing that debt to team managers. Issues of technical debt can often be hard to communicate, so part of this work should involve configuring tools that assist in objective code analysis. This very related to driving refactor plans.

There's more to say here, but this post is long and the day is short. I hope this helps you become a better project maintainer!

Heroku + Django Pipeline + Sass

@coilysiren — Thu, 30 Mar 2017 00:00:00 GMT

What you have

A currently existing django project, hosted on heroku, with a bunch of css

What you want

to have your css compiled from sass
for that compilation to be done with ruby, during the heroku build

How you get it

heroku setup

# bash

heroku buildpacks:add heroku/ruby --index 1 --app $APP_NAME

django pipeline setup

# config/settings/base.py

# snipped to only the relevant bit, see django pipeline's docs for the rest
PIPELINE = {
    'COMPILERS': (
        'pipeline.compilers.sass.SASSCompiler',
    ),
}

sass setup

# Gemfile

source "https://rubygems.org"
ruby '2.3.3'

gem "sass"
gem "susy"

Ok but that didn't work

or at least it didn't for me, I got this (formatted for readability) error

pipeline.exceptions.CompilerError: <filter object at 0x1337> exit code 1
b"/tmp/DIR/vendor/ruby-2.3.3/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require':
    cannot load such file -- bundler/setup (LoadError)
    from /tmp/DIR/vendor/ruby-2.3.3/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
    from /tmp/DIR/vendor/bundle/bin/sass:15:in `<main>"

From that message, it occurred to me that maybe python wasn't calling the right sass. Which inspired the following "fix"

How you get it, but for real this time

Install your gems to whatever ruby / sass the python buildpack is calling

# bin/pre_compile
gem install bundler
bundle install

(pre_compile is a hook for the python buildpack)

Does this mean heroku buildpacks:add heroku/ruby --index 1 is unused? Someone investigate this and let me know.

Get this fixed for good!

One of the devs of the projects below will know what's up

But if you want a fix for your project, the pre_compile script will work fine