GCP Setup Guide for TFDrift-Falco¶
This guide walks you through setting up TFDrift-Falco for Google Cloud Platform (GCP) drift detection.
Table of Contents¶
- Quick Start (5 Minutes) ⭐ Start Here!
- Prerequisites
- Architecture Overview
- Step 1: Enable GCP Audit Logs
- Step 2: Configure Pub/Sub for Audit Logs
- Step 3: Install and Configure Falco
- Step 4: Configure TFDrift-Falco
- Step 5: Verify Setup
- Troubleshooting
- Best Practices
- Complete Examples
- Advanced Configuration
Quick Start (5 Minutes)¶
Want to try TFDrift-Falco with GCP right now? This automated script sets everything up for you.
One-Command Setup¶
# Download and run the setup script
curl -fsSL https://raw.githubusercontent.com/higakikeita/tfdrift-falco/main/scripts/gcp-quick-start.sh | bash
Manual Quick Start¶
If you prefer to run commands manually:
# 1. Set your project
export PROJECT_ID="your-gcp-project-id"
gcloud config set project $PROJECT_ID
# 2. Enable required APIs (30 seconds)
gcloud services enable logging.googleapis.com pubsub.googleapis.com compute.googleapis.com
# 3. Create Pub/Sub infrastructure (30 seconds)
gcloud pubsub topics create tfdrift-audit-logs
gcloud logging sinks create tfdrift-sink \
pubsub.googleapis.com/projects/$PROJECT_ID/topics/tfdrift-audit-logs \
--log-filter='protoPayload.serviceName="compute.googleapis.com"'
SINK_SA=$(gcloud logging sinks describe tfdrift-sink --format="value(writerIdentity)")
gcloud pubsub topics add-iam-policy-binding tfdrift-audit-logs \
--member="$SINK_SA" --role="roles/pubsub.publisher"
gcloud pubsub subscriptions create tfdrift-falco-sub \
--topic=tfdrift-audit-logs
# 4. Create service account for Falco (30 seconds)
gcloud iam service-accounts create tfdrift-falco \
--display-name="TFDrift Falco"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
mkdir -p ~/tfdrift-config
gcloud iam service-accounts keys create ~/tfdrift-config/gcp-key.json \
--iam-account=tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com
# 5. Run Falco with Docker (1 minute)
cat > ~/tfdrift-config/falco.yaml <<EOF
engine:
kind: modern_ebpf
plugins:
- name: gcpaudit
library_path: /usr/share/falco/plugins/libgcpaudit.so
init_config:
project_id: "$PROJECT_ID"
subscription: "tfdrift-falco-sub"
load_plugins: [gcpaudit]
json_output: true
grpc:
enabled: true
bind_address: "0.0.0.0:5060"
threadiness: 8
grpc_output:
enabled: true
EOF
docker run -d --name falco \
-p 5060:5060 \
-v ~/tfdrift-config:/etc/falco \
-e GOOGLE_APPLICATION_CREDENTIALS=/etc/falco/gcp-key.json \
falcosecurity/falco:latest \
-c /etc/falco/falco.yaml
# 6. Create TFDrift-Falco config (30 seconds)
cat > ~/tfdrift-config/config-gcp.yaml <<EOF
providers:
gcp:
enabled: true
projects:
- "$PROJECT_ID"
state:
backend: "local"
local_path: "./terraform.tfstate"
falco:
enabled: true
hostname: "localhost"
port: 5060
drift_rules:
- name: "GCE Instance Change"
resource_types:
- "google_compute_instance"
watched_attributes:
- "metadata"
- "labels"
severity: "high"
notifications:
slack:
enabled: false
logging:
level: "info"
EOF
echo "✅ Setup complete!"
echo ""
echo "Next steps:"
echo "1. Create a test Terraform resource:"
echo " terraform init && terraform apply"
echo ""
echo "2. Run TFDrift-Falco:"
echo " tfdrift --config ~/tfdrift-config/config-gcp.yaml"
echo ""
echo "3. Make a manual change in GCP Console to trigger drift detection"
echo " Example: gcloud compute instances add-metadata INSTANCE_NAME --metadata=test=value"
What This Sets Up¶
- ✅ GCP Audit Logs → Pub/Sub pipeline
- ✅ Falco with gcpaudit plugin (Docker)
- ✅ TFDrift-Falco configuration
- ✅ Service account with minimal permissions
Test It¶
# 1. Create a simple test resource with Terraform
cat > main.tf <<EOF
resource "google_compute_network" "test" {
name = "tfdrift-test-network"
auto_create_subnetworks = false
}
EOF
terraform init
terraform apply -auto-approve
# 2. Run TFDrift-Falco
tfdrift --config ~/tfdrift-config/config-gcp.yaml &
# 3. Make a manual change
gcloud compute networks update tfdrift-test-network \
--description="Manual change - should trigger drift"
# You should see a drift detection alert!
Clean Up¶
# Remove test resources
terraform destroy -auto-approve
# Stop Falco
docker stop falco && docker rm falco
# Delete GCP resources
gcloud pubsub subscriptions delete tfdrift-falco-sub
gcloud pubsub topics delete tfdrift-audit-logs
gcloud logging sinks delete tfdrift-sink
gcloud iam service-accounts delete tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com
Prerequisites¶
Required Tools¶
- GCP Project with appropriate permissions
- Terraform 1.0+ managing GCP resources
- Falco 0.35+ (to be installed)
- Docker 20.10+ (optional, recommended for Falco)
- GCP CLI (
gcloud) installed and configured
Pre-flight Checklist¶
Run these commands to verify your environment is ready:
# Check gcloud is installed and authenticated
gcloud --version
gcloud auth list
# Verify you have an active project
export PROJECT_ID=$(gcloud config get-value project)
echo "Current project: $PROJECT_ID"
# Check required APIs are enabled
gcloud services list --enabled | grep -E "(logging|pubsub|compute)"
# Enable required APIs if not already enabled
gcloud services enable \
logging.googleapis.com \
pubsub.googleapis.com \
compute.googleapis.com \
storage-api.googleapis.com
# Check Terraform is installed
terraform version
# Check Docker is running (if using Docker for Falco)
docker ps
# Verify you have sufficient permissions
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.members:user:$(gcloud config get-value account)" \
--format="table(bindings.role)"
Expected Result: All commands should complete successfully without errors.
Required GCP Permissions¶
Your account needs these IAM roles: - roles/logging.admin - Create log sinks - roles/pubsub.admin - Create Pub/Sub topics and subscriptions - roles/iam.serviceAccountAdmin - Create service accounts - roles/storage.objectViewer - Read Terraform state from GCS (if using GCS backend)
Verify permissions:
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.members:user:$(gcloud config get-value account)" \
--format="table(bindings.role)"
Estimated Time¶
- Total setup time: 20-30 minutes
- Step 1 (Audit Logs): 5 minutes
- Step 2 (Pub/Sub): 5 minutes
- Step 3 (Falco): 10 minutes
- Step 4 (TFDrift-Falco): 5 minutes
- Step 5 (Verification): 5 minutes
Architecture Overview¶
┌─────────────────┐
│ GCP Resources │
│ (Terraform) │
└────────┬────────┘
│
│ Manual Changes
│ (Console/CLI)
▼
┌─────────────────┐
│ GCP Audit Logs │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Cloud Pub/Sub │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Falco │
│ (gcpaudit) │
└────────┬────────┘
│ gRPC
▼
┌─────────────────┐
│ TFDrift-Falco │
│ + GCS Backend │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Notifications │
│ (Slack/etc) │
└─────────────────┘
Step 1: Enable GCP Audit Logs¶
⏱️ Estimated time: 5 minutes
1.1 Enable Admin Activity Logs (Enabled by Default)¶
Admin Activity audit logs are enabled by default and cannot be disabled.
✅ Verification:
# Check if Admin Activity logs are flowing
gcloud logging read "protoPayload.serviceName=compute.googleapis.com" \
--limit=5 \
--format=json
Expected output: You should see recent audit log entries.
1.2 Enable Data Access Logs (Optional but Recommended)¶
For comprehensive drift detection, enable Data Access logs:
⚠️ Warning: Data Access logs can increase your Cloud Logging costs. Start with Admin Activity logs only for testing, then enable Data Access logs for production monitoring.
# Create audit config
cat > audit-config.yaml <<EOF
auditConfigs:
- auditLogConfigs:
- logType: ADMIN_READ
- logType: DATA_READ
- logType: DATA_WRITE
service: compute.googleapis.com
- auditLogConfigs:
- logType: ADMIN_READ
- logType: DATA_WRITE
service: storage.googleapis.com
- auditLogConfigs:
- logType: ADMIN_READ
- logType: DATA_WRITE
service: sqladmin.googleapis.com
EOF
# Apply audit config to project
gcloud projects set-iam-policy PROJECT_ID audit-config.yaml
1.3 Verify Audit Logs¶
# List recent audit logs
gcloud logging read "protoPayload.serviceName=compute.googleapis.com" \
--limit 10 \
--format json
Step 2: Configure Pub/Sub for Audit Logs¶
⏱️ Estimated time: 5 minutes
2.1 Create Pub/Sub Topic¶
# Set project (replace with your actual project ID)
export PROJECT_ID="your-gcp-project-id"
gcloud config set project $PROJECT_ID
# Create topic for audit logs
gcloud pubsub topics create tfdrift-audit-logs
✅ Verification:
Expected output:
2.2 Create Log Sink¶
Route audit logs to Pub/Sub:
# Create log sink
gcloud logging sinks create tfdrift-sink \
pubsub.googleapis.com/projects/$PROJECT_ID/topics/tfdrift-audit-logs \
--log-filter='
protoPayload.serviceName="compute.googleapis.com" OR
protoPayload.serviceName="storage.googleapis.com" OR
protoPayload.serviceName="sqladmin.googleapis.com" OR
protoPayload.serviceName="container.googleapis.com"
'
# Get sink service account
SINK_SA=$(gcloud logging sinks describe tfdrift-sink --format="value(writerIdentity)")
echo "Sink Service Account: $SINK_SA"
✅ Verification:
# Verify sink was created and is active
gcloud logging sinks describe tfdrift-sink
# Check the filter
gcloud logging sinks describe tfdrift-sink --format="value(filter)"
Expected output:
Created [tfdrift-sink].
writerIdentity: serviceAccount:service-XXXX@gcp-sa-logging.iam.gserviceaccount.com
destination: pubsub.googleapis.com/projects/YOUR_PROJECT_ID/topics/tfdrift-audit-logs
💡 Tip: The
writerIdentityis automatically created by Google Cloud and will be used to publish messages to Pub/Sub.
2.3 Grant Permissions to Sink¶
# Grant publish permission to sink service account
gcloud pubsub topics add-iam-policy-binding tfdrift-audit-logs \
--member="$SINK_SA" \
--role="roles/pubsub.publisher"
✅ Verification:
Expected output: You should see the sink service account with roles/pubsub.publisher role.
2.4 Create Subscription for Falco¶
# Create pull subscription
gcloud pubsub subscriptions create tfdrift-falco-sub \
--topic=tfdrift-audit-logs \
--ack-deadline=60
Step 3: Install and Configure Falco¶
3.1 Install Falco with Docker (Recommended)¶
# Pull Falco image
docker pull falcosecurity/falco:latest
# Create Falco config directory
mkdir -p ~/falco-config
3.2 Create Falco Configuration¶
cat > ~/falco-config/falco.yaml <<EOF
# Falco Configuration for GCP Audit Logs
# Enable gRPC output
grpc:
enabled: true
bind_address: "0.0.0.0:5060"
threadiness: 8
grpc_output:
enabled: true
# Disable kernel module (not needed for cloud audit logs)
engine:
kind: modern_ebpf
modern_ebpf:
cpus_for_each_buffer: 2
# Load GCP audit plugin
plugins:
- name: gcpaudit
library_path: /usr/share/falco/plugins/libgcpaudit.so
init_config:
project_id: "$PROJECT_ID"
subscription: "tfdrift-falco-sub"
open_params: ""
# Load rules for GCP
load_plugins: [gcpaudit]
# Output configuration
json_output: true
json_include_output_property: true
EOF
3.3 Create GCP Credentials Secret¶
# Create service account for Falco
gcloud iam service-accounts create tfdrift-falco \
--display-name="TFDrift Falco Service Account"
# Grant Pub/Sub subscriber role
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
# Download key
gcloud iam service-accounts keys create ~/falco-config/gcp-key.json \
--iam-account=tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com
3.4 Run Falco¶
# Run Falco with GCP plugin
docker run -d \
--name falco \
-p 5060:5060 \
-v ~/falco-config:/etc/falco \
-e GOOGLE_APPLICATION_CREDENTIALS=/etc/falco/gcp-key.json \
falcosecurity/falco:latest \
-c /etc/falco/falco.yaml
3.5 Verify Falco is Running¶
# Check Falco logs
docker logs falco
# You should see:
# "Falco initialized with GCP Audit Log plugin"
# "gRPC server listening on 0.0.0.0:5060"
Step 4: Configure TFDrift-Falco¶
4.1 Create GCS Bucket for Terraform State (If Using GCS Backend)¶
# Create bucket
gsutil mb -p $PROJECT_ID -l us-central1 gs://tfdrift-terraform-state
# Enable versioning
gsutil versioning set on gs://tfdrift-terraform-state
4.2 Create TFDrift-Falco Configuration¶
cat > config-gcp.yaml <<EOF
# TFDrift-Falco GCP Configuration
providers:
aws:
enabled: false
gcp:
enabled: true
projects:
- "$PROJECT_ID"
state:
backend: "gcs"
gcs_bucket: "tfdrift-terraform-state"
gcs_prefix: "terraform.tfstate"
falco:
enabled: true
hostname: "localhost" # or Falco container IP
port: 5060
drift_rules:
- name: "GCE Instance Configuration Change"
resource_types:
- "google_compute_instance"
watched_attributes:
- "metadata"
- "labels"
- "tags"
- "machine_type"
- "service_account"
severity: "high"
- name: "Firewall Rule Modification"
resource_types:
- "google_compute_firewall"
watched_attributes:
- "allow"
- "deny"
- "source_ranges"
- "target_tags"
severity: "critical"
- name: "Cloud Storage Security Settings"
resource_types:
- "google_storage_bucket"
watched_attributes:
- "encryption"
- "public_access_prevention"
- "uniform_bucket_level_access"
severity: "critical"
- name: "Cloud SQL Instance Change"
resource_types:
- "google_sql_database_instance"
watched_attributes:
- "settings"
- "database_version"
- "deletion_protection"
severity: "high"
- name: "IAM Policy Change"
resource_types:
- "google_project_iam_binding"
- "google_project_iam_member"
watched_attributes:
- "role"
- "members"
severity: "critical"
notifications:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#gcp-security-alerts"
webhook:
enabled: false
url: "https://your-siem.example.com/webhook"
logging:
level: "info"
format: "json"
EOF
4.3 Run TFDrift-Falco¶
Option A: Docker¶
docker run -d \
--name tfdrift-falco \
--network host \
-v $(pwd)/config-gcp.yaml:/config/config.yaml:ro \
-e GOOGLE_APPLICATION_CREDENTIALS=/config/gcp-key.json \
-v ~/falco-config/gcp-key.json:/config/gcp-key.json:ro \
ghcr.io/higakikeita/tfdrift-falco:latest \
--config /config/config.yaml
Option B: Binary¶
# Set GCP credentials
export GOOGLE_APPLICATION_CREDENTIALS=~/falco-config/gcp-key.json
# Run TFDrift-Falco
./tfdrift --config config-gcp.yaml
Step 5: Verify Setup¶
5.1 Test Drift Detection¶
Manually modify a Terraform-managed resource:
# Example: Add metadata to a GCE instance
gcloud compute instances add-metadata INSTANCE_NAME \
--zone=us-central1-a \
--metadata=test-key=test-value
5.2 Check TFDrift-Falco Logs¶
# Docker
docker logs -f tfdrift-falco
# You should see:
# "Drift Detected: google_compute_instance.INSTANCE_NAME"
# "Changed: metadata.test-key = null → test-value"
5.3 Check Slack Notification¶
You should receive a Slack alert with: - 🚨 Resource: google_compute_instance.INSTANCE_NAME - Changed attribute: metadata.test-key - User: your-email@example.com - Project: your-gcp-project-id
Troubleshooting¶
Issue 1: Falco Not Receiving Audit Logs¶
Symptoms: - Falco starts but no events appear - docker logs falco shows no audit log entries
Solutions:
# 1. Verify Pub/Sub subscription
gcloud pubsub subscriptions describe tfdrift-falco-sub
# 2. Check messages in subscription
gcloud pubsub subscriptions pull tfdrift-falco-sub --limit=5
# 3. Verify log sink
gcloud logging sinks describe tfdrift-sink
# 4. Check service account permissions
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:tfdrift-falco@*"
Issue 2: TFDrift-Falco Cannot Connect to Falco¶
Symptoms: - Error: "failed to connect to Falco" - gRPC connection refused
Solutions:
# 1. Check Falco gRPC port
docker exec falco netstat -tlnp | grep 5060
# 2. Test gRPC connection
grpcurl -plaintext localhost:5060 list
# 3. Check firewall rules (if using remote Falco)
gcloud compute firewall-rules create allow-falco-grpc \
--allow=tcp:5060 \
--source-ranges=YOUR_CLIENT_IP/32
Issue 3: GCS Backend Authentication Fails¶
Symptoms: - Error: "failed to read object from GCS" - Permission denied
Solutions:
# 1. Verify service account has Storage Object Viewer role
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# 2. Test GCS access manually
gsutil ls gs://tfdrift-terraform-state/
# 3. Verify GOOGLE_APPLICATION_CREDENTIALS is set
echo $GOOGLE_APPLICATION_CREDENTIALS
Issue 4: Events Not Matching Terraform Resources¶
Symptoms: - Audit logs received but no drift detected - "Resource not found in Terraform state"
Solutions:
# 1. Verify Terraform state path
gsutil ls gs://tfdrift-terraform-state/terraform.tfstate
# 2. Check resource naming in Terraform vs GCP
terraform show -json | jq '.values.root_module.resources[] | {type, name}'
# 3. Enable debug logging
# In config.yaml:
logging:
level: "debug"
Issue 5: High Volume of Irrelevant Events¶
Symptoms: - Too many events being processed - Performance degradation
Solutions:
Update log sink filter to be more specific:
gcloud logging sinks update tfdrift-sink \
--log-filter='
protoPayload.serviceName="compute.googleapis.com" AND
protoPayload.methodName=~"compute\.(instances|firewalls|networks)\.(insert|delete|update|patch|set.*)"
'
Issue 6: Falco Container Crashes or Restarts¶
Symptoms: - Docker container exits immediately after starting - Error: error opening device /dev/host/proc - Container restart loop
Error Messages:
Solutions:
# 1. Check container logs for exact error
docker logs falco --tail 50
# 2. Verify plugin exists in container
docker exec falco ls -la /usr/share/falco/plugins/
# 3. Use correct Falco version with gcpaudit plugin (0.37.0+)
docker pull falcosecurity/falco:latest
# 4. Check configuration file syntax
docker run --rm -v ~/tfdrift-config:/config \
falcosecurity/falco:latest \
-c /config/falco.yaml --validate
# 5. Verify GOOGLE_APPLICATION_CREDENTIALS path
docker exec falco ls -la $GOOGLE_APPLICATION_CREDENTIALS
Issue 7: Permission Denied on Pub/Sub Subscription¶
Symptoms: - Error: PERMISSION_DENIED: User not authorized to perform this action - Falco cannot pull messages from subscription
Error Messages:
ERROR Failed to pull messages: rpc error: code = PermissionDenied
desc = User not authorized to perform this action.
Solutions:
# 1. Verify service account has Pub/Sub Subscriber role
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.role:roles/pubsub.subscriber"
# 2. Grant required permission
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
# 3. Test subscription access with service account
gcloud pubsub subscriptions pull tfdrift-falco-sub \
--limit=1 \
--impersonate-service-account=tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com
# 4. Verify key file has correct format
cat ~/tfdrift-config/gcp-key.json | jq .
# Should show valid JSON with private_key, client_email, etc.
Issue 8: Log Sink Service Account Missing Permissions¶
Symptoms: - Audit logs generated but not appearing in Pub/Sub - Log sink exists but messages not delivered
Error Messages:
The caller does not have permission to publish to topic
projects/PROJECT_ID/topics/tfdrift-audit-logs
Solutions:
# 1. Get the log sink service account (writer identity)
SINK_SA=$(gcloud logging sinks describe tfdrift-sink \
--project=$PROJECT_ID \
--format="value(writerIdentity)")
echo "Sink Service Account: $SINK_SA"
# 2. Grant Publisher role to the sink service account
gcloud pubsub topics add-iam-policy-binding tfdrift-audit-logs \
--member="$SINK_SA" \
--role="roles/pubsub.publisher" \
--project=$PROJECT_ID
# 3. Verify the binding
gcloud pubsub topics get-iam-policy tfdrift-audit-logs \
--project=$PROJECT_ID
# 4. Trigger a test event and check delivery
gcloud compute instances list # Triggers compute.instances.list
sleep 30 # Wait for log delivery
gcloud pubsub subscriptions pull tfdrift-falco-sub --limit=1
Issue 9: Invalid Configuration File¶
Symptoms: - TFDrift-Falco fails to start - Error: failed to load configuration - YAML parsing errors
Error Messages:
Solutions:
# 1. Validate YAML syntax
yamllint ~/tfdrift-config/config-gcp.yaml
# Or use Python
python3 -c "import yaml; yaml.safe_load(open('config-gcp.yaml'))"
# 2. Check common issues:
# ❌ Wrong: projects as string
providers:
gcp:
projects: "my-project" # Wrong
# ✅ Correct: projects as list
providers:
gcp:
projects:
- "my-project"
# ❌ Wrong: watched_attributes with quotes issues
watched_attributes:
- metadata # Missing quotes
# ✅ Correct:
watched_attributes:
- "metadata"
- "labels"
# 3. Use config validation if available
tfdrift --config ~/tfdrift-config/config-gcp.yaml --validate
Issue 10: Drift Not Detected for Specific Resources¶
Symptoms: - Some resources show drift, others don't - Manual changes not triggering alerts - Logs show "resource not found in state"
Solutions:
# 1. Verify resource exists in Terraform state
terraform show -json | jq -r \
'.values.root_module.resources[] | select(.type=="google_compute_instance") | .name'
# 2. Check resource name format matches
# GCP Audit Log format: projects/PROJECT_ID/zones/ZONE/instances/NAME
# Terraform resource address: google_compute_instance.NAME
# 3. Enable debug logging to see resource matching
# In config-gcp.yaml:
logging:
level: "debug"
format: "json"
# 4. Check drift rule configuration
# Ensure resource type matches exactly:
drift_rules:
- name: "GCE Instance Drift"
resource_types:
- "google_compute_instance" # Must match Terraform type exactly
# 5. Verify watched attributes exist in Terraform state
terraform show -json | jq \
'.values.root_module.resources[] |
select(.type=="google_compute_instance") |
.values | keys'
Issue 11: Webhook Notifications Not Sending¶
Symptoms: - Drift detected but no Slack/webhook notification - No errors in TFDrift-Falco logs
Solutions:
# 1. Test webhook URL manually
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d '{"text":"Test message from TFDrift-Falco"}'
# 2. Check webhook configuration
# In config-gcp.yaml:
notifications:
slack:
enabled: true # Must be true
webhook_url: "https://hooks.slack.com/services/..." # Must be valid URL
webhook:
enabled: true
url: "https://your-webhook.example.com/tfdrift"
method: "POST"
headers:
Content-Type: "application/json"
Authorization: "Bearer YOUR_TOKEN"
# 3. Enable verbose logging for notifications
logging:
level: "debug"
# 4. Check for network connectivity issues
curl -v https://hooks.slack.com
Issue 12: Multiple Projects Not Loading State¶
Symptoms: - Only first project's state loaded - Resources from other projects not detected
Solutions:
# ❌ Wrong: Single state file for multiple projects
providers:
gcp:
projects:
- "project-1"
- "project-2"
state:
backend: "gcs"
gcs_bucket: "terraform-state"
gcs_prefix: "terraform.tfstate" # Same file for all!
# ✅ Correct: Use {PROJECT_ID} placeholder
providers:
gcp:
projects:
- "project-1"
- "project-2"
state:
backend: "gcs"
gcs_bucket: "terraform-state"
gcs_prefix: "{PROJECT_ID}/terraform.tfstate" # Different per project
# Verify state files exist for all projects
for project in project-1 project-2; do
gsutil ls gs://terraform-state/$project/terraform.tfstate
done
Debug Procedures¶
Step-by-Step Debugging Workflow¶
When TFDrift-Falco is not working as expected, follow this systematic approach:
1. Verify the Complete Pipeline¶
#!/bin/bash
# Debug script - save as debug-tfdrift.sh
PROJECT_ID="your-project-id"
echo "==> 1. Checking GCP Audit Logs"
gcloud logging read "protoPayload.serviceName=compute.googleapis.com" \
--limit=3 \
--format=json \
--project=$PROJECT_ID | jq '.[0].protoPayload.methodName'
echo "==> 2. Checking Log Sink"
gcloud logging sinks describe tfdrift-sink --project=$PROJECT_ID
echo "==> 3. Checking Pub/Sub Topic"
gcloud pubsub topics describe tfdrift-audit-logs --project=$PROJECT_ID
echo "==> 4. Checking Pub/Sub Subscription"
gcloud pubsub subscriptions describe tfdrift-falco-sub --project=$PROJECT_ID
echo "==> 5. Pulling Sample Message"
gcloud pubsub subscriptions pull tfdrift-falco-sub \
--limit=1 \
--format=json \
--project=$PROJECT_ID | jq '.[0].message.data' -r | base64 -d | jq .
echo "==> 6. Checking Falco Container"
docker ps | grep falco
docker logs falco --tail 20
echo "==> 7. Testing Falco gRPC"
grpcurl -plaintext localhost:5060 list
echo "==> 8. Testing TFDrift-Falco Connection"
# Run TFDrift-Falco with debug logging
tfdrift --config config-gcp.yaml --log-level=debug
2. Enable Maximum Verbosity¶
# config-gcp.yaml - Debug configuration
logging:
level: "debug" # trace, debug, info, warn, error
format: "json" # json or text
falco:
enabled: true
hostname: "localhost"
port: 5060
timeout: 30s
retry:
max_attempts: 3
initial_interval: "1s"
# falco.yaml - Debug configuration
json_output: true
json_include_output_property: true
log_level: debug # Add this for Falco debug logs
grpc:
enabled: true
bind_address: "0.0.0.0:5060"
threadiness: 8
grpc_output:
enabled: true
3. Test Each Component Independently¶
# Test 1: Trigger a known GCP event
echo "Creating test compute instance..."
gcloud compute instances create tfdrift-test-instance \
--zone=us-central1-a \
--machine-type=e2-micro \
--project=$PROJECT_ID
# Wait for audit log (30 seconds typical)
sleep 35
# Test 2: Check if audit log was created
gcloud logging read \
'protoPayload.methodName="v1.compute.instances.insert" AND
resource.labels.instance_id="tfdrift-test-instance"' \
--limit=1 \
--format=json \
--project=$PROJECT_ID
# Test 3: Check if it reached Pub/Sub
gcloud pubsub subscriptions pull tfdrift-falco-sub \
--limit=5 \
--format=json \
--project=$PROJECT_ID
# Test 4: Check Falco received it
docker logs falco --tail 50 | grep "compute.instances.insert"
# Cleanup
gcloud compute instances delete tfdrift-test-instance \
--zone=us-central1-a \
--quiet \
--project=$PROJECT_ID
4. Isolate Network Issues¶
# Test gRPC connectivity from TFDrift-Falco perspective
grpcurl -plaintext localhost:5060 list
# If using remote Falco, test from client machine
grpcurl -plaintext FALCO_HOST:5060 list
# Check Docker network
docker network inspect bridge
# Check port bindings
docker port falco
# Should show: 5060/tcp -> 0.0.0.0:5060
# Test with telnet
telnet localhost 5060
Log Analysis Guide¶
Understanding GCP Audit Log Structure¶
GCP Audit Logs delivered via Falco gcpaudit plugin have this structure:
{
"protoPayload": {
"serviceName": "compute.googleapis.com",
"methodName": "v1.compute.instances.setMetadata",
"resourceName": "projects/123456789/zones/us-central1-a/instances/my-instance",
"authenticationInfo": {
"principalEmail": "user@example.com"
},
"request": {
"metadata": {
"items": [
{
"key": "ssh-keys",
"value": "user:ssh-rsa AAAA..."
}
]
}
},
"response": {
"operationType": "setMetadata"
}
},
"timestamp": "2025-12-17T10:30:45.123456Z",
"severity": "NOTICE"
}
Key Fields for Drift Detection¶
| Field | Purpose | Example |
|---|---|---|
serviceName | Which GCP service | compute.googleapis.com |
methodName | What action | v1.compute.instances.setMetadata |
resourceName | Which resource | projects/.../instances/my-instance |
principalEmail | Who made the change | user@example.com |
request | What changed | New metadata values |
timestamp | When | ISO 8601 timestamp |
Falco gRPC Output Format¶
When Falco forwards events to TFDrift-Falco via gRPC:
{
"output": "GCP Audit Log Event",
"priority": "Notice",
"rule": "GCP Audit Log",
"time": "2025-12-17T10:30:45.123456Z",
"output_fields": {
"gcp.serviceName": "compute.googleapis.com",
"gcp.methodName": "v1.compute.instances.setMetadata",
"gcp.resourceName": "projects/123456789/zones/us-central1-a/instances/my-instance",
"gcp.principalEmail": "user@example.com",
"gcp.projectId": "my-project-123"
},
"source": "gcpaudit"
}
TFDrift-Falco Event Processing¶
TFDrift-Falco processes events through these stages:
1. Receive gRPC event from Falco
↓
2. Parse GCP-specific fields (gcp.serviceName, gcp.methodName, etc.)
↓
3. Map methodName to Terraform resource type
compute.instances.setMetadata → google_compute_instance
↓
4. Extract resource identifier from resourceName
projects/.../instances/my-instance → my-instance
↓
5. Load Terraform state for project
↓
6. Find matching resource in state
google_compute_instance.my_instance
↓
7. Compare changed attributes with drift rules
↓
8. Generate drift alert if mismatch detected
Reading TFDrift-Falco Logs¶
Normal Operation:
INFO[2025-12-17T10:30:00Z] Starting TFDrift-Falco v0.5.0
INFO[2025-12-17T10:30:01Z] Connected to Falco gRPC at localhost:5060
INFO[2025-12-17T10:30:01Z] Loaded Terraform state for project my-project-123 (45 resources)
INFO[2025-12-17T10:30:45Z] Received event: compute.instances.setMetadata (user@example.com)
INFO[2025-12-17T10:30:45Z] Mapped to resource: google_compute_instance.my_instance
WARN[2025-12-17T10:30:45Z] Drift detected: metadata changed on google_compute_instance.my_instance
INFO[2025-12-17T10:30:45Z] Sent Slack notification
Debug Mode (--log-level=debug):
DEBUG[2025-12-17T10:30:45Z] Raw Falco event: {output_fields:{gcp.methodName:v1.compute.instances.setMetadata}}
DEBUG[2025-12-17T10:30:45Z] Parsed GCP event: service=compute.googleapis.com method=setMetadata resource=my-instance
DEBUG[2025-12-17T10:30:45Z] Resource mapper: compute.instances → google_compute_instance
DEBUG[2025-12-17T10:30:45Z] State lookup: google_compute_instance.my_instance found
DEBUG[2025-12-17T10:30:45Z] Comparing attributes: [metadata, labels, tags]
DEBUG[2025-12-17T10:30:45Z] Attribute 'metadata' differs: state={...} event={...}
DEBUG[2025-12-17T10:30:45Z] Drift rule matched: GCE Instance Configuration Change (severity: high)
DEBUG[2025-12-17T10:30:45Z] Calling webhook: https://hooks.slack.com/services/...
DEBUG[2025-12-17T10:30:46Z] Webhook response: 200 OK
Common Log Patterns and Meanings¶
Pattern: "Resource not found in Terraform state"
Meaning: The GCP resource exists and was modified, but it's not managed by Terraform (or state file path is wrong)Pattern: "Failed to connect to Falco"
Meaning: TFDrift-Falco cannot reach Falco on the configured hostname:portPattern: "Failed to load Terraform state"
Meaning: State file path in configuration doesn't match actual GCS object pathPattern: "No drift rules matched"
Meaning: Event received, but no drift rules configured for this resource typeAdvanced Configuration¶
Multi-Project Setup¶
Monitor multiple GCP projects:
providers:
gcp:
enabled: true
projects:
- "project-1"
- "project-2"
- "project-3"
state:
backend: "gcs"
gcs_bucket: "tfdrift-terraform-state"
gcs_prefix: "project-{PROJECT_ID}/terraform.tfstate"
Custom Falco Rules¶
Create custom rules for specific GCP events:
# falco-custom-rules.yaml
- rule: Terraform Managed GCE Instance Modified
desc: Detect modifications to Terraform-managed GCE instances
condition: >
gcp.methodName in (compute.instances.setMetadata,
compute.instances.setLabels,
compute.instances.setTags) and
not gcp.authenticationInfo.principalEmail startswith "terraform-"
output: >
GCE instance modified outside Terraform
(user=%gcp.authenticationInfo.principalEmail
instance=%gcp.resource.name
method=%gcp.methodName)
priority: WARNING
tags: [gcp, terraform, drift]
Regional Deployment¶
Deploy TFDrift-Falco per region:
# config-us-central1.yaml
providers:
gcp:
enabled: true
projects:
- "my-project"
state:
backend: "gcs"
gcs_bucket: "tfdrift-state-us-central1"
gcs_prefix: "terraform.tfstate"
# Separate config for each region
Integration with SIEM¶
Send events to your SIEM:
notifications:
webhook:
enabled: true
url: "https://splunk.example.com/services/collector"
headers:
Authorization: "Splunk YOUR_HEC_TOKEN"
Content-Type: "application/json"
Complete Examples¶
This section provides production-ready Terraform configurations with corresponding TFDrift-Falco setups.
Example 1: Basic GCE Instance with Networking¶
Scenario: Monitor a simple compute instance for configuration drift.
Terraform Configuration (main.tf):
# Provider configuration
terraform {
required_version = ">= 1.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
backend "gcs" {
bucket = "my-terraform-state"
prefix = "prod/compute"
}
}
provider "google" {
project = var.project_id
region = var.region
}
# Variables
variable "project_id" {
description = "GCP Project ID"
type = string
}
variable "region" {
description = "GCP Region"
type = string
default = "us-central1"
}
variable "zone" {
description = "GCP Zone"
type = string
default = "us-central1-a"
}
# VPC Network
resource "google_compute_network" "main" {
name = "tfdrift-demo-network"
auto_create_subnetworks = false
description = "Network managed by Terraform"
}
# Subnet
resource "google_compute_subnetwork" "main" {
name = "tfdrift-demo-subnet"
ip_cidr_range = "10.0.1.0/24"
region = var.region
network = google_compute_network.main.id
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
}
}
# Firewall - Allow SSH
resource "google_compute_firewall" "allow_ssh" {
name = "tfdrift-demo-allow-ssh"
network = google_compute_network.main.name
allow {
protocol = "tcp"
ports = ["22"]
}
source_ranges = ["35.235.240.0/20"] # IAP ranges
target_tags = ["ssh-enabled"]
description = "Allow SSH via IAP"
}
# Compute Instance
resource "google_compute_instance" "web" {
name = "tfdrift-demo-web-server"
machine_type = "e2-medium"
zone = var.zone
tags = ["ssh-enabled", "web-server"]
labels = {
environment = "production"
managed_by = "terraform"
app = "web"
}
boot_disk {
initialize_params {
image = "debian-cloud/debian-11"
size = 20
type = "pd-standard"
}
}
network_interface {
subnetwork = google_compute_subnetwork.main.id
access_config {
// Ephemeral public IP
}
}
metadata = {
enable-oslogin = "TRUE"
startup-script = <<-EOF
#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
systemctl enable nginx
EOF
}
service_account {
email = google_service_account.instance_sa.email
scopes = ["cloud-platform"]
}
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
}
}
# Service Account for Instance
resource "google_service_account" "instance_sa" {
account_id = "tfdrift-demo-instance-sa"
display_name = "TFDrift Demo Instance Service Account"
description = "Service account for demo web server"
}
# Outputs
output "instance_name" {
value = google_compute_instance.web.name
}
output "instance_external_ip" {
value = google_compute_instance.web.network_interface[0].access_config[0].nat_ip
}
output "network_name" {
value = google_compute_network.main.name
}
TFDrift-Falco Configuration (config-demo.yaml):
providers:
gcp:
enabled: true
projects:
- "my-project-123"
state:
backend: "gcs"
gcs_bucket: "my-terraform-state"
gcs_prefix: "prod/compute/terraform.tfstate"
falco:
enabled: true
hostname: "localhost"
port: 5060
timeout: 30s
drift_rules:
# Monitor instance configuration changes
- name: "GCE Instance Configuration Drift"
resource_types:
- "google_compute_instance"
watched_attributes:
- "metadata"
- "labels"
- "tags"
- "machine_type"
severity: "high"
# Critical: Monitor firewall rules
- name: "Firewall Rule Modification"
resource_types:
- "google_compute_firewall"
watched_attributes:
- "allow"
- "deny"
- "source_ranges"
- "target_tags"
severity: "critical"
# Monitor network changes
- name: "Network Configuration Change"
resource_types:
- "google_compute_network"
- "google_compute_subnetwork"
watched_attributes:
- "auto_create_subnetworks"
- "ip_cidr_range"
severity: "high"
notifications:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
logging:
level: "info"
format: "text"
Deployment Steps:
# 1. Apply Terraform configuration
terraform init
terraform apply -var="project_id=my-project-123"
# 2. Start TFDrift-Falco
tfdrift --config config-demo.yaml
# 3. Trigger drift by making manual changes
gcloud compute instances add-labels tfdrift-demo-web-server \
--zone=us-central1-a \
--labels=manual_change=true
# Expected: Drift alert in Slack within 30-60 seconds
Example 2: Multi-Tier Web Application¶
Scenario: Complete web application with load balancer, managed instance group, and Cloud SQL.
Terraform Configuration (main.tf):
terraform {
required_version = ">= 1.0"
backend "gcs" {
bucket = "my-terraform-state"
prefix = "prod/webapp"
}
}
provider "google" {
project = var.project_id
region = var.region
}
variable "project_id" {
type = string
}
variable "region" {
type = string
default = "us-central1"
}
# Network
resource "google_compute_network" "webapp" {
name = "webapp-network"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "webapp" {
name = "webapp-subnet"
ip_cidr_range = "10.0.0.0/24"
region = var.region
network = google_compute_network.webapp.id
}
# Firewall Rules
resource "google_compute_firewall" "allow_lb" {
name = "webapp-allow-lb"
network = google_compute_network.webapp.name
allow {
protocol = "tcp"
ports = ["80", "443"]
}
source_ranges = ["130.211.0.0/22", "35.191.0.0/16"]
target_tags = ["web-backend"]
}
# Instance Template
resource "google_compute_instance_template" "webapp" {
name_prefix = "webapp-template-"
machine_type = "e2-medium"
tags = ["web-backend"]
labels = {
environment = "production"
tier = "web"
managed_by = "terraform"
}
disk {
source_image = "debian-cloud/debian-11"
auto_delete = true
boot = true
disk_size_gb = 20
}
network_interface {
subnetwork = google_compute_subnetwork.webapp.id
}
metadata = {
startup-script = templatefile("${path.module}/startup.sh", {
db_host = google_sql_database_instance.main.private_ip_address
db_name = google_sql_database.webapp.name
db_user = google_sql_user.webapp.name
db_password = random_password.db_password.result
})
}
service_account {
email = google_service_account.webapp.email
scopes = ["cloud-platform"]
}
lifecycle {
create_before_destroy = true
}
}
# Managed Instance Group
resource "google_compute_region_instance_group_manager" "webapp" {
name = "webapp-mig"
region = var.region
base_instance_name = "webapp-instance"
version {
instance_template = google_compute_instance_template.webapp.id
}
target_size = 3
named_port {
name = "http"
port = 80
}
auto_healing_policies {
health_check = google_compute_health_check.webapp.id
initial_delay_sec = 300
}
}
# Health Check
resource "google_compute_health_check" "webapp" {
name = "webapp-health-check"
http_health_check {
port = 80
request_path = "/health"
}
check_interval_sec = 10
timeout_sec = 5
healthy_threshold = 2
unhealthy_threshold = 3
}
# Backend Service
resource "google_compute_backend_service" "webapp" {
name = "webapp-backend"
protocol = "HTTP"
port_name = "http"
timeout_sec = 30
enable_cdn = true
health_checks = [google_compute_health_check.webapp.id]
load_balancing_scheme = "EXTERNAL"
backend {
group = google_compute_region_instance_group_manager.webapp.instance_group
balancing_mode = "UTILIZATION"
capacity_scaler = 1.0
}
log_config {
enable = true
sample_rate = 1.0
}
}
# URL Map
resource "google_compute_url_map" "webapp" {
name = "webapp-url-map"
default_service = google_compute_backend_service.webapp.id
}
# HTTP Proxy
resource "google_compute_target_http_proxy" "webapp" {
name = "webapp-http-proxy"
url_map = google_compute_url_map.webapp.id
}
# Global Forwarding Rule
resource "google_compute_global_forwarding_rule" "webapp" {
name = "webapp-forwarding-rule"
target = google_compute_target_http_proxy.webapp.id
port_range = "80"
}
# Cloud SQL Instance
resource "google_sql_database_instance" "main" {
name = "webapp-db-instance"
database_version = "POSTGRES_14"
region = var.region
settings {
tier = "db-f1-micro"
availability_type = "REGIONAL"
backup_configuration {
enabled = true
start_time = "03:00"
point_in_time_recovery_enabled = true
}
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.webapp.id
}
maintenance_window {
day = 7 # Sunday
hour = 3
}
database_flags {
name = "log_connections"
value = "on"
}
}
deletion_protection = true
}
resource "google_sql_database" "webapp" {
name = "webapp"
instance = google_sql_database_instance.main.name
}
resource "google_sql_user" "webapp" {
name = "webapp_user"
instance = google_sql_database_instance.main.name
password = random_password.db_password.result
}
resource "random_password" "db_password" {
length = 16
special = true
}
# Service Account
resource "google_service_account" "webapp" {
account_id = "webapp-instance-sa"
display_name = "WebApp Instance Service Account"
}
# IAM Bindings
resource "google_project_iam_member" "webapp_sql_client" {
project = var.project_id
role = "roles/cloudsql.client"
member = "serviceAccount:${google_service_account.webapp.email}"
}
# Outputs
output "load_balancer_ip" {
value = google_compute_global_forwarding_rule.webapp.ip_address
}
output "db_instance_connection" {
value = google_sql_database_instance.main.connection_name
sensitive = true
}
TFDrift-Falco Configuration (config-webapp.yaml):
providers:
gcp:
enabled: true
projects:
- "my-project-123"
state:
backend: "gcs"
gcs_bucket: "my-terraform-state"
gcs_prefix: "prod/webapp/terraform.tfstate"
falco:
enabled: true
hostname: "localhost"
port: 5060
drift_rules:
# Critical: Database configuration
- name: "Cloud SQL Configuration Change"
resource_types:
- "google_sql_database_instance"
watched_attributes:
- "settings"
- "database_version"
- "deletion_protection"
severity: "critical"
# Critical: IAM changes
- name: "IAM Binding Modification"
resource_types:
- "google_project_iam_member"
- "google_service_account_iam_binding"
watched_attributes:
- "role"
- "members"
severity: "critical"
# High: Load balancer configuration
- name: "Load Balancer Configuration Change"
resource_types:
- "google_compute_backend_service"
- "google_compute_url_map"
- "google_compute_target_http_proxy"
watched_attributes:
- "backend"
- "health_checks"
- "enable_cdn"
severity: "high"
# High: Instance template changes
- name: "Instance Template Modification"
resource_types:
- "google_compute_instance_template"
watched_attributes:
- "machine_type"
- "disk"
- "metadata"
- "service_account"
severity: "high"
# Medium: MIG scaling
- name: "MIG Target Size Change"
resource_types:
- "google_compute_region_instance_group_manager"
watched_attributes:
- "target_size"
- "version"
severity: "medium"
notifications:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
webhook:
enabled: true
url: "https://monitoring.example.com/webhooks/tfdrift"
method: "POST"
headers:
Content-Type: "application/json"
Authorization: "Bearer YOUR_API_TOKEN"
logging:
level: "info"
format: "json"
Test Drift Detection:
# 1. Deploy infrastructure
terraform apply -var="project_id=my-project-123"
# 2. Start TFDrift-Falco
tfdrift --config config-webapp.yaml
# 3. Trigger various drift scenarios
# Scenario A: Modify database backup settings (CRITICAL)
gcloud sql instances patch webapp-db-instance \
--backup-start-time=04:00
# Scenario B: Change MIG target size (MEDIUM)
gcloud compute instance-groups managed set-autoscaling webapp-mig \
--region=us-central1 \
--max-num-replicas=5
# Scenario C: Modify backend service timeout (HIGH)
gcloud compute backend-services update webapp-backend \
--global \
--timeout=60
# Expected: Different severity alerts in Slack
Example 3: GKE Cluster with Monitoring¶
Terraform Configuration (gke-cluster.tf):
terraform {
backend "gcs" {
bucket = "my-terraform-state"
prefix = "prod/gke"
}
}
provider "google" {
project = var.project_id
region = var.region
}
variable "project_id" {
type = string
}
variable "region" {
type = string
default = "us-central1"
}
variable "cluster_name" {
type = string
default = "prod-gke-cluster"
}
# VPC for GKE
resource "google_compute_network" "gke" {
name = "gke-network"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "gke" {
name = "gke-subnet"
ip_cidr_range = "10.0.0.0/20"
region = var.region
network = google_compute_network.gke.id
secondary_ip_range {
range_name = "pods"
ip_cidr_range = "10.4.0.0/14"
}
secondary_ip_range {
range_name = "services"
ip_cidr_range = "10.8.0.0/20"
}
}
# GKE Cluster
resource "google_container_cluster" "primary" {
name = var.cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
network = google_compute_network.gke.name
subnetwork = google_compute_subnetwork.gke.name
# IP allocation for VPC-native cluster
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
# Enable Workload Identity
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Master authorized networks
master_authorized_networks_config {
cidr_blocks {
cidr_block = "10.0.0.0/8"
display_name = "Internal"
}
}
# Monitoring and logging
logging_service = "logging.googleapis.com/kubernetes"
monitoring_service = "monitoring.googleapis.com/kubernetes"
# Addons
addons_config {
http_load_balancing {
disabled = false
}
horizontal_pod_autoscaling {
disabled = false
}
network_policy_config {
disabled = false
}
}
# Network policy
network_policy {
enabled = true
}
# Maintenance window
maintenance_policy {
daily_maintenance_window {
start_time = "03:00"
}
}
# Binary authorization
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}
}
# Node Pool
resource "google_container_node_pool" "primary_nodes" {
name = "primary-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
node_count = 3
node_config {
machine_type = "e2-medium"
labels = {
environment = "production"
managed_by = "terraform"
}
tags = ["gke-node", "prod"]
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
service_account = google_service_account.gke_nodes.email
workload_metadata_config {
mode = "GKE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
}
autoscaling {
min_node_count = 2
max_node_count = 10
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Service Account for GKE nodes
resource "google_service_account" "gke_nodes" {
account_id = "gke-node-sa"
display_name = "GKE Node Service Account"
}
# IAM bindings for nodes
resource "google_project_iam_member" "gke_node_sa_logging" {
project = var.project_id
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.gke_nodes.email}"
}
resource "google_project_iam_member" "gke_node_sa_monitoring" {
project = var.project_id
role = "roles/monitoring.metricWriter"
member = "serviceAccount:${google_service_account.gke_nodes.email}"
}
# Outputs
output "cluster_name" {
value = google_container_cluster.primary.name
}
output "cluster_endpoint" {
value = google_container_cluster.primary.endpoint
sensitive = true
}
output "cluster_ca_certificate" {
value = google_container_cluster.primary.master_auth.0.cluster_ca_certificate
sensitive = true
}
TFDrift-Falco Configuration (config-gke.yaml):
providers:
gcp:
enabled: true
projects:
- "my-project-123"
state:
backend: "gcs"
gcs_bucket: "my-terraform-state"
gcs_prefix: "prod/gke/terraform.tfstate"
falco:
enabled: true
hostname: "localhost"
port: 5060
drift_rules:
# Critical: GKE cluster configuration
- name: "GKE Cluster Configuration Change"
resource_types:
- "google_container_cluster"
watched_attributes:
- "master_authorized_networks_config"
- "workload_identity_config"
- "binary_authorization"
- "network_policy"
severity: "critical"
# High: Node pool configuration
- name: "GKE Node Pool Modification"
resource_types:
- "google_container_node_pool"
watched_attributes:
- "node_config"
- "autoscaling"
- "management"
severity: "high"
# Medium: Node pool scaling
- name: "Node Pool Size Change"
resource_types:
- "google_container_node_pool"
watched_attributes:
- "node_count"
severity: "medium"
notifications:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
logging:
level: "info"
format: "json"
Example 4: Production Multi-Project Setup¶
Directory Structure:
terraform/
├── environments/
│ ├── prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── staging/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── modules/
│ ├── compute/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── security/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── tfdrift-config/
├── config-prod.yaml
└── config-staging.yaml
Module Example (modules/compute/main.tf):
resource "google_compute_instance" "this" {
name = var.instance_name
machine_type = var.machine_type
zone = var.zone
tags = var.tags
labels = var.labels
boot_disk {
initialize_params {
image = var.boot_disk_image
size = var.boot_disk_size
}
}
network_interface {
subnetwork = var.subnetwork
dynamic "access_config" {
for_each = var.enable_external_ip ? [1] : []
content {
nat_ip = var.external_ip
}
}
}
metadata = var.metadata
service_account {
email = var.service_account_email
scopes = var.service_account_scopes
}
}
Environment Configuration (environments/prod/main.tf):
terraform {
backend "gcs" {
bucket = "company-terraform-state"
prefix = "prod"
}
}
module "networking" {
source = "../../modules/networking"
project_id = var.project_id
region = var.region
network_name = "prod-network"
subnets = [
{
name = "prod-subnet-web"
ip_cidr_range = "10.0.1.0/24"
region = "us-central1"
},
{
name = "prod-subnet-db"
ip_cidr_range = "10.0.2.0/24"
region = "us-central1"
}
]
}
module "web_servers" {
source = "../../modules/compute"
count = 3
instance_name = "prod-web-${count.index}"
machine_type = "n2-standard-2"
zone = "us-central1-a"
subnetwork = module.networking.subnet_ids["prod-subnet-web"]
tags = ["web", "prod"]
labels = {
environment = "production"
tier = "web"
managed_by = "terraform"
}
service_account_email = module.security.web_sa_email
service_account_scopes = ["cloud-platform"]
}
module "security" {
source = "../../modules/security"
project_id = var.project_id
}
TFDrift-Falco Production Config (tfdrift-config/config-prod.yaml):
providers:
gcp:
enabled: true
projects:
- "company-prod-123"
- "company-prod-456"
state:
backend: "gcs"
gcs_bucket: "company-terraform-state"
gcs_prefix: "{PROJECT_ID}/terraform.tfstate"
falco:
enabled: true
hostname: "falco.internal.company.com"
port: 5060
tls:
enabled: true
ca_cert: "/etc/tfdrift/certs/ca.crt"
client_cert: "/etc/tfdrift/certs/client.crt"
client_key: "/etc/tfdrift/certs/client.key"
drift_rules:
# Production-critical rules
- name: "Production IAM Changes"
resource_types:
- "google_project_iam_member"
- "google_project_iam_binding"
- "google_service_account_iam_binding"
watched_attributes:
- "role"
- "members"
severity: "critical"
- name: "Production Database Changes"
resource_types:
- "google_sql_database_instance"
watched_attributes:
- "settings"
- "deletion_protection"
- "database_version"
severity: "critical"
- name: "Production Network Security"
resource_types:
- "google_compute_firewall"
- "google_compute_security_policy"
watched_attributes:
- "allow"
- "deny"
- "source_ranges"
severity: "critical"
- name: "Production Compute Changes"
resource_types:
- "google_compute_instance"
- "google_compute_instance_template"
watched_attributes:
- "machine_type"
- "metadata"
- "labels"
- "service_account"
severity: "high"
- name: "Production GKE Cluster Changes"
resource_types:
- "google_container_cluster"
- "google_container_node_pool"
watched_attributes:
- "master_authorized_networks_config"
- "node_config"
- "autoscaling"
severity: "high"
notifications:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/T00/B00/XX"
channel: "#prod-alerts"
username: "TFDrift-Falco [PROD]"
webhook:
enabled: true
url: "https://monitoring.company.com/api/v1/alerts/tfdrift"
method: "POST"
timeout: "10s"
retry:
max_attempts: 3
initial_interval: "2s"
max_interval: "10s"
headers:
Content-Type: "application/json"
Authorization: "Bearer ${WEBHOOK_API_TOKEN}"
X-Environment: "production"
pagerduty:
enabled: true
integration_key: "${PAGERDUTY_INTEGRATION_KEY}"
severity_mapping:
critical: "critical"
high: "error"
medium: "warning"
low: "info"
logging:
level: "info"
format: "json"
output: "stdout"
filtering:
# Ignore read-only operations
exclude_events:
- "*.list"
- "*.get"
- "*.describe"
# Focus on specific services
include_services:
- "compute.googleapis.com"
- "container.googleapis.com"
- "sqladmin.googleapis.com"
- "iam.googleapis.com"
- "storage.googleapis.com"
Deployment Script (deploy-prod.sh):
#!/bin/bash
set -e
ENVIRONMENT="prod"
PROJECT_ID="company-prod-123"
echo "==> Deploying ${ENVIRONMENT} infrastructure..."
cd environments/${ENVIRONMENT}
# Initialize Terraform
terraform init
# Plan
terraform plan -var="project_id=${PROJECT_ID}" -out=tfplan
# Apply with approval
read -p "Apply changes? (yes/no): " APPLY
if [ "$APPLY" = "yes" ]; then
terraform apply tfplan
echo "✓ Terraform applied successfully"
# Start TFDrift-Falco
echo "==> Starting TFDrift-Falco..."
tfdrift --config ../../tfdrift-config/config-${ENVIRONMENT}.yaml &
TFDRIFT_PID=$!
echo "✓ TFDrift-Falco started (PID: $TFDRIFT_PID)"
# Save PID
echo $TFDRIFT_PID > /var/run/tfdrift.pid
else
echo "Deployment cancelled"
fi
Production Best Practices¶
This section covers best practices for running TFDrift-Falco in production environments.
1. Infrastructure Design¶
State Management¶
DO: - Use remote state backends (GCS) with versioning enabled - Implement state locking to prevent concurrent modifications - Organize state files by environment and project - Use workspace separation for different environments
# ✅ Good: Separate state files per project
providers:
gcp:
projects:
- "company-prod-123"
- "company-prod-456"
state:
backend: "gcs"
gcs_bucket: "company-terraform-state"
gcs_prefix: "{PROJECT_ID}/prod/terraform.tfstate"
DON'T: - Use local state files in production - Share state files across unrelated resources - Store state files in public buckets
# ❌ Bad: Single state for all projects
state:
backend: "gcs"
gcs_prefix: "all-projects.tfstate" # Too broad!
Multi-Project Organization¶
Recommended Structure:
terraform/
├── shared-services/ # Shared infrastructure
│ ├── networking/
│ ├── security/
│ └── monitoring/
├── environments/
│ ├── prod/ # Production environment
│ │ ├── project-a/
│ │ └── project-b/
│ ├── staging/ # Staging environment
│ └── dev/ # Development environment
└── modules/ # Reusable modules
├── compute/
├── database/
└── networking/
TFDrift Configuration per Environment:
# prod/tfdrift-config.yaml
providers:
gcp:
projects:
- "company-prod-project-a"
- "company-prod-project-b"
state:
backend: "gcs"
gcs_bucket: "company-terraform-state"
gcs_prefix: "{PROJECT_ID}/prod/terraform.tfstate"
drift_rules:
- name: "Critical Production Changes"
resource_types:
- "google_sql_database_instance"
- "google_compute_firewall"
- "google_project_iam_*"
severity: "critical"
Resource Naming Conventions¶
Consistent Naming Pattern:
{environment}-{project}-{service}-{resource_type}-{identifier}
Examples:
- prod-webapp-lb-frontend
- staging-api-db-primary
- prod-shared-network-main
In Terraform:
locals {
environment = "prod"
project = "webapp"
naming_prefix = "${local.environment}-${local.project}"
}
resource "google_compute_instance" "web" {
name = "${local.naming_prefix}-web-${count.index + 1}"
labels = {
environment = local.environment
project = local.project
managed_by = "terraform"
cost_center = var.cost_center
}
}
2. Configuration Management¶
Environment Separation¶
Use separate configurations for each environment:
# Directory structure
configs/
├── dev.yaml # Development settings
├── staging.yaml # Staging settings
├── prod.yaml # Production settings
└── shared.yaml # Shared base config
Example: Development Config
# configs/dev.yaml
providers:
gcp:
enabled: true
projects:
- "company-dev-123"
state:
backend: "local" # OK for dev
local_path: "./terraform.tfstate"
drift_rules:
- name: "Dev Compute Changes"
resource_types:
- "google_compute_instance"
severity: "medium" # Lower severity for dev
logging:
level: "debug" # More verbose in dev
notifications:
slack:
enabled: false # Don't spam Slack in dev
Example: Production Config
# configs/prod.yaml
providers:
gcp:
enabled: true
projects:
- "company-prod-123"
- "company-prod-456"
state:
backend: "gcs" # Remote state required
gcs_bucket: "company-terraform-state-prod"
gcs_prefix: "{PROJECT_ID}/terraform.tfstate"
drift_rules:
- name: "Critical Production IAM Changes"
resource_types:
- "google_project_iam_*"
- "google_service_account_iam_*"
severity: "critical"
- name: "Production Database Changes"
resource_types:
- "google_sql_database_instance"
severity: "critical"
logging:
level: "info" # Production logging
format: "json"
output: "stdout"
notifications:
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_PROD}"
channel: "#prod-alerts"
pagerduty:
enabled: true
integration_key: "${PAGERDUTY_KEY_PROD}"
Secret Management¶
DO: - Use environment variables for secrets - Integrate with secret managers (Secret Manager, Vault) - Rotate credentials regularly - Never commit secrets to version control
# ✅ Good: Environment variables
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
export PAGERDUTY_INTEGRATION_KEY="xxx"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
# Run with secrets from environment
tfdrift --config config-prod.yaml
# ✅ Good: Reference environment variables in config
notifications:
slack:
webhook_url: "${SLACK_WEBHOOK_URL}"
pagerduty:
integration_key: "${PAGERDUTY_INTEGRATION_KEY}"
DON'T:
# ❌ Bad: Hardcoded secrets
notifications:
slack:
webhook_url: "https://hooks.slack.com/services/T00/B00/actual_secret_here"
Integration with GCP Secret Manager:
# Store secret
echo -n "https://hooks.slack.com/services/..." | \
gcloud secrets create tfdrift-slack-webhook \
--data-file=- \
--replication-policy="automatic"
# Retrieve secret at runtime
export SLACK_WEBHOOK_URL=$(gcloud secrets versions access latest \
--secret="tfdrift-slack-webhook")
# Run TFDrift
tfdrift --config config-prod.yaml
Configuration Validation¶
Pre-deployment validation script:
#!/bin/bash
# validate-config.sh
CONFIG_FILE=$1
echo "==> Validating TFDrift configuration..."
# 1. Check YAML syntax
yamllint "$CONFIG_FILE" || exit 1
# 2. Check required fields
python3 << EOF
import yaml
import sys
with open("$CONFIG_FILE") as f:
config = yaml.safe_load(f)
# Check providers
if 'providers' not in config or 'gcp' not in config['providers']:
print("❌ Missing providers.gcp configuration")
sys.exit(1)
# Check projects
if not config['providers']['gcp'].get('projects'):
print("❌ No projects specified")
sys.exit(1)
# Check state backend
if not config['providers']['gcp'].get('state', {}).get('backend'):
print("❌ No state backend specified")
sys.exit(1)
# Check drift rules
if not config.get('drift_rules'):
print("⚠️ Warning: No drift rules defined")
print("✓ Configuration is valid")
EOF
# 3. Validate environment variables
required_vars=("SLACK_WEBHOOK_URL" "GOOGLE_APPLICATION_CREDENTIALS")
for var in "${required_vars[@]}"; do
if [ -z "${!var}" ]; then
echo "❌ Missing required environment variable: $var"
exit 1
fi
done
echo "✓ All validations passed"
3. Monitoring & Alerting¶
Alert Routing Strategy¶
Severity-based routing:
# Route alerts based on severity
notifications:
# Critical: Page on-call engineer
pagerduty:
enabled: true
integration_key: "${PAGERDUTY_KEY}"
severity_mapping:
critical: "critical" # Triggers page
high: "error" # Creates incident
medium: "warning" # Creates incident (low priority)
low: "info" # Notification only
# High/Critical: Post to Slack immediately
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_PROD}"
channel: "#prod-alerts"
severity_filter: ["critical", "high"]
# All events: Send to SIEM
webhook:
enabled: true
url: "https://siem.company.com/api/events"
severity_filter: ["critical", "high", "medium", "low"]
# Critical only: Email leadership
email:
enabled: true
smtp_server: "smtp.company.com"
to: ["oncall@company.com", "security@company.com"]
severity_filter: ["critical"]
Severity Level Guidelines¶
Define clear severity criteria:
| Severity | When to Use | Response Time | Examples |
|---|---|---|---|
| Critical | Security-impacting changes, data loss risk | Immediate (< 5 min) | IAM changes, firewall rules, database deletion protection |
| High | Service-impacting changes, compliance violations | < 30 minutes | Instance type changes, network config, encryption settings |
| Medium | Non-critical config changes, performance impact | < 4 hours | Labels, tags, non-critical metadata |
| Low | Informational, tracking purposes | Next business day | Read-only attribute changes |
Configure in drift rules:
drift_rules:
# Critical: Zero tolerance
- name: "IAM Permission Changes"
resource_types:
- "google_project_iam_*"
- "google_service_account_iam_*"
watched_attributes:
- "role"
- "members"
severity: "critical"
alert_immediately: true
# High: Important but not emergency
- name: "Database Configuration"
resource_types:
- "google_sql_database_instance"
watched_attributes:
- "settings"
- "database_version"
severity: "high"
alert_immediately: true
# Medium: Track and review
- name: "Instance Metadata"
resource_types:
- "google_compute_instance"
watched_attributes:
- "metadata"
- "labels"
severity: "medium"
alert_immediately: false
batch_alerts: true
batch_interval: "15m"
# Low: Informational
- name: "Resource Tags"
resource_types:
- "google_compute_*"
watched_attributes:
- "tags"
severity: "low"
alert_immediately: false
batch_alerts: true
batch_interval: "1h"
Monitoring TFDrift-Falco Itself¶
Health Check Endpoint:
# Implement health check
curl http://localhost:8080/health
# Expected response:
{
"status": "healthy",
"version": "v0.5.0",
"uptime_seconds": 3600,
"last_event_received": "2025-12-17T10:30:45Z",
"falco_connection": "connected",
"state_backend": "gcs",
"projects_monitored": 2
}
Monitoring Script:
#!/bin/bash
# monitor-tfdrift.sh
HEALTH_URL="http://localhost:8080/health"
ALERT_WEBHOOK="https://monitoring.company.com/alert"
while true; do
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
if [ "$RESPONSE" != "200" ]; then
# Alert: TFDrift is down
curl -X POST "$ALERT_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{
"severity": "critical",
"service": "tfdrift-falco",
"message": "TFDrift health check failed",
"status_code": "'$RESPONSE'"
}'
fi
sleep 60
done
Log Monitoring:
# Monitor for errors in logs
journalctl -u tfdrift-falco -f | grep -E "(ERROR|FATAL)" | while read line; do
# Alert on errors
curl -X POST "$ALERT_WEBHOOK" \
-d "TFDrift Error: $line"
done
4. Operational Excellence¶
Change Management¶
Terraform Change Workflow:
1. Developer creates Terraform change
↓
2. CI/CD runs terraform plan
↓
3. Pull request review
↓
4. Approved → terraform apply
↓
5. TFDrift-Falco monitors for manual changes
↓
6. Alert if drift detected within 1 hour
Handling Drift Alerts:
- Immediate Response (< 5 minutes):
- Acknowledge alert
- Check if change was authorized
-
If unauthorized, investigate and remediate
-
Investigation:
- Who made the change? (check principalEmail)
- What was changed? (review diff)
-
Why was manual change made? (was it emergency?)
-
Remediation Options:
-
Option A: Revert manual change
-
Option B: Update Terraform state
-
Option C: Accept drift temporarily
-
Post-Incident:
- Document incident
- Update runbooks
- Review access controls
- Consider automation improvements
Incident Response Template:
# Drift Alert Incident Report
**Date:** 2025-12-17
**Severity:** Critical
**Resource:** google_compute_firewall.prod_allow_ssh
**Change Detected:** source_ranges modified
## Timeline
- 10:30:00 - Manual change made via Console
- 10:30:45 - TFDrift alert fired
- 10:31:00 - On-call engineer acknowledged
- 10:35:00 - Change identified as unauthorized
- 10:40:00 - Terraform reapplied, change reverted
- 10:45:00 - Incident resolved
## Root Cause
Engineer made emergency change via Console without following change management process.
## Resolution
1. Reverted unauthorized firewall rule change
2. Reminded team of change management policy
3. Updated documentation
## Action Items
- [ ] Add pre-commit hook to validate Terraform
- [ ] Send change management reminder to team
- [ ] Review firewall rule IAM permissions
Documentation Standards¶
Maintain comprehensive documentation:
docs/
├── runbooks/
│ ├── drift-response.md # How to respond to drift alerts
│ ├── emergency-procedures.md # Emergency access procedures
│ └── escalation.md # Escalation paths
├── architecture/
│ ├── infrastructure-overview.md
│ ├── network-topology.md
│ └── security-controls.md
└── operations/
├── deployment-process.md
├── monitoring-guide.md
└── troubleshooting.md
Runbook Example:
# Drift Alert Response Runbook
## Critical IAM Change Alert
**Alert:** `google_project_iam_member` drift detected
### Step 1: Immediate Assessment (< 2 minutes)
1. Check alert details:
- Who made the change? (principalEmail)
- What role was granted/revoked?
- Which project?
2. Verify if authorized:
- Check change management tickets
- Contact user if possible
- Check recent approvals
### Step 2: Containment (< 5 minutes)
If unauthorized:
```bash
# Revoke unauthorized permission immediately
gcloud projects remove-iam-policy-binding PROJECT_ID \
--member="user:suspicious@example.com" \
--role="roles/editor"
Step 3: Investigation (< 30 minutes)¶
- Review GCP Audit Logs
- Check for related changes
- Interview user if needed
- Document findings
Step 4: Remediation¶
Choose appropriate action: - Revert change via Terraform - Update Terraform to match (if authorized) - Escalate to security team
Step 5: Follow-up¶
- Update incident tracker
- Notify stakeholders
- Schedule post-mortem if needed
--- ### 5. Performance & Scalability #### Event Filtering **Filter events at multiple levels:** **Level 1: GCP Log Sink (earliest, most efficient)** ```bash # Only forward compute and IAM events gcloud logging sinks update tfdrift-sink \ --log-filter=' (protoPayload.serviceName="compute.googleapis.com" OR protoPayload.serviceName="iam.googleapis.com") AND protoPayload.methodName!~"\.get$|\.list$" '
Level 2: Falco Rules (plugin level)
# falco-rules.yaml
- rule: Relevant GCP Changes
condition: >
gcp.serviceName in (compute.googleapis.com, iam.googleapis.com, sqladmin.googleapis.com) and
gcp.methodName matches "(insert|update|patch|delete|set.*)" and
not gcp.methodName matches "(list|get|describe)"
output: Relevant GCP change detected
priority: NOTICE
Level 3: TFDrift Configuration (application level)
# config.yaml
filtering:
# Ignore read-only operations
exclude_events:
- "*.list"
- "*.get"
- "*.describe"
- "*.testIamPermissions"
# Focus on write operations
include_methods:
- "*.insert"
- "*.update"
- "*.patch"
- "*.delete"
- "*.set*"
# Only monitor specific services
include_services:
- "compute.googleapis.com"
- "container.googleapis.com"
- "sqladmin.googleapis.com"
- "iam.googleapis.com"
# Ignore automated service accounts
exclude_principals:
- "serviceAccount:*@cloudbuild.gserviceaccount.com"
- "serviceAccount:*@cloudservices.gserviceaccount.com"
Resource Limits¶
Set appropriate limits:
# config.yaml
performance:
# Maximum events to process per second
max_events_per_second: 100
# Maximum concurrent state file loads
max_concurrent_state_loads: 5
# Event queue size
event_queue_size: 10000
# Worker pool size
worker_threads: 10
# Timeout for state backend operations
state_backend_timeout: "30s"
# Memory limits
max_memory_mb: 2048
Docker resource limits:
# docker-compose.yml
services:
tfdrift-falco:
image: ghcr.io/higakikeita/tfdrift-falco:v0.5.0
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '1.0'
memory: 1G
Load Balancing¶
For high-volume environments, run multiple instances:
# docker-compose.yml
services:
tfdrift-falco-1:
image: ghcr.io/higakikeita/tfdrift-falco:v0.5.0
environment:
- INSTANCE_ID=1
- PROJECT_FILTER=project-a,project-b
configs:
- config-instance-1.yaml
tfdrift-falco-2:
image: ghcr.io/higakikeita/tfdrift-falco:v0.5.0
environment:
- INSTANCE_ID=2
- PROJECT_FILTER=project-c,project-d
configs:
- config-instance-2.yaml
6. Security¶
IAM Best Practices¶
Principle of Least Privilege:
# ✅ Good: Minimal permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer" # Read-only for state
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber" # Only subscriber, not admin
# ❌ Bad: Overly broad permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/editor" # Too broad!
Custom Role for TFDrift:
# Create custom role with minimal permissions
gcloud iam roles create tfdriftFalcoRole \
--project=$PROJECT_ID \
--title="TFDrift Falco Custom Role" \
--description="Minimal permissions for TFDrift-Falco" \
--permissions="\
storage.objects.get,\
storage.objects.list,\
pubsub.subscriptions.consume,\
pubsub.subscriptions.get,\
logging.logEntries.list" \
--stage=GA
# Assign custom role
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:tfdrift-falco@$PROJECT_ID.iam.gserviceaccount.com" \
--role="projects/$PROJECT_ID/roles/tfdriftFalcoRole"
Network Security¶
Restrict Falco gRPC access:
# Only allow TFDrift host to connect to Falco
gcloud compute firewall-rules create allow-tfdrift-to-falco \
--network=prod-network \
--allow=tcp:5060 \
--source-ranges=10.0.1.10/32 \ # TFDrift-Falco IP
--target-tags=falco-server \
--description="Allow TFDrift to connect to Falco gRPC"
Use TLS for gRPC:
# config.yaml
falco:
enabled: true
hostname: "falco.internal.company.com"
port: 5060
tls:
enabled: true
ca_cert: "/etc/tfdrift/certs/ca.crt"
client_cert: "/etc/tfdrift/certs/client.crt"
client_key: "/etc/tfdrift/certs/client.key"
verify_server: true
Audit Logging¶
Enable audit logs for TFDrift itself:
# config.yaml
audit:
enabled: true
log_file: "/var/log/tfdrift/audit.log"
log_format: "json"
log_events:
- "drift_detected"
- "state_loaded"
- "notification_sent"
- "config_changed"
- "startup"
- "shutdown"
Sample audit log entry:
{
"timestamp": "2025-12-17T10:30:45Z",
"event_type": "drift_detected",
"severity": "critical",
"user": "user@example.com",
"resource": "google_compute_firewall.prod_allow_ssh",
"project": "company-prod-123",
"changes": {
"attribute": "source_ranges",
"old_value": ["10.0.0.0/8"],
"new_value": ["0.0.0.0/0"]
},
"action_taken": "alert_sent",
"alert_channels": ["slack", "pagerduty"]
}
7. Cost Optimization¶
Log Retention Policies¶
Set appropriate retention:
# Production: 90 days
gcloud logging sinks update tfdrift-sink \
--log-filter='...' \
--retention-days=90
# Development: 7 days
gcloud logging sinks update tfdrift-sink-dev \
--log-filter='...' \
--retention-days=7
Cost comparison: | Retention | Daily Logs (GB) | Monthly Cost (est.) | |-----------|-----------------|---------------------| | 7 days | 10 GB | $5 | | 30 days | 10 GB | $20 | | 90 days | 10 GB | $60 | | 365 days | 10 GB | $240 |
Storage Optimization¶
Enable lifecycle policies on state bucket:
# Delete old state versions after 30 days
cat > lifecycle.json <<EOF
{
"lifecycle": {
"rule": [
{
"action": {
"type": "Delete"
},
"condition": {
"age": 30,
"numNewerVersions": 5
}
}
]
}
}
EOF
gsutil lifecycle set lifecycle.json gs://company-terraform-state
Resource Efficiency¶
Right-size Falco deployment:
# Small environment (< 10 projects, < 1000 events/day)
falco:
resources:
cpu: "500m"
memory: "512Mi"
# Medium environment (10-50 projects, 1000-10000 events/day)
falco:
resources:
cpu: "1000m"
memory: "1Gi"
# Large environment (50+ projects, 10000+ events/day)
falco:
resources:
cpu: "2000m"
memory: "2Gi"
8. Disaster Recovery¶
Backup Strategy¶
Backup TFDrift configuration:
#!/bin/bash
# backup-tfdrift-config.sh
BACKUP_BUCKET="gs://company-backups/tfdrift"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Backup configuration files
gsutil cp -r /etc/tfdrift/config.yaml \
"${BACKUP_BUCKET}/config_${TIMESTAMP}.yaml"
# Backup service account keys
gsutil cp -r /etc/tfdrift/keys/*.json \
"${BACKUP_BUCKET}/keys/${TIMESTAMP}/"
# Backup Falco configuration
gsutil cp -r /etc/falco/*.yaml \
"${BACKUP_BUCKET}/falco/${TIMESTAMP}/"
echo "✓ Backup completed: ${BACKUP_BUCKET}/${TIMESTAMP}"
Recovery Procedures¶
Quick recovery script:
#!/bin/bash
# recover-tfdrift.sh
BACKUP_BUCKET="gs://company-backups/tfdrift"
BACKUP_DATE=$1 # Format: 20251217_103000
echo "==> Recovering TFDrift from backup: $BACKUP_DATE"
# Restore configuration
gsutil cp "${BACKUP_BUCKET}/config_${BACKUP_DATE}.yaml" \
/etc/tfdrift/config.yaml
# Restore keys
gsutil cp -r "${BACKUP_BUCKET}/keys/${BACKUP_DATE}/*" \
/etc/tfdrift/keys/
# Restore Falco config
gsutil cp -r "${BACKUP_BUCKET}/falco/${BACKUP_DATE}/*" \
/etc/falco/
# Restart services
docker-compose restart falco
docker-compose restart tfdrift-falco
echo "✓ Recovery completed"
Performance Tuning¶
Falco Configuration¶
# falco.yaml
grpc:
threadiness: 16 # Increase for high volume
plugins:
- name: gcpaudit
init_config:
# Batch size for Pub/Sub pulls
max_messages: 100
# Subscription timeout
timeout: 60s
TFDrift-Falco Configuration¶
# Filter irrelevant events at TFDrift level
drift_rules:
- name: "High Priority Only"
resource_types:
- "google_compute_firewall"
- "google_project_iam_binding"
watched_attributes:
- "*"
severity: "critical"
Security Best Practices¶
- Least Privilege: Grant minimal IAM roles
- Service Account Keys: Rotate regularly (every 90 days)
- Network Security: Restrict Falco gRPC access
- Audit Logs Retention: Configure log retention policies
- Encryption: Enable encryption at rest for GCS state bucket
# Enable customer-managed encryption
gsutil encryption set \
-k projects/$PROJECT_ID/locations/global/keyRings/tfdrift/cryptoKeys/state \
gs://tfdrift-terraform-state
Next Steps¶
Support¶
- Issues: https://github.com/higakikeita/tfdrift-falco/issues
- Discussions: https://github.com/higakikeita/tfdrift-falco/discussions
- Documentation: https://tfdrift-falco.readthedocs.io
Last Updated: 2025-12-17 TFDrift-Falco Version: v0.5.0+