TFDrift-Falco Best Practices¶
本ドキュメントでは、TFDrift-Falcoを本番環境で運用する際のベストプラクティスを紹介します。
📋 目次¶
- Production Deployment
- Security
- Operational Excellence
- Configuration Management
- Monitoring & Observability
- Performance Tuning
- Troubleshooting
Production Deployment¶
High Availability Setup¶
推奨構成: Active-Passive with Health Checks
# kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: tfdrift-falco
spec:
replicas: 2 # アクティブ-パッシブ構成
strategy:
type: Recreate # 同時実行を防ぐ
template:
spec:
containers:
- name: tfdrift-falco
image: ghcr.io/higakikeita/tfdrift-falco:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
exec:
command: ["/bin/sh", "-c", "pgrep -f tfdrift"]
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command: ["/bin/sh", "-c", "nc -z localhost 5060"]
initialDelaySeconds: 10
periodSeconds: 5
なぜActive-Passive? - TFDrift-FalcoはFalco gRPCストリームに接続するステートフルなサブスクライバー - 複数インスタンスが同時にイベントを処理すると、重複通知が発生 - Kubernetes LeaderElectionパターンの使用を推奨(将来のバージョンで実装予定)
Multi-Region Deployment¶
シナリオ: us-east-1とap-northeast-1を監視
# Region 1: us-east-1
providers:
aws:
enabled: true
regions:
- us-east-1
state:
backend: "s3"
s3_bucket: "terraform-state-us-east-1"
s3_key: "prod/terraform.tfstate"
falco:
hostname: "falco-us-east-1.internal"
port: 5060
notifications:
slack:
webhook_url: "https://hooks.slack.com/services/..."
channel: "#drift-us-east-1"
# Region 2: ap-northeast-1
providers:
aws:
enabled: true
regions:
- ap-northeast-1
state:
backend: "s3"
s3_bucket: "terraform-state-ap-northeast-1"
s3_key: "prod/terraform.tfstate"
falco:
hostname: "falco-ap-northeast-1.internal"
port: 5060
notifications:
slack:
webhook_url: "https://hooks.slack.com/services/..."
channel: "#drift-ap-northeast-1"
ポイント: - リージョンごとに独立したTFDrift-Falcoインスタンスを実行 - CloudTrailログもリージョンごとに処理 - Terraform Stateもリージョンごとに分離
Resource Sizing¶
最小構成 (Small workload, <50 CloudTrail events/min):
推奨構成 (Medium workload, 50-500 events/min):
Large構成 (Large workload, >500 events/min):
Security¶
IAM Permissions (Principle of Least Privilege)¶
最小権限のIAMポリシー例(AWS):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TerraformStateReadOnly",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-terraform-state",
"arn:aws:s3:::my-terraform-state/*"
]
},
{
"Sid": "KMSDecryptForState",
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/abc-123-def-456"
}
]
}
GCPのサービスアカウント権限例:
# Terraform State読み取り専用
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:tfdrift@my-project.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Cloud Auditログ読み取り(Pub/Sub経由)
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:tfdrift@my-project.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
Network Security¶
Falco gRPC接続にmTLS使用:
falco:
enabled: true
hostname: "falco.secure.internal"
port: 5060
cert_file: "/etc/tfdrift/certs/client.crt"
key_file: "/etc/tfdrift/certs/client.key"
ca_root_file: "/etc/tfdrift/certs/ca.crt"
Kubernetesネットワークポリシー:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tfdrift-falco-network-policy
spec:
podSelector:
matchLabels:
app: tfdrift-falco
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: prometheus # メトリクス収集用
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: falco # Falco gRPCへの接続
ports:
- protocol: TCP
port: 5060
- to: # Slack/Webhook通知用
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
Secrets Management¶
❌ 悪い例 - 設定ファイルに平文で記述:
notifications:
slack:
webhook_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX"
✅ 良い例 - Kubernetes Secretsを使用:
# Kubernetes Secret作成
kubectl create secret generic tfdrift-secrets \
--from-literal=slack-webhook-url="https://hooks.slack.com/services/..."
# Deployment
env:
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: tfdrift-secrets
key: slack-webhook-url
✅ さらに良い例 - AWS Secrets Manager / GCP Secret Manager:
notifications:
slack:
webhook_url_from_aws_secret: "prod/tfdrift/slack-webhook"
# または
webhook_url_from_gcp_secret: "projects/123/secrets/tfdrift-slack-webhook"
Operational Excellence¶
Log Retention and Rotation¶
推奨: 構造化ログ(JSON)を外部ロギングシステムに転送
FluentBit統合例:
# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Daemon Off
Log_Level info
[INPUT]
Name tail
Path /var/log/tfdrift/*.log
Parser json
Tag tfdrift.*
Refresh_Interval 5
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index tfdrift-logs
Backup and Disaster Recovery¶
Terraform State Backup:
# 定期的なState Snapshotスクリプト
#!/bin/bash
# backup-terraform-state.sh
DATE=$(date +%Y%m%d-%H%M%S)
BUCKET="my-terraform-state"
KEY="prod/terraform.tfstate"
BACKUP_BUCKET="my-terraform-state-backup"
# S3からダウンロード
aws s3 cp s3://${BUCKET}/${KEY} /tmp/terraform.tfstate
# バックアップバケットにアップロード(バージョニング有効)
aws s3 cp /tmp/terraform.tfstate s3://${BACKUP_BUCKET}/${KEY}.${DATE}
# 90日以上古いバックアップを削除
aws s3 ls s3://${BACKUP_BUCKET}/ | while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date --date="90 days ago" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
aws s3 rm s3://${BACKUP_BUCKET}/$fileName
fi
done
Cron設定 (毎日3AM実行):
Upgrade Procedures¶
Zero-Downtime Upgrade (Kubernetes):
# 1. 新しいバージョンをテスト環境で検証
kubectl set image deployment/tfdrift-falco \
tfdrift-falco=ghcr.io/higakikeita/tfdrift-falco:v0.6.0 \
-n staging
# 2. テスト環境で動作確認
kubectl logs -f deployment/tfdrift-falco -n staging
# 3. 本番環境にローリングアップデート
kubectl set image deployment/tfdrift-falco \
tfdrift-falco=ghcr.io/higakikeita/tfdrift-falco:v0.6.0 \
-n production
# 4. ロールアウト状況を監視
kubectl rollout status deployment/tfdrift-falco -n production
# 5. 問題があればロールバック
kubectl rollout undo deployment/tfdrift-falco -n production
Configuration Management¶
Drift Rule Design Patterns¶
パターン1: Critical Resources Only
drift_rules:
# IAM関連(最重要)
- name: "IAM Critical Changes"
resource_types:
- "aws_iam_role"
- "aws_iam_policy"
- "aws_iam_user"
watched_attributes:
- "policy"
- "assume_role_policy"
- "inline_policy"
severity: "critical"
# セキュリティグループ(重要)
- name: "Security Group Changes"
resource_types:
- "aws_security_group"
- "aws_security_group_rule"
watched_attributes:
- "ingress"
- "egress"
- "cidr_blocks"
severity: "critical"
パターン2: Environment-Specific Rules
drift_rules:
# 本番環境: 全ての変更を検知
- name: "Production - All Changes"
resource_types:
- "*" # 全てのリソース
environment: "production"
severity: "high"
# ステージング環境: Critical resourcesのみ
- name: "Staging - Critical Only"
resource_types:
- "aws_iam_*"
- "aws_security_group*"
- "aws_kms_*"
environment: "staging"
severity: "medium"
# 開発環境: IAMのみ
- name: "Dev - IAM Only"
resource_types:
- "aws_iam_*"
environment: "development"
severity: "low"
Multi-Account Strategy¶
アカウント構成例: - Production Account (123456789012) - Staging Account (234567890123) - Development Account (345678901234)
推奨デプロイメント: 各アカウントに個別のTFDrift-Falcoインスタンス
# production-config.yaml
providers:
aws:
enabled: true
regions:
- us-east-1
state:
backend: "s3"
s3_bucket: "terraform-state-production"
s3_key: "terraform.tfstate"
notifications:
slack:
webhook_url_from_env: "SLACK_WEBHOOK_PROD"
channel: "#security-alerts-prod"
# staging-config.yaml
providers:
aws:
enabled: true
regions:
- us-east-1
state:
backend: "s3"
s3_bucket: "terraform-state-staging"
s3_key: "terraform.tfstate"
notifications:
slack:
webhook_url_from_env: "SLACK_WEBHOOK_STAGING"
channel: "#security-alerts-staging"
Terraform State Backend Setup¶
S3バックエンド(推奨設定):
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true # 暗号化必須
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abc-123"
dynamodb_table = "terraform-state-lock" # ロック機能
# バージョニング必須
versioning = true
}
}
S3バケット設定:
resource "aws_s3_bucket" "terraform_state" {
bucket = "my-terraform-state"
lifecycle {
prevent_destroy = true # 誤削除防止
}
tags = {
Name = "Terraform State"
Environment = "Production"
ManagedBy = "Terraform"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled" # バージョニング有効化
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Monitoring & Observability¶
Prometheus Metrics¶
TFDrift-Falcoは以下のメトリクスを公開します(/metrics エンドポイント):
# 検知したドリフトイベント数
tfdrift_events_total{severity="critical"} 5
tfdrift_events_total{severity="high"} 23
# リソースタイプ別イベント数
tfdrift_events_by_type{type="aws_instance"} 12
tfdrift_events_by_type{type="aws_iam_role"} 8
# 検知レイテンシー(秒)
tfdrift_detection_latency_seconds{quantile="0.5"} 0.8
tfdrift_detection_latency_seconds{quantile="0.95"} 2.3
tfdrift_detection_latency_seconds{quantile="0.99"} 5.1
# Falco接続状態
tfdrift_falco_connected 1
# Terraform State同期時刻(UnixTimestamp)
tfdrift_state_last_sync_timestamp 1705312345
Grafana Alerting¶
アラートルール例:
# grafana-alerts.yaml
groups:
- name: tfdrift-alerts
interval: 1m
rules:
# Criticalドリフトが発生したら即座に通知
- alert: CriticalDriftDetected
expr: increase(tfdrift_events_total{severity="critical"}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Critical drift detected in {{ $labels.environment }}"
description: "{{ $value }} critical drift events detected in the last 5 minutes"
# 検知レイテンシーが10秒を超えたら警告
- alert: HighDetectionLatency
expr: tfdrift_detection_latency_seconds{quantile="0.95"} > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High drift detection latency"
description: "P95 latency is {{ $value }}s (threshold: 10s)"
# Falco接続が切れたら即座にアラート
- alert: FalcoConnectionLost
expr: tfdrift_falco_connected == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Lost connection to Falco"
description: "TFDrift-Falco cannot connect to Falco gRPC endpoint"
# State同期が30分以上行われていない場合
- alert: TerraformStateStale
expr: (time() - tfdrift_state_last_sync_timestamp) > 1800
for: 5m
labels:
severity: warning
annotations:
summary: "Terraform state not synced for 30+ minutes"
description: "Last sync: {{ $value | humanizeDuration }}"
Health Checks¶
Kubernetes Liveness & Readiness Probes:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
ヘルスチェックエンドポイント:
GET /healthz- 基本的なヘルスチェック(プロセスが生きているか)GET /ready- 準備状態チェック(Falco接続、State読み込み完了)GET /metrics- Prometheusメトリクス
Performance Tuning¶
Terraform State Refresh Interval¶
デフォルト: 5分ごと
providers:
aws:
state:
backend: "s3"
s3_bucket: "my-terraform-state"
s3_key: "terraform.tfstate"
refresh_interval: "5m" # デフォルト
推奨設定: - Small environments (<100 resources): refresh_interval: "5m" - Medium environments (100-500 resources): refresh_interval: "10m" - Large environments (>500 resources): refresh_interval: "15m"
理由: State読み込みはS3 API呼び出しが発生し、大規模環境ではオーバーヘッドになる
Event Filtering¶
不要なイベントをフィルタリング:
# Falcoルールで事前フィルタリング
- rule: Terraform Relevant CloudTrail Events
desc: Only process events relevant to Terraform drift detection
condition: >
ct.name in (RunInstances, TerminateInstances, ModifyInstanceAttribute, ...)
and not ct.user startswith "AWSServiceRole"
output: "Terraform-relevant event detected"
priority: INFO
TFDrift-Falco側でもフィルタリング:
drift_rules:
- name: "EC2 Specific Attributes"
resource_types:
- "aws_instance"
watched_attributes:
- "instance_type"
- "disable_api_termination"
# 不要な属性は監視しない(例: tags.Name の変更は無視)
severity: "high"
Concurrency Settings¶
# config.yaml
performance:
# 並列処理するイベント数(デフォルト: 10)
event_worker_count: 10
# Terraform State解析の並列度(デフォルト: 5)
state_parser_workers: 5
# 通知送信の並列度(デフォルト: 3)
notifier_workers: 3
推奨設定: - Small workload (<50 events/min): event_worker_count: 5 - Medium workload (50-200 events/min): event_worker_count: 10 - Large workload (>200 events/min): event_worker_count: 20
Troubleshooting¶
Common Issues¶
Issue 1: "Cannot connect to Falco gRPC endpoint"¶
症状:
原因と解決策:
-
Falcoが起動していない
-
gRPCが有効化されていない
-
ネットワーク接続問題
Issue 2: "Terraform state file not found"¶
症状:
原因と解決策:
-
S3バケット/キーが間違っている
-
IAM権限不足
-
KMS復号化権限不足
Issue 3: "Too many drift alerts (False Positives)"¶
症状: - Slackに大量のアラートが送信される - 実際にはドリフトではない変更も検知される
原因と解決策:
-
watched_attributesが広すぎる
# ❌ 悪い例 drift_rules: - name: "All EC2 Changes" resource_types: - "aws_instance" watched_attributes: - "*" # 全ての属性を監視(Tagsの変更も含む) # ✅ 良い例 drift_rules: - name: "Critical EC2 Changes" resource_types: - "aws_instance" watched_attributes: - "instance_type" - "disable_api_termination" - "security_groups" # Tags変更は除外 -
Terraform管理外のリソースを検知している
-
Auto Scalingによる自動変更を検知している
Issue 4: "High memory usage"¶
症状: - メモリ使用量が1GB以上 - OOMKillerによるPod再起動
原因と解決策:
-
Terraform Stateが巨大(1000+ resources)
-
イベントキューが溜まっている
-
リソース制限を増やす
Issue 5: "Detection latency is high (>10 seconds)"¶
症状: - CloudTrailイベント発生からアラート送信まで10秒以上かかる
原因と解決策:
- Terraform State読み込みが遅い
→ S3をVPCエンドポイント経由で接続(レイテンシー削減)
-
Diff計算が重い
-
通知送信が遅い(Webhook timeout)
Debug Logging¶
デバッグログの有効化:
特定のコンポーネントのみデバッグ:
# 環境変数で制御
export TFDRIFT_LOG_LEVEL=debug
export TFDRIFT_LOG_COMPONENTS="falco,detector" # FalcoとDetectorのみ
tfdrift --config config.yaml
Performance Profiling¶
CPU Profiling:
# pprofを有効化
tfdrift --config config.yaml --pprof-port 6060 &
# プロファイリング実行(30秒間)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
Memory Profiling:
# Heap profileを取得
curl -o heap.out http://localhost:6060/debug/pprof/heap
# 解析
go tool pprof heap.out
Appendix¶
Checklist: Production Readiness¶
本番環境にデプロイする前に、以下の項目を確認してください:
- High Availability: 2つ以上のレプリカでデプロイ(Active-Passive構成)
- Resource Limits: CPUとメモリのlimits/requestsを設定
- IAM Permissions: 最小権限の原則に従ったIAMポリシー
- Secrets Management: Webhook URLや認証情報をSecrets/Secret Managerで管理
- Network Security: mTLS有効化、Network Policy設定
- Monitoring: Prometheusメトリクス収集、Grafanaダッシュボード構築
- Alerting: Critical/Highレベルのアラートルール設定
- Logging: JSON形式でログを外部システムに転送
- Backup: Terraform Stateの定期バックアップ
- Testing: ステージング環境で動作確認
- Drift Rules: 環境に適したルール設定(False Positive最小化)
- Documentation: 運用手順書、トラブルシューティングガイド作成
References¶
- AWS CloudTrail User Guide
- Falco Documentation
- Terraform Backend Configuration
- Kubernetes Best Practices
- Prometheus Best Practices
次のステップ: Extending TFDrift-Falco でカスタムルールや通知チャネルの追加方法を学びましょう。