TFDrift-Falco Best Practices¶

本ドキュメントでは、TFDrift-Falcoを本番環境で運用する際のベストプラクティスを紹介します。

📋 目次¶

Production Deployment
Security
Operational Excellence
Configuration Management
Monitoring & Observability
Performance Tuning
Troubleshooting

Production Deployment¶

High Availability Setup¶

推奨構成: Active-Passive with Health Checks

# kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tfdrift-falco
spec:
  replicas: 2  # アクティブ-パッシブ構成
  strategy:
    type: Recreate  # 同時実行を防ぐ
  template:
    spec:
      containers:
      - name: tfdrift-falco
        image: ghcr.io/higakikeita/tfdrift-falco:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          exec:
            command: ["/bin/sh", "-c", "pgrep -f tfdrift"]
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          exec:
            command: ["/bin/sh", "-c", "nc -z localhost 5060"]
          initialDelaySeconds: 10
          periodSeconds: 5

なぜActive-Passive? - TFDrift-FalcoはFalco gRPCストリームに接続するステートフルなサブスクライバー - 複数インスタンスが同時にイベントを処理すると、重複通知が発生 - Kubernetes LeaderElectionパターンの使用を推奨（将来のバージョンで実装予定）

Multi-Region Deployment¶

シナリオ: us-east-1とap-northeast-1を監視

# Region 1: us-east-1
providers:
  aws:
    enabled: true
    regions:
      - us-east-1
    state:
      backend: "s3"
      s3_bucket: "terraform-state-us-east-1"
      s3_key: "prod/terraform.tfstate"

falco:
  hostname: "falco-us-east-1.internal"
  port: 5060

notifications:
  slack:
    webhook_url: "https://hooks.slack.com/services/..."
    channel: "#drift-us-east-1"

# Region 2: ap-northeast-1
providers:
  aws:
    enabled: true
    regions:
      - ap-northeast-1
    state:
      backend: "s3"
      s3_bucket: "terraform-state-ap-northeast-1"
      s3_key: "prod/terraform.tfstate"

falco:
  hostname: "falco-ap-northeast-1.internal"
  port: 5060

notifications:
  slack:
    webhook_url: "https://hooks.slack.com/services/..."
    channel: "#drift-ap-northeast-1"

ポイント: - リージョンごとに独立したTFDrift-Falcoインスタンスを実行 - CloudTrailログもリージョンごとに処理 - Terraform Stateもリージョンごとに分離

Resource Sizing¶

最小構成 (Small workload, <50 CloudTrail events/min):

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "250m"

推奨構成 (Medium workload, 50-500 events/min):

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Large構成 (Large workload, >500 events/min):

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

Security¶

IAM Permissions (Principle of Least Privilege)¶

最小権限のIAMポリシー例（AWS）:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateReadOnly",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-terraform-state",
        "arn:aws:s3:::my-terraform-state/*"
      ]
    },
    {
      "Sid": "KMSDecryptForState",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc-123-def-456"
    }
  ]
}

GCPのサービスアカウント権限例:

# Terraform State読み取り専用
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:tfdrift@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Cloud Auditログ読み取り（Pub/Sub経由）
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:tfdrift@my-project.iam.gserviceaccount.com" \
  --role="roles/pubsub.subscriber"

Network Security¶

Falco gRPC接続にmTLS使用:

falco:
  enabled: true
  hostname: "falco.secure.internal"
  port: 5060
  cert_file: "/etc/tfdrift/certs/client.crt"
  key_file: "/etc/tfdrift/certs/client.key"
  ca_root_file: "/etc/tfdrift/certs/ca.crt"

Kubernetesネットワークポリシー:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tfdrift-falco-network-policy
spec:
  podSelector:
    matchLabels:
      app: tfdrift-falco
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: prometheus  # メトリクス収集用
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: falco  # Falco gRPCへの接続
    ports:
    - protocol: TCP
      port: 5060
  - to:  # Slack/Webhook通知用
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

Secrets Management¶

❌ 悪い例 - 設定ファイルに平文で記述:

notifications:
  slack:
    webhook_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX"

✅ 良い例 - Kubernetes Secretsを使用:

# config.yaml
notifications:
  slack:
    webhook_url_from_env: "SLACK_WEBHOOK_URL"

# Kubernetes Secret作成
kubectl create secret generic tfdrift-secrets \
  --from-literal=slack-webhook-url="https://hooks.slack.com/services/..."

# Deployment
env:
- name: SLACK_WEBHOOK_URL
  valueFrom:
    secretKeyRef:
      name: tfdrift-secrets
      key: slack-webhook-url

✅ さらに良い例 - AWS Secrets Manager / GCP Secret Manager:

notifications:
  slack:
    webhook_url_from_aws_secret: "prod/tfdrift/slack-webhook"
    # または
    webhook_url_from_gcp_secret: "projects/123/secrets/tfdrift-slack-webhook"

Operational Excellence¶

Log Retention and Rotation¶

推奨: 構造化ログ（JSON）を外部ロギングシステムに転送

# config.yaml
logging:
  level: "info"
  format: "json"  # 本番環境ではJSON推奨

FluentBit統合例:

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Daemon       Off
        Log_Level    info

    [INPUT]
        Name              tail
        Path              /var/log/tfdrift/*.log
        Parser            json
        Tag               tfdrift.*
        Refresh_Interval  5

    [OUTPUT]
        Name  es
        Match *
        Host  elasticsearch.logging.svc.cluster.local
        Port  9200
        Index tfdrift-logs

Backup and Disaster Recovery¶

Terraform State Backup:

# 定期的なState Snapshotスクリプト
#!/bin/bash
# backup-terraform-state.sh

DATE=$(date +%Y%m%d-%H%M%S)
BUCKET="my-terraform-state"
KEY="prod/terraform.tfstate"
BACKUP_BUCKET="my-terraform-state-backup"

# S3からダウンロード
aws s3 cp s3://${BUCKET}/${KEY} /tmp/terraform.tfstate

# バックアップバケットにアップロード（バージョニング有効）
aws s3 cp /tmp/terraform.tfstate s3://${BACKUP_BUCKET}/${KEY}.${DATE}

# 90日以上古いバックアップを削除
aws s3 ls s3://${BACKUP_BUCKET}/ | while read -r line; do
  createDate=$(echo $line | awk '{print $1" "$2}')
  createDate=$(date -d "$createDate" +%s)
  olderThan=$(date --date="90 days ago" +%s)
  if [[ $createDate -lt $olderThan ]]; then
    fileName=$(echo $line | awk '{print $4}')
    aws s3 rm s3://${BACKUP_BUCKET}/$fileName
  fi
done

Cron設定 (毎日3AM実行):

0 3 * * * /usr/local/bin/backup-terraform-state.sh >> /var/log/backup.log 2>&1

Upgrade Procedures¶

Zero-Downtime Upgrade (Kubernetes):

# 1. 新しいバージョンをテスト環境で検証
kubectl set image deployment/tfdrift-falco \
  tfdrift-falco=ghcr.io/higakikeita/tfdrift-falco:v0.6.0 \
  -n staging

# 2. テスト環境で動作確認
kubectl logs -f deployment/tfdrift-falco -n staging

# 3. 本番環境にローリングアップデート
kubectl set image deployment/tfdrift-falco \
  tfdrift-falco=ghcr.io/higakikeita/tfdrift-falco:v0.6.0 \
  -n production

# 4. ロールアウト状況を監視
kubectl rollout status deployment/tfdrift-falco -n production

# 5. 問題があればロールバック
kubectl rollout undo deployment/tfdrift-falco -n production

Configuration Management¶

Drift Rule Design Patterns¶

パターン1: Critical Resources Only

drift_rules:
  # IAM関連（最重要）
  - name: "IAM Critical Changes"
    resource_types:
      - "aws_iam_role"
      - "aws_iam_policy"
      - "aws_iam_user"
    watched_attributes:
      - "policy"
      - "assume_role_policy"
      - "inline_policy"
    severity: "critical"

  # セキュリティグループ（重要）
  - name: "Security Group Changes"
    resource_types:
      - "aws_security_group"
      - "aws_security_group_rule"
    watched_attributes:
      - "ingress"
      - "egress"
      - "cidr_blocks"
    severity: "critical"

パターン2: Environment-Specific Rules

drift_rules:
  # 本番環境: 全ての変更を検知
  - name: "Production - All Changes"
    resource_types:
      - "*"  # 全てのリソース
    environment: "production"
    severity: "high"

  # ステージング環境: Critical resourcesのみ
  - name: "Staging - Critical Only"
    resource_types:
      - "aws_iam_*"
      - "aws_security_group*"
      - "aws_kms_*"
    environment: "staging"
    severity: "medium"

  # 開発環境: IAMのみ
  - name: "Dev - IAM Only"
    resource_types:
      - "aws_iam_*"
    environment: "development"
    severity: "low"

Multi-Account Strategy¶

アカウント構成例: - Production Account (123456789012) - Staging Account (234567890123) - Development Account (345678901234)

推奨デプロイメント: 各アカウントに個別のTFDrift-Falcoインスタンス

# production-config.yaml
providers:
  aws:
    enabled: true
    regions:
      - us-east-1
    state:
      backend: "s3"
      s3_bucket: "terraform-state-production"
      s3_key: "terraform.tfstate"

notifications:
  slack:
    webhook_url_from_env: "SLACK_WEBHOOK_PROD"
    channel: "#security-alerts-prod"

# staging-config.yaml
providers:
  aws:
    enabled: true
    regions:
      - us-east-1
    state:
      backend: "s3"
      s3_bucket: "terraform-state-staging"
      s3_key: "terraform.tfstate"

notifications:
  slack:
    webhook_url_from_env: "SLACK_WEBHOOK_STAGING"
    channel: "#security-alerts-staging"

Terraform State Backend Setup¶

S3バックエンド（推奨設定）:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true  # 暗号化必須
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/abc-123"
    dynamodb_table = "terraform-state-lock"  # ロック機能

    # バージョニング必須
    versioning     = true
  }
}

S3バケット設定:

resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-terraform-state"

  lifecycle {
    prevent_destroy = true  # 誤削除防止
  }

  tags = {
    Name        = "Terraform State"
    Environment = "Production"
    ManagedBy   = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"  # バージョニング有効化
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Monitoring & Observability¶

Prometheus Metrics¶

TFDrift-Falcoは以下のメトリクスを公開します（/metrics エンドポイント）:

# 検知したドリフトイベント数
tfdrift_events_total{severity="critical"} 5
tfdrift_events_total{severity="high"} 23

# リソースタイプ別イベント数
tfdrift_events_by_type{type="aws_instance"} 12
tfdrift_events_by_type{type="aws_iam_role"} 8

# 検知レイテンシー（秒）
tfdrift_detection_latency_seconds{quantile="0.5"} 0.8
tfdrift_detection_latency_seconds{quantile="0.95"} 2.3
tfdrift_detection_latency_seconds{quantile="0.99"} 5.1

# Falco接続状態
tfdrift_falco_connected 1

# Terraform State同期時刻（UnixTimestamp）
tfdrift_state_last_sync_timestamp 1705312345

Grafana Alerting¶

アラートルール例:

# grafana-alerts.yaml
groups:
  - name: tfdrift-alerts
    interval: 1m
    rules:
      # Criticalドリフトが発生したら即座に通知
      - alert: CriticalDriftDetected
        expr: increase(tfdrift_events_total{severity="critical"}[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Critical drift detected in {{ $labels.environment }}"
          description: "{{ $value }} critical drift events detected in the last 5 minutes"

      # 検知レイテンシーが10秒を超えたら警告
      - alert: HighDetectionLatency
        expr: tfdrift_detection_latency_seconds{quantile="0.95"} > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High drift detection latency"
          description: "P95 latency is {{ $value }}s (threshold: 10s)"

      # Falco接続が切れたら即座にアラート
      - alert: FalcoConnectionLost
        expr: tfdrift_falco_connected == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Lost connection to Falco"
          description: "TFDrift-Falco cannot connect to Falco gRPC endpoint"

      # State同期が30分以上行われていない場合
      - alert: TerraformStateStale
        expr: (time() - tfdrift_state_last_sync_timestamp) > 1800
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Terraform state not synced for 30+ minutes"
          description: "Last sync: {{ $value | humanizeDuration }}"

Health Checks¶

Kubernetes Liveness & Readiness Probes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

ヘルスチェックエンドポイント:

GET /healthz - 基本的なヘルスチェック（プロセスが生きているか）
GET /ready - 準備状態チェック（Falco接続、State読み込み完了）
GET /metrics - Prometheusメトリクス

Performance Tuning¶

Terraform State Refresh Interval¶

デフォルト: 5分ごと

providers:
  aws:
    state:
      backend: "s3"
      s3_bucket: "my-terraform-state"
      s3_key: "terraform.tfstate"
      refresh_interval: "5m"  # デフォルト

推奨設定: - Small environments (<100 resources): refresh_interval: "5m" - Medium environments (100-500 resources): refresh_interval: "10m" - Large environments (>500 resources): refresh_interval: "15m"

理由: State読み込みはS3 API呼び出しが発生し、大規模環境ではオーバーヘッドになる

Event Filtering¶

不要なイベントをフィルタリング:

# Falcoルールで事前フィルタリング
- rule: Terraform Relevant CloudTrail Events
  desc: Only process events relevant to Terraform drift detection
  condition: >
    ct.name in (RunInstances, TerminateInstances, ModifyInstanceAttribute, ...)
    and not ct.user startswith "AWSServiceRole"
  output: "Terraform-relevant event detected"
  priority: INFO

TFDrift-Falco側でもフィルタリング:

drift_rules:
  - name: "EC2 Specific Attributes"
    resource_types:
      - "aws_instance"
    watched_attributes:
      - "instance_type"
      - "disable_api_termination"
      # 不要な属性は監視しない（例: tags.Name の変更は無視）
    severity: "high"

Concurrency Settings¶

# config.yaml
performance:
  # 並列処理するイベント数（デフォルト: 10）
  event_worker_count: 10

  # Terraform State解析の並列度（デフォルト: 5）
  state_parser_workers: 5

  # 通知送信の並列度（デフォルト: 3）
  notifier_workers: 3

推奨設定: - Small workload (<50 events/min): event_worker_count: 5 - Medium workload (50-200 events/min): event_worker_count: 10 - Large workload (>200 events/min): event_worker_count: 20

Troubleshooting¶

Common Issues¶

Issue 1: "Cannot connect to Falco gRPC endpoint"¶

症状:

ERROR Failed to connect to Falco gRPC endpoint: connection refused

原因と解決策:

Falcoが起動していない

# Falco起動確認
systemctl status falco

# 起動していない場合
systemctl start falco

gRPCが有効化されていない

# /etc/falco/falco.yaml
grpc:
  enabled: true
  bind_address: "0.0.0.0:5060"
  threadiness: 8

ネットワーク接続問題

# Falcoエンドポイントに接続確認
nc -zv localhost 5060

# Kubernetesの場合、ServiceのDNS名を確認
nslookup falco.default.svc.cluster.local

Issue 2: "Terraform state file not found"¶

症状:

ERROR Failed to load Terraform state: NoSuchKey

原因と解決策:

S3バケット/キーが間違っている

# S3にファイルが存在するか確認
aws s3 ls s3://my-terraform-state/prod/terraform.tfstate

IAM権限不足

# IAMポリシーを確認
aws iam get-policy-version --policy-arn arn:aws:iam::123456789012:policy/TFDriftPolicy --version-id v1

KMS復号化権限不足

# KMSキーへのアクセス確認
aws kms describe-key --key-id arn:aws:kms:us-east-1:123456789012:key/abc-123

Issue 3: "Too many drift alerts (False Positives)"¶

症状: - Slackに大量のアラートが送信される - 実際にはドリフトではない変更も検知される

原因と解決策:

watched_attributesが広すぎる

# ❌ 悪い例
drift_rules:
  - name: "All EC2 Changes"
    resource_types:
      - "aws_instance"
    watched_attributes:
      - "*"  # 全ての属性を監視（Tagsの変更も含む）

# ✅ 良い例
drift_rules:
  - name: "Critical EC2 Changes"
    resource_types:
      - "aws_instance"
    watched_attributes:
      - "instance_type"
      - "disable_api_termination"
      - "security_groups"
    # Tags変更は除外

Terraform管理外のリソースを検知している

# TerraformのStateに含まれるリソースのみ監視
drift_rules:
  - name: "IAM Changes"
    resource_types:
      - "aws_iam_role"
    terraform_managed_only: true  # Terraform管理リソースのみ

Auto Scalingによる自動変更を検知している

# ユーザーによる変更のみ検知（AWSサービスロールを除外）
drift_rules:
  - name: "Manual EC2 Changes"
    resource_types:
      - "aws_instance"
    exclude_users:
      - "AWSServiceRoleForAutoScaling"
      - "AWSServiceRoleForECS"

Issue 4: "High memory usage"¶

症状: - メモリ使用量が1GB以上 - OOMKillerによるPod再起動

原因と解決策:

Terraform Stateが巨大（1000+ resources）

# State読み込み頻度を下げる
providers:
  aws:
    state:
      refresh_interval: "15m"  # 5m → 15m

イベントキューが溜まっている

# Worker数を増やす
performance:
  event_worker_count: 20  # 10 → 20

リソース制限を増やす

resources:
  limits:
    memory: "1Gi"  # 512Mi → 1Gi
    cpu: "1000m"

Issue 5: "Detection latency is high (>10 seconds)"¶

症状: - CloudTrailイベント発生からアラート送信まで10秒以上かかる

原因と解決策:

Terraform State読み込みが遅い

# State読み込み時間を計測
time aws s3 cp s3://my-terraform-state/terraform.tfstate - | wc -l

→ S3をVPCエンドポイント経由で接続（レイテンシー削減）

Diff計算が重い

# watched_attributesを減らす
drift_rules:
  - name: "IAM Changes"
    resource_types:
      - "aws_iam_role"
    watched_attributes:
      - "assume_role_policy"  # 主要な属性のみ

通知送信が遅い（Webhook timeout）

notifications:
  webhook:
    timeout: 5s  # 10s → 5s
    max_retries: 2  # 5 → 2

Debug Logging¶

デバッグログの有効化:

# config.yaml
logging:
  level: "debug"  # info → debug
  format: "text"  # 人間が読みやすい形式

特定のコンポーネントのみデバッグ:

# 環境変数で制御
export TFDRIFT_LOG_LEVEL=debug
export TFDRIFT_LOG_COMPONENTS="falco,detector"  # FalcoとDetectorのみ

tfdrift --config config.yaml

Performance Profiling¶

CPU Profiling:

# pprofを有効化
tfdrift --config config.yaml --pprof-port 6060 &

# プロファイリング実行（30秒間）
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Memory Profiling:

# Heap profileを取得
curl -o heap.out http://localhost:6060/debug/pprof/heap

# 解析
go tool pprof heap.out

Appendix¶

Checklist: Production Readiness¶

本番環境にデプロイする前に、以下の項目を確認してください:

References¶

次のステップ: Extending TFDrift-Falco でカスタムルールや通知チャネルの追加方法を学びましょう。