Cloud Run 고가용성 설정 가이드

📋 개요

PLT-NFR-004 (99.9% 가용성) 요구사항을 충족하기 위한 Cloud Run 고가용성 설정 방법을 안내합니다.

📖 관련 문서: PLT-NFR-004 99.9% 가용성 구현 가이드 - 전체 아키텍처 및 모니터링 시스템

🔧 현재 설정 개선사항

1. 환경별 Terragrunt 설정 (완료)

각 환경의 목적에 맞게 최적화된 설정을 적용했습니다:

Dev 환경 (비용 최적화 우선)

# cloudrun-deploy/dta-wide-api/cloudrun/dev/terragrunt.hcl
scaling = {
  min_instance_count = "0"  # Scale-to-zero로 비용 절감
  max_instance_count = "20" # 적절한 상한선
}

resources = {
  limits = {
    cpu    = "1"     # CPU 절약
    memory = "1Gi"   # 메모리 절약
  }
}

annotations = {
  "autoscaling.knative.dev/minScale" = "0"
  "autoscaling.knative.dev/maxScale" = "20"
  "run.googleapis.com/cpu-throttling" = "true"  # 비용 절감
}

Stage 환경 (베타 테스터용)

scaling = {
  min_instance_count = "0"  # 베타 테스터 30명, scale-to-zero로 비용 최적화
  max_instance_count = "5"  # 소규모 베타 테스트 대응
}

resources = {
  limits = {
    cpu    = "1"     # 베타 테스트용 적절한 CPU
    memory = "1Gi"   # 소규모 테스트에 충분한 메모리
  }
}

annotations = {
  "autoscaling.knative.dev/minScale" = "0"
  "autoscaling.knative.dev/maxScale" = "5"
  "run.googleapis.com/cpu-throttling" = "true"  # 베타 환경에서는 비용 우선
}

Prod 환경 (99.9% 가용성)

scaling = {
  min_instance_count = "2"   # 동시 접속 100명 기준 안전한 최소값
  max_instance_count = "15"  # 피크 타임 및 확장성 고려
}

resources = {
  limits = {
    cpu    = "2"     # 안정적인 성능
    memory = "2Gi"   # 적절한 메모리
  }
}

annotations = {
  "autoscaling.knative.dev/minScale" = "2"
  "autoscaling.knative.dev/maxScale" = "15"
  "run.googleapis.com/cpu-throttling" = "false"  # 성능 우선
}

2. Health Check 설정 (추가 필요)

현재 사용 중인 gops-terraform-module에서 Health Check 설정을 지원하는지 확인이 필요합니다.

2.1 애플리케이션 레벨 Health Check

먼저 애플리케이션에서 Health Check 엔드포인트를 구현해야 합니다:

// src/health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import { ApiTags, ApiOperation, ApiResponse } from '@nestjs/swagger';

@ApiTags('Health')
@Controller('health')
export class HealthController {
  
  @Get()
  @ApiOperation({ summary: 'Health check endpoint' })
  @ApiResponse({ status: 200, description: 'Service is healthy' })
  @ApiResponse({ status: 503, description: 'Service is unhealthy' })
  async healthCheck() {
    return {
      status: 'ok',
      timestamp: new Date().toISOString(),
      service: 'dta-wide-api',
      version: process.env.npm_package_version || 'unknown'
    };
  }

  @Get('ready')
  @ApiOperation({ summary: 'Readiness check endpoint' })
  @ApiResponse({ status: 200, description: 'Service is ready' })
  @ApiResponse({ status: 503, description: 'Service is not ready' })
  async readinessCheck() {
    // 여기서 데이터베이스, Redis 등 의존성 확인
    try {
      // 예: 데이터베이스 연결 확인
      // await this.databaseService.ping();
      // await this.redisService.ping();
      
      return {
        status: 'ready',
        timestamp: new Date().toISOString(),
        dependencies: {
          database: 'ok',
          redis: 'ok'
        }
      };
    } catch (error) {
      throw new ServiceUnavailableException('Service dependencies not ready');
    }
  }
}

2.2 Terraform 모듈 Health Check 설정

gops-terraform-module이 health check를 지원한다면 다음과 같이 설정합니다:

# terragrunt.hcl에 추가할 설정 (모듈 지원 시)
containers = [
  {
    name  = "dta-wide-api"
    image = "eu.gcr.io/dta-cloud-de-dev/dta-wide-api:dta-wide-api-${local.container_image}"
    
    # Health Check 설정
    startup_probe = {
      http_get = {
        path = "/health"
        port = 8080
      }
      initial_delay_seconds = 30
      timeout_seconds      = 5
      period_seconds       = 10
      failure_threshold    = 3
    }
    
    liveness_probe = {
      http_get = {
        path = "/health"
        port = 8080
      }
      initial_delay_seconds = 30
      timeout_seconds      = 5
      period_seconds       = 10
      failure_threshold    = 3
    }
    
    readiness_probe = {
      http_get = {
        path = "/ready"
        port = 8080
      }
      initial_delay_seconds = 5
      timeout_seconds      = 3
      period_seconds       = 5
      failure_threshold    = 2
    }
    
    # 기존 설정...
  }
]

2.3 gcloud 명령어로 직접 설정 (대안)

모듈이 health check를 지원하지 않는 경우, 배포 후 gcloud 명령어로 설정할 수 있습니다:

#!/bin/bash
# scripts/configure-health-checks.sh

PROJECT_ID="dta-cloud-de-dev"
SERVICE_NAME="dta-wide-api"
REGION="europe-west3"

# Health Check 설정
gcloud run services update $SERVICE_NAME \
  --project=$PROJECT_ID \
  --region=$REGION \
  --set-health-checks \
  --startup-probe-initial-delay=30 \
  --startup-probe-timeout=5 \
  --startup-probe-period=10 \
  --startup-probe-failure-threshold=3 \
  --startup-probe-http-get-path=/health \
  --startup-probe-http-get-port=8080 \
  \
  --liveness-probe-initial-delay=30 \
  --liveness-probe-timeout=5 \
  --liveness-probe-period=10 \
  --liveness-probe-failure-threshold=3 \
  --liveness-probe-http-get-path=/health \
  --liveness-probe-http-get-port=8080

echo "✅ Health checks configured for $SERVICE_NAME"

🚀 배포 및 검증

1. 단계별 배포

# 1. Terragrunt 설정 적용
cd cloudrun-deploy/dta-wide-api/cloudrun/dev
terragrunt plan
terragrunt apply

# 2. Health Check 설정 (모듈이 지원하지 않는 경우)
./scripts/configure-health-checks.sh

# 3. 배포 검증
./scripts/validate-ha-deployment.sh

2. 검증 스크립트

#!/bin/bash
# scripts/validate-ha-deployment.sh

PROJECT_ID="dta-cloud-de-dev"
SERVICE_NAME="dta-wide-api"
REGION="europe-west3"

echo "🔍 Validating Cloud Run HA deployment..."

# 1. 서비스 상태 확인
echo "📋 Service Status:"
gcloud run services describe $SERVICE_NAME \
  --project=$PROJECT_ID \
  --region=$REGION \
  --format="table(status.conditions[0].type,status.conditions[0].status,status.traffic[0].percent)"

# 2. 현재 실행 중인 인스턴스 수 확인
echo ""
echo "🏃 Running Instances:"
gcloud run services describe $SERVICE_NAME \
  --project=$PROJECT_ID \
  --region=$REGION \
  --format="value(status.traffic[0].latestRevision)" | \
  xargs -I {} gcloud run revisions describe {} \
  --project=$PROJECT_ID \
  --region=$REGION \
  --format="table(metadata.name,status.observedGeneration,spec.containerConcurrency)"

# 3. Health Check 엔드포인트 테스트
echo ""
echo "🏥 Health Check Test:"
SERVICE_URL=$(gcloud run services describe $SERVICE_NAME \
  --project=$PROJECT_ID \
  --region=$REGION \
  --format="value(status.url)")

curl -s "$SERVICE_URL/health" | jq '.'
curl -s "$SERVICE_URL/ready" | jq '.'

# 4. 로드 테스트 (간단)
echo ""
echo "🔥 Simple Load Test:"
for i in {1..10}; do
  curl -s -w "Response time: %{time_total}s\n" "$SERVICE_URL/health" > /dev/null &
done
wait

echo "✅ HA deployment validation completed"

📊 모니터링 설정

Cloud Run 서비스의 가용성 모니터링을 위해서는 포괄적인 모니터링 시스템이 필요합니다.

📖 상세 모니터링 구성: PLT-NFR-004 99.9% 가용성 구현 가이드

BigQuery 기반 메트릭 수집
실시간 알림 정책 설정
SLO/SLA 대시보드 구성
Terraform 모니터링 모듈 활용

간단 확인 방법

# Cloud Run 서비스 상태 확인
gcloud run services describe dta-wide-api \
  --project=dta-cloud-de-dev \
  --region=europe-west3 \
  --format="table(status.conditions[0].type,status.conditions[0].status)"

# 인스턴스 수 확인
gcloud run services describe dta-wide-api \
  --project=dta-cloud-de-dev \
  --region=europe-west3 \
  --format="value(status.traffic[0].latestRevision)"

🔄 환경별 적용 전략

1. 환경별 목적과 설정 방향

환경	주요 목적	사용자 규모	설정 방향	가용성 목표
Dev	개발 및 테스트	개발팀 (~10명)	Scale-to-zero, 개발 편의성 우선	95%
Stage	베타 테스트	베타 테스터 (~30명)	Scale-to-zero, 소규모 테스트 환경	98%
Prod	실제 서비스 운영	정식 이용자 (10,000명)	실제 트래픽 기반, 99.9% 가용성 보장	99.9%

2. 환경별 설정 파일 생성 가이드

Stage 환경 설정

# cloudrun-deploy/dta-wide-api/cloudrun/stage/terragrunt.hcl 생성
cp cloudrun-deploy/dta-wide-api/cloudrun/dev/terragrunt.hcl \
   cloudrun-deploy/dta-wide-api/cloudrun/stage/terragrunt.hcl

# Stage 환경에 맞게 수정

# cloudrun-deploy/dta-wide-api/cloudrun/stage/terragrunt.hcl
locals {
  project = "dta-cloud-de-stage"
  # ... 기타 설정
}

scaling = {
  min_instance_count = "0"  # 베타 테스터 30명, scale-to-zero로 비용 최적화
  max_instance_count = "5"  # 소규모 베타 테스트 대응
}

resources = {
  limits = {
    cpu    = "1"     # 베타 테스트용 적절한 CPU
    memory = "1Gi"   # 소규모 테스트에 충분한 메모리
  }
}

annotations = {
  "autoscaling.knative.dev/minScale" = "0"
  "autoscaling.knative.dev/maxScale" = "5"
  "run.googleapis.com/cpu-throttling" = "true"  # 베타 환경에서는 비용 우선
}

Prod 환경 설정

# cloudrun-deploy/dta-wide-api/cloudrun/prod/terragrunt.hcl
locals {
  project = "dta-cloud-de-prod"
  # ... 기타 설정
}

scaling = {
  min_instance_count = "2"   # 동시 접속 100명 기준 안전한 최소값
  max_instance_count = "15"  # 피크 타임 및 확장성 고려
}

resources = {
  limits = {
    cpu    = "2"     # 안정적인 성능
    memory = "2Gi"   # 적절한 메모리
  }
}

max_instance_request_concurrency = "80"  # 트래픽 고려한 적절한 동시성

annotations = {
  "autoscaling.knative.dev/minScale" = "2"
  "autoscaling.knative.dev/maxScale" = "15"
  "run.googleapis.com/cpu-throttling" = "false"
}

📈 성능 벤치마크

1. 환경별 성능 지표

환경	최소 인스턴스	최대 인스턴스	CPU/메모리	동시성	예상 RPS	실제 대상	가용성 목표
Dev	0 (scale-to-zero)	20	1CPU/1GB	80	1,600	개발팀 (~10명)	95%
Stage	0 (scale-to-zero)	5	1CPU/1GB	80	400	베타 테스터 (~30명)	98%
Prod	2	15	2CPU/2GB	80	2,400	정식 이용자 (10,000명)	99.9%

💡 환경별 실제 트래픽 분석

Dev 환경

사용자: 개발팀 (~10명)
사용 패턴: 개발 및 테스트, 비정기적 사용
설정 근거: Scale-to-zero로 사용하지 않을 때 비용 없음

Stage 환경

사용자: 베타 테스터 (~30명)
동시 접속: 5-10명 (약 30% 동시 사용률)
예상 API 호출: 10-50 RPS
설정 근거: 소규모 베타 테스트에 충분, scale-to-zero로 비용 최적화

Prod 환경

총 이용자: 10,000명 (의료 기기 소프트웨어)
동시 접속: 100명 (1% 동시 사용률)
예상 API 호출: 100-200 RPS (일반), 300-600 RPS (피크)
안전 마진: 4배 여유분으로 최대 2,400 RPS 대응 가능

2. 비용 최적화 효과

Dev 환경 (Scale-to-zero)

# Dev 환경 월간 비용 추정
echo "Dev Environment Cost (Scale-to-zero):"
echo "사용 패턴: 주 5일, 하루 8시간 (160시간/월)"
echo "Monthly cost: ~$20-50/월"
echo "Cold Start: 최초 요청시 1-3초 지연 (개발환경에서는 허용 가능)"

Stage 환경 (베타 테스터용 Scale-to-zero)

# Stage 환경 월간 비용 추정
echo "Stage Environment Cost (Scale-to-zero):"
echo "사용자: 베타 테스터 30명"
echo "사용 패턴: 비정기적, 베타 테스트 기간에만 활발"
echo "Monthly cost: ~$10-30/월"
echo "Cold Start: 베타 테스트 환경에서는 허용 가능"

Prod 환경 (실제 서비스)

# Prod 환경 월간 비용 추정
echo "Prod Environment Cost:"
echo "Min instances: 2 * 2CPU * 24h * 30d = 2,880 vCPU-hours"
echo "Memory: 2 * 2GB * 24h * 30d = 2,880 GB-hours"
echo "Monthly cost: ~$150-250/월"
echo "99.9% 가용성 보장"

전체 비용 요약

echo "💰 환경별 월간 비용:"
echo "Dev   : ~$20-50/월   (개발팀 10명)"
echo "Stage : ~$10-30/월   (베타 테스터 30명)"  
echo "Prod  : ~$150-250/월 (정식 이용자 10,000명)"
echo ""
echo "Total : ~$180-330/월"

3. Cold Start 최적화 전략 (Dev, Stage 환경)

Dev와 Stage 환경에서 scale-to-zero로 인한 cold start를 최소화하는 방법:

# 개발 시작 전 웜업 요청
curl https://dta-wide-api-dev.weltcorp.com/health
curl https://dta-wide-api-stage.weltcorp.com/health

# 또는 간단한 웜업 스크립트
#!/bin/bash
# scripts/warmup-environments.sh
echo "🔥 Warming up development environments..."

# Dev 환경 웜업
echo "Warming up DEV environment..."
for i in {1..3}; do
  curl -s "https://dta-wide-api-dev.weltcorp.com/health" > /dev/null
  echo "Dev warmup request $i sent"
  sleep 1
done

# Stage 환경 웜업 (베타 테스트 시작 전)
echo "Warming up STAGE environment..."
for i in {1..2}; do
  curl -s "https://dta-wide-api-stage.weltcorp.com/health" > /dev/null
  echo "Stage warmup request $i sent"
  sleep 1
done

echo "✅ Dev and Stage environments ready!"

베타 테스트 시작 전 Stage 환경 준비

# 베타 테스트 시작 전 Stage 환경 웜업
#!/bin/bash
# scripts/prepare-beta-test.sh
echo "🧪 Preparing Stage environment for beta testing..."

# Stage 환경 상태 확인
STAGE_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://dta-wide-api-stage.weltcorp.com/health)
if [ "$STAGE_HEALTH" = "200" ]; then
  echo "✅ Stage environment is ready for beta testing"
else
  echo "⚠️ Stage environment is starting up... (this may take 1-3 seconds)"
  sleep 3
  echo "✅ Stage environment should be ready now"
fi

🚦 다음 단계

1. Cloud Run 설정 적용

cd cloudrun-deploy/dta-wide-api/cloudrun/dev
terragrunt plan   # 변경사항 확인
terragrunt apply  # 적용

2. Health Check 구현

애플리케이션에 /health, /ready 엔드포인트 구현
gops-terraform-module의 health check 지원 여부 확인

3. 포괄적 모니터링 연동

📖 가용성 모니터링 시스템 구축

Terraform 모니터링 모듈 배포
SLO 대시보드 설정
알림 정책 구성

4. 점진적 확장

Stage 환경: 고급 설정 적용
Prod 환경: 최대 안정성 구성

이 가이드를 통해 99.9% 가용성 목표를 달성할 수 있는 견고한 Cloud Run 설정을 구축할 수 있습니다.

📋 개요​

🔧 현재 설정 개선사항​

1. 환경별 Terragrunt 설정 (완료)​

Dev 환경 (비용 최적화 우선)​

Stage 환경 (베타 테스터용)​

Prod 환경 (99.9% 가용성)​

2. Health Check 설정 (추가 필요)​

2.1 애플리케이션 레벨 Health Check​

2.2 Terraform 모듈 Health Check 설정​

2.3 gcloud 명령어로 직접 설정 (대안)​

🚀 배포 및 검증​

1. 단계별 배포​

2. 검증 스크립트​

📊 모니터링 설정​

간단 확인 방법​

🔄 환경별 적용 전략​

1. 환경별 목적과 설정 방향​

2. 환경별 설정 파일 생성 가이드​

Stage 환경 설정​

Prod 환경 설정​

📈 성능 벤치마크​

1. 환경별 성능 지표​

💡 환경별 실제 트래픽 분석​

Dev 환경​

Stage 환경​

Prod 환경​

2. 비용 최적화 효과​

Dev 환경 (Scale-to-zero)​

Stage 환경 (베타 테스터용 Scale-to-zero)​

Prod 환경 (실제 서비스)​

전체 비용 요약​

3. Cold Start 최적화 전략 (Dev, Stage 환경)​

베타 테스트 시작 전 Stage 환경 준비​

🚦 다음 단계​

1. Cloud Run 설정 적용​

2. Health Check 구현​

3. 포괄적 모니터링 연동​

4. 점진적 확장​