PLT-NFR-008 메트릭 수집 구현 가이드

OpenTelemetry & Prometheus 표준 기반 포괄적 메트릭 수집 시스템

📋 목차

개요
아키텍처 설계
구현 솔루션
OpenTelemetry 통합
Prometheus 메트릭
모니터링 시스템
배포 가이드
운영 절차
문제 해결
성능 메트릭
다음 단계

1. 개요

1.1 PLT-NFR-008 요구사항

시스템은 OpenTelemetry와 Prometheus 표준을 준수하는 포괄적인 메트릭 수집 기능을 제공해야 한다.

비즈니스 임팩트 (DTx 플랫폼)

성능 최적화: 실시간 성능 모니터링으로 사용자 경험 개선
운영 효율성: 자동화된 메트릭 수집으로 운영 비용 절감
규제 준수: 의료 기기 소프트웨어 성능 추적 및 감사
데이터 기반 의사결정: 메트릭 기반 시스템 최적화

1.2 기술적 목표

목표	측정 지표	목표값
메트릭 수집 커버리지	모든 핵심 서비스 메트릭 수집	100%
수집 지연시간	메트릭 생성부터 저장까지	< 30초
수집 성공률	메트릭 손실률	< 0.1%
시스템 오버헤드	애플리케이션 성능 영향	< 2%
비용 효율성	환경별 최적화된 비용	Dev: $30, Stage: $80, Prod: $400

1.3 구현 범위

환경	수집 대상	메트릭 수	보관 기간	비용 최적화
Dev	핵심 메트릭	~50개	3일	90% 절감
Stage	확장 메트릭	~150개	14일	70% 절감
Prod	전체 메트릭	~500개	90일	완전 수집

1.4 관련 문서

2. 아키텍처 설계

2.1 메트릭 수집 원칙

업계 표준 준수

OpenTelemetry: CNCF 표준 관찰 가능성 프레임워크
Prometheus: 메트릭 수집 및 시계열 데이터베이스 표준
OTEL 시맨틱 규약: 일관된 메트릭 명명 및 속성
Cloud Native: 클라우드 네이티브 환경 최적화

메트릭 타입 지원

// OpenTelemetry 메트릭 타입
enum MetricType {
  COUNTER = 'counter',      // 누적 값 (요청 수, 오류 수)
  GAUGE = 'gauge',          // 현재 값 (CPU 사용률, 메모리)
  HISTOGRAM = 'histogram',  // 분포 (응답 시간, 요청 크기)
  SUMMARY = 'summary'       // 분위수 (지연시간 백분위수)
}

2.2 시스템 아키텍처

2.3 데이터 플로우

2.3.1 OpenTelemetry 플로우

Application Code
  ↓ (SDK 자동 계측)
OpenTelemetry SDK
  ↓ (OTLP 프로토콜)
OpenTelemetry Collector
  ↓ (배치 처리, 필터링)
Pub/Sub Topic (otel-metrics)
  ↓ (스키마 검증)
BigQuery Table (otel_metrics)
  ↓ (분석 쿼리)
Monitoring Dashboard

2.3.2 Prometheus 플로우

Application /metrics Endpoint
  ↓ (HTTP 스크래핑)
Prometheus Scraper
  ↓ (메트릭 파싱)
Pub/Sub Topic (prometheus-metrics)
  ↓ (JSON 변환)
BigQuery Table (prometheus_metrics)
  ↓ (시계열 분석)
Analytics & Alerts

2.4 핵심 구성 요소

OpenTelemetry Collector 설정

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # 환경별 샘플링
  probabilistic_sampler:
    sampling_percentage: ${SAMPLING_RATE}
  
  # 리소스 감지
  resourcedetection:
    detectors: [gcp, env]
    
  # 메트릭 변환
  metricstransform:
    transforms:
      - include: ".*"
        match_type: regexp
        action: update
        operations:
          - action: add_label
            new_label: environment
            new_value: ${ENVIRONMENT}

exporters:
  # Google Cloud Monitoring
  googlecloud:
    project: ${PROJECT_ID}
    
  # Pub/Sub로 BigQuery 전송
  googlepubsub:
    project: ${PROJECT_ID}
    topic: otel-metrics-${ENVIRONMENT}
    
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [resourcedetection, probabilistic_sampler, batch, metricstransform]
      exporters: [googlecloud, googlepubsub]

3. 구현 솔루션

3.1 Terraform 모듈 활용

기본 배포 (Dev 환경)

# terragrunt/dev/metrics-collection/terragrunt.hcl
terraform {
  source = "../../../../infrastructure/terraform/modules/metrics-collection"
}

include "root" {
  path = find_in_parent_folders()
}

inputs = {
  project_id  = "dta-cloud-de-dev"
  environment = "dev"
  region      = "europe-west3"
  
  # Dev 환경 비용 최적화 설정
  custom_metrics_config = {
    retention_days_override      = 3      # 3일 보관
    sampling_rate_override       = 0.1    # 10% 샘플링
    scrape_interval_override     = "30s"  # 30초 간격
    enable_detailed_metrics      = false  # 기본 메트릭만
    enable_cost_optimization     = true   # 비용 최적화 활성화
  }
  
  # Prometheus 스크래핑 대상
  prometheus_config = {
    scrape_targets = [
      {
        job_name = "dta-wide-api-dev"
        targets  = ["dta-wide-api-dev.run.app"]
        metrics_path = "/metrics"
        scrape_interval = "30s"
        scheme = "https"
        labels = {
          service = "api"
          env     = "dev"
        }
      }
    ]
  }
  
  notification_channels = [
    dependency.monitoring.outputs.notification_channel_email_id,
    dependency.monitoring.outputs.notification_channel_slack_id
  ]
}

dependency "monitoring" {
  config_path = "../monitoring"
}

3.2 환경별 구성

설정	Dev	Stage	Prod
보관 기간	3일	14일	90일
샘플링 비율	10%	30%	100%
수집 간격	30초	15초	10초
상세 메트릭	비활성화	활성화	활성화
크로스 리전 백업	비활성화	비활성화	활성화
알림 임계값	관대	보통	엄격

3.3 비용 최적화 전략

환경별 예상 비용

# 비용 계산기 실행
./scripts/metrics-cost-calculator.sh

# 출력 예시:
# ==========================================
# 메트릭 수집 시스템 비용 분석
# ==========================================
# 
# Dev 환경:    $25/월  (90% 최적화)
# Stage 환경:  $65/월  (70% 최적화) 
# Prod 환경:   $350/월 (완전 수집)
# 
# 총 예상 비용: $440/월

4. OpenTelemetry 통합

4.1 NestJS 애플리케이션 통합

OpenTelemetry SDK 설정

// src/telemetry.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

// 환경별 설정
const isProduction = process.env.NODE_ENV === 'production';
const environment = process.env.DEPLOY_ENV || 'dev';

// 메트릭 Exporter 설정
const metricExporter = new OTLPMetricExporter({
  url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT || 
       `https://otel-collector-${environment}.europe-west3.run.app/v1/metrics`,
  headers: {
    'Content-Type': 'application/json',
  }
});

// 메트릭 Reader 설정
const metricReader = new PeriodicExportingMetricReader({
  exporter: metricExporter,
  exportIntervalMillis: isProduction ? 10000 : 30000, // 환경별 내보내기 간격
});

// OpenTelemetry SDK 초기화
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'dta-wide-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: environment,
    [SemanticResourceAttributes.SERVICE_INSTANCE_ID]: process.env.HOSTNAME || 'local',
  }),
  
  instrumentations: [
    getNodeAutoInstrumentations({
      // HTTP 요청 자동 계측
      '@opentelemetry/instrumentation-http': {
        enabled: true,
        ignoreIncomingRequestHook: (req) => {
          // 헬스체크 요청 제외
          return req.url?.includes('/health') || req.url?.includes('/metrics');
        }
      },
      
      // Express 자동 계측
      '@opentelemetry/instrumentation-express': {
        enabled: true
      },
      
      // 데이터베이스 자동 계측
      '@opentelemetry/instrumentation-pg': {
        enabled: true
      },
      
      // Redis 자동 계측
      '@opentelemetry/instrumentation-redis': {
        enabled: true
      },
      
      // 불필요한 계측 비활성화 (비용 절감)
      '@opentelemetry/instrumentation-fs': {
        enabled: false
      }
    })
  ],
  
  metricReader,
});

// SDK 시작
export function initializeTelemetry(): void {
  sdk.start();
  console.log('OpenTelemetry initialized successfully');
}

// 애플리케이션 종료 시 정리
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('OpenTelemetry terminated'))
    .catch((error) => console.log('Error terminating OpenTelemetry', error))
    .finally(() => process.exit(0));
});

메인 애플리케이션 통합

// src/main.ts
import { initializeTelemetry } from './telemetry';

// OpenTelemetry를 가장 먼저 초기화
initializeTelemetry();

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app/app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  
  // 기존 부트스트랩 코드...
  
  await app.listen(port);
  console.log(`🚀 Application is running on: http://localhost:${port}/v1`);
}

bootstrap();

4.2 커스텀 메트릭 구현

비즈니스 메트릭 수집

// src/metrics/business-metrics.service.ts
import { Injectable } from '@nestjs/common';
import { metrics } from '@opentelemetry/api';

@Injectable()
export class BusinessMetricsService {
  private readonly meter = metrics.getMeter('dta-wide-business-metrics', '1.0.0');
  
  // 사용자 관련 메트릭
  private readonly activeUsersGauge = this.meter.createUpDownCounter('dta_users_active_total', {
    description: '현재 활성 사용자 수',
    unit: '1'
  });
  
  private readonly userSessionsCounter = this.meter.createCounter('dta_user_sessions_total', {
    description: '사용자 세션 총 수',
    unit: '1'
  });
  
  // 수면 로그 메트릭
  private readonly sleepLogsCounter = this.meter.createCounter('dta_sleep_logs_created_total', {
    description: '생성된 수면 로그 총 수',
    unit: '1'
  });
  
  private readonly sleepLogDurationHistogram = this.meter.createHistogram('dta_sleep_log_duration_hours', {
    description: '수면 시간 분포',
    unit: 'h'
  });
  
  // 설문 관련 메트릭
  private readonly questionnaireCompletedCounter = this.meter.createCounter('dta_questionnaires_completed_total', {
    description: '완료된 설문 총 수',
    unit: '1'
  });
  
  // API 성능 메트릭
  private readonly apiRequestDuration = this.meter.createHistogram('dta_api_request_duration_ms', {
    description: 'API 요청 처리 시간',
    unit: 'ms'
  });
  
  // 메트릭 기록 메서드들
  recordActiveUser(increment: number = 1, attributes: Record<string, string> = {}): void {
    this.activeUsersGauge.add(increment, attributes);
  }
  
  recordUserSession(attributes: Record<string, string> = {}): void {
    this.userSessionsCounter.add(1, {
      environment: process.env.DEPLOY_ENV || 'dev',
      ...attributes
    });
  }
  
  recordSleepLogCreated(durationHours: number, attributes: Record<string, string> = {}): void {
    this.sleepLogsCounter.add(1, attributes);
    this.sleepLogDurationHistogram.record(durationHours, attributes);
  }
  
  recordQuestionnaireCompleted(questionnaireType: string, attributes: Record<string, string> = {}): void {
    this.questionnaireCompletedCounter.add(1, {
      questionnaire_type: questionnaireType,
      ...attributes
    });
  }
  
  recordApiRequest(durationMs: number, method: string, route: string, statusCode: number): void {
    this.apiRequestDuration.record(durationMs, {
      method,
      route,
      status_code: statusCode.toString(),
      environment: process.env.DEPLOY_ENV || 'dev'
    });
  }
}

메트릭 데코레이터 확장

// src/decorators/otel-measure.decorator.ts
import { metrics } from '@opentelemetry/api';

interface OTelMeasureOptions {
  metricName: string;
  description: string;
  unit?: string;
  metricType?: 'counter' | 'histogram' | 'gauge';
  attributes?: Record<string, string>;
}

export function OTelMeasure(options: OTelMeasureOptions) {
  return function (target: any, propertyKey: string, descriptor: PropertyDescriptor) {
    const originalMethod = descriptor.value;
    const meter = metrics.getMeter('dta-wide-custom-metrics', '1.0.0');
    
    // 메트릭 타입에 따라 적절한 계측기 생성
    let instrument;
    switch (options.metricType || 'histogram') {
      case 'counter':
        instrument = meter.createCounter(options.metricName, {
          description: options.description,
          unit: options.unit || '1'
        });
        break;
      case 'gauge':
        instrument = meter.createUpDownCounter(options.metricName, {
          description: options.description,
          unit: options.unit || '1'
        });
        break;
      case 'histogram':
      default:
        instrument = meter.createHistogram(options.metricName, {
          description: options.description,
          unit: options.unit || 'ms'
        });
        break;
    }
    
    descriptor.value = async function (...args: any[]) {
      const start = Date.now();
      let error: any = null;
      
      try {
        const result = await originalMethod.apply(this, args);
        return result;
      } catch (err) {
        error = err;
        throw err;
      } finally {
        const duration = Date.now() - start;
        const attributes = {
          method: propertyKey,
          success: error ? 'false' : 'true',
          environment: process.env.DEPLOY_ENV || 'dev',
          ...options.attributes
        };
        
        if (options.metricType === 'counter') {
          instrument.add(1, attributes);
        } else if (options.metricType === 'gauge') {
          instrument.add(error ? -1 : 1, attributes);
        } else {
          instrument.record(duration, attributes);
        }
      }
    };
    
    return descriptor;
  };
}

// 사용 예시
@Injectable()
export class SleepLogService {
  
  @OTelMeasure({
    metricName: 'dta_sleep_log_creation_duration',
    description: '수면 로그 생성 처리 시간',
    unit: 'ms',
    metricType: 'histogram',
    attributes: { operation: 'create' }
  })
  async createSleepLog(userId: string, sleepData: CreateSleepLogDto): Promise<SleepLogResponseDto> {
    // 비즈니스 로직 구현
    return result;
  }
}

5. Prometheus 메트릭

5.1 Prometheus 클라이언트 설정

NestJS Prometheus 통합

// src/metrics/prometheus.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { register, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

@Injectable()
export class PrometheusService implements OnModuleInit {
  
  onModuleInit() {
    // 기본 시스템 메트릭 수집 활성화
    collectDefaultMetrics({
      register,
      prefix: 'dta_wide_',
      gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
    });
  }
  
  // HTTP 요청 메트릭
  readonly httpRequestsTotal = new Counter({
    name: 'dta_wide_http_requests_total',
    help: 'HTTP 요청 총 수',
    labelNames: ['method', 'route', 'status_code', 'environment'],
    registers: [register]
  });
  
  readonly httpRequestDuration = new Histogram({
    name: 'dta_wide_http_request_duration_seconds',
    help: 'HTTP 요청 처리 시간 (초)',
    labelNames: ['method', 'route', 'status_code'],
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    registers: [register]
  });
  
  // 데이터베이스 메트릭
  readonly dbConnectionsActive = new Gauge({
    name: 'dta_wide_db_connections_active',
    help: '활성 데이터베이스 연결 수',
    labelNames: ['database', 'pool'],
    registers: [register]
  });
  
  readonly dbQueryDuration = new Histogram({
    name: 'dta_wide_db_query_duration_seconds',
    help: '데이터베이스 쿼리 실행 시간',
    labelNames: ['operation', 'table'],
    buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1],
    registers: [register]
  });
  
  // 캐시 메트릭
  readonly cacheHitsTotal = new Counter({
    name: 'dta_wide_cache_hits_total',
    help: '캐시 히트 총 수',
    labelNames: ['cache_name', 'key_pattern'],
    registers: [register]
  });
  
  readonly cacheMissesTotal = new Counter({
    name: 'dta_wide_cache_misses_total',
    help: '캐시 미스 총 수',
    labelNames: ['cache_name', 'key_pattern'],
    registers: [register]
  });
  
  // 비즈니스 메트릭
  readonly activeUsersGauge = new Gauge({
    name: 'dta_wide_users_active_current',
    help: '현재 활성 사용자 수',
    labelNames: ['user_type', 'region'],
    registers: [register]
  });
  
  readonly backgroundJobsTotal = new Counter({
    name: 'dta_wide_background_jobs_total',
    help: '백그라운드 작업 총 수',
    labelNames: ['job_type', 'status'],
    registers: [register]
  });
  
  // 메트릭 기록 헬퍼 메서드
  recordHttpRequest(method: string, route: string, statusCode: number, duration: number): void {
    const labels = {
      method,
      route,
      status_code: statusCode.toString(),
      environment: process.env.DEPLOY_ENV || 'dev'
    };
    
    this.httpRequestsTotal.inc(labels);
    this.httpRequestDuration.observe({ method, route, status_code: statusCode.toString() }, duration / 1000);
  }
  
  recordDbQuery(operation: string, table: string, duration: number): void {
    this.dbQueryDuration.observe({ operation, table }, duration / 1000);
  }
  
  recordCacheOperation(cacheName: string, keyPattern: string, hit: boolean): void {
    const labels = { cache_name: cacheName, key_pattern: keyPattern };
    if (hit) {
      this.cacheHitsTotal.inc(labels);
    } else {
      this.cacheMissesTotal.inc(labels);
    }
  }
  
  updateActiveUsers(count: number, userType: string = 'general', region: string = 'eu'): void {
    this.activeUsersGauge.set({ user_type: userType, region }, count);
  }
  
  recordBackgroundJob(jobType: string, status: 'success' | 'failure'): void {
    this.backgroundJobsTotal.inc({ job_type: jobType, status });
  }
  
  // 메트릭 데이터 반환
  async getMetrics(): Promise<string> {
    return register.metrics();
  }
}

5.2 HTTP 미들웨어 통합

Express 미들웨어

// src/middleware/prometheus.middleware.ts
import { Injectable, NestMiddleware } from '@nestjs/common';
import { Request, Response, NextFunction } from 'express';
import { PrometheusService } from '../metrics/prometheus.service';

@Injectable()
export class PrometheusMiddleware implements NestMiddleware {
  
  constructor(private readonly prometheusService: PrometheusService) {}
  
  use(req: Request, res: Response, next: NextFunction) {
    const start = Date.now();
    
    // 응답 완료 시 메트릭 기록
    res.on('finish', () => {
      const duration = Date.now() - start;
      const route = req.route?.path || req.path;
      
      this.prometheusService.recordHttpRequest(
        req.method,
        route,
        res.statusCode,
        duration
      );
    });
    
    next();
  }
}

메트릭 엔드포인트

// src/controllers/metrics.controller.ts
import { Controller, Get, Header } from '@nestjs/common';
import { PrometheusService } from '../metrics/prometheus.service';

@Controller('metrics')
export class MetricsController {
  
  constructor(private readonly prometheusService: PrometheusService) {}
  
  @Get()
  @Header('Content-Type', 'text/plain')
  async getMetrics(): Promise<string> {
    return this.prometheusService.getMetrics();
  }
}

5.3 애플리케이션 모듈 통합

// src/app/app.module.ts
import { Module, MiddlewareConsumer } from '@nestjs/common';
import { PrometheusService } from '../metrics/prometheus.service';
import { PrometheusMiddleware } from '../middleware/prometheus.middleware';
import { MetricsController } from '../controllers/metrics.controller';
import { BusinessMetricsService } from '../metrics/business-metrics.service';

@Module({
  providers: [
    PrometheusService,
    BusinessMetricsService,
  ],
  controllers: [
    MetricsController,
  ],
  exports: [
    PrometheusService,
    BusinessMetricsService,
  ]
})
export class MetricsModule {}

@Module({
  imports: [
    MetricsModule,
    // 기타 모듈들...
  ],
})
export class AppModule {
  configure(consumer: MiddlewareConsumer) {
    consumer
      .apply(PrometheusMiddleware)
      .forRoutes('*'); // 모든 라우트에 메트릭 미들웨어 적용
  }
}

6. 모니터링 시스템

6.1 SLI/SLO 정의

메트릭 수집 SLI

-- 메트릭 수집 성공률 SLI
-- BigQuery 쿼리 예시
SELECT
  TIMESTAMP_TRUNC(timestamp, HOUR) as time_window,
  COUNT(*) as total_metrics_received,
  COUNTIF(processing_error IS NULL) as successful_metrics,
  SAFE_DIVIDE(
    COUNTIF(processing_error IS NULL),
    COUNT(*)
  ) * 100 as success_rate_percent
FROM `{PROJECT_ID}.metrics.otel_metrics`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY time_window
ORDER BY time_window DESC;

메트릭 지연시간 SLI

-- 메트릭 수집 지연시간 SLI
SELECT
  TIMESTAMP_TRUNC(timestamp, HOUR) as time_window,
  PERCENTILE_CONT(
    TIMESTAMP_DIFF(
      storage_timestamp, 
      metric_timestamp, 
      SECOND
    ), 0.95
  ) OVER (PARTITION BY TIMESTAMP_TRUNC(timestamp, HOUR)) as p95_latency_seconds,
  PERCENTILE_CONT(
    TIMESTAMP_DIFF(
      storage_timestamp, 
      metric_timestamp, 
      SECOND
    ), 0.50
  ) OVER (PARTITION BY TIMESTAMP_TRUNC(timestamp, HOUR)) as p50_latency_seconds
FROM `{PROJECT_ID}.metrics.otel_metrics`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND storage_timestamp IS NOT NULL
  AND metric_timestamp IS NOT NULL
GROUP BY time_window, storage_timestamp, metric_timestamp
ORDER BY time_window DESC;

6.2 알림 정책

메트릭 수집 장애 알림

# Cloud Monitoring Alert Policy
displayName: "메트릭 수집 시스템 장애 (환경: {ENVIRONMENT})"
documentation:
  content: |
    메트릭 수집 시스템에 장애가 발생했습니다.
    
    트러블슈팅 가이드:
    1. OpenTelemetry Collector 상태 확인
    2. Pub/Sub 토픽 및 구독 상태 확인
    3. 애플리케이션 로그에서 메트릭 전송 오류 확인
    4. 네트워크 연결 상태 확인
    
    에스컬레이션: 15분 내 해결되지 않으면 인프라팀 호출

conditions:
  - displayName: "OpenTelemetry 메트릭 수집 중단"
    conditionThreshold:
      filter: 'resource.type="pubsub_topic" AND resource.labels.topic_id="otel-metrics-{ENVIRONMENT}"'
      comparison: COMPARISON_LESS_THAN
      thresholdValue: 10  # 분당 10개 미만 시 알림
      duration: "300s"    # 5분 지속
      aggregations:
        - alignmentPeriod: "60s"
          perSeriesAligner: ALIGN_RATE
          crossSeriesReducer: REDUCE_SUM
          
  - displayName: "BigQuery 저장 오류율 급증"
    conditionThreshold:
      filter: 'resource.type="pubsub_subscription" AND resource.labels.subscription_id="otel-metrics-bigquery-{ENVIRONMENT}"'
      comparison: COMPARISON_GREATER_THAN
      thresholdValue: 0.05  # 5% 오류율 초과
      duration: "180s"
      aggregations:
        - alignmentPeriod: "60s"
          perSeriesAligner: ALIGN_RATE

alertStrategy:
  autoClose: "604800s"  # 7일 후 자동 닫힘
  
notificationChannels:
  - "projects/{PROJECT_ID}/notificationChannels/{EMAIL_CHANNEL_ID}"
  - "projects/{PROJECT_ID}/notificationChannels/{SLACK_CHANNEL_ID}"

6.3 대시보드 구성

메트릭 수집 개요 대시보드

{
  "displayName": "메트릭 수집 시스템 개요 ({ENVIRONMENT})",
  "mosaicLayout": {
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "메트릭 수집률 (시간당)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "resource.type=\"pubsub_topic\" AND resource.labels.topic_id=\"otel-metrics-{ENVIRONMENT}\"",
                    "aggregation": {
                      "alignmentPeriod": "3600s",
                      "perSeriesAligner": "ALIGN_RATE",
                      "crossSeriesReducer": "REDUCE_SUM"
                    }
                  }
                },
                "plotType": "LINE",
                "targetAxis": "Y1"
              }
            ],
            "yAxis": {
              "label": "Metrics/hour",
              "scale": "LINEAR"
            }
          }
        }
      },
      {
        "width": 6,
        "height": 4,
        "xPos": 6,
        "widget": {
          "title": "메트릭 타입별 분포",
          "pieChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilterRatio": {
                    "numeratorFilter": "resource.type=\"bigquery_table\" AND resource.labels.table_id=\"otel_metrics\"",
                    "denominatorFilter": "resource.type=\"bigquery_table\""
                  }
                }
              }
            ]
          }
        }
      },
      {
        "width": 12,
        "height": 4,
        "yPos": 4,
        "widget": {
          "title": "메트릭 지연시간 (P95)",
          "xyChart": {
            "dataSets": [
              {
                "timeSeriesQuery": {
                  "timeSeriesFilter": {
                    "filter": "metric.type=\"custom.googleapis.com/metrics/collection_latency\"",
                    "aggregation": {
                      "alignmentPeriod": "300s",
                      "perSeriesAligner": "ALIGN_PERCENTILE_95"
                    }
                  }
                },
                "plotType": "LINE"
              }
            ],
            "yAxis": {
              "label": "Latency (seconds)",
              "scale": "LINEAR"
            },
            "thresholds": [
              {
                "value": 30.0,
                "color": "RED",
                "direction": "ABOVE",
                "label": "SLO Threshold (30s)"
              }
            ]
          }
        }
      }
    ]
  }
}

7. 구체적 구현 단계

7.1 개요: 메트릭 수집 시스템 구축 로드맵

OpenTelemetry/Prometheus 메트릭 수집 시스템을 구축하기 위한 6단계 실행 계획입니다.

두 가지 메트릭 수집 방식 이해

🔄 Push 방식 (OTLP): 애플리케이션이 능동적으로 Collector에 전송

장점: 실시간성, 네트워크 효율성, 중앙 집중 처리
사용 케이스: 트랜잭션 메트릭, 비즈니스 이벤트

📡 Pull 방식 (Prometheus): Collector가 주기적으로 애플리케이션에서 가져옴

장점: 표준 호환성, 디버깅 용이성, 독립적 동작
사용 케이스: 시스템 메트릭, 헬스체크, 성능 지표

🎯 중요: 두 방식 모두 OpenTelemetry Collector가 필요합니다!

Collector는 데이터 변환, 정규화, 멀티 출력, 배치 처리를 담당
애플리케이션은 단순히 메트릭만 생성하면 됨 (관심사 분리)

7.2 단계 1: OpenTelemetry Collector 설정 및 배포

1.1 Collector 설정 파일 생성

# Collector 전용 디렉토리 생성
mkdir -p apps/dta-wide-otel-collector
cd apps/dta-wide-otel-collector

otel-collector-config.yaml 생성:

# OpenTelemetry Collector 설정
receivers:
  # OTLP 프로토콜로 애플리케이션에서 Push 방식 수신
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318  # 애플리케이션 SDK가 연결할 엔드포인트
        
  # Prometheus 형식 메트릭을 Pull 방식으로 수집
  prometheus:
    config:
      scrape_configs:
        - job_name: 'dta-wide-api'
          scrape_interval: 30s
          static_configs:
            - targets: ['dta-wide-api-dev.europe-west3.run.app']  # 실제 Cloud Run URL로 변경
          metrics_path: '/metrics'
          scheme: https
          
        - job_name: 'dta-wide-agent-qa'
          scrape_interval: 30s
          static_configs:
            - targets: ['dta-wide-agent-qa-dev.europe-west3.run.app']
          metrics_path: '/metrics'
          scheme: https
          
        - job_name: 'dta-wide-mcp'
          scrape_interval: 30s
          static_configs:
            - targets: ['dta-wide-mcp-dev.europe-west3.run.app']
          metrics_path: '/metrics'
          scheme: https

processors:
  # 메트릭 배치 처리 (성능 최적화)
  batch:
    timeout: 30s
    send_batch_size: 1000
    
  # 메모리 사용량 제한
  memory_limiter:
    limit_mib: 200
    spike_limit_mib: 50
    
  # 메트릭에 환경 정보 추가
  resource:
    attributes:
      - key: cloud.platform
        value: gcp
        action: insert
      - key: environment
        value: ${ENVIRONMENT}  # 환경 변수에서 주입
        action: insert

exporters:
  # Google Cloud Monitoring으로 실시간 메트릭 전송
  googlecloud:
    project: ${PROJECT_ID}
    
  # Pub/Sub를 통해 BigQuery로 장기 저장
  googlepubsub:
    project: ${PROJECT_ID}
    topic: otel-metrics-${ENVIRONMENT}
    encoding: otlp_json

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [googlecloud, googlepubsub]

1.2 Collector Docker 이미지 생성

Dockerfile 생성:

# OpenTelemetry Collector 공식 이미지 사용
FROM otel/opentelemetry-collector-contrib:0.99.0

# 설정 파일 복사
COPY otel-collector-config.yaml /etc/otelcol-contrib/config.yaml

# 포트 노출
EXPOSE 4318    # OTLP HTTP (애플리케이션 연결용)
EXPOSE 8888    # Prometheus 메트릭 스크래핑용
EXPOSE 13133   # 헬스체크용

# Collector 실행
CMD ["--config=/etc/otelcol-contrib/config.yaml"]

1.3 배포 스크립트 실행

# 배포 스크립트 실행
cd apps/dta-wide-otel-collector
chmod +x deploy.sh
./deploy.sh dev

# Cloud Run 서비스 배포
cd infrastructure/terragrunt/dev/otel-collector
terragrunt init
terragrunt apply

# 배포 확인
COLLECTOR_URL=$(terragrunt output -raw service_url)
curl "$COLLECTOR_URL/health"

7.3 단계 2: 메트릭 수집 인프라 Terraform 배포

2.1 Terraform 모듈 배포

# Dev 환경 메트릭 인프라 배포
cd infrastructure/terragrunt/dev/metrics-collection

terragrunt init
terragrunt plan
terragrunt apply
terragrunt output

2.2 배포 검증

# BigQuery 데이터셋 확인
bq ls --project_id=dta-cloud-de-dev metrics

# Pub/Sub 토픽 확인  
gcloud pubsub topics list --filter="name:otel-metrics"

# Collector 연결 테스트
COLLECTOR_URL=$(terragrunt output -raw collector_service_url)
curl -X POST "$COLLECTOR_URL/v1/metrics" \
  -H "Content-Type: application/json" \
  -d '{"resourceMetrics": []}'

7.4 단계 3: 애플리케이션 OpenTelemetry SDK 통합

3.1 패키지 설치 및 Telemetry 설정

cd apps/dta-wide-api

# OpenTelemetry 패키지 설치
npm install \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-metrics-otlp-http \
  @opentelemetry/semantic-conventions \
  @opentelemetry/resources

3.2 OpenTelemetry 초기화 구현

src/telemetry.ts 생성 (자세한 코드는 위 문서 참조)

src/main.ts 수정:

// ⚠️ 중요: OpenTelemetry를 가장 먼저 초기화
import { initializeTelemetry } from './telemetry';
initializeTelemetry();

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app/app.module';
// ... 나머지 부트스트랩 코드

7.5 단계 4: Prometheus 메트릭 엔드포인트 구성

4.1 Prometheus 클라이언트 설치 및 구현

npm install prom-client

4.2 핵심 구현 파일들

src/metrics/prometheus.service.ts: Prometheus 메트릭 서비스
src/controllers/metrics.controller.ts: /metrics 엔드포인트
src/middleware/metrics.middleware.ts: HTTP 요청 자동 추적
src/app/app.module.ts: 모듈 통합

7.6 단계 5: 환경 변수 및 서비스 연결

5.1 Cloud Run 환경 변수 설정

Terragrunt 설정에 추가:

env_vars = {
  OTEL_EXPORTER_OTLP_METRICS_ENDPOINT = "https://otel-collector-dev.europe-west3.run.app/v1/metrics"
  OTEL_SERVICE_NAME = "dta-wide-api"
  DEPLOY_ENV = "dev"
}

5.2 서비스 계정 권한 설정

# Collector와 애플리케이션 서비스 계정에 권한 부여
gcloud projects add-iam-policy-binding dta-cloud-de-dev \
  --member="serviceAccount:otel-collector-dev@dta-cloud-de-dev.iam.gserviceaccount.com" \
  --role="roles/pubsub.publisher"

7.7 단계 6: 검증 및 모니터링 확인

6.1 최종 검증 체크리스트

# 1. Prometheus 메트릭 엔드포인트 확인
curl -s "https://dta-wide-api-dev.europe-west3.run.app/metrics" | head -20

# 2. BigQuery 데이터 확인
bq query --use_legacy_sql=false --project_id=dta-cloud-de-dev '
SELECT service_name, metric_name, COUNT(*) as count
FROM `dta-cloud-de-dev.metrics.otel_metrics` 
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)
GROUP BY service_name, metric_name'

# 3. 대시보드 URL 확인
cd infrastructure/terragrunt/dev/metrics-collection
terragrunt output dashboard_url

6.2 문제 해결

메트릭이 수집되지 않는 경우:

# Collector 로그 확인
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=otel-collector-dev" --limit=20

# 네트워크 연결 확인
curl -v "https://otel-collector-dev.europe-west3.run.app/health"

8. 사전 요구사항 및 권한 설정

8.1 필수 권한

8. 운영 절차

8.1 일일 체크리스트

메트릭 수집 상태 점검

#!/bin/bash
# scripts/daily-metrics-check.sh

echo "=== 메트릭 수집 시스템 일일 점검 ==="
echo "점검 시간: $(date)"
echo

# 1. Pub/Sub 토픽 상태 확인
echo "📊 Pub/Sub 토픽 상태 확인"
for env in dev stage prod; do
  echo "  - $env: $(gcloud pubsub topics describe otel-metrics-$env --format='value(name)' 2>/dev/null || echo 'NOT_FOUND')"
done
echo

# 2. BigQuery 최근 메트릭 확인  
echo "📈 BigQuery 최근 메트릭 (지난 1시간)"
bq query --use_legacy_sql=false --format=table '
SELECT 
  environment,
  COUNT(*) as metric_count,
  COUNT(DISTINCT metric_name) as unique_metrics,
  MAX(timestamp) as latest_timestamp
FROM `'$PROJECT_ID'.metrics.otel_metrics`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY environment
ORDER BY environment'
echo

# 3. 알림 정책 상태 확인
echo "🚨 활성 알림 확인"
gcloud alpha monitoring policies list \
  --filter="displayName:메트릭" \
  --format="table(displayName,enabled,conditions.len():label=CONDITIONS)"
echo

# 4. 비용 추정
echo "💰 예상 일일 비용"
echo "  - BigQuery: $(bq query --use_legacy_sql=false --dry_run '
SELECT COUNT(*) FROM `'$PROJECT_ID'.metrics.otel_metrics` 
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)' 2>&1 | grep 'bytes processed' | awk '{print $4}') bytes processed"

echo "✅ 일일 점검 완료"

8.2 주간 유지보수

성능 분석 및 최적화

-- weekly-metrics-analysis.sql
-- 주간 메트릭 성능 분석

-- 1. 높은 카디널리티 메트릭 식별
WITH metric_cardinality AS (
  SELECT 
    metric_name,
    service_name,
    COUNT(DISTINCT CONCAT(
      ARRAY_TO_STRING(ARRAY(
        SELECT CONCAT(attr.key, '=', attr.value) 
        FROM UNNEST(attributes) as attr
      ), ',')
    )) as cardinality,
    COUNT(*) as total_points,
    SAFE_DIVIDE(COUNT(*), COUNT(DISTINCT CONCAT(
      ARRAY_TO_STRING(ARRAY(
        SELECT CONCAT(attr.key, '=', attr.value) 
        FROM UNNEST(attributes) as attr
      ), ',')
    ))) as avg_points_per_series
  FROM `{PROJECT_ID}.metrics.otel_metrics`
  WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  GROUP BY metric_name, service_name
)
SELECT 
  metric_name,
  service_name,
  cardinality,
  total_points,
  avg_points_per_series,
  CASE 
    WHEN cardinality > 1000 THEN 'HIGH_CARDINALITY'
    WHEN cardinality > 100 THEN 'MEDIUM_CARDINALITY'
    ELSE 'LOW_CARDINALITY'
  END as cardinality_level
FROM metric_cardinality
ORDER BY cardinality DESC, total_points DESC
LIMIT 20;

-- 2. 메트릭 수집 트렌드 분석
SELECT
  DATE(timestamp) as date,
  environment,
  COUNT(*) as daily_metrics,
  COUNT(DISTINCT metric_name) as unique_metrics,
  COUNT(DISTINCT service_name) as active_services,
  AVG(CASE WHEN metric_type = 'histogram' THEN metric_value END) as avg_histogram_value,
  STDDEV(CASE WHEN metric_type = 'gauge' THEN metric_value END) as gauge_stddev
FROM `{PROJECT_ID}.metrics.otel_metrics`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY DATE(timestamp), environment
ORDER BY date DESC, environment;

-- 3. 오류 패턴 분석
SELECT
  DATE(timestamp) as date,
  service_name,
  metric_name,
  COUNT(*) as error_count,
  STRING_AGG(DISTINCT error_message, '; ' LIMIT 3) as sample_errors
FROM `{PROJECT_ID}.metrics.otel_metrics`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND error_message IS NOT NULL
GROUP BY DATE(timestamp), service_name, metric_name
HAVING error_count > 10
ORDER BY date DESC, error_count DESC;

8.3 월간 검토

비용 최적화 검토

#!/bin/bash
# scripts/monthly-cost-review.sh

echo "=== 메트릭 시스템 월간 비용 검토 ==="
echo "검토 기간: $(date -d '1 month ago' '+%Y-%m') ~ $(date '+%Y-%m')"
echo

# BigQuery 스토리지 비용 분석
echo "📊 BigQuery 스토리지 분석"
bq query --use_legacy_sql=false '
SELECT
  table_name,
  ROUND(size_bytes / 1024 / 1024 / 1024, 2) as size_gb,
  ROUND(size_bytes / 1024 / 1024 / 1024 * 0.02, 2) as estimated_storage_cost_usd,
  num_rows,
  ROUND(num_rows / (size_bytes / 1024 / 1024), 0) as rows_per_mb
FROM `'$PROJECT_ID'.metrics.__TABLES__`
ORDER BY size_bytes DESC'

# 쿼리 비용 분석
echo
echo "💰 BigQuery 쿼리 비용 추정"
bq query --use_legacy_sql=false '
SELECT
  DATE_TRUNC(creation_time, DAY) as query_date,
  COUNT(*) as query_count,
  ROUND(SUM(total_bytes_processed) / 1024 / 1024 / 1024, 2) as total_gb_processed,
  ROUND(SUM(total_bytes_processed) / 1024 / 1024 / 1024 * 5 / 1024, 2) as estimated_query_cost_usd
FROM `region-europe-west3`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = "QUERY"
  AND statement_type = "SELECT"
  AND destination_table.table_id IN ("otel_metrics", "prometheus_metrics")
GROUP BY DATE_TRUNC(creation_time, DAY)
ORDER BY query_date DESC
LIMIT 30'

echo
echo "🎯 최적화 권장사항:"
echo "  1. 높은 카디널리티 메트릭 라벨 최적화"
echo "  2. 사용하지 않는 메트릭 수집 중단"
echo "  3. 개발 환경 샘플링 비율 조정"
echo "  4. 오래된 데이터 아카이브 정책 검토"

9. 문제 해결

9.1 일반적인 문제

문제 1: 메트릭이 수집되지 않음

증상:

대시보드에 메트릭이 표시되지 않음
BigQuery 테이블이 비어있음

진단:

# 1. 애플리케이션 메트릭 엔드포인트 확인
curl -s https://dta-wide-api-dev.run.app/metrics | head -20

# 2. OpenTelemetry Collector 로그 확인
gcloud logging read "resource.type=cloud_run_revision AND 
  resource.labels.service_name=otel-collector-dev AND
  severity>=WARNING" --limit=50

# 3. Pub/Sub 메시지 흐름 확인
gcloud pubsub topics describe otel-metrics-dev
gcloud pubsub subscriptions describe otel-metrics-bigquery-dev

# 4. BigQuery 스트리밍 상태 확인
bq show -j --project_id=$PROJECT_ID --format=prettyjson \
  $(bq ls -j --max_results=1 --format='value(jobId)')

해결책:

SDK 설정 확인: OpenTelemetry SDK가 올바르게 초기화되었는지 확인
네트워크 연결: Collector 엔드포인트 연결 상태 확인
서비스 계정 권한: 메트릭 전송 권한 확인
방화벽 규칙: VPC 내부 통신 허용 확인

문제 2: 높은 비용

증상:

BigQuery 비용이 예상보다 높음
메트릭 카디널리티가 과도함

진단:

-- 비용 발생 메트릭 식별
SELECT 
  metric_name,
  service_name,
  COUNT(DISTINCT CONCAT(
    ARRAY_TO_STRING(ARRAY(
      SELECT CONCAT(attr.key, '=', attr.value) 
      FROM UNNEST(attributes) as attr
    ), ',')
  )) as unique_series,
  COUNT(*) as total_points,
  ROUND(COUNT(*) * 0.000002, 4) as estimated_cost_usd_per_day
FROM `{PROJECT_ID}.metrics.otel_metrics`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY metric_name, service_name
ORDER BY total_points DESC
LIMIT 10;

해결책:

샘플링 조정: 개발/스테이지 환경 샘플링 비율 감소
라벨 최적화: 높은 카디널리티 라벨 제거
메트릭 필터링: 불필요한 메트릭 수집 중단
보관 기간 단축: 환경별 적절한 보관 기간 설정

9.2 성능 최적화

OpenTelemetry Collector 최적화

# otel-collector-optimized.yaml
processors:
  # 메트릭 배치 최적화
  batch:
    timeout: 5s
    send_batch_size: 2048
    send_batch_max_size: 4096
  
  # 메모리 제한
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s
  
  # 메트릭 필터링
  filter:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*_bucket$"  # 히스토그램 버킷 제외 (비용 절감)
          - ".*debug.*"   # 디버그 메트릭 제외
  
  # 속성 정리
  attributes:
    actions:
      - key: "http.user_agent"
        action: delete  # 높은 카디널리티 속성 제거
      - key: "user.id"
        action: delete  # 개인 식별 정보 제거

service:
  extensions: [memory_ballast]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, filter, attributes, batch]
      exporters: [googlepubsub]

애플리케이션 레벨 최적화

// 메트릭 샘플링 최적화
const shouldSampleMetric = (metricName: string, environment: string): boolean => {
  // 환경별 샘플링 전략
  const samplingRates = {
    dev: 0.1,    // 10%
    stage: 0.3,  // 30%
    prod: 1.0    // 100%
  };
  
  // 중요한 메트릭은 항상 수집
  const criticalMetrics = [
    'http_requests_total',
    'http_request_duration',
    'db_query_duration',
    'cache_hits_total'
  ];
  
  if (criticalMetrics.includes(metricName)) {
    return true;
  }
  
  const rate = samplingRates[environment] || 0.1;
  return Math.random() < rate;
};

// 메트릭 배치 전송
class MetricsBatcher {
  private batch: any[] = [];
  private batchSize = 100;
  private flushInterval = 30000; // 30초
  
  constructor(private exporter: any) {
    setInterval(() => this.flush(), this.flushInterval);
  }
  
  add(metric: any): void {
    this.batch.push(metric);
    
    if (this.batch.length >= this.batchSize) {
      this.flush();
    }
  }
  
  private flush(): void {
    if (this.batch.length === 0) return;
    
    this.exporter.export(this.batch)
      .then(() => this.batch = [])
      .catch(err => console.error('Failed to export metrics batch:', err));
  }
}

10. 성능 메트릭

10.1 핵심 KPI

메트릭 수집 성능

지표	Dev	Stage	Prod	측정 방법
수집 성공률	> 95%	> 98%	> 99.5%	BigQuery 수신율
수집 지연시간 P95	< 60초	< 45초	< 30초	타임스탬프 차이
메트릭 처리량	100/분	500/분	2000/분	Pub/Sub 메시지율
시스템 오버헤드	< 5%	< 3%	< 2%	CPU/메모리 증가율

비용 효율성

환경	목표 월간 비용	실제 비용	메트릭당 비용	최적화율
Dev	$25	-	$0.0005	90%
Stage	$65	-	$0.0003	70%
Prod	$350	-	$0.0002	기준

10.2 성능 모니터링 쿼리

실시간 성능 대시보드

-- 메트릭 수집 성능 실시간 모니터링
WITH recent_metrics AS (
  SELECT
    TIMESTAMP_TRUNC(timestamp, MINUTE) as minute,
    environment,
    COUNT(*) as metrics_count,
    COUNT(DISTINCT service_name) as active_services,
    AVG(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, SECOND)) as avg_age_seconds
  FROM `{PROJECT_ID}.metrics.otel_metrics`
  WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  GROUP BY minute, environment
),
performance_metrics AS (
  SELECT
    minute,
    environment,
    metrics_count,
    active_services,
    avg_age_seconds,
    LAG(metrics_count) OVER (PARTITION BY environment ORDER BY minute) as prev_metrics_count,
    CASE 
      WHEN avg_age_seconds > 30 THEN 'SLOW'
      WHEN avg_age_seconds > 60 THEN 'CRITICAL'
      ELSE 'NORMAL'
    END as latency_status
  FROM recent_metrics
)
SELECT
  minute,
  environment,
  metrics_count,
  active_services,
  ROUND(avg_age_seconds, 1) as avg_latency_seconds,
  latency_status,
  ROUND(SAFE_DIVIDE(metrics_count - prev_metrics_count, prev_metrics_count) * 100, 1) as growth_rate_percent
FROM performance_metrics
ORDER BY minute DESC, environment;

비용 트렌드 분석

-- 일별 비용 트렌드 및 예측
WITH daily_costs AS (
  SELECT
    DATE(timestamp) as date,
    environment,
    COUNT(*) as daily_metrics,
    -- BigQuery 저장 비용 추정 (대략적)
    ROUND(COUNT(*) * 0.00002, 2) as estimated_storage_cost_usd,
    -- 쿼리 비용 추정 (평균 스캔 기반)
    ROUND(COUNT(*) * 0.000001, 2) as estimated_query_cost_usd
  FROM `{PROJECT_ID}.metrics.otel_metrics`
  WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  GROUP BY date, environment
),
cost_trends AS (
  SELECT
    date,
    environment,
    daily_metrics,
    estimated_storage_cost_usd + estimated_query_cost_usd as total_daily_cost_usd,
    AVG(estimated_storage_cost_usd + estimated_query_cost_usd) 
      OVER (PARTITION BY environment ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as avg_7day_cost,
    -- 월간 비용 예측
    (estimated_storage_cost_usd + estimated_query_cost_usd) * 30 as projected_monthly_cost
  FROM daily_costs
)
SELECT
  date,
  environment,
  daily_metrics,
  ROUND(total_daily_cost_usd, 2) as daily_cost_usd,
  ROUND(avg_7day_cost, 2) as avg_7day_cost_usd,
  ROUND(projected_monthly_cost, 2) as projected_monthly_cost_usd,
  CASE
    WHEN projected_monthly_cost > 500 THEN 'HIGH_COST'
    WHEN projected_monthly_cost > 100 THEN 'MEDIUM_COST'
    ELSE 'LOW_COST'
  END as cost_category
FROM cost_trends
ORDER BY date DESC, environment;

11. 다음 단계

11.1 단기 개선 사항 (1-2주)

애플리케이션 통합 완료
- 모든 서비스에 OpenTelemetry SDK 적용
- Prometheus 메트릭 엔드포인트 구현
- 커스텀 비즈니스 메트릭 추가
모니터링 강화
- SLO 기반 알림 정책 세부 조정
- 대시보드 시각화 개선
- 비정상 패턴 자동 감지
성능 최적화
- 메트릭 카디널리티 최적화
- 샘플링 전략 정밀 조정
- 배치 처리 효율성 개선

11.2 중기 계획 (1-2개월)

고급 분석 기능
- Grafana 통합 및 고급 시각화
- 메트릭 기반 이상 탐지
- 예측적 성능 분석
규제 준수 강화
- 의료 기기 소프트웨어 감사 요구사항 대응
- 데이터 주권 준수 검증
- 보관 정책 자동화
비용 최적화
- 지능형 메트릭 필터링
- 적응형 샘플링
- 스토리지 계층화 자동화

11.3 장기 비전 (3-6개월)

AI/ML 통합
- 메트릭 기반 성능 예측
- 자동 이상 탐지
- 지능형 알림 시스템
멀티 클라우드 지원
- 다른 클라우드 메트릭 시스템 연동
- 하이브리드 모니터링
- 클라우드 중립적 메트릭 표준
완전 자동화
- 자가 치유 메트릭 시스템
- 동적 리소스 할당
- 무인 운영 달성

11.4 확장 가능성

다른 관찰 가능성 도구와의 통합

📚 관련 문서

문서 버전: 1.0.0
최종 업데이트: 2025-08-13
문서 승인: DTA-Wide 인프라팀
다음 검토 예정: 2025-09-13

변경 이력

버전	날짜	작성자	변경 내용
0.1.0	2025-08-13	bok@weltcorp.com	최초 작성

📋 목차​

1. 개요​

1.1 PLT-NFR-008 요구사항​

비즈니스 임팩트 (DTx 플랫폼)​

1.2 기술적 목표​

1.3 구현 범위​

1.4 관련 문서​

2. 아키텍처 설계​

2.1 메트릭 수집 원칙​

업계 표준 준수​

메트릭 타입 지원​

2.2 시스템 아키텍처​

2.3 데이터 플로우​

2.3.1 OpenTelemetry 플로우​

2.3.2 Prometheus 플로우​

2.4 핵심 구성 요소​

OpenTelemetry Collector 설정​

3. 구현 솔루션​

3.1 Terraform 모듈 활용​

기본 배포 (Dev 환경)​

3.2 환경별 구성​

3.3 비용 최적화 전략​

환경별 예상 비용​

4. OpenTelemetry 통합​

4.1 NestJS 애플리케이션 통합​

OpenTelemetry SDK 설정​

메인 애플리케이션 통합​

4.2 커스텀 메트릭 구현​

비즈니스 메트릭 수집​

메트릭 데코레이터 확장​

5. Prometheus 메트릭​

5.1 Prometheus 클라이언트 설정​

NestJS Prometheus 통합​

5.2 HTTP 미들웨어 통합​

Express 미들웨어​

메트릭 엔드포인트​

5.3 애플리케이션 모듈 통합​

6. 모니터링 시스템​

6.1 SLI/SLO 정의​

메트릭 수집 SLI​

메트릭 지연시간 SLI​

6.2 알림 정책​

메트릭 수집 장애 알림​

6.3 대시보드 구성​

메트릭 수집 개요 대시보드​

7. 구체적 구현 단계​

7.1 개요: 메트릭 수집 시스템 구축 로드맵​

두 가지 메트릭 수집 방식 이해​

7.2 단계 1: OpenTelemetry Collector 설정 및 배포​

1.1 Collector 설정 파일 생성​

1.2 Collector Docker 이미지 생성​

1.3 배포 스크립트 실행​

7.3 단계 2: 메트릭 수집 인프라 Terraform 배포​

2.1 Terraform 모듈 배포​

2.2 배포 검증​

7.4 단계 3: 애플리케이션 OpenTelemetry SDK 통합​

3.1 패키지 설치 및 Telemetry 설정​

3.2 OpenTelemetry 초기화 구현​

7.5 단계 4: Prometheus 메트릭 엔드포인트 구성​

4.1 Prometheus 클라이언트 설치 및 구현​

4.2 핵심 구현 파일들​

7.6 단계 5: 환경 변수 및 서비스 연결​

5.1 Cloud Run 환경 변수 설정​

5.2 서비스 계정 권한 설정​

7.7 단계 6: 검증 및 모니터링 확인​

6.1 최종 검증 체크리스트​

6.2 문제 해결​

8. 사전 요구사항 및 권한 설정​

8.1 필수 권한​

8. 운영 절차​

8.1 일일 체크리스트​

메트릭 수집 상태 점검​

8.2 주간 유지보수​

성능 분석 및 최적화​

8.3 월간 검토​

비용 최적화 검토​

9. 문제 해결​

9.1 일반적인 문제​

문제 1: 메트릭이 수집되지 않음​

문제 2: 높은 비용​