SOL CoD BackTest 시스템 CloudRun 비동기 패턴 마이그레이션 계획

📋 문서 개요

작성일: 2025-09-13
대상 시스템: SOL Chain of Debate BackTest 시스템
환경: Google Cloud Run (Serverless)
목적: 31초 이상 소요되는 SOL CoD 시뮬레이션을 CloudRun 환경에서 안정적으로 처리하기 위한 비동기 패턴 도입

🚨 CloudRun 환경 분석 및 제약사항

1. CloudRun Serverless 특성

CloudRun 환경 제약:
- Request Timeout: 최대 3600초 (1시간)
- CPU 할당: 요청 처리 시에만 CPU 사용 가능
- Instance Lifecycle: 요청이 없으면 인스턴스 자동 종료
- Memory: 최대 32GB 지원
- Concurrency: 인스턴스당 최대 1000개 동시 요청

2. SOL CoD 시스템 현재 상황

현재 성능 지표 (Phase 6 기준):
- 4개 전문가 에이전트 SOL 예측: 5-7초
- 5개 전문가 에이전트 (with 멜라토닌): 10초 목표
- 백테스팅 시뮬레이션 (83일 데이터): 31초 이상 예상
- Gemini API 호출: 에이전트당 2-3초
- 데이터 처리: PostgreSQL/Prisma 조회/저장 1-2초

3. 핵심 도전과제

🔴 주요 문제점:
장기 실행 작업 중 인스턴스 유지 필요
사용자 대기 시간 최소화 (UX 개선)
중단된 작업의 복구 메커니즘
실시간 진행 상황 피드백
리소스 효율적 사용

🎯 기술적 해결 방안 분석

Option 1: CloudRun + Redis Queue + SSE (추천)

아키텍처 패턴:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Web UI    │ ←→ │ CloudRun    │ ←→ │   Redis     │
│   (SSE)     │    │  (Async)    │    │  (Queue)    │
└─────────────┘    └─────────────┘    └─────────────┘
                            ↕
                    ┌─────────────┐
                    │ PostgreSQL  │
                    │ (Prisma)    │
                    └─────────────┘

장점:
✅ CloudRun 인스턴스 종료 후 재시작 시 작업 복구 가능
✅ 실시간 진행 상황 SSE 스트리밍
✅ 작업 우선순위 및 스케줄링 지원
✅ 수평 확장 가능 (여러 인스턴스에서 큐 처리)
✅ 장애 격리 및 재시도 로직 구현 용이

단점:
⚠️ Redis 의존성 증가
⚠️ SSE 연결 관리 복잡성
⚠️ 추가 인프라 비용

Option 2: CloudRun + PostgreSQL Jobs + Polling

아키텍처 패턴:
┌─────────────┐    ┌─────────────┐    
│   Web UI    │ ←→ │ CloudRun    │    
│ (Polling)   │    │ (Stateless) │    
└─────────────┘    └─────────────┘    
                            ↕
                    ┌─────────────┐
                    │ PostgreSQL  │
                    │(Job Store)  │
                    └─────────────┘

장점:
✅ 단순한 구조 (추가 인프라 불필요)
✅ PostgreSQL의 ACID 보장 및 높은 가용성 활용
✅ 자동 백업 및 복구 기능
✅ CloudRun의 자동 스케일링 활용

단점:
⚠️ 실시간성 제한 (폴링 간격에 의존)
⚠️ 불필요한 API 호출 증가
⚠️ 배터리 소모 (모바일에서 지속적 폴링)

Option 3: CloudRun + Cloud Tasks + SSE (확장성 최고)

아키텍처 패턴:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Web UI    │ ←→ │ CloudRun    │ ←→ │Cloud Tasks  │
│   (SSE)     │    │   (API)     │    │   (Queue)   │
└─────────────┘    └─────────────┘    └─────────────┘
                            ↕                   ↕
                    ┌─────────────┐    ┌─────────────┐
                    │ PostgreSQL  │    │ CloudRun    │
                    │ (Prisma)    │    │ (Workers)   │
                    └─────────────┘    └─────────────┘

장점:
✅ 완전 관리형 큐 서비스
✅ 지연 실행 및 재시도 로직 내장
✅ 최고 수준의 확장성
✅ 비용 최적화 (사용한 만큼 과금)

단점:
⚠️ 복잡한 아키텍처
⚠️ Cloud Tasks 학습 곡선
⚠️ 디버깅 및 모니터링 복잡성 증가

Option 4: 하이브리드 접근 (즉시 응답 + 상태 폴링)

아키텍처 패턴:
┌─────────────┐    ┌─────────────┐
│   Web UI    │ ←→ │ CloudRun    │
│             │    │             │
│ ┌─────────┐ │    │ ┌─────────┐ │
│ │Quick UI │ │    │ │Async BG │ │
│ │Response │ │    │ │Process  │ │
│ └─────────┘ │    │ └─────────┘ │
└─────────────┘    └─────────────┘
                            ↕
                    ┌─────────────┐
                    │ PostgreSQL  │
                    │(Job + State)│
                    └─────────────┘

장점:
✅ 사용자 즉시 피드백 (1초 이내)
✅ 기존 CloudRun 구조 최대 활용
✅ 점진적 마이그레이션 가능
✅ 간단한 구현

단점:
⚠️ 실시간성 제한
⚠️ 인스턴스 유지 전략 필요

🏆 선택된 솔루션: Option 1 (CloudRun + Redis + SSE)

🎯 선택 근거

**실시간 피드백**: SOL 백테스팅은 의료진의 즉각적 피드백이 중요
**안정성**: Redis Queue의 검증된 안정성과 복구 기능
**확장성**: 향후 더 복잡한 분석 작업 지원 가능
**사용자 경험**: SSE 실시간 업데이트로 우수한 UX 제공
**비용 효율성**: Redis Memorystore는 이미 사용 중

📐 상세 구현 계획

1. Redis Job Queue 구조

// Job Queue 데이터 구조
interface BacktestJob {
  jobId: string;
  userId: string;
  sessionId: string;
  status: 'queued' | 'processing' | 'completed' | 'failed' | 'cancelled';
  progress: {
    currentDay: number;
    totalDays: number;
    currentAgent: string;
    percentage: number;
  };
  parameters: {
    startDay: number;
    endDay: number;
    agentConfig: string[];
  };
  result?: BacktestResult;
  error?: string;
  createdAt: Date;
  startedAt?: Date;
  completedAt?: Date;
  estimatedDuration: number;
}

// Redis Keys 구조
const REDIS_KEYS = {
  JOB_QUEUE: 'sol-backtest:queue',
  JOB_STATUS: 'sol-backtest:status:{jobId}',
  JOB_PROGRESS: 'sol-backtest:progress:{jobId}',
  JOB_RESULT: 'sol-backtest:result:{jobId}',
  USER_SESSIONS: 'sol-backtest:user:{userId}:sessions',
  ACTIVE_WORKERS: 'sol-backtest:workers',
};

2. CloudRun 인스턴스 유지 전략

# CloudRun 배포 설정
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sol-backtest-service
  annotations:
    run.googleapis.com/cpu-throttling: "false"
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"      # 최소 인스턴스 유지
        autoscaling.knative.dev/maxScale: "10"
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/timeout: 3600s          # 1시간 타임아웃
    spec:
      containerConcurrency: 100                    # 인스턴스당 100개 요청
      timeoutSeconds: 3600
      containers:
      - image: gcr.io/project/sol-backtest:latest
        resources:
          limits:
            cpu: "2"
            memory: "4Gi"
        env:
        - name: REDIS_URL
          value: "redis://redis-instance:6379"
        - name: KEEP_ALIVE_INTERVAL
          value: "50000"  # 50초마다 keep-alive

3. Background Job Worker 구현

// Background Job Worker Service
@Injectable()
export class BacktestWorkerService {
  private readonly logger = new Logger(BacktestWorkerService.name);
  private isProcessing = false;
  private keepAliveInterval?: NodeJS.Timeout;

  constructor(
    private readonly redisService: RedisService,
    private readonly solPredictionService: SOLCoDWorkflowService,
    private readonly progressService: BacktestProgressService,
    private readonly sseService: SSENotificationService
  ) {}

  async startWorker(): Promise<void> {
    this.logger.info('🚀 BackTest Worker started');
    this.setupKeepAlive();
    
    while (true) {
      try {
        await this.processNextJob();
        await this.sleep(1000); // 1초 대기
      } catch (error) {
        this.logger.error('Worker error:', error);
        await this.sleep(5000); // 에러 시 5초 대기
      }
    }
  }

  private setupKeepAlive(): void {
    // CloudRun 인스턴스 유지를 위한 Keep-Alive
    this.keepAliveInterval = setInterval(async () => {
      if (this.isProcessing) {
        this.logger.debug('💓 Keep-alive ping during job processing');
        await this.redisService.ping();
      }
    }, 50000); // 50초마다 핑
  }

  private async processNextJob(): Promise<void> {
    const jobData = await this.redisService.lpop(REDIS_KEYS.JOB_QUEUE);
    if (!jobData) return;

    const job: BacktestJob = JSON.parse(jobData);
    this.isProcessing = true;

    try {
      await this.updateJobStatus(job.jobId, 'processing');
      await this.runBacktestSimulation(job);
      await this.updateJobStatus(job.jobId, 'completed');
    } catch (error) {
      this.logger.error(`Job ${job.jobId} failed:`, error);
      await this.updateJobStatus(job.jobId, 'failed', error.message);
    } finally {
      this.isProcessing = false;
    }
  }

  private async runBacktestSimulation(job: BacktestJob): Promise<void> {
    const { sessionId, parameters, userId } = job;
    const totalDays = parameters.endDay - parameters.startDay + 1;

    for (let currentDay = parameters.startDay; currentDay <= parameters.endDay; currentDay++) {
      // 진행 상황 업데이트
      const progress = {
        currentDay,
        totalDays,
        currentAgent: 'preparing',
        percentage: Math.round((currentDay - parameters.startDay) / totalDays * 100)
      };
      
      await this.updateProgress(job.jobId, progress);
      await this.sseService.sendProgressUpdate(userId, job.jobId, progress);

      // 각 전문가 에이전트별 분석 수행
      for (const agentType of parameters.agentConfig) {
        progress.currentAgent = agentType;
        await this.updateProgress(job.jobId, progress);
        await this.sseService.sendProgressUpdate(userId, job.jobId, progress);

        // SOL 예측 실행
        const prediction = await this.solPredictionService.predictSOL({
          userId,
          dayIndex: currentDay,
          systemCurrentTime: new Date()
        });

        // 결과 저장 (Prisma를 통한 PostgreSQL 저장)
        await this.saveBacktestResult(sessionId, currentDay, agentType, prediction);
        
        // 사용자에게 중간 결과 전송
        await this.sseService.sendIntermediateResult(userId, job.jobId, {
          day: currentDay,
          agent: agentType,
          prediction: prediction.predictedSOL
        });
      }
    }

    // 최종 결과 계산 및 저장
    const finalResult = await this.calculateFinalResults(sessionId);
    await this.saveJobResult(job.jobId, finalResult);
    
    // 완료 알림
    await this.sseService.sendCompletionNotification(userId, job.jobId, finalResult);
  }
}

4. SSE (Server-Sent Events) 실시간 통신

// SSE Controller
@Controller('backtest')
export class BacktestSSEController {
  
  @Get('stream/:userId')
  @Sse('events')
  streamProgress(
    @Param('userId') userId: string,
    @Query('jobId') jobId?: string
  ): Observable<MessageEvent> {
    
    return new Observable(observer => {
      const subscription = this.sseService.subscribeToUserEvents(
        userId, 
        jobId,
        (event) => {
          observer.next({
            type: event.type,
            data: JSON.stringify(event.data),
            id: event.id,
            retry: 5000
          } as MessageEvent);
        }
      );

      // 연결 해제 시 정리
      return () => {
        subscription.unsubscribe();
      };
    });
  }
}

// SSE 서비스
@Injectable()
export class SSENotificationService {
  private readonly eventSubjects = new Map<string, Subject<SSEEvent>>();

  async sendProgressUpdate(userId: string, jobId: string, progress: BacktestProgress): Promise<void> {
    const event: SSEEvent = {
      type: 'progress',
      id: `${jobId}-${Date.now()}`,
      data: {
        jobId,
        progress,
        timestamp: new Date().toISOString()
      }
    };

    this.emitToUser(userId, event);
  }

  async sendCompletionNotification(userId: string, jobId: string, result: BacktestResult): Promise<void> {
    const event: SSEEvent = {
      type: 'completed',
      id: `${jobId}-complete`,
      data: {
        jobId,
        result,
        downloadUrl: `/api/backtest/results/${jobId}/download`,
        timestamp: new Date().toISOString()
      }
    };

    this.emitToUser(userId, event);
  }

  private emitToUser(userId: string, event: SSEEvent): void {
    const subject = this.eventSubjects.get(userId);
    if (subject) {
      subject.next(event);
    }
  }
}

5. 프론트엔드 SSE 클라이언트

// simulation.js 개선
class BacktestSimulationUI {
  constructor() {
    this.eventSource = null;
    this.currentJobId = null;
  }

  async startSimulation(userId, parameters) {
    try {
      // 시뮬레이션 작업 요청
      const response = await fetch('/api/backtest/start', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ userId, parameters })
      });
      
      const { jobId } = await response.json();
      this.currentJobId = jobId;

      // SSE 연결 시작
      this.startSSEConnection(userId, jobId);
      
      // UI 업데이트
      this.showProgressUI(jobId);
      
    } catch (error) {
      console.error('시뮬레이션 시작 오류:', error);
      this.showErrorMessage(error.message);
    }
  }

  startSSEConnection(userId, jobId) {
    const url = `/api/backtest/stream/${userId}?jobId=${jobId}`;
    this.eventSource = new EventSource(url);

    this.eventSource.addEventListener('progress', (event) => {
      const data = JSON.parse(event.data);
      this.updateProgressUI(data.progress);
    });

    this.eventSource.addEventListener('completed', (event) => {
      const data = JSON.parse(event.data);
      this.showCompletionResults(data.result);
      this.eventSource.close();
    });

    this.eventSource.addEventListener('error', (event) => {
      console.error('SSE 연결 오류:', event);
      this.showErrorMessage('실시간 업데이트 연결이 끊어졌습니다.');
    });
  }

  updateProgressUI(progress) {
    document.getElementById('progress-bar').style.width = `${progress.percentage}%`;
    document.getElementById('current-day').textContent = `${progress.currentDay}/${progress.totalDays}`;
    document.getElementById('current-agent').textContent = progress.currentAgent;
    
    // 실시간 차트 업데이트
    this.updateProgressChart(progress);
  }

  showCompletionResults(result) {
    // 최종 결과 UI 표시
    document.getElementById('simulation-status').className = 'completed';
    document.getElementById('final-accuracy').textContent = `${result.accuracy}%`;
    
    // 결과 다운로드 링크 표시
    const downloadBtn = document.getElementById('download-results');
    downloadBtn.href = result.downloadUrl;
    downloadBtn.style.display = 'block';
  }

  cancelSimulation() {
    if (this.currentJobId) {
      fetch(`/api/backtest/cancel/${this.currentJobId}`, { method: 'POST' });
      
      if (this.eventSource) {
        this.eventSource.close();
      }
      
      this.resetUI();
    }
  }
}

// 전역 인스턴스 생성
const backtestUI = new BacktestSimulationUI();

6. CloudRun 환경 변수 설정

# 환경 변수 설정
export REDIS_URL="redis://10.x.x.x:6379"
export SSE_HEARTBEAT_INTERVAL=30000
export JOB_TIMEOUT=3600000
export KEEP_ALIVE_INTERVAL=50000
export MAX_CONCURRENT_JOBS=5
export JOB_RETRY_ATTEMPTS=3
export DATABASE_URL="postgresql://user:password@host:5432/sol_prediction_db"

# CloudRun 배포 커맨드
gcloud run deploy sol-backtest-service \
  --image gcr.io/your-project/sol-backtest:latest \
  --platform managed \
  --region europe-west3 \
  --allow-unauthenticated \
  --memory 4Gi \
  --cpu 2 \
  --timeout 3600s \
  --concurrency 100 \
  --min-instances 1 \
  --max-instances 10 \
  --set-env-vars="REDIS_URL=${REDIS_URL}" \
  --set-env-vars="KEEP_ALIVE_INTERVAL=${KEEP_ALIVE_INTERVAL}"

📊 성능 및 비용 분석

1. 예상 성능 지표

응답 시간:
  - 시뮬레이션 시작: < 1초 (즉시 응답)
  - 진행 상황 업데이트: 실시간 (< 100ms 지연)
  - 전체 완료 시간: 25-35초 (5개 에이전트 기준)

처리 용량:
  - 동시 백테스팅 작업: 최대 10개
  - 사용자별 SSE 연결: 무제한
  - 일일 처리 가능한 시뮬레이션: 500개+

가용성:
  - CloudRun 자동 스케일링: 99.9%
  - Redis 가용성: 99.95%
  - 전체 시스템 SLA: 99.8%

2. 예상 비용 (월간)

CloudRun 비용:
  - 기본 인스턴스 (min-instances: 1): ~$30
  - 추가 인스턴스 (요청 기반): ~$50
  - 네트워크 비용: ~$10

Redis Memorystore:
  - 기본 용량 (2GB): ~$40
  - 추가 용량 (필요시): ~$20

총 예상 비용: ~$150/월 (vs 기존 ~$100/월)
추가 비용 대비 개선 사항: 실시간 피드백, 안정성 향상

🧪 테스트 및 검증 계획

1. 단위 테스트

describe('BacktestWorkerService', () => {
  it('should process jobs from Redis queue', async () => {
    // Redis 큐에서 작업 처리 테스트
  });

  it('should handle job failures gracefully', async () => {
    // 작업 실패 시 적절한 에러 처리 테스트
  });

  it('should maintain CloudRun instance during processing', async () => {
    // 장기 작업 중 인스턴스 유지 테스트
  });
});

describe('SSENotificationService', () => {
  it('should send real-time progress updates', async () => {
    // SSE를 통한 실시간 업데이트 테스트
  });

  it('should handle connection drops gracefully', async () => {
    // SSE 연결 끊김 시 재연결 테스트
  });
});

2. 통합 테스트

describe('Backtest Async Flow E2E', () => {
  it('should complete full 83-day simulation', async () => {
    const response = await request(app)
      .post('/api/backtest/start')
      .send({ userId: 'test-user', parameters: { startDay: 1, endDay: 83 } });

    expect(response.body.jobId).toBeDefined();
    
    // SSE 연결 및 진행 상황 모니터링
    const progressEvents = await monitorSSEProgress(response.body.jobId);
    expect(progressEvents).toHaveLength(83); // 83일치 진행 상황
  });

  it('should recover from CloudRun instance restart', async () => {
    // 인스턴스 재시작 시나리오 테스트
  });

  it('should handle concurrent simulations', async () => {
    // 동시 다중 시뮬레이션 처리 테스트
  });
});

3. 부하 테스트

# K6를 사용한 부하 테스트
k6 run --duration 10m --vus 50 backtest-load-test.js

# 테스트 시나리오:
# - 50명의 가상 사용자가 동시에 백테스팅 시작
# - 10분간 지속적 부하
# - SSE 연결 유지 확인
# - Redis 메모리 사용량 모니터링

📈 모니터링 및 알림

1. CloudRun 모니터링

지표 모니터링:
- CPU 사용률: > 80% 시 알림
- 메모리 사용률: > 70% 시 알림
- 응답 시간: > 60초 시 알림
- 에러율: > 5% 시 알림
- 인스턴스 수: 변화 시 알림

로그 분석:
- 작업 시작/완료 로그 수집
- 에러 패턴 분석
- 성능 병목 식별

2. Redis 모니터링

Redis 지표:
- 메모리 사용률: > 80% 시 알림
- 연결 수: > 100개 시 알림
- 큐 길이: > 50개 시 알림
- 명령 지연시간: > 10ms 시 알림

데이터 정리:
- 완료된 작업 7일 후 삭제
- 실패한 작업 30일 후 삭제
- 진행 상황 데이터 1일 후 삭제

3. 사용자 경험 모니터링

// 사용자 경험 메트릭 수집
class UXMetricsCollector {
  trackSimulationStart(userId: string, jobId: string) {
    // 시뮬레이션 시작 시간 기록
    this.recordEvent('simulation_started', { userId, jobId });
  }

  trackProgressUpdate(userId: string, jobId: string, latency: number) {
    // SSE 업데이트 지연 시간 기록
    this.recordMetric('sse_update_latency', latency, { userId, jobId });
  }

  trackCompletion(userId: string, jobId: string, totalDuration: number) {
    // 전체 완료 시간 기록
    this.recordMetric('simulation_duration', totalDuration, { userId, jobId });
  }
}

🚀 단계별 구현 로드맵

Phase 1: 기본 인프라 구축 (3-4시간)

✅ 작업 목록:
1. Redis Job Queue 스키마 설계 (1시간)
2. BacktestWorkerService 기본 구조 구현 (1시간)
3. CloudRun Keep-Alive 메커니즘 구현 (1시간)
4. 기본 에러 처리 및 로깅 (1시간)

🎯 완료 기준:
- Redis에서 작업을 가져와 처리하는 기본 워커 동작
- CloudRun 인스턴스가 작업 중 종료되지 않음
- 작업 상태를 Redis에 정확히 기록

Phase 2: SSE 실시간 통신 구현 (2-3시간)

✅ 작업 목록:
1. SSENotificationService 구현 (1시간)
2. BacktestSSEController 구현 (1시간)
3. 프론트엔드 SSE 클라이언트 개발 (1시간)

🎯 완료 기준:
- 웹 브라우저에서 실시간 진행 상황 확인 가능
- SSE 연결 끊김 시 자동 재연결
- 진행 상황 UI 업데이트가 지연 없이 동작

Phase 3: 작업 복구 및 안정성 구현 (2-3시간)

✅ 작업 목록:
1. 중단된 작업 복구 로직 (1시간)
2. 재시도 메커니즘 구현 (1시간)
3. 작업 취소 기능 구현 (1시간)

🎯 완료 기준:
- 인스턴스 재시작 후 작업 자동 복구
- 실패한 작업의 자동 재시도
- 사용자가 진행 중인 작업을 취소 가능

Phase 4: 성능 최적화 및 테스트 (3-4시간)

✅ 작업 목록:
1. 병렬 처리 최적화 (1시간)
2. 메모리 사용량 최적화 (1시간)
3. End-to-End 테스트 작성 (1시간)
4. 부하 테스트 수행 (1시간)

🎯 완료 기준:
- 83일 백테스팅이 30초 이내 완료
- 동시 5개 시뮬레이션 처리 가능
- 모든 테스트 통과

Phase 5: 모니터링 및 프로덕션 준비 (2시간)

✅ 작업 목록:
1. CloudRun 모니터링 설정 (1시간)
2. 알림 시스템 구축 (1시간)

🎯 완료 기준:
- 모든 주요 지표에 대한 알림 설정 완료
- 대시보드에서 시스템 상태 실시간 확인 가능
- 프로덕션 배포 준비 완료

📋 위험 요소 및 완화 방안

1. 기술적 위험

🔴 위험: CloudRun 인스턴스 예기치 못한 종료
📋 완화: 
- Keep-alive 메커니즘으로 인스턴스 유지
- 작업 상태를 Redis에 지속적 저장
- 중단된 작업 자동 복구 로직

🔴 위험: Redis 연결 끊김 또는 장애
📋 완화:
- Redis 연결 재시도 로직 구현
- PostgreSQL 백업 및 복제 활용
- Circuit Breaker 패턴 적용

🔴 위험: SSE 연결 불안정
📋 완화:
- 자동 재연결 메커니즘
- Polling 방식 폴백 옵션
- 연결 상태 모니터링 및 알림

2. 성능 위험

🔴 위험: 동시 다중 작업으로 인한 성능 저하
📋 완화:
- 작업 큐 우선순위 시스템 도입
- 동시 실행 작업 수 제한 (5개)
- 리소스 사용량 모니터링

🔴 위험: 메모리 부족으로 인한 OOM
📋 완화:
- CloudRun 메모리 4GB로 증설
- 작업별 메모리 사용량 추적
- 메모리 임계점 도달 시 새 작업 거부

3. 사용자 경험 위험

🔴 위험: 사용자 대기 시간 길어짐
📋 완화:
- 즉시 작업 ID 반환 (1초 이내)
- 실시간 진행 상황 업데이트
- 예상 완료 시간 표시

🔴 위험: 작업 결과 손실
📋 완화:
- PostgreSQL에 영구 저장 (Prisma ORM 사용)
- 다운로드 링크 7일간 유효
- 이메일 알림 옵션 제공

📈 성공 기준 및 KPI

1. 성능 KPI

응답 시간:
  - 시뮬레이션 시작 응답: < 1초 ✅
  - 83일 백테스팅 완료: < 35초 ✅
  - SSE 업데이트 지연: < 100ms ✅

가용성:
  - 시스템 가동률: > 99.8% ✅
  - 작업 완료율: > 98% ✅
  - 데이터 손실률: < 0.1% ✅

확장성:
  - 동시 시뮬레이션: 최소 5개 ✅
  - 피크 시간 처리: 50개/시간 ✅

2. 사용자 경험 KPI

만족도:
  - 실시간 피드백 만족도: > 90% ✅
  - 전체 UX 만족도: > 85% ✅
  - 작업 완료 알림 효과성: > 95% ✅

효율성:
  - 작업 취소율: < 10% ✅
  - 재실행 비율: < 5% ✅
  - 결과 다운로드 성공률: > 98% ✅

🔄 향후 발전 방향

1. 단기 개선사항 (1-2개월)

🎯 성능 최적화:
- GPU 인스턴스 활용으로 AI 추론 가속화
- 캐싱 레이어 도입으로 반복 계산 최소화
- 스마트 샘플링으로 계산량 최적화

🎯 사용자 경험:
- 모바일 앱 push 알림 연동
- 결과 비교 및 분석 도구 제공
- 커스텀 시뮬레이션 파라미터 지원

2. 중장기 확장 계획 (3-6개월)

🎯 아키텍처 진화:
- Kubernetes로 전환하여 더 정교한 스케일링
- Stream Processing 도입으로 실시간 분석
- ML 파이프라인과 통합으로 예측 정확도 향상

🎯 다중 분석 지원:
- 다른 건강 지표 백테스팅 확장
- 다중 사용자 비교 분석 기능
- 의료진용 대시보드 및 리포팅 시스템

📝 결론

CloudRun 환경에서 SOL CoD BackTest 시스템의 비동기 패턴 마이그레이션은 Redis Queue + SSE 조합을 통해 실현 가능하며, 다음과 같은 핵심 장점을 제공합니다:

✅ 주요 달성 목표

실시간 피드백: SSE를 통한 지연 시간 < 100ms
안정성 보장: CloudRun 인스턴스 유지 및 작업 복구 메커니즘
확장성 확보: 동시 다중 시뮬레이션 처리 능력
사용자 경험: 즉시 응답 + 실시간 진행 상황 업데이트

📊 예상 효과

처리 시간 단축: 31초+ → 25-35초 (동기 대비 안정적)
사용자 만족도: 90%+ (실시간 피드백)
시스템 안정성: 99.8% 가동률
의료진 신뢰도: 작업 중단 없는 안정적 분석

이 계획은 현재 SOL CoD 시스템의 95% 완성도를 바탕으로 CloudRun의 서버리스 특성을 최대한 활용하면서도, 장기 실행 작업의 안정성을 보장하는 실용적이고 검증 가능한 솔루션을 제시합니다.

📋 문서 개요​

🚨 CloudRun 환경 분석 및 제약사항​

1. CloudRun Serverless 특성​

2. SOL CoD 시스템 현재 상황​

3. 핵심 도전과제​

🎯 기술적 해결 방안 분석​

Option 1: CloudRun + Redis Queue + SSE (추천)​

Option 2: CloudRun + PostgreSQL Jobs + Polling​

Option 3: CloudRun + Cloud Tasks + SSE (확장성 최고)​

Option 4: 하이브리드 접근 (즉시 응답 + 상태 폴링)​

🏆 선택된 솔루션: Option 1 (CloudRun + Redis + SSE)​

🎯 선택 근거​

📐 상세 구현 계획​

1. Redis Job Queue 구조​

2. CloudRun 인스턴스 유지 전략​

3. Background Job Worker 구현​

4. SSE (Server-Sent Events) 실시간 통신​

5. 프론트엔드 SSE 클라이언트​

6. CloudRun 환경 변수 설정​

📊 성능 및 비용 분석​

1. 예상 성능 지표​

2. 예상 비용 (월간)​

🧪 테스트 및 검증 계획​

1. 단위 테스트​

2. 통합 테스트​

3. 부하 테스트​

📈 모니터링 및 알림​

1. CloudRun 모니터링​

2. Redis 모니터링​

3. 사용자 경험 모니터링​

🚀 단계별 구현 로드맵​

Phase 1: 기본 인프라 구축 (3-4시간)​

Phase 2: SSE 실시간 통신 구현 (2-3시간)​

Phase 3: 작업 복구 및 안정성 구현 (2-3시간)​

Phase 4: 성능 최적화 및 테스트 (3-4시간)​

Phase 5: 모니터링 및 프로덕션 준비 (2시간)​

📋 위험 요소 및 완화 방안​

1. 기술적 위험​

2. 성능 위험​

3. 사용자 경험 위험​

📈 성공 기준 및 KPI​

1. 성능 KPI​

2. 사용자 경험 KPI​

🔄 향후 발전 방향​

1. 단기 개선사항 (1-2개월)​

2. 중장기 확장 계획 (3-6개월)​

📝 결론​

✅ 주요 달성 목표​

📊 예상 효과​