[Azure] AGW 인스턴스 축소 후 발생하는 5xx 에러, AGW 문제인가? 백엔드 문제인가?

AGW(Application Gateway) Minimum instance count를 비용 최적화 차원에서 줄인 후 5xx가 갑자기 늘었다.

인스턴스를 줄인 게 원인인지, 아니면 원래 있던 백엔드 문제가 이제 보이는 건지? 상태 코드만으로는 구분이 안 돼서, Access Log와 메트릭으로 원인을 가려내는 과정을 정리해 봄.

인스턴스 <-> Capacity Unit 환산

인스턴스 1개 = Capacity Unit 약 10개
인스턴스 N개가 커버하는 용량 = N * 10CU
Capacity Unit 1개 = Compute Unint 1개 또는 Persistent Connection 2,500개 또는 Throughput 2.22Mbps 중 가장 높은 사용량 기준으로 계산 (아래 예시에선 5CU 필요)
- Compute 기준 -> 3CU 필요 (임의 값)
- Connection 기준 -> 12,550개 / 2,500 = 5CU 필요
- Throughput 기준 -> 4.44Mbps / 2.22 = 2CU 필요
Minimum instance count는 **바닥값(하한)**일 뿐. 트래픽이 늘면 그 위로 자동 확장됨. 단, 스케일아웃에는 약 3~5분이 걸림.

적정 인스턴스 산정 방법

Metrics에서 최근 한 달치 Current Capacity Units(또는 Compute Units)의 최댓값(Max) 을 확인
그 피크값을 10으로 나눈 뒤 올림 → 필요한 최소 인스턴스 수
스케일아웃 지연(3~5분)과 트래픽 변동성을 감안해 여유분 +1 정도

인스턴스 축소 진행

Minimum instance count = 6
최근 한달 피크 Capacity Unit = 약 16.6
필요한 인스턴스 = 16.6 / 10 = 2개(=20CU 커버)
여유로 + 1 해서 Minimum instance count = 3개(=30CU)로 설정 -> 피크의 약 1.8배 여유

-------------------------------------------------------

AGW의 가변 요금(CU)은 '실제 사용한 트래픽 양'과 '미리 예약해 둔 최소 용량(Min)' 중 더 큰 값을 기준으로 청구.

따라서 최소 용량(Min) 설정을 낮추면, 트래픽이 없는 시간대(Idle)에 쓰지도 않고 내야 했던 고정 비용을 없앨 수 있어 요금이 절약.

-------------------------------------------------------

Access Log에서 확인 시 5xx 오류가 급증함.

인스턴스 축소 후 피크 대비 여유가 있다면 인스턴스 축소가 5xx 오류의 원인일 가능성은 낮음.

Access Log에서 아래 값으로 5xx 오류가 AGW측인지 백엔드측인지 구분할 수 있음.

1. serverRouted_s (백엔드로 라우팅 여부)

값이 있다 = AGW가 백엔드까지 요청을 전달함 → AGW는 제 일을 했다는 뜻
비어 있다 / - = 백엔드로 못 붙임 → AGW측 또는 백엔드 헬스 문제

2. error_info_s (에러 원인 코드)

ErrorInfo 값	의미	원
ERRORINFO_NO_ERROR	AGW는 에러 없음. 백엔드 응답을 그대로 전달	백엔드 앱
ERRORINFO_UPSTREAM_CLOSED_CONNECTION	백엔드가 연결을 끊음	백엔드 (keep-alive/timeout 불일치, 앱 크래시)
ERRORINFO_UPSTREAM_CONNECT_TIMEOUT	백엔드 연결 타임아웃	백엔드 다운/네트워크
ERRORINFO_UPSTREAM_TIMED_OUT	백엔드 응답 타임아웃	백엔드 지연
ERRORINFO_UPSTREAM_NO_LIVE	살아있는 백엔드 없음	백엔드 헬스 프로브 전부 실패

여기에서 "ERRORINFO_NO_ERROR" 로 생성 된 500 오류는 AGW과 무관한 백엔드 앱이 직접 던지 오류임.

상태 코드별 빠른 해석 (AI 생성)

500 + NO_ERROR → 백엔드 앱 에러. 백엔드 애플리케이션 로그를 봐야 함
502 + UPSTREAM_CLOSED_CONNECTION → 백엔드가 연결을 끊음. keep-alive timeout 불일치 의심
502 + CONNECT_TIMEOUT/FAILED → 백엔드 다운 또는 네트워크
503 + 라우팅 안 됨 → AGW측. 가용 백엔드 없음 / 과부하 의심 (← 인스턴스 축소가 원인일 수 있는 거의 유일한 신호)
504 → 백엔드 타임아웃 (백엔드 지연)

참고 링크

learn.microsoft.com/en-us/azure/application-gateway/http-response-codes
learn.microsoft.com/en-us/azure/application-gateway/monitor-application-gateway-reference

KQL:5xx 원인 분류

let Lookback = 24h;
AzureDiagnostics
| where TimeGenerated > ago(Lookback)
| where Category == "ApplicationGatewayAccessLog"
| extend
    HttpStatusCode      = toint(column_ifexists("httpStatus_d", int(null))),
    BackendStatusCode   = toint(column_ifexists("serverStatus_d", int(null))),
    TotalTimeSec        = todouble(column_ifexists("timeTaken_d", real(null))),
    BackendLatencySec   = todouble(column_ifexists("serverResponseLatency_d", real(null))),
    ErrorInfo           = tostring(column_ifexists("error_info_s", "")),
    RequestUriNorm      = tostring(column_ifexists("requestUri_s", "")),
    ListenerNorm        = tostring(column_ifexists("listenerName_s", "")),
    HostNorm            = tostring(column_ifexists("host_s", "")),
    BackendPoolName     = tostring(column_ifexists("backendPoolName_s", "")),
    BackendServer       = tostring(column_ifexists("serverRouted_s", "")),
    BackendServerPort   = tostring(column_ifexists("backendServerPort_s", "")),
    BackendServerIP     = tostring(column_ifexists("backendServerIP_s", ""))
| where HttpStatusCode >= 500 or BackendStatusCode >= 500
| extend RoutedToBackend = isnotempty(BackendServer) and BackendServer != "-"
| extend ErrorSource = case(
    HttpStatusCode == 502 and ErrorInfo == "ERRORINFO_UPSTREAM_CLOSED_CONNECTION",
        "Backend 502: 연결 끊김 (keep-alive/timeout 불일치 의심)",
    HttpStatusCode == 502 and ErrorInfo in ("ERRORINFO_UPSTREAM_CONNECT_TIMEOUT", "ERRORINFO_UPSTREAM_CONNECT_FAILED"),
        "Backend 502: 연결 실패 (백엔드 다운/네트워크 의심)",
    HttpStatusCode == 502 and ErrorInfo == "ERRORINFO_UPSTREAM_NO_LIVE",
        "AGW 502: 가용 백엔드 없음 (전부 Unhealthy)",
    HttpStatusCode == 502,
        strcat("Backend 502: ", ErrorInfo),
    HttpStatusCode == 504,
        strcat("Backend 504: 타임아웃 (백엔드 지연) ", ErrorInfo),
    HttpStatusCode == 503 and not(RoutedToBackend),
        "AGW 503: 가용 백엔드 없음 (전부 Unhealthy/과부하 의심)",
    HttpStatusCode == 503,
        "AGW 503: 일시적 과부하 의심",
    HttpStatusCode == 500 and ErrorInfo == "ERRORINFO_NO_ERROR",
        "Backend 500 (앱이 직접 응답)",
    BackendStatusCode >= 500,
        strcat("Backend ", tostring(BackendStatusCode), " (앱 응답)"),
    HttpStatusCode >= 500 and not(RoutedToBackend),
        strcat("AGW 자체 5xx (백엔드 미도달) ", ErrorInfo),
    strcat("기타 5xx: ", ErrorInfo)
)
| extend Likely = case(
    ErrorSource startswith "AGW",     "AGW측 의심",
    ErrorSource startswith "Backend", "Backend측 원인",
    "확인 필요"
)
| project
    TimeGenerated, Likely, ErrorSource,
    HttpStatusCode, BackendStatusCode, ErrorInfo, RoutedToBackend,
    TotalTimeSec, BackendLatencySec,
    ListenerNorm, HostNorm, RequestUriNorm,
    BackendPoolName, BackendServer, BackendServerPort, BackendServerIP
| order by TimeGenerated desc, TotalTimeSec desc