Implement Lambda runtime init error reporting#103
Draft
joe4dev wants to merge 13 commits into
Draft
Conversation
348ee93 to
ff646e5
Compare
3ec2cde to
8137f8b
Compare
…s API Ports the supervisor and events API from PR #41 to enable proper error reporting when a Lambda runtime process exits unexpectedly (e.g. sys.exit() or missing wrapper script), instead of LocalStack timing out with a generic error. - Add LocalStackSupervisor: wraps ProcessSupervisor, detects unexpected runtime-* process exits and emits SendFault(RuntimeExit) events - Add LocalStackEventsAPI: wraps StandaloneEventsAPI, overrides SendFault to forward errors to LocalStack via SendStatus(error, ...) - Wire both into SandboxBuilder via SetEventsAPI / SetSupervisor - Refactor NewCustomInteropServer to accept a pre-created *LocalStackAdapter shared with the events API - Improve SendInitErrorResponse: properly deserialises the payload, includes RequestId, and sends asynchronously (non-blocking) Enables test_lambda_runtime_exit and test_lambda_runtime_wrapper_not_found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use *string for the RequestId field in ErrorResponse so that an empty string is serialized (not omitted by omitempty), while nil — used for fault events — stays omitted. Fixes test_lambda_runtime_error snapshot mismatch where requestId: "" was expected but absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move Init Duration after Max Memory Used in REPORT line (matches AWS) - Add Status: timeout to REPORT line on invoke timeout - Fix timeout error message format to "RequestId: <id> Error: Task timed out after N.00 seconds" - Add ErrorType: "Sandbox.Timedout" to timeout error response - Track init start time and emit Init Duration on first non-retry invocation - Add is-init-retry field to InvokeRequest to suppress Init Duration on retry invokes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Close response bodies in SendStatus/SendLogs/SendResult so idle connections are released instead of leaked. - Use errors.New instead of fmt.Errorf with no format arguments. - Document the single-invoke assumption behind the unsynchronized initStart/warmStart fields. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve the fault request ID in the events API: prefer an explicit ID, then the current invoke ID so a mid-invocation runtime crash reports the actual request, and only synthesize a UUID as a fallback for init-phase faults where no invocation has been dispatched yet. Previously the supervisor always passed a random UUID, masking the real invoke ID. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pidcore Move the Lambda init-phase timeout retry into the RIE and replace the custom supervisor with rapidcore's existing init-failure machinery. - rapidcore: add AwaitInitializedWithDetails (structured init outcome) and AwaitInitializedWithTimeout (timer-aware; does NOT consume the init-failures channel on timeout, so the invoke path's Reserve() can still drive suppressed init). Refactor awaitInitialized into interpretInitFailure. - main.go: on init-phase timeout (LOCALSTACK_INIT_PHASE_TIMEOUT, default 10s), emit INIT_REPORT, signal ready, and reset the in-progress init so the first invoke re-runs it (suppressed init) under the function timeout. Genuine init failures are reported via SendInitError. - custom_interop.go: SendInitError crash-path fallback with an initErrorForwarded dedup guard, formatted as AWS's "RequestId: <id> Error: <msg>"; ReportInitTimeout + initTimedOut-driven Init Duration suppression. - Remove the custom LocalStackSupervisor/LocalStackEventsAPI; drop unused IsInitRetry from lsapi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The synthetic START line was written eagerly when LocalStack dispatched /invoke, i.e. before the suppressed init re-runs the function's static code. AWS emits START upon the Invoke event reaching the runtime, which rapidcore sequences after any inline (suppressed) init (doInvoke -> sendInvokeStartLogEvent). Emit START from a minimal LocalStackEventsAPI.SendInvokeStart override (riding rapidcore's correctly-placed invoke-start event) and drop the eager write in the /invoke handler. Correct for warm, cold, and suppressed-init invocations; fixes test_lambda_timeout_init_phase against the unmodified AWS snapshot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t it Follow the established LocalStack-env strategy: capture the value in InitLsOpts (before UnsetLsEnvs runs) and add it to the UnsetLsEnvs list. Previously it was read inline with os.Getenv after UnsetLsEnvs, so the variable was never unset and leaked into the function's environment (forwarded via os.Environ in InitHandler), breaking AWS parity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckTrace) SendInitErrorResponse round-tripped the runtime's /init/error payload through a typed struct whose stackTrace used omitempty, dropping an empty-but-present "stackTrace": [] (as AWS emits for Runtime.HandlerNotFound). Decode into a map and only inject requestId, forwarding the runtime's fields verbatim. Fixes test_lambda_handler_not_found. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8137f8b to
c7f0e6f
Compare
AWS folds a failed cold-start init into the first invocation (suppressed init), reporting it as a failed invoke with the full INIT_REPORT(phase=invoke)/START/END/ REPORT envelope rather than a separate init error. Match this for on-demand: - main.go: on init failure for on-demand, signal ready and keep the process alive instead of SendInitError+exit, so the first invoke surfaces the cached init error. - custom_interop: skip /status/error forwarding for on-demand (cache only, so the invoke carries the error); emit INIT_REPORT Phase:invoke Status:error Error Type before START and add Status/Error Type to the REPORT, using rapidcore's scrubbed fatal error type. PC/SnapStart/Managed Instances keep the provisioning-time model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The runtime emits multi-line records (e.g. an unhandled-init traceback) as a single log frame with internal newlines replaced by bare carriage returns. AWS renders these back as line feeds, so convert bare CR to LF in the assembled log output while preserving genuine CRLF endings (which AWS keeps, e.g. the LAMBDA_WARNING line). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…elper to its own file The init-phase timeout / suppressed-init support previously edited the vendored upstream internal/lambda/rapidcore/server.go (exported InitCompletionResponse, extracted interpretInitFailure, added AwaitInitializedWithDetails and AwaitInitializedWithTimeout). Modifying vendored upstream files causes rebase conflicts when syncing with aws-lambda-runtime-interface-emulator. Revert server.go to byte-identical upstream and move the only load-bearing addition (AwaitInitializedWithTimeout, plus a local InitCompletionResponse and a duplicated interpretInitFailure) into a new same-package file server_localstack.go. It must stay in package rapidcore because it needs the unexported getInitFailuresChan()/Release()/setRuntimeState() helpers, but as a standalone file it never conflicts on rebase. AwaitInitializedWithDetails was unused by cmd/localstack and is dropped. No behavior change: RIE go test ./... and TestLambdaErrors both green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Commit 9b54e66 rewrote every bare carriage return in the captured runtime output to a line feed, intending to render multi-line init tracebacks across lines. But AWS keeps a bare CR inside a single CloudWatch log event (records split on LF only), and LocalStack's log ingestion likewise splits on "\n". The conversion therefore wrongly split any record containing a bare CR into multiple events, breaking the AWS-validated TestCloudwatchLogs::test_multi_line_prints (a user `print("a\rb")` was emitted as two events "a" and "b" instead of one event "a\rb"). Emit the runtime output verbatim. Verified: test_multi_line_prints and the full TestLambdaErrors suite (incl. the runtime-exit/segfault error-reporting tests) are both green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a Lambda runtime exits unexpectedly or throws an error during initialization, LocalStack previously received no callback and would wait until the environment timeout. This PR adds two complementary error-reporting paths so LocalStack immediately receives a structured
ErrorResponseinstead of timing out, and also adds RIE-side support for the init-phase timeout retry protocol.Changes
cmd/localstack/events.go— newLocalStackEventsAPIwrapstelemetry.StandaloneEventsAPIand overridesSendInvokeStartto emit the syntheticSTART RequestId: ... Version: ...log line at the AWS-faithful point: after any inline suppressed-init (if init timed out or failed on-demand) and before the runtime handles the invocation. rapidcore callsSendInvokeStartat the right moment indoInvoke(internal/lambda/rapid/handlers.go), so this produces the correct AWS log ordering without any custom sequencing.internal/lambda/rapidcore/server_localstack.go— newContains LocalStack-specific additions to the rapidcore
Serverthat must live in packagerapidcore(to access unexported channels and state helpers) but are kept out ofserver.go(which is vendored from upstream and must stay byte-identical to avoid rebase conflicts):InitCompletionResponse— carries the structured init failure cause (error type and message) so callers can report it to LocalStackinterpretInitFailure— maps anInitFailureto the sentinel error and structured cause (mirrors the upstreamawaitInitializedbody without modifying it)AwaitInitializedWithTimeout— behaves likeAwaitInitializedbut returns early if init does not complete within the given timeout; on timeout returnstimedOut=truewithout consuming the init-failures channel, so a subsequent invoke'sReserve()/awaitInitialized()can still observe the init outcome and trigger the suppressed initcmd/localstack/custom_interop.go— modifiedNew fields on
CustomInteropServer:initStart time.Time— set inInit(), used to compute Init DurationwarmStart bool— flipped on the first invoke; Init Duration is only emitted on the first invocationinitTimedOut atomic.Bool— set byReportInitTimeout; suppresses Init Duration on the folded-invoke REPORT (the timed-out init was already reported separately)initErrorForwarded atomic.Bool— set inSendInitErrorResponsebefore forwarding; preventsSendInitErrorfrom sending a duplicate error status for the same failed initinitErrorType atomic.Value— stores rapidcore's scrubbed fatal error type (e.g.Runtime.Unknown) when init failed; used to renderINIT_REPORT(phase=invoke)and REPORTStatus/Error Typefor the on-demand folded-into-invoke pathonDemand bool—truefor on-demand functions; controls the init-failure reporting modelNew methods:
ReportInitTimeout()— emitsINIT_REPORT Init Duration: ... Phase: init Status: timeoutand setsinitTimedOutSendInitError(errType, errMsg)— crash-path fallback: reports a structured init failure to LocalStack when the runtime died without calling/init/error(e.g. crash,sys.exit, invalid entrypoint); no-op ifSendInitErrorResponsealready forwarded the errorOn-demand vs PC/SnapStart/Managed Instances split in
SendInitErrorResponse:AWS handles init failures differently depending on the function type:
SendInitErrorResponserecords the error type ininitErrorTypebut does not call/status/error; the function signals ready so LocalStack dispatches the first invoke, which surfaces the error with the fullINIT_REPORT(phase=invoke)/START/END/REPORTlog envelope./status/error.SendInitErrorResponsedecodes the runtime's payload into amap[string]any(not a typed struct) to preserve all fields verbatim — in particular an empty-but-present"stackTrace": [](e.g.Runtime.HandlerNotFound) thatomitemptywould drop on re-marshal — then injectsrequestIdand POSTs to LocalStack.Other
custom_interop.gochanges:NewCustomInteropServeraccepts a pre-created*LocalStackAdapter(shared with the events API inmain.go)"RequestId: <id> Error: Task timed out after X seconds"withErrorType: "Sandbox.Timedout"andStatus: timeoutin the REPORT lineSendStatus,SendLogs,SendResult)cmd/localstack/main.go— modifiedLocalStackAdapterupfront and passes it to bothNewLocalStackEventsAPIandNewCustomInteropServerlsEventsAPIinto the sandbox viaSetEventsAPIAwaitInitialized()withAwaitInitializedWithTimeout()usingLOCALSTACK_INIT_PHASE_TIMEOUT(default 10 s, overridden by LocalStack for PC/SnapStart/MI viaexecution_environment.pyin localstack-pro)ReportInitTimeout()and launches an asyncReset("initTimeout")so rapidcore re-runs the init as a suppressed init on the first invocationErrInitDoneFailed): signals ready and lets the first invoke surface the cached errorSendInitError(no-op ifSendInitErrorResponsealready ran) and exitscmd/localstack/awsutil.go— modifiedPrintEndReportsgains astatus stringparameterMax Memory Usedin the REPORT line, matching the AWS field orderstatus(e.g."Status: timeout"or"Status: error\tError Type: Runtime.Unknown") is appended after Init Durationcmd/localstack/logs.go— modifiedDocuments why bare CR characters in captured runtime output are intentionally not rewritten to LF: AWS splits CloudWatch log events on LF only, so
print("a\rb")must stay as one event"a\rb". Converting CR to LF here would wrongly split such records (seetest_multi_line_prints).internal/lsapi/types.go— modifiedErrorResponse.RequestIdchanged fromstringto*stringwithomitempty: a pointer-to-empty-string serializes as""(required for init errors, which always include arequestIdfield), whilenilis omitted (used when no request ID is known).Tests
Covered by the integration tests in localstack/localstack-pro#7293:
test_lambda_runtime_errorsys.exit()called during inittest_lambda_runtime_exitAWS_LAMBDA_EXEC_WRAPPERscripttest_lambda_runtime_wrapper_not_foundtest_lambda_handler_not_foundtest_lambda_timeout_init_phaseRelated
Depends on #101
Closes DEVX-1