Skip to content

Implement Lambda runtime init error reporting#103

Draft
joe4dev wants to merge 13 commits into
localstackfrom
devx-1-implement-lambda-runtime-init-error-reporting
Draft

Implement Lambda runtime init error reporting#103
joe4dev wants to merge 13 commits into
localstackfrom
devx-1-implement-lambda-runtime-init-error-reporting

Conversation

@joe4dev

@joe4dev joe4dev commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

When a Lambda runtime exits unexpectedly or throws an error during initialization, LocalStack previously received no callback and would wait until the environment timeout. This PR adds two complementary error-reporting paths so LocalStack immediately receives a structured ErrorResponse instead of timing out, and also adds RIE-side support for the init-phase timeout retry protocol.

Changes

cmd/localstack/events.go — new

LocalStackEventsAPI wraps telemetry.StandaloneEventsAPI and overrides SendInvokeStart to emit the synthetic START RequestId: ... Version: ... log line at the AWS-faithful point: after any inline suppressed-init (if init timed out or failed on-demand) and before the runtime handles the invocation. rapidcore calls SendInvokeStart at the right moment in doInvoke (internal/lambda/rapid/handlers.go), so this produces the correct AWS log ordering without any custom sequencing.

internal/lambda/rapidcore/server_localstack.go — new

Contains LocalStack-specific additions to the rapidcore Server that must live in package rapidcore (to access unexported channels and state helpers) but are kept out of server.go (which is vendored from upstream and must stay byte-identical to avoid rebase conflicts):

  • InitCompletionResponse — carries the structured init failure cause (error type and message) so callers can report it to LocalStack
  • interpretInitFailure — maps an InitFailure to the sentinel error and structured cause (mirrors the upstream awaitInitialized body without modifying it)
  • AwaitInitializedWithTimeout — behaves like AwaitInitialized but returns early if init does not complete within the given timeout; on timeout returns timedOut=true without consuming the init-failures channel, so a subsequent invoke's Reserve()/awaitInitialized() can still observe the init outcome and trigger the suppressed init

cmd/localstack/custom_interop.go — modified

New fields on CustomInteropServer:

  • initStart time.Time — set in Init(), used to compute Init Duration
  • warmStart bool — flipped on the first invoke; Init Duration is only emitted on the first invocation
  • initTimedOut atomic.Bool — set by ReportInitTimeout; suppresses Init Duration on the folded-invoke REPORT (the timed-out init was already reported separately)
  • initErrorForwarded atomic.Bool — set in SendInitErrorResponse before forwarding; prevents SendInitError from sending a duplicate error status for the same failed init
  • initErrorType atomic.Value — stores rapidcore's scrubbed fatal error type (e.g. Runtime.Unknown) when init failed; used to render INIT_REPORT(phase=invoke) and REPORT Status/Error Type for the on-demand folded-into-invoke path
  • onDemand booltrue for on-demand functions; controls the init-failure reporting model

New methods:

  • ReportInitTimeout() — emits INIT_REPORT Init Duration: ... Phase: init Status: timeout and sets initTimedOut
  • SendInitError(errType, errMsg) — crash-path fallback: reports a structured init failure to LocalStack when the runtime died without calling /init/error (e.g. crash, sys.exit, invalid entrypoint); no-op if SendInitErrorResponse already forwarded the error

On-demand vs PC/SnapStart/Managed Instances split in SendInitErrorResponse:

AWS handles init failures differently depending on the function type:

  • On-demand: a failed cold-start init is folded into the first invocation. SendInitErrorResponse records the error type in initErrorType but does not call /status/error; the function signals ready so LocalStack dispatches the first invoke, which surfaces the error with the full INIT_REPORT(phase=invoke)/START/END/REPORT log envelope.
  • PC/SnapStart/MI: init failures are reported at provisioning time via /status/error. SendInitErrorResponse decodes the runtime's payload into a map[string]any (not a typed struct) to preserve all fields verbatim — in particular an empty-but-present "stackTrace": [] (e.g. Runtime.HandlerNotFound) that omitempty would drop on re-marshal — then injects requestId and POSTs to LocalStack.

Other custom_interop.go changes:

  • NewCustomInteropServer accepts a pre-created *LocalStackAdapter (shared with the events API in main.go)
  • Timeout error message format: "RequestId: <id> Error: Task timed out after X seconds" with ErrorType: "Sandbox.Timedout" and Status: timeout in the REPORT line
  • All HTTP response bodies are now closed (fixes connection leaks on SendStatus, SendLogs, SendResult)

cmd/localstack/main.go — modified

  • Creates LocalStackAdapter upfront and passes it to both NewLocalStackEventsAPI and NewCustomInteropServer
  • Wires lsEventsAPI into the sandbox via SetEventsAPI
  • Replaces AwaitInitialized() with AwaitInitializedWithTimeout() using LOCALSTACK_INIT_PHASE_TIMEOUT (default 10 s, overridden by LocalStack for PC/SnapStart/MI via execution_environment.py in localstack-pro)
  • On timeout: calls ReportInitTimeout() and launches an async Reset("initTimeout") so rapidcore re-runs the init as a suppressed init on the first invocation
  • On on-demand init failure (ErrInitDoneFailed): signals ready and lets the first invoke surface the cached error
  • On PC/SnapStart/MI failure or unexpected error: calls SendInitError (no-op if SendInitErrorResponse already ran) and exits

cmd/localstack/awsutil.go — modified

  • PrintEndReports gains a status string parameter
  • Init Duration is placed after Max Memory Used in the REPORT line, matching the AWS field order
  • status (e.g. "Status: timeout" or "Status: error\tError Type: Runtime.Unknown") is appended after Init Duration

cmd/localstack/logs.go — modified

Documents why bare CR characters in captured runtime output are intentionally not rewritten to LF: AWS splits CloudWatch log events on LF only, so print("a\rb") must stay as one event "a\rb". Converting CR to LF here would wrongly split such records (see test_multi_line_prints).

internal/lsapi/types.go — modified

ErrorResponse.RequestId changed from string to *string with omitempty: a pointer-to-empty-string serializes as "" (required for init errors, which always include a requestId field), while nil is omitted (used when no request ID is known).

Tests

Covered by the integration tests in localstack/localstack-pro#7293:

Scenario Test
Exception raised during module import test_lambda_runtime_error
sys.exit() called during init test_lambda_runtime_exit
Missing AWS_LAMBDA_EXEC_WRAPPER script test_lambda_runtime_wrapper_not_found
Handler function does not exist in module test_lambda_handler_not_found
Init phase exceeds 10 s → transparent retry with function timeout test_lambda_timeout_init_phase

Related

Depends on #101
Closes DEVX-1

Base automatically changed from localstack-api-compat-test to localstack June 9, 2026 07:13
@joe4dev joe4dev force-pushed the devx-1-implement-lambda-runtime-init-error-reporting branch from 348ee93 to ff646e5 Compare June 9, 2026 07:34
@joe4dev joe4dev changed the base branch from localstack to main June 9, 2026 07:37
@joe4dev joe4dev changed the base branch from main to localstack June 9, 2026 07:37
@joe4dev joe4dev force-pushed the devx-1-implement-lambda-runtime-init-error-reporting branch 2 times, most recently from 3ec2cde to 8137f8b Compare June 9, 2026 11:28
joe4dev and others added 9 commits June 9, 2026 14:13
…s API

Ports the supervisor and events API from PR #41 to enable proper error
reporting when a Lambda runtime process exits unexpectedly (e.g. sys.exit()
or missing wrapper script), instead of LocalStack timing out with a generic
error.

- Add LocalStackSupervisor: wraps ProcessSupervisor, detects unexpected
  runtime-* process exits and emits SendFault(RuntimeExit) events
- Add LocalStackEventsAPI: wraps StandaloneEventsAPI, overrides SendFault
  to forward errors to LocalStack via SendStatus(error, ...)
- Wire both into SandboxBuilder via SetEventsAPI / SetSupervisor
- Refactor NewCustomInteropServer to accept a pre-created *LocalStackAdapter
  shared with the events API
- Improve SendInitErrorResponse: properly deserialises the payload, includes
  RequestId, and sends asynchronously (non-blocking)

Enables test_lambda_runtime_exit and test_lambda_runtime_wrapper_not_found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use *string for the RequestId field in ErrorResponse so that an empty
string is serialized (not omitted by omitempty), while nil — used for
fault events — stays omitted. Fixes test_lambda_runtime_error snapshot
mismatch where requestId: "" was expected but absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move Init Duration after Max Memory Used in REPORT line (matches AWS)
- Add Status: timeout to REPORT line on invoke timeout
- Fix timeout error message format to "RequestId: <id> Error: Task timed out after N.00 seconds"
- Add ErrorType: "Sandbox.Timedout" to timeout error response
- Track init start time and emit Init Duration on first non-retry invocation
- Add is-init-retry field to InvokeRequest to suppress Init Duration on retry invokes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Close response bodies in SendStatus/SendLogs/SendResult so idle
  connections are released instead of leaked.
- Use errors.New instead of fmt.Errorf with no format arguments.
- Document the single-invoke assumption behind the unsynchronized
  initStart/warmStart fields.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve the fault request ID in the events API: prefer an explicit ID,
then the current invoke ID so a mid-invocation runtime crash reports the
actual request, and only synthesize a UUID as a fallback for init-phase
faults where no invocation has been dispatched yet. Previously the
supervisor always passed a random UUID, masking the real invoke ID.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pidcore

Move the Lambda init-phase timeout retry into the RIE and replace the custom
supervisor with rapidcore's existing init-failure machinery.

- rapidcore: add AwaitInitializedWithDetails (structured init outcome) and
  AwaitInitializedWithTimeout (timer-aware; does NOT consume the init-failures
  channel on timeout, so the invoke path's Reserve() can still drive suppressed
  init). Refactor awaitInitialized into interpretInitFailure.
- main.go: on init-phase timeout (LOCALSTACK_INIT_PHASE_TIMEOUT, default 10s),
  emit INIT_REPORT, signal ready, and reset the in-progress init so the first
  invoke re-runs it (suppressed init) under the function timeout. Genuine init
  failures are reported via SendInitError.
- custom_interop.go: SendInitError crash-path fallback with an initErrorForwarded
  dedup guard, formatted as AWS's "RequestId: <id> Error: <msg>"; ReportInitTimeout
  + initTimedOut-driven Init Duration suppression.
- Remove the custom LocalStackSupervisor/LocalStackEventsAPI; drop unused
  IsInitRetry from lsapi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The synthetic START line was written eagerly when LocalStack dispatched /invoke,
i.e. before the suppressed init re-runs the function's static code. AWS emits
START upon the Invoke event reaching the runtime, which rapidcore sequences after
any inline (suppressed) init (doInvoke -> sendInvokeStartLogEvent).

Emit START from a minimal LocalStackEventsAPI.SendInvokeStart override (riding
rapidcore's correctly-placed invoke-start event) and drop the eager write in the
/invoke handler. Correct for warm, cold, and suppressed-init invocations; fixes
test_lambda_timeout_init_phase against the unmodified AWS snapshot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t it

Follow the established LocalStack-env strategy: capture the value in InitLsOpts
(before UnsetLsEnvs runs) and add it to the UnsetLsEnvs list. Previously it was
read inline with os.Getenv after UnsetLsEnvs, so the variable was never unset and
leaked into the function's environment (forwarded via os.Environ in InitHandler),
breaking AWS parity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckTrace)

SendInitErrorResponse round-tripped the runtime's /init/error payload through a
typed struct whose stackTrace used omitempty, dropping an empty-but-present
"stackTrace": [] (as AWS emits for Runtime.HandlerNotFound). Decode into a map and
only inject requestId, forwarding the runtime's fields verbatim.

Fixes test_lambda_handler_not_found.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joe4dev joe4dev force-pushed the devx-1-implement-lambda-runtime-init-error-reporting branch from 8137f8b to c7f0e6f Compare June 9, 2026 12:19
joe4dev and others added 4 commits June 9, 2026 14:59
AWS folds a failed cold-start init into the first invocation (suppressed init),
reporting it as a failed invoke with the full INIT_REPORT(phase=invoke)/START/END/
REPORT envelope rather than a separate init error. Match this for on-demand:

- main.go: on init failure for on-demand, signal ready and keep the process alive
  instead of SendInitError+exit, so the first invoke surfaces the cached init error.
- custom_interop: skip /status/error forwarding for on-demand (cache only, so the
  invoke carries the error); emit INIT_REPORT Phase:invoke Status:error Error Type
  before START and add Status/Error Type to the REPORT, using rapidcore's scrubbed
  fatal error type. PC/SnapStart/Managed Instances keep the provisioning-time model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The runtime emits multi-line records (e.g. an unhandled-init traceback) as a single
log frame with internal newlines replaced by bare carriage returns. AWS renders
these back as line feeds, so convert bare CR to LF in the assembled log output while
preserving genuine CRLF endings (which AWS keeps, e.g. the LAMBDA_WARNING line).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…elper to its own file

The init-phase timeout / suppressed-init support previously edited the vendored
upstream internal/lambda/rapidcore/server.go (exported InitCompletionResponse,
extracted interpretInitFailure, added AwaitInitializedWithDetails and
AwaitInitializedWithTimeout). Modifying vendored upstream files causes rebase
conflicts when syncing with aws-lambda-runtime-interface-emulator.

Revert server.go to byte-identical upstream and move the only load-bearing
addition (AwaitInitializedWithTimeout, plus a local InitCompletionResponse and a
duplicated interpretInitFailure) into a new same-package file
server_localstack.go. It must stay in package rapidcore because it needs the
unexported getInitFailuresChan()/Release()/setRuntimeState() helpers, but as a
standalone file it never conflicts on rebase. AwaitInitializedWithDetails was
unused by cmd/localstack and is dropped.

No behavior change: RIE go test ./... and TestLambdaErrors both green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Commit 9b54e66 rewrote every bare carriage return in the captured runtime output to
a line feed, intending to render multi-line init tracebacks across lines. But AWS
keeps a bare CR inside a single CloudWatch log event (records split on LF only), and
LocalStack's log ingestion likewise splits on "\n". The conversion therefore wrongly
split any record containing a bare CR into multiple events, breaking the AWS-validated
TestCloudwatchLogs::test_multi_line_prints (a user `print("a\rb")` was emitted as two
events "a" and "b" instead of one event "a\rb").

Emit the runtime output verbatim. Verified: test_multi_line_prints and the full
TestLambdaErrors suite (incl. the runtime-exit/segfault error-reporting tests) are
both green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant