Skip to content

Instantly share code, notes, and snippets.

@alexfazio
Created March 14, 2026 17:21
Show Gist options
  • Select an option

  • Save alexfazio/359c17d84cb6a5af12bac88fa1db9770 to your computer and use it in GitHub Desktop.

Select an option

Save alexfazio/359c17d84cb6a5af12bac88fa1db9770 to your computer and use it in GitHub Desktop.
Codex CLI exec mode experiments: 81 flag/feature tests with raw outputs

Codex CLI Exec Mode Experiments

Date: 2026-03-13 CLI Version: 0.114.0 Total Experiments: 81

Raw experiment outputs from testing various codex exec flag combinations.

Note: Some experiment outputs include MCP (Model Context Protocol) server startup messages (e.g., mcp: flywheel starting, mcp: exa ready). These are from the author's local Codex configuration and do not affect experiment results. Your output may differ depending on your configured MCP servers.


01-baseline-text.txt

Objective: Tests the most basic codex exec invocation with a simple text prompt to confirm the command runs and returns a response. Establishes the baseline for comparing behavior across subsequent experiments.

codex exec --skip-git-repo-check "respond with the word PING and nothing else" 2>/dev/null
PING

Result: The command successfully returned PING, confirming that codex exec functions correctly with a minimal prompt and the --skip-git-repo-check flag.


02-short-alias.txt

Objective: Tests the codex e shorthand alias to confirm it is a valid substitute for codex exec. Verifies that the abbreviated form produces identical output.

codex e --skip-git-repo-check "respond with the word PING and nothing else" 2>/dev/null
PING

Result: codex e works identically to codex exec, confirming the short alias is fully functional.


03-jsonl-output.txt

Objective: Tests the --json flag to verify that codex exec emits a structured JSONL event stream instead of plain text. Establishes the baseline event sequence and token counts for comparison with later experiments.

codex exec --skip-git-repo-check --json "respond with the word PING and nothing else" 2>/dev/null
{"type":"thread.started","thread_id":"019ce6ce-65fd-7530-8e6b-9ccce0436091"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8497,"cached_input_tokens":8448,"output_tokens":51}}

NOTE: This establishes the baseline JSONL event protocol: 4 events in sequence (thread.startedturn.starteditem.completedturn.completed). The system prompt consumes ~8,497 input tokens for a minimal prompt. Cache hit rate is 99.4% (8,448/8,497), meaning the system prompt is almost entirely cached. This baseline is the reference point for token comparisons in experiments 09, 12, 14, and 37.


04-stdin-pipe.txt

Objective: Tests piping a prompt via stdin using the explicit - argument to confirm that codex exec reads the prompt from standard input when no positional prompt argument is given. Verifies the stdin input method works end-to-end.

echo "respond with the word PING and nothing else" | codex exec --skip-git-repo-check - 2>/dev/null
PING

Result: Piping a prompt via stdin with the - placeholder works correctly, returning the expected PING response.


05-stdin-pipe-with-argument.txt

Objective: Tests the behavior when a prompt is provided both via stdin and as a positional argument simultaneously, to determine which source takes priority. Checks whether stdin content is injected into the context when a direct argument is also present.

echo "PING" | codex exec --skip-git-repo-check --json "repeat the exact text from stdin" 2>/dev/null
{"type":"thread.started","thread_id":"019ce6d2-8938-7e80-8f6e-a51032f2cd11"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"repeat the exact text from stdin"}}
{"type":"turn.completed","usage":{"input_tokens":8494,"cached_input_tokens":4736,"output_tokens":321}}

NOTE: Stdin content was NOT received by the model when a prompt argument was also provided. The --help confirms: stdin is only read when no prompt argument is provided (or when - is used). The article's claim about combining stdin + argument (cat file.py | codex exec "review this code") appears incorrect. Finding [HIGH]: The official CLI reference confirms stdin is only read when no prompt argument is provided or when - is used explicitly. The resolve_prompt function prioritizes the direct argument. GitHub issue #1123 requested this feature (source); the CLI reference documents the current behavior (source). Recommendation: The article's combined pattern is incorrect. Use heredoc or pipe-to-dash syntax: (cat file.py; echo "review this code") | codex exec -.


06-here-doc.txt

Objective: Tests providing a prompt to codex exec via a bash heredoc to confirm the heredoc method works as a stdin input technique. Useful for multi-line prompts that are awkward to pass as a single quoted argument.

codex exec --skip-git-repo-check <<EOF 2>/dev/null
respond with the word PING and nothing else
EOF
PING

Result: The heredoc syntax works as a valid stdin input method, returning the expected PING response.


07-git-repo-check.txt

Objective: Tests the behavior of running codex exec outside a git repository without the --skip-git-repo-check flag to confirm the guard rail error is triggered. Documents the exact error message produced by the safety check.

codex exec "respond with PING" 2>&1
Not inside a trusted directory and --skip-git-repo-check was not specified.

NOTE: Error message says "Not inside a trusted directory" rather than "Not inside a git repository" as the article implies. Exit code 1. The --skip-git-repo-check flag is the correct override.


08-save-to-file.txt

Objective: Tests the -o flag to confirm that codex exec can write the last agent message to a file. Verifies whether the flag suppresses stdout output or writes to both the file and stdout simultaneously.

codex exec --skip-git-repo-check -o /tmp/codex-test-output.txt "respond with the word PING and nothing else" 2>/dev/null
cat /tmp/codex-test-output.txt
PING
--- stdout above, file contents below ---
PING

NOTE: Output appeared on both stdout and the file, confirming -o writes to file while still printing to stdout.


09-ephemeral.txt

Objective: Tests the --ephemeral flag to confirm that codex exec runs without persisting a session file to disk. Compares token counts against the baseline to measure any overhead difference from session tracking.

codex exec --skip-git-repo-check --ephemeral --json "respond with the word PING and nothing else" 2>/dev/null
{"type":"thread.started","thread_id":"019ce700-2f3b-7941-8cb9-1cf4ffc1e97a"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8156,"cached_input_tokens":7296,"output_tokens":44}}

NOTE: No session file was persisted to disk at ~/.codex/sessions/ for the thread_id. Token comparison vs baseline (exp 03): 8,156 input tokens vs 8,497 (341 fewer, ~4% reduction), and 7,296 cached vs 8,448. The lower counts suggest ephemeral mode skips session context overhead. This directly causes the behavior in experiment 48, where resuming an ephemeral session silently creates a new session because no session file exists to load.


10-color-never.txt

Objective: Tests the --color never flag to confirm that ANSI color codes are suppressed in the output. Relevant for piping codex exec output into other tools or log files where color escape sequences would be unwanted.

codex exec --skip-git-repo-check --color never "respond with the word PING and nothing else" 2>/dev/null
PING

Result: The --color never flag is accepted and produces clean plain-text output with no ANSI escape sequences.


11-set-working-dir.txt

Objective: Tests the -C and --cd flags to confirm that codex exec can change its working directory before running the agent. Verifies whether the model correctly perceives the overridden working directory.

codex exec --skip-git-repo-check -C /tmp "respond with the current working directory path only, nothing else" 2>/dev/null
codex exec --skip-git-repo-check --cd /tmp "respond with the current working directory path only, nothing else" 2>/dev/null
/tmp
/tmp

NOTE: Both -C and --cd work. The article only documents --cd.


12-model-selection.txt

Objective: Tests the -m/--model flag to confirm that codex exec can override the default model with an alternate model identifier. Compares token usage against the baseline to measure any model-specific overhead or cache invalidation.

codex exec --skip-git-repo-check --model gpt-5.4 --json "respond with the word PING and nothing else" 2>/dev/null
{"type":"thread.started","thread_id":"019ce718-aa52-7b31-98cf-c77ed01045b4"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8586,"cached_input_tokens":3456,"output_tokens":45}}

NOTE: --model gpt-5.4 works in exec mode. Token comparison vs baseline (exp 03): 8,586 vs 8,497 input tokens (+89, minor model-specific overhead). Cached tokens dropped from 8,448 to 3,456 — switching models invalidates ~59% of the prompt cache, roughly tripling the uncached token cost for the first call. Subsequent calls to the same model will re-warm the cache.


13-structured-output.txt

Objective: Tests the --output-schema flag to confirm that codex exec can constrain model output to a JSON schema via the OpenAI structured output feature. Identifies what schema constraints are actually required by the API versus what the article documents.

# First attempt with article's example schema (FAILED):
echo '{"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}},"required":["answer"]}' > /tmp/test-schema.json
codex exec --skip-git-repo-check --output-schema /tmp/test-schema.json -o /tmp/test-schema-result.json "What is 2+2?" 2>&1
ERROR: Invalid schema for response_format 'codex_output_schema': In context=(), 'additionalProperties' is required to be supplied and to be false.
# Second attempt with additionalProperties:false (FAILED):
echo '{"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}},"required":["answer"],"additionalProperties":false}' > /tmp/test-schema.json
codex exec --skip-git-repo-check --output-schema /tmp/test-schema.json -o /tmp/test-schema-result.json "What is 2+2?" 2>&1
ERROR: Invalid schema for response_format 'codex_output_schema': In context=(), 'required' is required to be supplied and to be an array including every key in properties. Missing 'confidence'.
# Third attempt with strict schema (SUCCESS):
echo '{"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}},"required":["answer","confidence"],"additionalProperties":false}' > /tmp/test-schema.json
codex exec --skip-git-repo-check --output-schema /tmp/test-schema.json -o /tmp/test-schema-result.json "What is 2+2? Put the answer in the answer field and confidence as a number 0-1." 2>/dev/null
cat /tmp/test-schema-result.json
{"answer":"4","confidence":1}

NOTE: The article's example schema is INCORRECT. The OpenAI API requires: (1) additionalProperties: false at the top level, and (2) ALL properties listed in required. The article's schema has "required": ["answer"] but omits "confidence", and is missing additionalProperties.


14-full-auto.txt

Objective: Tests the --full-auto convenience flag to confirm it sets sandbox and approval modes as documented, and investigates whether its approval semantics behave differently in non-interactive exec mode versus interactive mode.

codex exec --skip-git-repo-check --full-auto --json "respond with the word PING and nothing else" 2>/dev/null
{"type":"thread.started","thread_id":"019ce76e-c1c9-76b2-bd23-cb3c6b3724e2"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8242,"cached_input_tokens":3456,"output_tokens":58}}
# Verifying actual modes via stderr:
codex exec --skip-git-repo-check --full-auto "respond with PING" 2>&1 | head -10
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: The official docs define --full-auto as equivalent to -a on-request -s workspace-write. However, in non-interactive exec mode, the stderr output displays approval: never — likely because there is no interactive user to prompt. The effective behavior in exec mode is auto-approval, but the flag's documented semantics are on-request. Finding [HIGH]: Confirmed by design. The non-interactive mode documentation states never is the appropriate approval setting for non-interactive runs. on-request is automatically downgraded to never in exec mode since there is no TTY to prompt (source). Recommendation: This is expected behavior; document that --full-auto approval semantics are mode-dependent.


15-yolo.txt

Objective: Tests the --dangerously-bypass-approvals-and-sandbox flag and its undocumented --yolo alias to confirm both disable sandboxing entirely. Verifies the resulting approval and sandbox mode values reported by the CLI.

codex exec --skip-git-repo-check --dangerously-bypass-approvals-and-sandbox "respond with PING" 2>&1 | head -10
codex exec --skip-git-repo-check --yolo "respond with PING" 2>&1 | head -10
# --dangerously-bypass-approvals-and-sandbox:
approval: never
sandbox: danger-full-access

# --yolo:
approval: never
sandbox: danger-full-access

NOTE: Both produce identical results. --yolo is a valid alias despite not appearing in --help.


16-config-flag.txt

Objective: Tests the -c inline config override flag to confirm that codex exec accepts dotted config key-value pairs at invocation time. Verifies that the config change takes effect by testing a functionally observable setting (features.shell_tool).

codex exec --skip-git-repo-check -c features.shell_tool=false "respond with the word PING and nothing else" 2>/dev/null
PING

NOTE: -c features.shell_tool=false is accepted in exec mode.

Rerun (v0.114.0) — verifying the flag actually prevents shell execution:

# With shell_tool disabled:
codex exec --skip-git-repo-check --full-auto -c features.shell_tool=false --json \
  "run the command 'echo SHELL_TEST_SUCCESS' and tell me the output" 2>/dev/null
# → "I can't directly execute shell commands in this session."

# Control (shell_tool enabled, default):
codex exec --skip-git-repo-check --full-auto --json \
  "run the command 'echo SHELL_TEST_SUCCESS' and tell me the output" 2>/dev/null
# → Ran the command and reported: "SHELL_TEST_SUCCESS"

Confirmed: features.shell_tool=false prevents the agent from executing shell commands. The model acknowledges it lacks access and falls back to predicting the output. With the default (enabled), the model ran the command directly.


17-sandbox-explicit.txt

Objective: Tests the -s/--sandbox flag with an explicit sandbox mode value to confirm direct sandbox selection works in exec mode. Cross-checks the resulting writable path set against the --full-auto output from experiment 14 to verify the two flags are equivalent.

codex exec --skip-git-repo-check --sandbox workspace-write "respond with PING" 2>&1 | head -10
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: --sandbox workspace-write works in exec mode. The writable path set (workdir, /tmp, $TMPDIR, ~/.codex/memories) is identical to what --full-auto produces in experiment 14, confirming that --full-auto is a pure alias for --sandbox workspace-write + approval settings — it does not add extra writable paths.


18-add-dir.txt

Objective: Tests the --add-dir flag to confirm that codex exec can extend the sandbox's writable directory set with an additional path. Verifies the specified directory appears in the sandbox configuration reported on stderr.

codex exec --skip-git-repo-check --full-auto --add-dir /tmp/output "respond with PING" 2>&1 | grep sandbox
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /tmp/output, /Users/codex-user/.codex/memories]

NOTE: /tmp/output appears in writable paths.


19-reasoning-effort.txt

Objective: Tests the -c reasoning.effort config override to confirm that reasoning effort can be set at invocation time. Verifies the value is reflected in the CLI's stderr configuration summary.

codex exec --skip-git-repo-check -c model_reasoning_effort=high "respond with PING" 2>&1 | grep reasoning
reasoning effort: high
reasoning summaries: none

NOTE: -c model_reasoning_effort=high works in exec mode. Experiment 20 reveals the default reasoning effort is xhigh, meaning this experiment actually downgraded reasoning from the default. The default reasoning summaries: none is also visible. See experiment 20 for the cross-confirmation.


20-reasoning-summary.txt

Objective: Tests the -c reasoning.summary config override to confirm that reasoning summary verbosity can be set at invocation time. Also reveals the default reasoning effort level as a side effect of only changing the summary setting.

codex exec --skip-git-repo-check -c model_reasoning_summary=detailed "respond with PING" 2>&1 | grep reasoning
reasoning effort: xhigh
reasoning summaries: detailed

NOTE: -c model_reasoning_summary=detailed works in exec mode. The output confirms the default reasoning effort is xhigh (this experiment only changed the summary setting, leaving effort at default). Cross-references: experiment 19 shows that setting model_reasoning_effort=high is a downgrade from this default; experiment 61 shows that model_reasoning_summary=detailed causes a reasoning item type to appear in the JSONL stream.


21-experimental-json.txt

Objective: Tests the --experimental-json flag as a potential alias for --json to determine whether it produces the same JSONL event stream output. Verifies whether the flag is recognized despite not appearing in --help.

codex exec --skip-git-repo-check --experimental-json "respond with PING" 2>/dev/null
{"type":"thread.started","thread_id":"019ce793-08b2-7ae2-9561-3ac7618cd884"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8151,"cached_input_tokens":3456,"output_tokens":80}}

NOTE: --experimental-json works despite not appearing in --help. Produces identical JSONL output to --json. Finding [HIGH]: The official CLI reference documents --json, --experimental-json as equivalent aliases for JSONL output. The flag is officially supported, not undocumented — it simply may not appear in local --help output (source). Recommendation: Treat --experimental-json as a supported alias; prefer --json for clarity.


22-ask-for-approval.txt

Objective: Tests the -a/--ask-for-approval flag placement relative to the exec subcommand to determine whether it is a subcommand flag or a global flag. Confirms the correct syntax required to set approval policy in exec mode.

codex exec --skip-git-repo-check --ask-for-approval on-request "respond with PING" 2>&1 | head -5
codex exec --skip-git-repo-check -a on-request "respond with PING" 2>&1 | head -5
error: unexpected argument '--ask-for-approval' found
error: unexpected argument '-a' found

NOTE: Neither --ask-for-approval nor -a were recognized when placed after exec. -a is a global flag that must precede the exec subcommand. Finding [HIGH]: -a / --ask-for-approval is a global flag that must precede the exec subcommand: codex -a never exec "prompt". It cannot follow exec. The exec subcommand parser has its own flag set and does not include approval flags (source). Recommendation: Use codex -a never exec ... syntax or -c approval_policy=never to set approval in exec mode.

Retest result (v0.114.0):

codex -a on-request exec --skip-git-repo-check "respond with PING" 2>&1 | grep -E "approval|sandbox"
approval: never
sandbox: read-only

Confirmed: global -a on-request is accepted (no error), but exec mode still downgrades to approval: never since there is no interactive user to prompt. This matches the exec-mode downgrade behavior seen in experiment 14.


23-resume-session.txt

Objective: Tests session resumption using exec resume --last and resume <thread_id> to verify that conversation history persists across separate codex exec invocations.

# Step 1: Create a session
codex exec --skip-git-repo-check --json "respond with SESSION_TEST_1" 2>/dev/null
{"type":"thread.started","thread_id":"019ce799-d0a4-7be0-b32a-bb903ab16d2b"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"SESSION_TEST_1"}}
{"type":"turn.completed","usage":{"input_tokens":8153,"cached_input_tokens":3456,"output_tokens":63}}
# Step 2: Resume with --last
codex exec --skip-git-repo-check resume --last "respond with SESSION_TEST_2" 2>/dev/null
SESSION_TEST_2
# Step 3: Resume by specific ID
codex exec --skip-git-repo-check resume 019ce799-d0a4-7be0-b32a-bb903ab16d2b "respond with SESSION_TEST_3" 2>/dev/null
SESSION_TEST_3

NOTE: All resume variants work. --skip-git-repo-check must come BEFORE resume subcommand.

Rerun (v0.114.0) — verifying session continuity with --json:

# Step 1: Create session
codex exec --skip-git-repo-check --json "respond with SESSION_STEP_1" 2>/dev/null
# → thread_id: 019cec77-af02-7ef3-8f7c-f8b82eb928de, input_tokens: 10,144

# Step 2: Resume by ID with --json
codex exec --skip-git-repo-check resume 019cec77-af02-7ef3-8f7c-f8b82eb928de --json "respond with SESSION_STEP_2" 2>/dev/null
# → thread_id: 019cec77-af02-7ef3-8f7c-f8b82eb928de (SAME), input_tokens: 20,364

Confirmed: the JSONL thread.started event reports the same thread_id on resume, proving session continuity. Input tokens doubled from 10,144 to 20,364, proving conversation history accumulates across resume calls.


24-image-input.txt

Objective: Tests the -i flag for attaching image files to exec prompts, verifying the required argument ordering (prompt must precede -i flags).

# Article's pattern (image BEFORE prompt) — FAILS:
codex exec --skip-git-repo-check -i /tmp/test-pixel.png "describe this image" 2>&1
Reading prompt from stdin...
No prompt provided via stdin.
# Correct pattern (prompt BEFORE image) — WORKS:
codex exec --skip-git-repo-check "describe what you see in this image in one sentence" -i /tmp/test-pixel.png 2>/dev/null
It appears to be a 1x1 PNG showing a single black pixel.

NOTE: In codex exec, the prompt argument must come BEFORE the -i flag: codex exec "prompt" -i image.png. The article's pattern codex exec -i screenshot.png "Explain this error" (image before prompt) fails in exec mode. Note that the official docs show codex -i screenshot.png "prompt" (image first) for the interactive codex command, so argument ordering may differ between interactive and exec modes. Finding [MEDIUM]: PR #10709 ('fix: ensure resume args precede image args') fixes the SDK to emit resume arguments before image flags when calling the CLI, confirming that argument ordering matters. The exec subcommand's positional argument parser requires the prompt before flags like -i. The interactive command has a more flexible parser (source). Recommendation: Always use codex exec "prompt" -i image.png ordering in exec mode.


25-model-shorthand.txt

Objective: Tests model shorthand strings with the -m flag to verify alternate model selection in exec mode.

codex exec --skip-git-repo-check -m gpt-5.3-codex "respond with PING" 2>&1 | grep model
model: gpt-5.3-codex

NOTE: -m works as shorthand for --model. The article uses --model throughout.


26-capture-to-variable.txt

Objective: Tests capturing codex exec stdout into a shell variable using command substitution, verifying that 2>/dev/null is needed for clean capture.

response=$(codex exec --skip-git-repo-check "respond with the word PING and nothing else" 2>/dev/null)
echo "Captured: $response"
Captured: PING

NOTE: Shell variable capture works with codex exec. The 2>/dev/null redirect is critical for clean capture — without it, stderr progress output (session metadata, model name, sandbox status — see experiment 41) would mix into the variable.


27-pipe-to-tee.txt

Objective: Tests piping codex exec output through tee to simultaneously display on stdout and write to a file.

codex exec --skip-git-repo-check "respond with the word PING and nothing else" 2>/dev/null | tee /tmp/codex-tee-test.txt
cat /tmp/codex-tee-test.txt
PING
--- tee file ---
PING

Result: Output is correctly duplicated to both stdout and the tee output file.


28-stdin-redirect.txt

Objective: Tests redirecting file content into codex exec via stdin (< file) as an alternative to piping or positional arguments.

echo "respond with the word PING and nothing else" > /tmp/codex-prompt.txt
codex exec --skip-git-repo-check < /tmp/codex-prompt.txt 2>/dev/null
PING

Result: Stdin file redirection works correctly as an alternative prompt input method.


29-jq-extraction.txt

Objective: Tests extracting specific fields from --json JSONL output using jq, targeting both agent response text and token usage metadata.

codex exec --skip-git-repo-check --json "respond with PING" 2>/dev/null | jq 'select(.type == "item.completed") | .item.text'
"PING"
codex exec --skip-git-repo-check --json "respond with PING" 2>/dev/null | jq 'select(.type == "turn.completed") | .usage'
{
  "input_tokens": 8151,
  "cached_input_tokens": 3456,
  "output_tokens": 47
}

NOTE: Both jq extraction patterns from the article work correctly.


30-sandbox-shorthand.txt

Objective: Tests shorthand sandbox mode names and the -s flag alias with codex exec.

codex exec --skip-git-repo-check -s read-only "respond with PING" 2>&1 | grep sandbox
sandbox: read-only

NOTE: -s works as shorthand for --sandbox.


31-codex-api-key.txt

Objective: Tests the CODEX_API_KEY environment variable as an alternative authentication method.

CODEX_API_KEY=fake-key-12345 codex exec --skip-git-repo-check "respond with PING" 2>&1 | tail -3
Reconnecting... 5/5 (unexpected status 401 Unauthorized: Incorrect API key provided: fake-key**2345.)
ERROR: unexpected status 401 Unauthorized: Incorrect API key provided: fake-key**2345.

NOTE: CODEX_API_KEY is accepted and overrides stored auth. Fails with 401 when given a fake key, confirming it is used for authentication.


32-git-repo-baseline.txt

Objective: Temporary git repo initialized in the article directory for experiments 32-36, then removed.

git init && git add -A && git commit -m "temp: init for codex exec testing"
codex exec "respond with the word PING and nothing else" 2>/dev/null
PING

NOTE: Works without --skip-git-repo-check when inside a git repo.


33-full-auto-git-repo.txt

Objective: Tests --full-auto inside a git repository to confirm it works without --skip-git-repo-check.

codex exec --full-auto "respond with the word PING and nothing else" 2>/dev/null
PING

NOTE: --full-auto works inside a git repo without --skip-git-repo-check, same as baseline exec mode (experiment 32). This confirms --skip-git-repo-check is only needed outside a git repo — inside one, exec mode trusts the directory by default.


34-git-log-pipe.txt

Objective: Tests piping git log output into codex exec via stdin using the - argument.

(echo "Write a one-line commit message for these changes:"; git log --oneline -3) | codex exec - 2>/dev/null
`chore: initialize Codex Exec testing setup`

NOTE: Piping git output as context via stdin with - works correctly.


35-workspace-write-test.txt

Objective: Tests the workspace-write sandbox mode to confirm it permits file write operations.

codex exec --full-auto --sandbox workspace-write "create a file called /tmp/codex-write-test.txt with the content WRITE_TEST" 2>/dev/null
cat /tmp/codex-write-test.txt
WRITE_TEST

NOTE: Workspace-write sandbox allows writing to /tmp (included in writable paths).


36-read-only-sandbox-blocks-writes.txt

Objective: Tests that the read-only sandbox blocks filesystem write attempts by the agent.

codex exec --sandbox read-only "create a file called /tmp/codex-readonly-test.txt with the content SHOULD_FAIL" 2>/dev/null
Couldn't create it in this environment due filesystem restrictions (`read-only` sandbox).
Attempted command: `printf 'SHOULD_FAIL' > /tmp/codex-readonly-test.txt`
Result: `zsh:1: operation not permitted: /tmp/codex-readonly-test.txt`

NOTE: Read-only sandbox correctly blocks file writes. The agent attempted the write but the sandbox prevented it.


Research-Discovered Flags & Combinations (37–66)

Discovered via official docs research: Codex CLI Reference, Config Reference, Advanced Config, Non-Interactive Mode, codex exec --help v0.114.0.


37-enable-disable-feature-flags.txt

Objective: Tests --enable and --disable flags for toggling feature flags at runtime, measuring the impact on system prompt token counts.

# --enable is equivalent to -c features.<name>=true
codex exec --skip-git-repo-check --enable multi_agent --json "respond with PING" 2>/dev/null

# --disable is equivalent to -c features.<name>=false
codex exec --skip-git-repo-check --disable shell_tool "respond with PING" 2>/dev/null
{"type":"thread.started","thread_id":"019ce7d6-2410-77b1-889a-64bbfdffd691"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":10140,"cached_input_tokens":4480,"output_tokens":19}}
PING

NOTE: Both --enable and --disable work in exec mode. --enable multi_agent increased input_tokens from ~8,200 to 10,140 — a ~1,940 token increase (24%) caused by additional multi-agent system instructions being loaded into the prompt. --disable shell_tool produced identical output to -c features.shell_tool=false (experiment 16).

Rerun (v0.114.0) — --disable shell_tool with --json for token comparison:

codex exec --skip-git-repo-check --disable shell_tool --json "respond with PING" 2>/dev/null
# → input_tokens: 9,702, cached_input_tokens: 2,560

Token comparison (same session, all with --json):

  • Baseline (exp 03): 8,497 input tokens
  • --enable multi_agent: 10,140 input tokens (+1,643 from baseline, +19%)
  • --disable shell_tool: 9,702 input tokens (+1,205 from baseline)

The --disable shell_tool result has more input tokens than baseline, which is unexpected. This may reflect MCP tool registration or session state differences rather than the feature flag itself. The low cache hit (2,560 vs 8,448 baseline) confirms the system prompt changed substantially.


38-profile-flag.txt

Objective: Tests the -p/--profile flag for loading a named configuration profile, verifying that a missing profile produces a clear error.

codex exec --skip-git-repo-check --profile nonexistent-profile "respond with PING" 2>&1 | head -5
Error: config profile `nonexistent-profile` not found

NOTE: --profile / -p works in exec mode. Returns a clear error for missing profiles. To test fully, create a [profiles.test-profile] section in ~/.codex/config.toml. Note: -p is the profile flag in codex (not to be confused with Claude CLI's -p for print mode).


39-search-flag.txt

Objective: Tests the --search flag placement with codex exec to determine whether it works as an exec subcommand flag or only as a global flag.

codex exec --skip-git-repo-check --search "what is the current weather in San Francisco?" 2>&1 | head -5
error: unexpected argument '--search' found

  tip: to pass '--search' as a value, use '-- --search'

Usage: codex exec --skip-git-repo-check [PROMPT]

NOTE: --search is a global flag that must precede the exec subcommand (like -a, see experiment 22). Placing it after exec produces an error. The syntax codex --search exec "prompt" may work but was not tested. The -c web_search=live inline config is a reliable alternative (see experiment 58).


40-oss-flag.txt

Objective: Tests the --oss flag for routing exec requests to a local open-source model provider instead of the OpenAI API.

codex exec --skip-git-repo-check --oss "respond with PING" 2>&1 | head -10
codex exec --skip-git-repo-check --oss --local-provider ollama "respond with PING" 2>&1 | head -10
Error: No default OSS provider configured. Use --local-provider=provider or set oss_provider to one of: lmstudio, ollama in config.toml

Error: OSS setup failed: No running Ollama server detected. Start it with: `ollama serve` (after installing). Install instructions: https://github.com/ollama/ollama?tab=readme-ov-file#ollama

NOTE: Both --oss and --local-provider work in exec mode. Without a provider configured, --oss gives a helpful error listing options. With --local-provider ollama, it checks for a running server and gives install instructions if not found.


41-progress-cursor.txt

Objective: Tests the --progress-cursor flag to determine whether it changes the stderr progress display format.

codex exec --skip-git-repo-check --progress-cursor "respond with PING" 2>&1
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /Users/codex-user/Documents/cc-inbox/codex-exec-article
model: gpt-5.3-codex
provider: openai
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019ce7d6-5962-7f21-9f20-95ebe6504c32
--------
user
respond with the word PING and nothing else
codex
PING
tokens used
4,719
PING

NOTE: --progress-cursor changes the stderr progress display to use cursor-based updates (overwriting lines in-place) rather than appending. The difference is more visible in a live terminal — when captured to a file, the output looks similar to default mode but may contain ANSI cursor control sequences.

Retest result (v0.114.0):

# ANSI escape count comparison:
codex exec --skip-git-repo-check --progress-cursor "respond with PING" 2>&1 | cat -v | grep -c '\^\[' # → 0
codex exec --skip-git-repo-check "respond with PING" 2>&1 | cat -v | grep -c '\^\['               # → 0

# Diff between both outputs:
diff <(codex exec ... --progress-cursor ... 2>&1 | cat -v) <(codex exec ... 2>&1 | cat -v)
# Only differences: session ID and token count — no ANSI escape sequences in either mode

No ANSI escape differences detected when captured to a pipe. Both modes produce identical output (aside from session ID and token count). --progress-cursor has no observable effect when stdout/stderr are piped (not connected to a TTY). For exec-mode scripting where output is always captured, this flag is irrelevant.


42-exec-review.txt

Objective: Temporary git repo initialized for this test, then removed.

# Without flags:
codex exec review 2>&1

# With --uncommitted and a staged change:
codex exec review --uncommitted 2>&1
# Without flags:
Error: Specify --uncommitted, --base, --commit, or provide custom review instructions

# With --uncommitted (added "# test" to a file):
Review comment:

- [P3] Remove leftover test heading from documentation — codex-exec-vs-claude-print.md:37-37
  The added `# test` line appears to be a temporary marker rather than intentional content,
  and it will render as a real heading in the published doc.

NOTE: codex exec review is a dedicated code review subcommand. It requires one of: --uncommitted (review working tree changes), --base <branch> (review changes against a branch), --commit <sha> (review a specific commit), or custom instructions. It runs git status and git diff automatically, then produces structured review comments with severity levels (P1-P4). Requires a git repository.


43-resume-all.txt

Tests resume --all --last to verify the --all flag enables session lookup across all directories.

codex exec --skip-git-repo-check resume --all --last "respond with PING" 2>/dev/null
PING

NOTE: resume --all --last works. The --all flag broadens session search beyond the current working directory, finding sessions created in any directory.


44-json-plus-output-schema.txt

Tests whether --json and --output-schema can be used together in a single invocation. Verifies that the JSONL event stream and schema-constrained JSON output are produced simultaneously without conflict.

echo '{"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}},"required":["answer","confidence"],"additionalProperties":false}' > /tmp/test-schema-combo.json
codex exec --skip-git-repo-check --json --output-schema /tmp/test-schema-combo.json "What is 2+2? Put the answer in the answer field and confidence as a number 0-1." 2>/dev/null
{"type":"thread.started","thread_id":"019ce7d7-5073-7573-9544-fb97785ce226"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"{\"answer\":\"4\",\"confidence\":1}"}}
{"type":"turn.completed","usage":{"input_tokens":8202,"cached_input_tokens":2432,"output_tokens":135}}

NOTE: --json + --output-schema combine successfully. The JSONL stream's item.completed event contains schema-conforming JSON as the .item.text value. This confirms the article's "schema-conforming JSONL" state machine combination works orthogonally.


45-json-plus-output-file.txt

Tests whether --json and -o (output file) can be combined, and clarifies what each flag captures. Determines whether -o writes the raw JSONL stream or just the final plain-text agent message.

codex exec --skip-git-repo-check --json -o /tmp/json-output-test.txt "respond with PING" 2>/dev/null
cat /tmp/json-output-test.txt
{"type":"thread.started","thread_id":"019ce7d7-a28a-7db0-a8df-370ec87e1616"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"PING"}}
{"type":"turn.completed","usage":{"input_tokens":8156,"cached_input_tokens":3456,"output_tokens":87}}
--- file contents ---
PING

NOTE: When --json and -o are combined, stdout gets the full JSONL event stream, but -o writes only the final plain-text message (not JSONL). The -o flag always captures the last agent message as text regardless of --json.


46-full-auto-plus-sandbox-read-only.txt

Tests what happens when --full-auto and --sandbox read-only are specified together to determine which flag takes precedence. Checks whether the preset composite behavior of --full-auto overrides an explicit --sandbox argument.

codex exec --skip-git-repo-check --full-auto --sandbox read-only "respond with PING" 2>&1 | grep -E "sandbox|approval"
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: --full-auto overrides --sandbox read-only. The sandbox remains workspace-write despite the explicit --sandbox read-only flag. --full-auto takes precedence for the sandbox mode. Finding [MEDIUM]: The sandboxing docs define --full-auto as a composite preset (sandbox_mode=workspace-write + approval_policy=on-request). It acts as an atomic override, not combinable with individual --sandbox flags (source). Recommendation: Do not combine --full-auto with --sandbox. Use --full-auto alone for workspace-write, or --sandbox <mode> alone for other sandbox levels.


47-full-auto-plus-sandbox-danger.txt

Tests whether pairing --full-auto with --sandbox danger-full-access allows escalation to an unrestricted sandbox. Determines if --full-auto also overrides an attempt to increase, rather than decrease, sandbox permissiveness.

codex exec --skip-git-repo-check --full-auto --sandbox danger-full-access "respond with PING" 2>&1 | grep -E "sandbox|approval"
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: --full-auto also overrides --sandbox danger-full-access. Even when trying to escalate, --full-auto locks the sandbox to workspace-write. To get danger-full-access, use --yolo or --sandbox danger-full-access WITHOUT --full-auto.


48-ephemeral-plus-resume.txt

Tests whether a session created with --ephemeral can be resumed by its thread ID in a subsequent invocation. Verifies what happens when resume targets a session that was never persisted to disk.

# Step 1: Create ephemeral session
codex exec --skip-git-repo-check --ephemeral --json "respond with EPHEMERAL_SESSION_TEST" 2>/dev/null
# Thread ID: 019ce7d8-045d-7fe1-a9e3-47c6d90a73e1

# Step 2: Try to resume it
codex exec --skip-git-repo-check resume 019ce7d8-045d-7fe1-a9e3-47c6d90a73e1 "respond with RESUMED" 2>&1
# Step 2 output (new session id: 019ce7d8-58e9-7851-b6ab-6656c671f0cb):
RESUMED

NOTE: Resuming an ephemeral session does NOT fail — but it creates a NEW session instead of continuing the old one. The session ID changed from 019ce7d8-045d to 019ce7d8-58e9, confirming the ephemeral session was not persisted. The resume silently fell back to a fresh session rather than erroring. Finding [MEDIUM]: --ephemeral prevents session file persistence; resume looks up saved files and falls back to a new session when none are found. PR #7357 fixed the related bug where resume with an invalid session ID silently created a new conversation instead of erroring (source). Recommendation: Consider filing a bug/feature request for resume to warn or error when the target session ID is not found.

Retest result (v0.114.0):

# Step 1: Create ephemeral session
codex exec --skip-git-repo-check --ephemeral --json "respond with EPHEMERAL_RETEST" 2>/dev/null
# Thread ID: 019cec22-ab75-79f2-b02a-ec5e3fc67f35

# Step 2: Resume it
codex exec --skip-git-repo-check resume 019cec22-ab75-79f2-b02a-ec5e3fc67f35 --json "respond with RESUMED_EPHEMERAL" 2>&1
{"type":"thread.started","thread_id":"019cec22-dc15-7cd1-afcc-ad2dcf1fb993"}
{"type":"turn.started"}
{"type":"item.completed","item":{"id":"item_0","type":"agent_message","text":"RESUMED_EPHEMERAL"}}
{"type":"turn.completed","usage":{"input_tokens":10147,"cached_input_tokens":10112,"output_tokens":71}}

Still silently creates a new session on v0.114.0 (ID changed from ab75 to dc15). PR #7357 fix is either not in this version or does not cover the ephemeral-resume case.


49-multiple-add-dir.txt

Tests whether the --add-dir flag can be specified multiple times in a single invocation to grant write access to several directories simultaneously. Confirms that the flag is repeatable and that all specified paths appear in the sandbox writable list.

codex exec --skip-git-repo-check --full-auto --add-dir /tmp/dir-a --add-dir /tmp/dir-b "respond with PING" 2>&1 | grep sandbox
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /tmp/dir-a, /tmp/dir-b, /Users/codex-user/.codex/memories]

NOTE: Multiple --add-dir flags work. Both /tmp/dir-a and /tmp/dir-b appear in the writable paths list. The flag is repeatable as documented.


50-multiple-images.txt

Tests two syntaxes for attaching multiple images to a single codex exec invocation: comma-separated values in one --image flag and repeated -i flags. Verifies that both methods correctly pass multiple images to the model.

# Comma-separated images
codex exec --skip-git-repo-check "describe what you see" --image /tmp/test-pixel.png,/tmp/test-pixel-red.png 2>/dev/null
I see a tiny, nearly solid black square (pixel-like block) on a light background, with no distinct object or scene.
# Repeated -i flags
codex exec --skip-git-repo-check "describe what you see in these images, one sentence each" -i /tmp/test-pixel.png -i /tmp/test-pixel.png 2>/dev/null
Image 1: I can't see any visual content because `/tmp/test-pixel.png` is corrupted (IDAT CRC/checksum error).
Image 2: The second reference is the same corrupted file, so there is likewise no viewable image content.

NOTE: Both comma-separated and repeated -i patterns are accepted. The original test used 1x1 pixel images that produced ambiguous results (model described only one image for comma-separated, reported corruption for repeated -i with the same file twice).

Rerun (v0.114.0) — with two distinct 10x10 images (blue and red):

# Comma-separated:
codex exec --skip-git-repo-check "describe the colors of these two images, one sentence each" \
  --image /tmp/test-blue-10x10.png,/tmp/test-red-10x10.png 2>/dev/null
# → "Image #1 is a solid bright blue color. Image #2 is a solid bright red color."

# Repeated -i:
codex exec --skip-git-repo-check "describe the colors of these two images, one sentence each" \
  -i /tmp/test-blue-10x10.png -i /tmp/test-red-10x10.png 2>/dev/null
# → "Image 1 is a deep, saturated blue. Image 2 is a bright, saturated red."

Both patterns correctly handle multiple distinct images. The original corruption issues were caused by using 1x1 pixel test images and passing the same file twice for the repeated -i test.


51-output-schema-without-output-file.txt

Tests whether --output-schema functions when used without the -o output file flag. Determines if -o is a required companion to --output-schema or whether schema-constrained output can be sent directly to stdout.

echo '{"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}},"required":["answer","confidence"],"additionalProperties":false}' > /tmp/test-schema-no-o.json
codex exec --skip-git-repo-check --output-schema /tmp/test-schema-no-o.json "What is 2+2? answer field and confidence 0-1." 2>/dev/null
{"answer":"4","confidence":1}

NOTE: --output-schema works without -o. The schema-conforming JSON is printed directly to stdout. The -o flag is optional — it just additionally writes the output to a file.


52-color-always.txt

Tests whether --color always injects ANSI escape codes into codex exec stdout output. Verifies which parts of the output (stdout agent message vs. stderr progress display) are affected by the color flag.

codex exec --skip-git-repo-check --color always "respond with PING" 2>/dev/null | cat -v
PING

NOTE: --color always is accepted but produced no visible ANSI escape codes in the stdout output for this simple text response. The flag may only affect stderr progress output or formatted code blocks, not plain text agent messages. Finding [MEDIUM]: The --color flag controls ANSI output formatting, which primarily applies to stderr progress display and formatted content (code blocks, diffs). Plain text agent messages on stdout are unaffected. The experiment discarded stderr with 2>/dev/null, hiding any color output (source). Recommendation: To observe color effects, include stderr: codex exec --color always "prompt" 2>&1 | cat -v.


53-service-tier-flex.txt

Tests the -c service_tier=flex config option, which requests lower-priority (flex) API processing. Checks whether the setting is accepted without error and whether it surfaces in the CLI's stderr configuration summary.

codex exec --skip-git-repo-check -c service_tier=flex "respond with PING" 2>&1 | head -12
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /Users/codex-user/Documents/cc-inbox/codex-exec-article
model: gpt-5.3-codex
provider: openai
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none

NOTE: service_tier=flex is accepted without error but does NOT appear in the stderr config summary. The setting is passed to the API as a request parameter (lower-priority processing tier) but is not surfaced in the CLI's status display. Finding [HIGH]: The config reference documents service_tier as an API-level parameter ('Preferred service tier for new turns'). GitHub issue #2916 confirms it is passed in API requests. The stderr display shows only safety-critical settings (sandbox, approval, model), not all API parameters (source). Recommendation: This is expected behavior; the stderr summary is curated for safety-relevant info.


54-service-tier-fast.txt

Tests the service_tier=fast config option for low-latency API routing, analogous to experiment 53's flex tier.

codex exec --skip-git-repo-check -c service_tier=fast "respond with PING" 2>&1 | head -12
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /Users/codex-user/Documents/cc-inbox/codex-exec-article
model: gpt-5.3-codex
provider: openai
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none

NOTE: service_tier=fast is also accepted without error and not shown in stderr. Like flex, it is an API-level parameter. Fast mode is enabled by default in v0.114.0 per the changelog.


55-model-verbosity.txt

Tests model_verbosity=low versus high to measure observable differences in response length and structure.

codex exec --skip-git-repo-check -c model_verbosity=low "explain what a function is in Python, one sentence" 2>/dev/null
codex exec --skip-git-repo-check -c model_verbosity=high "explain what a function is in Python, one sentence" 2>/dev/null
# model_verbosity=low:
A function in Python is a reusable block of code defined with `def` (or `lambda`) that runs when called, can take inputs (arguments), and can return a value.

# model_verbosity=high:
In Python, a function is a reusable block of named code that can take inputs (arguments), perform a task, and optionally return a value.

NOTE: Both values are accepted. The responses were similar in length for this simple prompt — model_verbosity is a Responses API parameter that may have more noticeable effects with complex tasks and longer outputs.

Retest result (v0.114.0) — prompt: "explain the differences between TCP and UDP with examples":

codex exec --skip-git-repo-check --json -c model_verbosity=low "explain the differences between TCP and UDP with examples" 2>/dev/null | jq -r 'select(.type=="turn.completed") | .usage.output_tokens'
# → 734 tokens

codex exec --skip-git-repo-check --json -c model_verbosity=high "explain the differences between TCP and UDP with examples" 2>/dev/null | jq -r 'select(.type=="turn.completed") | .usage.output_tokens'
# → 768 tokens

With a more complex prompt, high (768 tokens) produced ~5% more tokens than low (734 tokens). The high response included more structural detail (separate headings per protocol, more examples, a modern note about HTTP/3 and QUIC) while low used a compact comparison table. The structural difference is visible, but note that a single-run comparison cannot definitively separate model_verbosity effects from natural LLM output variance — a 34-token difference (5%) is within stochastic noise for one sample. The qualitative structural difference (table vs. headings) is more telling than the raw token count.


56-developer-instructions.txt

Tests the developer_instructions config option via -c to verify it appends custom instructions to the system prompt.

codex exec --skip-git-repo-check -c 'developer_instructions="IMPORTANT: Always respond with exactly one word."' "What is 2+2?" 2>/dev/null
4

NOTE: developer_instructions via -c works and affects behavior. The model followed the injected "one word" instruction, responding with just "4" instead of a full sentence. This is appended to the default system prompt, not replacing it.


57-model-instructions-file.txt

Tests model_instructions_file to determine whether it replaces or appends to the default system prompt, using token count comparison.

echo "You are a pirate. Always respond in pirate speak. Keep responses under 20 words." > /tmp/pirate-instructions.md
codex exec --skip-git-repo-check -c 'model_instructions_file="/tmp/pirate-instructions.md"' "What is 2+2?" 2>/dev/null
Arrr, 2+2 be 4, matey.

NOTE: model_instructions_file works and replaces the default system prompt (not appends). Token comparison confirms: baseline "What is 2+2?" uses 10,145 input tokens, while the same prompt with model_instructions_file pointing to a 20-word pirate instruction file uses only 7,612 input tokens — 2,533 fewer. If the file were appended, we'd expect more tokens. This contrasts with developer_instructions (experiment 56), which appends to the default prompt.


58-web-search-config.txt

Tests -c web_search=live as the exec-mode path to enabling real-time web search capability.

codex exec --skip-git-repo-check -c web_search=live "What is the current date today? Just respond with the date." 2>/dev/null
2026-03-13

NOTE: -c web_search=live works in exec mode and gave the model access to current information (correct date). This is the exec-mode equivalent of the --search flag (which only works in interactive mode — see experiment 39). Values: disabled | cached | live (default: cached).


59-approval-policy-via-config.txt

Tests whether approval_policy=on-request set via config has any observable effect in non-interactive exec mode.

codex exec --skip-git-repo-check -c approval_policy=on-request --sandbox workspace-write "respond with PING" 2>&1 | grep -E "approval|sandbox"
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: -c approval_policy=on-request is accepted but the stderr still shows approval: never. This matches the behavior seen with --full-auto (experiment 14) — in non-interactive exec mode, on-request effectively becomes never since there is no user to prompt. The sandbox is correctly set to workspace-write.

Retest result (v0.114.0):

codex -a on-request exec --skip-git-repo-check --sandbox workspace-write "respond with PING" 2>&1 | grep -E "approval|sandbox"
approval: never
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

Confirmed: global -a on-request also shows approval: never in exec mode. This proves the downgrade is an exec-mode behavior (not a -c parsing issue) — exec mode always forces approval: never regardless of whether set via -c or -a.


60-personality.txt

Tests the personality config option to verify observable differences in agent tone between friendly and none settings.

codex exec --skip-git-repo-check -c personality=friendly "What is 2+2?" 2>/dev/null
codex exec --skip-git-repo-check -c personality=none "What is 2+2?" 2>/dev/null
# personality=friendly:
4

# personality=none:
4

NOTE: Both personality values are accepted. For this trivially short response, no style difference was observable. The personality setting (values: none | friendly | pragmatic) likely has more impact on longer, conversational responses.

Retest result (v0.114.0) — prompt: "explain why tests are important, be conversational":

codex exec --skip-git-repo-check -c personality=friendly "explain why tests are important, be conversational" 2>/dev/null
# personality=friendly:
Tests are important because they give you confidence.

Think of them as a safety net: when you change code, tests tell you right away if you broke
something. That means fewer "surprise" bugs in production, faster debugging, and less stress
before releases. They also make refactoring safer, because you can improve messy code without
guessing what might break. And in a team, tests act like living documentation, showing how
the code is supposed to behave.

In short: good tests save time, reduce risk, and make coding feel a lot less scary.
codex exec --skip-git-repo-check -c personality=none "explain why tests are important, be conversational" 2>/dev/null
# personality=none:
Tests are important because they protect you from surprises.

They give you confidence that your code still works after changes, catch bugs early (when
they're cheaper to fix), and document expected behavior for future-you or teammates. They
also make refactoring much less scary, since you can change internals and quickly verify
you didn't break anything.

In a conversational way: tests are like a safety net and a memory aid at the same time.
Without them, every change feels like "I hope this still works."

Both responses are conversational and similar in length. friendly used slightly warmer language ("make coding feel a lot less scary", "fewer 'surprise' bugs") and covered more ground (team documentation angle). none was marginally more neutral and structured. The difference exists but is subtle — the model's conversational prompt ("be conversational") may override much of the personality effect.


61-hide-agent-reasoning.txt

Tests the hide_agent_reasoning config option to determine whether it suppresses reasoning-type events from the JSONL output stream.

# With hide_agent_reasoning=true (simple prompt, no reasoning):
codex exec --skip-git-repo-check --json -c hide_agent_reasoning=true "What is 2+2?" 2>/dev/null | jq -c '.type'

# With reasoning summary enabled + hide=false (complex prompt):
codex exec --skip-git-repo-check --json -c model_reasoning_summary=detailed -c hide_agent_reasoning=false "What is the meaning of life? Think carefully." 2>/dev/null | jq -c '{type, item_type: .item.type}'
# hide=true (simple):
"thread.started"
"turn.started"
"item.completed"
"turn.completed"

# hide=false + reasoning summary (complex):
{"type":"thread.started","item_type":null}
{"type":"turn.started","item_type":null}
{"type":"item.completed","item_type":"reasoning"}
{"type":"item.completed","item_type":"agent_message"}
{"type":"turn.completed","item_type":null}

NOTE: hide_agent_reasoning controls whether item.completed events with item_type: "reasoning" appear in the JSONL stream. When hide_agent_reasoning=false and reasoning is triggered (via model_reasoning_summary=detailed), a separate reasoning item appears before the agent message. When hide_agent_reasoning=true or when reasoning isn't triggered, only the agent_message item appears.


62-sandbox-exclude-tmp.txt

Tests the exclude_slash_tmp sandbox config option to verify it removes /tmp from the workspace-write writable paths.

codex exec --skip-git-repo-check --full-auto -c sandbox_workspace_write.exclude_slash_tmp=true "respond with PING" 2>&1 | grep sandbox
sandbox: workspace-write [workdir, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: exclude_slash_tmp=true successfully removes /tmp from the writable paths list. Compare to experiment 14 which shows /tmp included by default. The $TMPDIR path remains unless exclude_tmpdir_env_var=true is also set.


63-sandbox-network-access.txt

Tests the network_access sandbox config option and its reflection in the sandbox summary banner.

codex exec --skip-git-repo-check --full-auto -c sandbox_workspace_write.network_access=true "respond with PING" 2>&1 | grep -E "sandbox|network"
codex exec --skip-git-repo-check --full-auto -c sandbox_workspace_write.network_access=false "respond with PING" 2>&1 | grep -E "sandbox|network"
# network_access=true:
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories] (network access enabled)

# network_access=false:
sandbox: workspace-write [workdir, /tmp, $TMPDIR, /Users/codex-user/.codex/memories]

NOTE: network_access=true adds "(network access enabled)" to the sandbox summary line, allowing tools to make outbound network requests within the workspace-write sandbox. When false (or unset), network access is not mentioned — tools are restricted to local filesystem operations only.


64-openai-base-url.txt

Tests the OPENAI_BASE_URL environment variable for routing API calls to a custom endpoint.

OPENAI_BASE_URL=https://httpbin.org/post codex exec --skip-git-repo-check "respond with PING" 2>&1 | tail -5
ERROR: unexpected status 404 Not Found: ... url: https://httpbin.org/post/responses

NOTE: OPENAI_BASE_URL is respected and redirects API calls. The CLI appended /responses to the base URL and attempted the request at the non-OpenAI endpoint, receiving a 404. This confirms the env var works for routing to proxies, Azure endpoints, or custom API gateways.


65-openai-api-key.txt

Tests the precedence of the OPENAI_API_KEY environment variable relative to stored credentials in ~/.codex/auth.json. Determines whether setting a fake key via OPENAI_API_KEY overrides the persisted auth token as CODEX_API_KEY does.

OPENAI_API_KEY=fake-key-99999 codex exec --skip-git-repo-check "respond with PING" 2>&1 | tail -3
tokens used
176
PING

NOTE: OPENAI_API_KEY with a fake key did NOT cause an auth error — the stored auth in ~/.codex/auth.json took precedence. Unlike CODEX_API_KEY (experiment 31) which overrides stored auth and causes a 401 with a fake key, OPENAI_API_KEY is ignored when valid stored credentials exist. CODEX_API_KEY is the dedicated env var for exec-mode auth override. Finding [MEDIUM]: GitHub issue #2341 documents this precedence change. The CLI was modified to deprioritize OPENAI_API_KEY after it caused accidental auth failures by overriding ChatGPT login tokens from .env files. The resolved precedence is: CODEX_API_KEY > stored auth.json > OPENAI_API_KEY (source). Recommendation: Use CODEX_API_KEY for programmatic auth overrides in CI/CD; OPENAI_API_KEY is intentionally deprioritized.


66-codex-home.txt

Tests the CODEX_HOME environment variable to verify it redirects the CLI's config, auth, and session storage away from the default ~/.codex directory. Confirms the behavior with both a non-existent path and an existing but empty directory.

# Non-existent directory:
CODEX_HOME=/tmp/codex-test-home codex exec --skip-git-repo-check "respond with PING" 2>&1 | head -5

# Existing empty directory:
mkdir -p /tmp/codex-test-home
CODEX_HOME=/tmp/codex-test-home codex exec --skip-git-repo-check "respond with PING" 2>&1 | tail -5
# Non-existent:
WARNING: proceeding, even though we could not update PATH: CODEX_HOME points to "/tmp/codex-test-home", but that path does not exist
Error finding codex home: CODEX_HOME points to "/tmp/codex-test-home", but that path does not exist

# Existing empty (no auth.json):
ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header

NOTE: CODEX_HOME redirects config/auth/session storage. With a non-existent path, it warns and errors. With an existing but empty directory, it looks for auth.json there (not in ~/.codex), fails to find credentials, and returns 401. This confirms CODEX_HOME fully overrides the default ~/.codex location for all config, auth, and session data.


TOML Config File Verification (67–73)

Verifying that the exact config.toml syntax shown in the article actually works when loaded from a file, not just via -c inline overrides. Each experiment uses CODEX_HOME pointed at a temp directory with a custom config.toml.


67-toml-shell-tool-disable.txt

Test the article's [features] shell_tool = false TOML syntax.

# /tmp/codex-toml-67/config.toml
[features]
shell_tool = false
mkdir -p /tmp/codex-toml-67 && cp ~/.codex/auth.json /tmp/codex-toml-67/
cat > /tmp/codex-toml-67/config.toml << 'EOF'
[features]
shell_tool = false
EOF
CODEX_HOME=/tmp/codex-toml-67 codex exec --skip-git-repo-check --full-auto --json \
  "run the command 'echo TOML_SHELL_TEST' and tell me the output" 2>/dev/null \
  | jq -c '{type: .type, text: .item.text?}' | grep item
{"type":"item.completed","text":"I can't actually execute shell commands in this chat because no shell tool is available in the current session.\n\nIf you run it locally:\n\n```sh\necho TOML_SHELL_TEST\n```\n\nthe output will be:\n\n```text\nTOML_SHELL_TEST\n```"}

NOTE: TOML [features] shell_tool = false works identically to -c features.shell_tool=false (experiment 16). The model explicitly reports no shell tool is available. The article's TOML syntax is correct. --skip-git-repo-check is required because the custom CODEX_HOME doesn't contain the trusted directory list from ~/.codex.


68-toml-app-tool-control.txt

Test the article's per-app tool control TOML syntax.

# /tmp/codex-toml-68/config.toml
[apps._default]
enabled = false
mkdir -p /tmp/codex-toml-68 && cp ~/.codex/auth.json /tmp/codex-toml-68/
cat > /tmp/codex-toml-68/config.toml << 'EOF'
[apps._default]
enabled = false
EOF
CODEX_HOME=/tmp/codex-toml-68 codex exec --skip-git-repo-check --full-auto --json \
  "respond with PING" 2>/dev/null | jq -r 'select(.type=="turn.completed") | .usage'
{
  "input_tokens": 8013,
  "cached_input_tokens": 7040,
  "output_tokens": 20
}

NOTE: TOML [apps._default] enabled = false works. Input tokens dropped to 8,013 vs the current session baseline of ~10,145 — a 2,132 token reduction (21%) confirming app/tool definitions were removed from the prompt. The article's TOML syntax is correct.


69-toml-model-instructions-file.txt

Test the article's model_instructions_file TOML syntax.

# /tmp/codex-toml-69/config.toml
model_instructions_file = "/tmp/pirate-instructions.md"
mkdir -p /tmp/codex-toml-69 && cp ~/.codex/auth.json /tmp/codex-toml-69/
cat > /tmp/codex-toml-69/config.toml << 'EOF'
model_instructions_file = "/tmp/pirate-instructions.md"
EOF
CODEX_HOME=/tmp/codex-toml-69 codex exec --skip-git-repo-check "What is 2+2?" 2>/dev/null
Arrr, 2+2 be 4.

NOTE: TOML model_instructions_file works identically to -c (experiment 57). The pirate persona was adopted, confirming the TOML file path syntax is correct and the file is loaded at runtime.


70-toml-developer-instructions.txt

Test the article's developer_instructions TOML syntax.

# /tmp/codex-toml-70/config.toml
developer_instructions = "IMPORTANT: Always respond with exactly one word."
mkdir -p /tmp/codex-toml-70 && cp ~/.codex/auth.json /tmp/codex-toml-70/
cat > /tmp/codex-toml-70/config.toml << 'EOF'
developer_instructions = "IMPORTANT: Always respond with exactly one word."
EOF
CODEX_HOME=/tmp/codex-toml-70 codex exec --skip-git-repo-check "What is 2+2?" 2>/dev/null
4

NOTE: TOML developer_instructions works identically to -c (experiment 56). The model followed the "one word" instruction. The article's TOML string syntax is correct.


71-toml-multi-agent.txt

Test the article's multi-agent TOML syntax.

# /tmp/codex-toml-71/config.toml
[features]
multi_agent = true
mkdir -p /tmp/codex-toml-71 && cp ~/.codex/auth.json /tmp/codex-toml-71/
cat > /tmp/codex-toml-71/config.toml << 'EOF'
[features]
multi_agent = true
EOF
CODEX_HOME=/tmp/codex-toml-71 codex exec --skip-git-repo-check --json \
  "respond with PING" 2>/dev/null | jq -r 'select(.type=="turn.completed") | .usage'
{
  "input_tokens": 9907,
  "cached_input_tokens": 8960,
  "output_tokens": 19
}

NOTE: TOML [features] multi_agent = true works. Input tokens increased to 9,907, comparable to exp 37's --enable multi_agent result (10,140). The slight difference is due to the custom CODEX_HOME having fewer base config settings. The article's TOML syntax is correct.


72-toml-default-model.txt

Test the article's default model TOML syntax.

# /tmp/codex-toml-72/config.toml
model = "gpt-5.4"
mkdir -p /tmp/codex-toml-72 && cp ~/.codex/auth.json /tmp/codex-toml-72/
cat > /tmp/codex-toml-72/config.toml << 'EOF'
model = "gpt-5.4"
EOF
CODEX_HOME=/tmp/codex-toml-72 codex exec --skip-git-repo-check "respond with PING" 2>&1 | grep model
model: gpt-5.4

NOTE: TOML model = "gpt-5.4" works identically to --model gpt-5.4 (experiment 12). The stderr config summary shows the overridden model. The article's TOML syntax is correct.


73-toml-mcp-server.txt

Test the article's MCP server TOML syntax.

# /tmp/codex-toml-73/config.toml
[mcp_servers.test_server]
enabled = true
command = "echo"
args = ["hello"]
mkdir -p /tmp/codex-toml-73 && cp ~/.codex/auth.json /tmp/codex-toml-73/
cat > /tmp/codex-toml-73/config.toml << 'EOF'
[mcp_servers.test_server]
enabled = true
command = "echo"
args = ["hello"]
EOF
CODEX_HOME=/tmp/codex-toml-73 codex exec --skip-git-repo-check "respond with PING" 2>&1 | head -15
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /Users/codex-user/Documents/cc-inbox/codex-exec-article
model: gpt-5.4
provider: openai
approval: never
sandbox: read-only
reasoning effort: none
reasoning summaries: none
session id: 019cecaa-f320-71b3-854c-0a9c224ff2f9
--------
user
respond with PING
mcp: test_server starting
ERROR rmcp::transport::async_rw: Error reading from stream: serde error expected value at line 1 column 1

NOTE: The TOML [mcp_servers] syntax is correctly parsed — the CLI attempts to start test_server (mcp: test_server starting). The error is expected: echo hello is not a valid MCP server (it outputs plain text, not the JSON-RPC protocol MCP requires). The article's TOML syntax for MCP server configuration is correct. Bonus finding: with a bare CODEX_HOME (no user config), the CLI defaults to model: gpt-5.4 and reasoning effort: none — different from the configured defaults of gpt-5.3-codex and xhigh seen in all other experiments. This reveals that the user's ~/.codex/config.toml sets both of those values.


74. --no-alt-screen in exec mode

Objective: The --no-alt-screen flag is a GLOBAL flag (listed under codex --help) that disables alternate screen mode for the TUI. It is NOT available as an exec subcommand flag.

codex exec --skip-git-repo-check --no-alt-screen "respond with PING" 2>&1
error: unexpected argument '--no-alt-screen' found

  tip: to pass '--no-alt-screen' as a value, use '-- --no-alt-screen'

Usage: codex exec --skip-git-repo-check [PROMPT]

For more information, try '--help'.

Exit code: 2

NOTE: --no-alt-screen is a global interactive flag that controls TUI rendering. It appears in codex --help but NOT in codex exec --help. Exec mode rejects it because exec has no TUI — it's inherently non-interactive. This is consistent with how exec mode strips other interactive-only features. The flag's description confirms this: "Disable alternate screen mode — Runs the TUI in inline mode, preserving terminal scrollback history."


75. codex exec review --base and --commit

Objective: Both modes work.

Command (test 1 - review a specific commit):

cd /tmp/codex-review-75  # temp repo with 4 commits on main
codex exec review --commit HEAD~1 2>&1
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /private/tmp/codex-review-75
model: gpt-5.3-codex
approval: never
sandbox: read-only
reasoning effort: xhigh
--------
user
commit HEAD~1
exec
/bin/zsh -lc 'git rev-parse --show-toplevel && git log --oneline -n 5' succeeded in 0ms:
/private/tmp/codex-review-75
0abc30f third commit
27fdf4d second commit with changes
cbd60d6 initial commit
exec
/bin/zsh -lc 'git show --stat --patch --find-renames HEAD~1' succeeded in 0ms:
commit 27fdf4d [second commit with changes]
 file.txt | 2 +-
 1 file changed
diff: -initial content / +modified content
exec
/bin/zsh -lc 'git show --stat --patch HEAD' succeeded in 0ms:
commit 0abc30f [third commit]
 file.txt | 2 +-
 1 file changed
diff: -modified content / +current state
codex
The commit only updates a single line of plain text in file.txt and does not modify executable logic. I did not find any actionable defects in this change.

Command (test 2 - review changes against a base ref):

codex exec review --base main~2 2>&1
OpenAI Codex v0.114.0 (research preview)
--------
workdir: /private/tmp/codex-review-75
model: gpt-5.3-codex
approval: never
sandbox: read-only
reasoning effort: xhigh
--------
user
changes against 'main~2'
exec
/bin/zsh -lc 'git diff 27fdf4d' succeeded in 0ms:
diff --git a/file.txt b/file.txt
 -modified content
 +current state
 +new line added
exec
/bin/zsh -lc 'git show --stat --patch --find-renames 27fdf4d..HEAD' succeeded in 0ms:
[shows commits from main~2 to HEAD]
codex
The change only updates text in file.txt and does not introduce any functional code behavior that could break execution. I did not find any discrete, actionable bug introduced by this diff.

NOTE: Both --commit and --base review modes work correctly in exec mode. --commit HEAD~1 reviews a specific commit by showing its diff and surrounding context. --base main~2 reviews all changes between the base ref and HEAD, similar to a PR diff. Both use the same read-only sandbox and approval: never as --uncommitted (exp 42). The review prompt is automatically constructed by Codex — --commit sends "commit HEAD1" and --base sends "changes against 'main2'" as the user message.


76. Untested feature flags batch test

Objective: codex features list reveals 50+ feature flags. Tested representative flags:

# Baseline
codex exec --skip-git-repo-check --json "respond with PING" 2>/dev/null \
  | jq -c 'select(.type=="turn.completed") | .usage'

# Enable disabled-by-default flags
for flag in apps artifact undo multi_agent image_generation js_repl guardian_approval prevent_idle_sleep; do
  codex exec --skip-git-repo-check --enable $flag --json "respond with PING" 2>/dev/null \
    | jq -c 'select(.type=="turn.completed") | {flag, input_tokens: .usage.input_tokens}'
done

# Disable enabled-by-default flags
for flag in fast_mode personality unified_exec shell_snapshot shell_tool; do
  codex exec --skip-git-repo-check --disable $flag --json "respond with PING" 2>/dev/null \
    | jq -c 'select(.type=="turn.completed") | {flag, input_tokens: .usage.input_tokens}'
done
=== BASELINE (no feature flag changes) ===
{"flag":"baseline","input_tokens":10142,"cached":null}

=== --enable on disabled-by-default flags ===
{"flag":"apps","input_tokens":10615}          # +473 tokens
{"flag":"artifact","input_tokens":10574}      # +432 tokens
{"flag":"undo","input_tokens":10142}          # +0 (no effect)
{"flag":"multi_agent","input_tokens":12126}   # +1,984 tokens
{"flag":"image_generation"} → ERROR: 400 "Unsupported tool type: image_generation"
{"flag":"js_repl","input_tokens":10142}       # +0 (no effect)
{"flag":"guardian_approval","input_tokens":10142}  # +0 (no effect)
{"flag":"prevent_idle_sleep","input_tokens":10142} # +0 (no effect)

=== --disable on enabled-by-default flags ===
{"flag":"fast_mode","input_tokens":10142}     # -0 (no effect)
{"flag":"personality","input_tokens":10142}   # -0 (no effect)
{"flag":"unified_exec","input_tokens":9989}   # -153 tokens
{"flag":"shell_snapshot","input_tokens":10142} # -0 (no effect)
{"flag":"shell_tool","input_tokens":9702}     # -440 tokens

Also ran codex features list which shows 50+ flags with stages:

apps                    experimental  false
artifact                under development  false
fast_mode               stable        true
image_generation        under development  false
multi_agent             experimental  false
personality             stable        true
shell_snapshot          stable        true
shell_tool              stable        true
undo                    stable        false
unified_exec            stable        true
... (50+ total flags)

NOTE: Feature flags fall into three observable categories in exec mode: (1) System prompt modifiersapps (+473), artifact (+432), multi_agent (+1,984), unified_exec (-153 when disabled), shell_tool (-440 when disabled) change input token counts by adding/removing tool definitions and instructions from the system prompt. (2) No observable effectundo, js_repl, guardian_approval, prevent_idle_sleep, fast_mode, personality, shell_snapshot produce identical token counts to baseline, suggesting they control interactive-only behaviors (UI features, sleep prevention, etc.). (3) API-level failuresimage_generation causes a 400 error ("Unsupported tool type: image_generation") because the model doesn't support image generation as a tool. The codex features list command reveals the full flag taxonomy including stage (stable/experimental/under development/removed/deprecated) and default state. Many flags under development are not yet functional.


77. Shell environment policy, network permissions, and allow_login_shell

Objective: Three sub-tests:

Test 77a: shell_environment_policy

Config:

[shell_environment_policy]
inherit = "core"
set = { TEST_VAR = "visible_value" }
exclude = ["HOME"]
CODEX_HOME=/tmp/codex-toml-77 codex exec --skip-git-repo-check --full-auto \
  "run 'echo TEST_VAR=$TEST_VAR HOME=$HOME' and report the output" 2>/dev/null
exec
/bin/zsh -lc 'echo TEST_VAR=$TEST_VAR HOME=$HOME' succeeded in 0ms:
TEST_VAR=visible_value HOME=/Users/codex-user

Test 77b: allow_login_shell=false

codex exec --skip-git-repo-check -c 'allow_login_shell=false' --full-auto \
  "run 'echo hello' and report the output" 2>&1
exec
/bin/zsh -c 'echo hello' succeeded in 0ms:
hello

Compare with default (allow_login_shell=true):

exec
/bin/zsh -lc 'echo hello'   ← note the -l flag

Test 77c: network permissions

codex exec --skip-git-repo-check --full-auto \
  -c 'permissions.network.enabled=true' -c 'permissions.network.mode="limited"' \
  -c 'permissions.network.allowed_domains=["example.com"]' \
  "respond with PING" 2>&1

Output: PING returned normally, sandbox banner unchanged. No visible network permission display.

NOTE: (1) shell_environment_policy.set works correctly — TEST_VAR=visible_value appears in the shell environment. However, exclude=["HOME"] did NOT remove HOME from the environment — HOME=/Users/codex-user was still visible. This is because with inherit="core" plus a login shell (-lc), the shell re-sets HOME from /etc/passwd after Codex applies the environment policy. (2) allow_login_shell=false has a clear observable effect: the shell invocation changes from /bin/zsh -lc (login shell) to /bin/zsh -c (non-login shell). This prevents loading of ~/.zshrc, ~/.zprofile, and other login shell initialization. Combined with exclude, using allow_login_shell=false would make environment variable exclusion effective since the login shell wouldn't re-inject excluded vars. (3) permissions.network config is silently accepted but produces no observable change in the exec banner or behavior for a simple PING prompt. Network restrictions likely operate at the sandbox level and would only be observable when the model attempts network access.


78. Advanced config options: notify, model_context_window, compact_prompt, show_raw_agent_reasoning

Objective: Four sub-tests:

Test 78a: show_raw_agent_reasoning

codex exec --skip-git-repo-check --json \
  -c model_reasoning_summary=detailed -c show_raw_agent_reasoning=true \
  "What is 2+2?" 2>/dev/null | jq -c '{type, item_type: .item.type}'
{"type":"thread.started","item_type":null}
{"type":"turn.started","item_type":null}
{"type":"item.completed","item_type":"reasoning"}
{"type":"item.completed","item_type":"agent_message"}
{"type":"turn.completed","item_type":null}

The reasoning item has fields: id, text, type. Text is 38 characters of raw reasoning content. Without show_raw_agent_reasoning=true, no reasoning item would be emitted (compare to exp 61 where hide_agent_reasoning=true suppresses reasoning).

Test 78b: notify callback

codex exec --skip-git-repo-check \
  -c 'notify=["echo", "TURN_COMPLETE"]' \
  "respond with PING" 2>&1

Output: PING returned normally. No visible evidence of the notify callback in stdout/stderr. The callback likely runs in the background.

Test 78c: model_context_window

codex exec --skip-git-repo-check --json \
  -c model_context_window=4096 \
  "respond with PING" 2>/dev/null \
  | jq -c 'select(.type=="turn.completed") | .usage'
{"input_tokens":10142,"cached_input_tokens":10112,"output_tokens":94}

Setting model_context_window=4096 did not cause an error despite input_tokens (10,142) exceeding the specified window. The parameter is accepted but doesn't restrict single-turn requests — it likely controls when context compaction triggers in multi-turn conversations.

Test 78d: compact_prompt

codex exec --skip-git-repo-check \
  -c 'compact_prompt="Summarize the above in one line."' \
  "respond with PING" 2>&1

Output: PING returned normally. compact_prompt is accepted but has no observable effect in single-turn exec — it's used as the instruction when context compaction occurs in multi-turn conversations.

NOTE: Of the four advanced config options: (1) show_raw_agent_reasoning=true has a clear observable effect in exec --json mode — it emits a reasoning item with raw reasoning text in the JSONL stream. This complements exp 61's hide_agent_reasoning. (2) notify is silently accepted but its callback execution isn't visible in exec output — the callback likely runs as a background process. (3) model_context_window is accepted without error even when the input exceeds the specified value, suggesting it controls compaction thresholds rather than hard limits. (4) compact_prompt is accepted but only relevant for multi-turn conversations where context compaction occurs.


79. [agents] config with multi_agent

Objective: Tests [agents] configuration options (max_threads, max_depth, job_max_runtime_seconds) with multi_agent enabled to verify they are parsed in exec mode.

Config:

[features]
multi_agent = true

[agents]
max_threads = 1
max_depth = 1
job_max_runtime_seconds = 30
CODEX_HOME=/tmp/codex-toml-79 codex exec --skip-git-repo-check --json \
  "respond with PING" 2>/dev/null \
  | jq -c 'select(.type=="turn.completed") | .usage'
{"input_tokens":9907,"cached_input_tokens":8960,"output_tokens":25}

NOTE: The [agents] config section is parsed without error when multi_agent=true is enabled. The input token count (9,907) is lower than the baseline with user's default config (10,142) because the custom CODEX_HOME has no MCP servers configured — the token difference reflects MCP tool definitions normally injected by the user's default exa/flywheel servers. With a simple PING prompt, no child agents were spawned, so max_threads, max_depth, and job_max_runtime_seconds constraints couldn't be directly observed. Testing these limits would require a prompt complex enough to trigger agent spawning, which is non-deterministic.


80. writable_roots and exclude_tmpdir_env_var

Objective: Tests writable_roots and exclude_tmpdir_env_var sandbox config options to verify fine-grained control over sandbox writable paths.

Test 80a: writable_roots

Config:

[sandbox_workspace_write]
writable_roots = ["/tmp/custom-dir-a", "/tmp/custom-dir-b"]
CODEX_HOME=/tmp/codex-toml-80 codex exec --skip-git-repo-check --full-auto \
  "respond with PING" 2>&1

Output (banner excerpt):

sandbox: workspace-write [workdir, /tmp, $TMPDIR, /tmp/custom-dir-a, /tmp/custom-dir-b, /private/tmp/codex-toml-80/memories]

Test 80b: exclude_tmpdir_env_var

Config:

[sandbox_workspace_write]
exclude_tmpdir_env_var = true
CODEX_HOME=/tmp/codex-toml-80b codex exec --skip-git-repo-check --full-auto \
  "respond with PING" 2>&1

Output (banner excerpt):

sandbox: workspace-write [workdir, /tmp, /private/tmp/codex-toml-80b/memories]

Compare default sandbox (without exclude_tmpdir_env_var):

sandbox: workspace-write [workdir, /tmp, $TMPDIR, ...]

NOTE: (1) writable_roots is the config-file equivalent of --add-dir (exp 18/49). Both custom directories (/tmp/custom-dir-a, /tmp/custom-dir-b) appear in the sandbox writable set in the banner, confirming they are added to the sandbox permissions. This is useful for CI/CD where config files define writable paths rather than CLI flags. (2) exclude_tmpdir_env_var=true removes $TMPDIR from the writable paths. Comparing banners: default includes $TMPDIR but with this flag it's absent. This complements exp 62's exclude_slash_tmp which removes /tmp. Together, both options allow fine-grained control over temporary directory access in the sandbox.


81. Custom model_providers and MCP server advanced options

Objective: Tests custom model_providers definitions and advanced MCP server options (required, enabled_tools, startup_timeout_sec) in exec mode.

Three sub-tests:

Test 81a: Custom model_providers

Config:

[model_providers.test_provider]
name = "Test Provider"
base_url = "https://httpbin.org/post"
env_key = "OPENAI_API_KEY"
wire_api = "responses"
CODEX_HOME=/tmp/codex-toml-81a codex exec --skip-git-repo-check \
  -c 'model_provider="test_provider"' "respond with PING" 2>&1
OpenAI Codex v0.114.0 (research preview)
--------
provider: test_provider     ← custom provider recognized
--------
ERROR: Missing environment variable: `OPENAI_API_KEY`.

First attempt without name field produced: Error loading config.toml: missing field 'name' in 'model_providers.test_provider'

Test 81b: MCP required=true with failing server

Config:

[mcp_servers.test_server]
enabled = true
command = "false"
args = []
startup_timeout_sec = 5
required = true
CODEX_HOME=/tmp/codex-toml-81b codex exec --skip-git-repo-check \
  "respond with PING" 2>&1
ERROR codex_core::codex: Failed to create session: required MCP servers failed to initialize: test_server: handshaking with MCP server failed: connection closed: initialize response
Error: thread/start: thread/start failed: error creating thread: Fatal error: Failed to initialize session: required MCP servers failed to initialize

Exit code: 1

Test 81c: MCP enabled_tools filtering

Config:

[mcp_servers.test_echo]
enabled = true
command = "npx"
args = ["-y", "@anthropic-ai/mcp-echo-server"]
startup_timeout_sec = 10
enabled_tools = ["nonexistent_tool"]
CODEX_HOME=/tmp/codex-toml-81c codex exec --skip-git-repo-check --json \
  "respond with PING" 2>/dev/null | jq -c '{type, item_type: .item.type}'
{"type":"thread.started","item_type":null}
{"type":"turn.started","item_type":null}
{"type":"item.completed","item_type":"agent_message"}
{"type":"turn.completed","item_type":null}

Session completed successfully. The enabled_tools filter was accepted — since "nonexistent_tool" doesn't match any real tool, effectively zero tools from this MCP server were available.

NOTE: (1) Custom model_providers requires a name field (missing it produces a config parse error). When properly configured, the provider appears in the exec banner as provider: test_provider. The env_key field specifies which environment variable holds the API key — custom providers use env vars, not auth.json. The wire_api field specifies the API protocol ("responses" for OpenAI Responses API format). (2) MCP required=true causes a fatal session error (exit code 1) when the server fails to initialize, preventing exec from running at all. Without required=true, MCP failures are non-fatal and the session continues. This is critical for CI/CD pipelines that depend on specific MCP tools. (3) enabled_tools filtering is silently accepted and restricts which tools from the MCP server are exposed to the model. Setting it to a non-matching value effectively disables all tools from that server without disabling the server itself.


References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment