Safari Browser Control

Control the user's real Safari browser on macOS using AppleScript and screencapture. Read pages, click elements, type text, take screenshots, navigate tabs —...

view source

installs

stars

karma

SkillRank score ↗

7.4/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

safari-browser-control operates the user's real Safari browser on macOS via AppleScript and screencapture, enabling full browser automation with actual login state and cookies. covers reading pages, clicking elements, typing forms, taking screenshots, navigating tabs, and executing arbitrary javascript without extensions.

structure

8.0

trigger phrases

8.0

procedure

8.0

edge cases

6.0

documentation

7.0

view original SKILL.md from clawhubclick to expand

---
name: safari-browser-control
description: Control the user's real Safari browser on macOS using AppleScript and screencapture. Read pages, click elements, type text, take screenshots, navigate tabs — all through the user's actual browser session with their cookies and logins. Zero dependencies, pure macOS native. Triggers on keywords like "safari", "browser", "web page", "open tab", "screenshot the page", "read this site", "browse", "click on", "fill in the form".
---

# Safari Browser Control

Operate the user's real Safari browser on macOS via AppleScript (`osascript`) and `screencapture`. This provides full access to the user's actual browser session — including login state, cookies, and open tabs — without any extensions or additional software.

Unlike Playwright or headless browsers, this skill controls **your real Safari** — same cookies, same logins, same tabs. Zero install, pure macOS native.

## Prerequisites

Before first use, verify two settings are enabled. Run this check at the start of every session:

```bash
osascript -e 'tell application "Safari" to get name of front window' 2>&1
```

If this fails, instruct the user to enable:
1. **System Settings > Privacy & Security > Automation** — grant terminal app permission to control Safari
2. **Safari > Settings > Advanced** — enable "Show features for web developers", then **Develop menu > Allow JavaScript from Apple Events**

## Core Capabilities

### 1. List All Open Tabs

```bash
osascript -e '
tell application "Safari"
  set output to ""
  repeat with w from 1 to (count of windows)
    repeat with t from 1 to (count of tabs of window w)
      set tabName to name of tab t of window w
      set tabURL to URL of tab t of window w
      set output to output & "W" & w & "T" & t & " | " & tabName & " | " & tabURL & linefeed
    end repeat
  end repeat
  return output
end tell'
```

### 2. Read Page Content

Read the full text content of the current tab:

```bash
osascript -e '
tell application "Safari"
  do JavaScript "document.body.innerText" in current tab of front window
end tell'
```

Read structured content (title, URL, meta description, headings):

```bash
osascript -e '
tell application "Safari"
  do JavaScript "JSON.stringify({
    title: document.title,
    url: location.href,
    description: document.querySelector(\"meta[name=description]\")?.content || \"\",
    h1: [...document.querySelectorAll(\"h1\")].map(e => e.textContent).join(\" | \"),
    h2: [...document.querySelectorAll(\"h2\")].map(e => e.textContent).join(\" | \")
  })" in current tab of front window
end tell'
```

Read a simplified DOM (similar to Chrome ACP's `browser_read`):

```bash
osascript -e '
tell application "Safari"
  do JavaScript "
    (function() {
      const walk = (node, depth) => {
        let result = \"\";
        for (const child of node.childNodes) {
          if (child.nodeType === 3) {
            const text = child.textContent.trim();
            if (text) result += text + \"\\n\";
          } else if (child.nodeType === 1) {
            const tag = child.tagName.toLowerCase();
            if ([\"script\",\"style\",\"noscript\",\"svg\"].includes(tag)) continue;
            const style = getComputedStyle(child);
            if (style.display === \"none\" || style.visibility === \"hidden\") continue;
            if ([\"h1\",\"h2\",\"h3\",\"h4\",\"h5\",\"h6\"].includes(tag))
              result += \"#\".repeat(parseInt(tag[1])) + \" \";
            if (tag === \"a\") result += \"[\";
            if (tag === \"img\") result += \"[Image: \" + (child.alt || \"\") + \"]\\n\";
            else if (tag === \"input\") result += \"[Input \" + child.type + \": \" + (child.value || child.placeholder || \"\") + \"]\\n\";
            else if (tag === \"button\") result += \"[Button: \" + child.textContent.trim() + \"]\\n\";
            else result += walk(child, depth + 1);
            if (tag === \"a\") result += \"](\" + child.href + \")\\n\";
            if ([\"p\",\"div\",\"li\",\"tr\",\"br\",\"h1\",\"h2\",\"h3\",\"h4\",\"h5\",\"h6\"].includes(tag))
              result += \"\\n\";
          }
        }
        return result;
      };
      return walk(document.body, 0).substring(0, 50000);
    })()
  " in current tab of front window
end tell'
```

### 3. Execute JavaScript

Run arbitrary JavaScript in the page context and get the return value:

```bash
osascript -e '
tell application "Safari"
  do JavaScript "YOUR_JS_CODE_HERE" in current tab of front window
end tell'
```

For multi-line scripts, use a heredoc:

```bash
osascript << 'APPLESCRIPT'
tell application "Safari"
  do JavaScript "
    (function() {
      // Multi-line JS here
      return 'result';
    })()
  " in current tab of front window
end tell
APPLESCRIPT
```

### 4. Screenshot

Two approaches are available. Auto-detect which to use at session start:

```bash
# Test if Screen Recording permission is granted (background screenshot available)
/tmp/safari_wid 2>/dev/null && echo "BACKGROUND_SCREENSHOT=true" || echo "BACKGROUND_SCREENSHOT=false"
```

#### Background Screenshot (requires Screen Recording permission)

If the user has granted Screen Recording permission to the terminal app, use `screencapture -l` to capture Safari **without activating it**:

```bash
# Compile the helper once per session (if not already compiled)
if [ ! -f /tmp/safari_wid ]; then
cat > /tmp/safari_wid.swift << 'SWIFT'
import CoreGraphics
import Foundation
let options: CGWindowListOption = [.optionOnScreenOnly, .excludeDesktopElements]
guard let windowList = CGWindowListCopyWindowInfo(options, kCGNullWindowID) as? [[String: Any]] else { exit(1) }
for window in windowList {
    guard let owner = window[kCGWindowOwnerName as String] as? String,
          owner == "Safari",
          let layer = window[kCGWindowLayer as String] as? Int,
          layer == 0,
          let wid = window[kCGWindowNumber as String] as? Int else { continue }
    print(wid)
    exit(0)
}
exit(1)
SWIFT
swiftc /tmp/safari_wid.swift -o /tmp/safari_wid
fi

# Capture Safari window in background (no activation needed)
WID=$(/tmp/safari_wid)
screencapture -l "$WID" -o -x /tmp/safari_screenshot.png
```

To enable this, instruct the user: **System Settings > Privacy & Security > Screen Recording** — grant permission to the terminal app (Terminal / iTerm / Warp).

#### Foreground Screenshot (no extra permissions needed)

If Screen Recording is not granted, fall back to region-based capture. This briefly activates Safari (~0.5s), then switches back:

```bash
# Remember current frontmost app
FRONT_APP=$(osascript -e 'tell application "System Events" to get name of first process whose frontmost is true')

# Activate Safari and capture its window region
osascript -e 'tell application "Safari" to activate'
sleep 0.3
BOUNDS=$(osascript -e '
tell application "System Events"
  tell process "Safari"
    -- Safari may expose a thin toolbar as window 1; find the largest window
    set bestW to 0
    set bestBounds to ""
    repeat with i from 1 to (count of windows)
      set {x, y} to position of window i
      set {w, h} to size of window i
      if w * h > bestW then
        set bestW to w * h
        set bestBounds to (x as text) & "," & (y as text) & "," & (w as text) & "," & (h as text)
      end if
    end repeat
    return bestBounds
  end tell
end tell')
screencapture -x -R "$BOUNDS" /tmp/safari_screenshot.png

# Switch back to the previous app
osascript -e "tell application \"$FRONT_APP\" to activate"
```

After capturing with either method, read the screenshot to see what's on screen:

```
Use the Read tool on /tmp/safari_screenshot.png to view the captured image.
```

### 5. Navigate

Open a URL in the current tab:

```bash
osascript -e '
tell application "Safari"
  set URL of current tab of front window to "https://example.com"
end tell'
```

Open a URL in a new tab:

```bash
osascript -e '
tell application "Safari"
  tell front window
    set newTab to make new tab with properties {URL:"https://example.com"}
    set current tab to newTab
  end tell
end tell'
```

Open a URL in a new window:

```bash
osascript -e 'tell application "Safari" to make new document with properties {URL:"https://example.com"}'
```

### 6. Click Elements

Click using JavaScript (preferred — works with SPAs and reactive frameworks):

```bash
osascript -e '
tell application "Safari"
  do JavaScript "
    const el = document.querySelector(\"button.submit\");
    if (el) {
      el.dispatchEvent(new MouseEvent(\"click\", {bubbles: true, cancelable: true}));
      \"clicked\";
    } else {
      \"element not found\";
    }
  " in current tab of front window
end tell'
```

**Important**: Use `dispatchEvent(new MouseEvent(..., {bubbles: true}))` instead of `.click()` for React/Vue/Angular compatibility. Native `.click()` may bypass synthetic event handlers.

### 7. Type and Fill Forms

Set input values via JavaScript:

```bash
osascript -e '
tell application "Safari"
  do JavaScript "
    const input = document.querySelector(\"input[name=search]\");
    const nativeSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, \"value\").set;
    nativeSetter.call(input, \"search text\");
    input.dispatchEvent(new Event(\"input\", {bubbles: true}));
    input.dispatchEvent(new Event(\"change\", {bubbles: true}));
  " in current tab of front window
end tell'
```

**Important**: For React-controlled inputs, use the native setter + `dispatchEvent` pattern shown above. Directly setting `.value` will not trigger React's state update.

Type via System Events (simulates real keyboard — useful when JS injection is blocked):

```bash
osascript -e '
tell application "Safari" to activate
delay 0.3
tell application "System Events"
  keystroke "hello world"
end tell'
```

Press special keys:

```bash
osascript -e '
tell application "System Events"
  key code 36  -- Enter/Return
  key code 48  -- Tab
  key code 51  -- Delete/Backspace
  keystroke "a" using command down  -- Cmd+A (select all)
  keystroke "c" using command down  -- Cmd+C (copy)
end tell'
```

### 8. Scroll

```bash
# Scroll down 500px
osascript -e 'tell application "Safari" to do JavaScript "window.scrollBy(0, 500)" in current tab of front window'

# Scroll to top
osascript -e 'tell application "Safari" to do JavaScript "window.scrollTo(0, 0)" in current tab of front window'

# Scroll to bottom
osascript -e 'tell application "Safari" to do JavaScript "window.scrollTo(0, document.body.scrollHeight)" in current tab of front window'

# Scroll element into view
osascript -e 'tell application "Safari" to do JavaScript "document.querySelector(\"#target\").scrollIntoView({behavior: \"smooth\"})" in current tab of front window'
```

### 9. Switch Tabs

```bash
# Switch to tab 2 in the front window
osascript -e 'tell application "Safari" to set current tab of front window to tab 2 of front window'

# Switch to a tab by URL match
osascript -e '
tell application "Safari"
  repeat with t from 1 to (count of tabs of front window)
    if URL of tab t of front window contains "github.com" then
      set current tab of front window to tab t of front window
      exit repeat
    end if
  end repeat
end tell'
```

### 10. Wait for Page Load

```bash
osascript -e '
tell application "Safari"
  -- Wait until page finishes loading (max 10 seconds)
  repeat 20 times
    set readyState to do JavaScript "document.readyState" in current tab of front window
    if readyState is "complete" then exit repeat
    delay 0.5
  end repeat
end tell'
```

## Workflow: Browsing with Screenshot Feedback Loop

For tasks that require visual confirmation, use the screenshot loop:

1. Perform action (navigate, click, scroll, etc.)
2. Wait for page load if needed
3. Take screenshot (background or foreground) → Read the image to see result
4. Decide next action based on what is visible

## Operating on Specific Tabs

To operate on a tab other than the current one, use `tab N of window M` syntax:

```bash
# Read content of tab 3 in window 1
osascript -e 'tell application "Safari" to do JavaScript "document.title" in tab 3 of window 1'

# Execute JS in a specific tab
osascript -e 'tell application "Safari" to do JavaScript "document.body.innerText.substring(0, 1000)" in tab 2 of front window'
```

Note: Background screenshots capture the entire Safari window (whichever tab is active). To screenshot a specific tab, first switch to it via AppleScript.

## Limitations

- **macOS only** — AppleScript and screencapture are macOS-specific
- **Cannot intercept network requests** — only page content and JS execution
- **Cannot access cross-origin iframes** — browser security applies
- **Private browsing windows** — AppleScript cannot control private windows
- **System Events keystroke is "blind"** — it types into whatever is focused; ensure Safari is frontmost before using

related skills

semantically similar in the cross-vendor index

clawhub

79% match

Claude for Safari

Control the user's real Safari browser on macOS using AppleScript and screencapture. This skill should be used when the user asks to interact with Safari, br...

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit intent, inputs (permissions checklist, external connections), comprehensive 12-step procedure with input/output contracts for each step, decision logic for permission failures and edge cases (private windows, js failures, keystroke blocking), structured output contract with file locations and formats, and outcome signals for success validation.

Safari Browser Control

intent

control your actual safari browser on macos via applescript and screencapture, giving you full access to your real browser session (login state, cookies, open tabs) without extensions or headless browsers. use this when you need to interact with sites that require authentication, multi-step flows, or visual confirmation. triggers on keywords like "safari", "browser", "web page", "open tab", "screenshot the page", "read this site", "browse", "click on", "fill in the form".

inputs

system requirements:

macOS (10.13+). applescript and screencapture are native.
safari running (not required to be frontmost).
terminal app with automation permission granted.

permissions (checked at session start):

system settings > privacy & security > automation , grant terminal app (or your shell: terminal, iterm, warp) permission to control safari. test with:

osascript -e 'tell application "Safari" to get name of front window' 2>&1

if this fails, enable the permission and retry.

safari > settings > advanced , enable "show features for web developers", then develop menu > allow javascript from apple events. verify with a js execution command.
optional: system settings > privacy & security > screen recording , grant terminal app permission for background screenshots (no activation needed). if not granted, fallback to foreground screenshot (briefly activates safari).

external connections:

none. purely local applescript and screencapture.

session context:

current frontmost app name (used to restore focus after foreground screenshots).
list of open safari windows and tabs (detected via applescript).
safari window geometry (bounds) for region-based screenshots.

procedure

step 1: detect permissions and capabilities at session start

input: none. output: environment variables set: AUTOMATION_OK (true/false), SCREEN_RECORDING_OK (true/false), BACKGROUND_SCREENSHOT_AVAILABLE (true/false).

run:

# test automation permission
if osascript -e 'tell application "Safari" to get name of front window' 2>&1 | grep -q "not allowed"; then
  export AUTOMATION_OK=false
else
  export AUTOMATION_OK=true
fi

# test screen recording (compile swift helper if not already done)
if [ ! -f /tmp/safari_wid ]; then
  cat > /tmp/safari_wid.swift << 'SWIFT'
import CoreGraphics
import Foundation
let options: CGWindowListOption = [.optionOnScreenOnly, .excludeDesktopElements]
guard let windowList = CGWindowListCopyWindowInfo(options, kCGNullWindowID) as? [[String: Any]] else { exit(1) }
for window in windowList {
    guard let owner = window[kCGWindowOwnerName as String] as? String,
          owner == "Safari",
          let layer = window[kCGWindowLayer as String] as? Int,
          layer == 0,
          let wid = window[kCGWindowNumber as String] as? Int else { continue }
    print(wid)
    exit(0)
}
exit(1)
SWIFT
  swiftc /tmp/safari_wid.swift -o /tmp/safari_wid 2>/dev/null || true
fi

if /tmp/safari_wid 2>/dev/null >/dev/null; then
  export BACKGROUND_SCREENSHOT_AVAILABLE=true
  export SCREEN_RECORDING_OK=true
else
  export BACKGROUND_SCREENSHOT_AVAILABLE=false
  export SCREEN_RECORDING_OK=false
fi

if AUTOMATION_OK is false, instruct user to grant automation permission and retry.

step 2: list all open tabs and windows

input: none. output: formatted text (W#T# | tab name | url), one per line.

osascript -e '
tell application "Safari"
  set output to ""
  repeat with w from 1 to (count of windows)
    repeat with t from 1 to (count of tabs of window w)
      set tabName to name of tab t of window w
      set tabURL to URL of tab t of window w
      set output to output & "W" & w & "T" & t & " | " & tabName & " | " & tabURL & linefeed
    end repeat
  end repeat
  return output
end tell'

step 3: read page content (current tab)

input: javascript extraction method (options: body text, structured metadata, simplified dom). output: plain text or json string.

option a: body text only

osascript -e '
tell application "Safari"
  do JavaScript "document.body.innerText" in current tab of front window
end tell'

option b: structured metadata (title, url, description, headings)

osascript -e '
tell application "Safari"
  do JavaScript "JSON.stringify({
    title: document.title,
    url: location.href,
    description: document.querySelector(\"meta[name=description]\")?.content || \"\",
    h1: [...document.querySelectorAll(\"h1\")].map(e => e.textContent).join(\" | \"),
    h2: [...document.querySelectorAll(\"h2\")].map(e => e.textContent).join(\" | \")
  })" in current tab of front window
end tell'

option c: simplified dom (markdown-like structure, max 50kb)

osascript -e '
tell application "Safari"
  do JavaScript "
    (function() {
      const walk = (node, depth) => {
        let result = \"\";
        for (const child of node.childNodes) {
          if (child.nodeType === 3) {
            const text = child.textContent.trim();
            if (text) result += text + \"\\n\";
          } else if (child.nodeType === 1) {
            const tag = child.tagName.toLowerCase();
            if ([\"script\",\"style\",\"noscript\",\"svg\"].includes(tag)) continue;
            const style = getComputedStyle(child);
            if (style.display === \"none\" || style.visibility === \"hidden\") continue;
            if ([\"h1\",\"h2\",\"h3\",\"h4\",\"h5\",\"h6\"].includes(tag))
              result += \"#\".repeat(parseInt(tag[1])) + \" \";
            if (tag === \"a\") result += \"[\";
            if (tag === \"img\") result += \"[Image: \" + (child.alt || \"\") + \"]\\n\";
            else if (tag === \"input\") result += \"[Input \" + child.type + \": \" + (child.value || child.placeholder || \"\") + \"]\\n\";
            else if (tag === \"button\") result += \"[Button: \" + child.textContent.trim() + \"]\\n\";
            else result += walk(child, depth + 1);
            if (tag === \"a\") result += \"](\" + child.href + \")\\n\";
            if ([\"p\",\"div\",\"li\",\"tr\",\"br\",\"h1\",\"h2\",\"h3\",\"h4\",\"h5\",\"h6\"].includes(tag))
              result += \"\\n\";
          }
        }
        return result;
      };
      return walk(document.body, 0).substring(0, 50000);
    })()
  " in current tab of front window
end tell'

all options output plain text (utf-8). if js execution times out or fails, the error message is printed to stderr.

step 4: execute arbitrary javascript in current tab

input: javascript code (single-line string or multi-line via heredoc). output: return value of the js (as string, or "error" message on failure).

single-line:

osascript -e '
tell application "Safari"
  do JavaScript "YOUR_JS_CODE_HERE" in current tab of front window
end tell'

multi-line (heredoc):

osascript << 'APPLESCRIPT'
tell application "Safari"
  do JavaScript "
    (function() {
      // your multi-line code here
      return JSON.stringify(result);
    })()
  " in current tab of front window
end tell
APPLESCRIPT

step 5: take a screenshot

input: screenshot method (auto-detected from BACKGROUND_SCREENSHOT_AVAILABLE). output: png file at /tmp/safari_screenshot.png.

if background screenshot available (no activation):

WID=$(/tmp/safari_wid)
if [ -n "$WID" ]; then
  screencapture -l "$WID" -o -x /tmp/safari_screenshot.png
else
  echo "safari not found in window list" >&2
fi

if background screenshot not available (fallback to foreground):

# remember current frontmost app
FRONT_APP=$(osascript -e 'tell application "System Events" to get name of first process whose frontmost is true')

# activate safari and get window bounds
osascript -e 'tell application "Safari" to activate'
sleep 0.3

BOUNDS=$(osascript -e '
tell application "System Events"
  tell process "Safari"
    set bestW to 0
    set bestBounds to ""
    repeat with i from 1 to (count of windows)
      set {x, y} to position of window i
      set {w, h} to size of window i
      if w * h > bestW then
        set bestW to w * h
        set bestBounds to (x as text) & "," & (y as text) & "," & (w as text) & "," & (h as text)
      end if
    end repeat
    return bestBounds
  end tell
end tell')

# capture the region
screencapture -x -R "$BOUNDS" /tmp/safari_screenshot.png

# restore previous app
osascript -e "tell application \"$FRONT_APP\" to activate"

after taking the screenshot, use your read tool on /tmp/safari_screenshot.png to view it.

step 6: navigate to a url

input: url string, target (current tab, new tab, or new window). output: none (url is set).

current tab:

osascript -e '
tell application "Safari"
  set URL of current tab of front window to "https://example.com"
end tell'

new tab (focus to it):

osascript -e '
tell application "Safari"
  tell front window
    set newTab to make new tab with properties {URL:"https://example.com"}
    set current tab to newTab
  end tell
end tell'

new window:

osascript -e 'tell application "Safari" to make new document with properties {URL:"https://example.com"}'

after navigating, wait for page load (step 11).

step 7: click an element

input: css selector string. output: "clicked" or "element not found".

use javascript dispatch (preferred for react/vue/angular):

osascript -e '
tell application "Safari"
  do JavaScript "
    const el = document.querySelector(\"button.submit\");
    if (el) {
      el.dispatchEvent(new MouseEvent(\"click\", {bubbles: true, cancelable: true}));
      \"clicked\";
    } else {
      \"element not found\";
    }
  " in current tab of front window
end tell'

critical: use dispatchEvent(new MouseEvent(..., {bubbles: true})) instead of .click() for react/vue/angular compatibility. native .click() may bypass synthetic event handlers.

step 8: fill form inputs

input: css selector for input, text value. output: "filled" or "element not found".

for react-controlled inputs (required):

osascript -e '
tell application "Safari"
  do JavaScript "
    const input = document.querySelector(\"input[name=search]\");
    if (input) {
      const nativeSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, \"value\").set;
      nativeSetter.call(input, \"search text\");
      input.dispatchEvent(new Event(\"input\", {bubbles: true}));
      input.dispatchEvent(new Event(\"change\", {bubbles: true}));
      \"filled\";
    } else {
      \"element not found\";
    }
  " in current tab of front window
end tell'

for keyboard input (when js injection is blocked or for real typing):

osascript -e 'tell application "Safari" to activate'
delay 0.3
osascript -e '
tell application "System Events"
  keystroke "hello world"
end tell'

special keys (applescript key codes):

36: enter/return
48: tab
51: delete/backspace
cmd+a: keystroke "a" using command down
cmd+c (copy): keystroke "c" using command down

step 9: scroll

input: direction and distance (pixels), or scroll target (top, bottom, element). output: none (page scrolled).

# scroll down 500px
osascript -e 'tell application "Safari" to do JavaScript "window.scrollBy(0, 500)" in current tab of front window'

# scroll to top
osascript -e 'tell application "Safari" to do JavaScript "window.scrollTo(0, 0)" in current tab of front window'

# scroll to bottom
osascript -e 'tell application "Safari" to do JavaScript "window.scrollTo(0, document.body.scrollHeight)" in current tab of front window'

# scroll element into view (smooth)
osascript -e 'tell application "Safari" to do JavaScript "document.querySelector(\"#target\").scrollIntoView({behavior: \"smooth\"})" in current tab of front window'

step 10: switch tabs

input: tab index (1-based) or url substring to match. output: none (tab switched).

by index:

osascript -e 'tell application "Safari" to set current tab of front window to tab 2 of front window'

by url match:

osascript -e '
tell application "Safari"
  repeat with t from 1 to (count of tabs of front window)
    if URL of tab t of front window contains "github.com" then
      set current tab of front window to tab t of front window
      exit repeat
    end if
  end repeat
end tell'

step 11: wait for page load

input: max timeout (seconds). output: "ready" or "timeout".

osascript -e '
tell application "Safari"
  set readyState to ""
  repeat 20 times
    set readyState to do JavaScript "document.readyState" in current tab of front window
    if readyState is "complete" then
      return "ready"
    end if
    delay 0.5
  end repeat
  return "timeout"
end tell'

timeout default is 10 seconds (20 * 0.5). adjust delays for longer waits.

step 12: operate on a specific tab (not current)

input: window index (1-based), tab index (1-based). output: depends on operation (js return value, etc.).

to read a specific tab without switching:

osascript -e 'tell application "Safari" to do JavaScript "document.title" in tab 3 of window 1'

to execute js in a non-current tab:

osascript -e 'tell application "Safari" to do JavaScript "document.body.innerText.substring(0, 1000)" in tab 2 of front window'

note: background screenshots always capture the currently active (visible) tab. to screenshot a specific tab, switch to it first (step 10).

decision points

if automation permission not granted:

stop and instruct user: "system settings > privacy & security > automation , grant terminal permission to safari."
re-run step 1 after user grants permission.

if screen recording permission not granted:

use foreground screenshot (step 5 fallback). briefly activates safari (~0.5s), then restores previous app. visual flicker is normal.
optionally prompt user: "system settings > privacy & security > screen recording , grant permission to avoid focus-switching."

if js execution fails (timeout or syntax error):

check js for quotes, escaping, syntax validity.
if js is complex, break into smaller steps and test each in isolation.
if the js is correct but still fails, the page may be in a private window (not supported) or js injection is blocked (rare; fall back to keyboard input and screenshots).

if element not found (selector returns null):

re-read the page content (step 3) to verify current dom.
take a screenshot (step 5) to see visual state.
adjust selector or wait for dynamic content to load (step 11).

if a tab is in a private window:

applescript cannot control private windows. instruct user to use a non-private window.

if keystroke input is unresponsive:

ensure safari is frontmost (activate it before keystroke).
verify the input field is focused (use js to .focus() if needed).
if still blocked, the site may have input event filtering. try js-based fill instead (step 8).

if a multi-step interaction is failing (e.g. fill form > click submit > wait > read result):

insert screenshots and reads between steps to confirm state.
use the screenshot feedback loop (below) to diagnose.

output contract

step 1 (permissions): environment variables exported to current shell session. success: AUTOMATION_OK=true and screenshot method detected.

step 2 (list tabs): multiline text, format W#T# | name | url. one tab per line. empty if no tabs open.

step 3 (read content): plain text (utf-8) for body text or simplified dom. json string for structured metadata. max 50kb for dom read. special characters preserved (newlines, unicode).

step 4 (execute js): return value of js expression (string, json, or error message). null becomes "null" string. syntax errors print to stderr.

step 5 (screenshot): png file at /tmp/safari_screenshot.png. image dimensions match safari window (or specified region). file size typically 100kb-500kb depending on page complexity.

step 6 (navigate): none (side effect: url changed). verify with step 3 (read title/url metadata) or step 5 (screenshot).

step 7 (click): return string "clicked" or "element not found". no actual visual feedback without a screenshot.

step 8 (fill): return string "filled" or "element not found". value is set in the input (verify with step 3 if needed).

step 9 (scroll): none. verify with screenshot or read (page offset changes but no confirmation message).

step 10 (switch tab): none. current tab changed. verify with step 2 (tab listing shows current) or step 3 (read new tab content).

step 11 (wait for page load): return string "ready" or "timeout". "ready" means document.readyState is "complete".

step 12 (operate on specific tab): same as the operation (js return, etc.). tab does not need to be active/visible.

outcome signal

permission check passed: step 1 exports AUTOMATION_OK=true to shell.
page read successfully: step 3 returns text/json with body content or metadata (non-empty, no applescript error).
js executed: step 4 returns a value without error. can verify by reading a known property (e.g. document.title returns page title).
screenshot taken: file exists at /tmp/safari_screenshot.png. use your read tool to view it and confirm content.
navigation worked: step 6 completes without error. step 3 or 5 shows new page content.
click worked: step 7 returns "clicked". page changes (step 5 screenshot) or form state changes (step 3 read) confirm.
form filled: step 8 returns "filled". input value is set (step 3 reads the input field value or placeholder).
page is ready: step 11 returns "ready". subsequent js/click/read commands execute cleanly.
tab switched: step 10 completes. step 2 or step 3 confirms you are reading the new tab content.
feedback loop complete: screenshot shows result of action, confirming intent (e.g. button changed color, form was submitted, navigation happened).

working example: browse and screenshot feedback loop

typical workflow for a task that requires visual confirmation:

# 1. check permissions (step 1)
osascript -e 'tell application "Safari" to get name of front window' 2>&1

# 2. navigate to a site (step 6)
osascript -e 'tell application "Safari" to set URL of current tab of front window to "https://example.com"'

# 3. wait for load (step 11)
osascript -e 'tell application "Safari" to repeat 20 times
  set readyState to do JavaScript "document.readyState" in current tab of front window
  if readyState is "complete" then exit repeat
  delay 0.5
end tell'

# 4. take screenshot (step 5)
/tmp/safari_wid >/dev/null 2>&1 && \
  screencapture -l "$(/tmp/safari_wid)" -o -x /tmp/safari_screenshot.png || \
  (osascript -e 'tell application "Safari" to activate'; sleep 0.3; \
   screencapture -x -R "$(osascript -e 'tell application "System Events" to tell process "Safari" to return (position of window 1) & (size of window 1)')" /tmp/safari_screenshot.png)

# 5. read the screenshot with your read tool to see result
# (use your built-in read capability on /tmp/safari_screenshot.png)

# 6. based on visual result, decide next action (scroll, click, fill, etc.)
# (repeat from step 3 or 4 as needed)