Markitdown-Skill-for-non-multimodal-agent

Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx...

installs

stars

karma

SkillRank score ↗

8.5/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-15

markitdown converts non-text documents (pdf, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to markdown for text-only llm agents. optional ocr layer reads scanned pdfs and images via openai-compatible vision models when an api key is configured.

structure

9.0

trigger phrases

9.0

procedure

9.0

edge cases

8.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: markitdown
description: "Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx/xls), EPUB, HTML, CSV, JSON, XML, or ZIP — and you need its CONTENT to read, summarize, quote, or store it. The model can't open the file itself, so convert it to Markdown first with the markitdown MCP server (local, free, no API key). An OPTIONAL OCR layer reads scanned PDFs and images via an OpenAI-compatible vision model, but ONLY when a key is configured. Skip for files the agent can already read as plain text, and for plain images when no OCR key is set."
version: 0.1.0
homepage: https://github.com/Self-made-Orange/markitdown
metadata:
openclaw:
emoji: 📄
requires:
anyBins: [uvx, markitdown-mcp]
envVars:
- name: OPENAI_API_KEY
required: false
description: "OPTIONAL. Turns on the OCR layer (reading images and scanned PDFs) via the markitdown CLI. Costs API calls per image. Leave unset for free, text-only document conversion. Any OpenAI-compatible endpoint works — set OPENAI_BASE_URL to point elsewhere."
- name: MARKITDOWN_OCR_MODEL
required: false
description: "Vision model used by the OCR layer. Default: gpt-4o-mini (cheapest vision-capable). Only read when OPENAI_API_KEY is set."
---

# markitdown — let a text-only agent read documents

A **non-multimodal** OpenClaw agent has no eyes: its backend is a plain-text API, so it cannot open a PDF / Word / Excel / PowerPoint attachment at all. This skill turns those files into Markdown the model *can* read.

Two layers, and you almost always only need the first:

- **Free layer — DEFAULT.** The `markitdown` MCP server converts text-bearing documents (PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to Markdown **locally**. No API key. No per-call cost. This handles the vast majority of attachments, because most documents store real text.
- **OCR layer — OPT-IN.** Scanned PDFs (photographed pages), standalone image files, and images embedded *inside* documents contain no extractable text — the only way to read them is to have a vision model look. This layer is **OFF unless `OPENAI_API_KEY` is set**, and it **bills per image**.

> The cost line is simple: pulling existing text out of a file is free; asking a model to *look at a picture* costs money.

## When to use

A document attachment arrives (Slack file, email attachment, a path the user gives you) whose extension or MIME is one of:

`pdf · docx · pptx · xlsx · xls · epub · html · htm · csv · json · xml · zip · md`
…and you need its **content** (to answer about it, summarize it, quote it, or save it).

For plain images (`png · jpg · jpeg · gif · webp · tiff`): only useful **if the OCR layer is on**. With no key, a text-only agent simply cannot read an image — say so rather than guessing.

Do **not** use this for a file the agent can already read as plain text in the prompt.

---

## Setup (operator, one time)

### Free layer — the MCP server

Run the server over stdio with no install using `uvx`:

```bash
uvx markitdown-mcp
```

```json
{
"mcpServers": {
"markitdown": {
"command": "uvx",
"args": ["markitdown-mcp"]
}
}
}
```

This exposes one tool: **`convert_to_markdown(uri)`**, where `uri` is any `http:`, `https:`, `file:`, or `data:` URI. That is the whole free layer.

> The MCP server runs with the privileges of its process and can read any file that user can read. Keep it bound to local/stdio use only.

### OCR layer — the CLI (optional)

The MCP server **cannot OCR** — it never wires up a vision client, so even with plugins enabled it silently returns text-only output. OCR runs through the **CLI** instead. Install the fork (which ships the `markitdown-ocr` plugin) plus an OpenAI client:

```bash
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]"
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr"
pip install openai
```

Then export a key (any OpenAI-compatible endpoint works):

```bash
export OPENAI_API_KEY=sk-...
export MARKITDOWN_OCR_MODEL=gpt-4o-mini # cheapest vision model; optional
```

With no `OPENAI_API_KEY`, the plugin still loads but OCR is skipped — you fall back to the free converter automatically. So the OCR layer is genuinely zero-cost until someone opts in.

---

## Flow (per attachment)

1. **Get the absolute path.** The downloaded attachment's absolute path is already provided by the runtime (e.g. `MediaPaths`). Build a `file://<absolute-path>` URI.
- ⚠️ Convert **in the same turn the file arrives** — downloads live in a temp dir and may be GC'd next turn.

2. **Free convert (always try first).** Call `convert_to_markdown("file://<abspath>")` on the `markitdown` MCP server. For normal documents you are done — read or store the Markdown.

3. **Decide if OCR is needed.** OCR only matters when:
- the file is a **standalone image**, or
- the free conversion came back **empty / whitespace-only / a few stray characters** (a tell-tale of a *scanned* PDF — pages are images, not text).

If neither is true, stop. Don't spend a vision call on a document that already gave you text.

4. **OCR (only if needed AND `OPENAI_API_KEY` is set).** Shell out to the CLI:

```bash
markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "<out>.md"
```

Or, with no global install, one-shot via `uvx`:

```bash
uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
--with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
--with openai \
markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}"
```

OCR-extracted text is wrapped inline as `*[Image OCR] … [End OCR]*`, interleaved in reading order, so document structure is preserved.

**If `OPENAI_API_KEY` is NOT set** and the content is image-only: do not pretend. Tell the user the file is image-based and reading it needs the optional OCR layer (an OpenAI-compatible key), and stop.

5. **Use or persist.** One-off question → read the Markdown and answer; no need to save. Worth keeping → write it to your knowledge store with provenance (original filename, source, date).

---

## Cost & model notes

- **Free layer:** $0. Local text extraction, no network model.
- **OCR layer:** one vision API call per image (and one per page for fully scanned PDFs, rendered at 300 DPI). With `gpt-4o-mini` this is roughly a fraction of a cent per image — cheap, but not zero, and it scales with image count. Pick a small vision model unless you need fidelity.
- The OCR layer is the reason this fork exists: it gives a text-only agent a way to "see" images, on demand, without making the whole agent multimodal.

## Gotchas

- **MCP ≠ OCR.** Do not set `MARKITDOWN_ENABLE_PLUGINS=true` on the server expecting OCR — the server passes no `llm_client`, so it silently skips OCR. OCR is CLI-only.
- **Path access.** Both the `file://` input and any output path must be inside the server/agent's allowed root, or the call is blocked.
- **Encrypted / corrupt files** can fail conversion. Report the failure plainly; for PDFs you can retry with a dedicated PDF tool if available.
- **Don't OCR what already has text.** Step 3's check exists to avoid burning vision calls on ordinary documents.

## Supported formats

Free (local): PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, EPUB, ZIP (iterates contents), plus text formats.
OCR-enhanced (key required): scanned PDFs, standalone images, and images embedded in PDF/DOCX/PPTX/XLSX.

---

Built on [microsoft/markitdown](https://github.com/microsoft/markitdown); OCR layer from the [Self-made-Orange/markitdown](https://github.com/Self-made-Orange/markitdown) fork (`packages/markitdown-ocr`).

don't have the plugin yet? install it then click "run inline in claude" again.

separated free and OCR layers into explicit inputs and decision points, added edge cases (timeouts, rate limits, auth expiry, temp file cleanup, encrypted files), structured procedure into six numbered steps with clear inputs/outputs, detailed output contract with success and failure cases, and defined outcome signals for user confirmation.

markitdown skill for non-multimodal agent

Item: Markitdown-Skill-for-non-multimodal-agent
Rating: 8.5
Author: Implexa

intent

a text-only LLM backend cannot read file attachments directly. this skill converts documents (PDF, docx, pptx, xlsx, xls, epub, html, csv, json, xml, zip) to Markdown so your agent can read, summarize, quote, or store the content. a free local layer handles text-bearing documents; an optional OCR layer (requires OPENAI_API_KEY) reads scanned PDFs and standalone images via a vision model. use this whenever an attachment arrives and you need its content extracted.

inputs

required

file path or URI: absolute filesystem path to the attachment (e.g., /tmp/document.pdf), or an http/https/file/data URI. the runtime typically provides this via MediaPaths or equivalent after download.
markitdown MCP server: running on stdio via uvx markitdown-mcp (registered in your MCP client config). exposes tool convert_to_markdown(uri).

optional

OPENAI_API_KEY: enables OCR for scanned PDFs and images. if unset, OCR is skipped and you fall back to text-only conversion. any OpenAI-compatible endpoint works.
MARKITDOWN_OCR_MODEL: vision model for OCR (default: gpt-4o-mini). only used if OPENAI_API_KEY is set.
OPENAI_BASE_URL: endpoint override for non-OpenAI vision providers. only used if OPENAI_API_KEY is set.

edge cases

network timeout: markitdown CLI may hang on very large files or slow vision API calls. consider a timeout wrapper (e.g., 60 seconds for OCR).
rate limits: OpenAI vision API has per-minute limits. batch large document sets or add retry logic with exponential backoff.
auth expiry: OPENAI_API_KEY may rotate. handle 401 gracefully and instruct the user to refresh credentials.
temp file cleanup: downloaded files live in a temp directory and may be garbage-collected between turns. convert in the same turn the file arrives.

procedure

obtain the file path. receive the attachment's absolute path from the runtime (e.g., MediaPaths, direct user input). construct a file://<absolute-path> URI (example: file:///tmp/report.pdf).
- input: absolute filesystem path
- output: file:// URI
attempt free text conversion. call the markitdown MCP server tool convert_to_markdown(file://<abspath>). this extracts text from PDFs, Word documents, Excel sheets, HTML, CSV, JSON, XML, EPUB, ZIP, and other text-bearing formats locally, at zero cost.
- input: file:// URI
- output: Markdown string (may be empty or whitespace-only if the file is image-only or heavily corrupted)
check if OCR is needed (see decision points below). determine whether the free conversion succeeded or whether OCR is required for image-based or scanned content.
- input: conversion result, file extension
- output: boolean flag (OCR needed yes/no), reason string

if OCR is needed and OPENAI_API_KEY is set, run OCR CLI. shell out to the markitdown CLI with plugins and vision model:

markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "<output>.md"

or via uvx for one-shot execution:

uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
    --with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
    --with openai \
    markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "<output>.md"

input: abspath, OPENAI_API_KEY, MARKITDOWN_OCR_MODEL, OPENAI_BASE_URL (optional)
output: Markdown file (or stdout); OCR-extracted text is wrapped as *[Image OCR] ... [End OCR]* and interleaved in reading order

handle OCR unavailable. if OCR is needed but OPENAI_API_KEY is not set, inform the user that the file is image-based and requires the optional OCR layer (an OpenAI-compatible API key). do not guess or hallucinate text. stop.
- input: failed free conversion, no API key
- output: user-facing error message
use or persist the result. for a one-off question, read the Markdown and answer directly. if the content is worth keeping, write it to your knowledge store with provenance (original filename, source URL, date converted).
- input: Markdown string or file path
- output: knowledge store entry or inline response

decision points

if the file is a plain text format (txt, md, csv readable as plain text). skip this skill entirely. the agent can consume the file directly. only use markitdown if the attachment is binary (PDF, docx, xlsx) or requires parsing (HTML, JSON, XML).
if the free conversion returns non-empty, readable Markdown. stop after step 2. do not call OCR. the document already has extractable text.
if the free conversion returns empty, whitespace-only, or a few stray characters. this signals a scanned PDF (pages are images, not text) or a standalone image. proceed to OCR (step 4) if OPENAI_API_KEY is set.
if the file extension is a known image format (png, jpg, jpeg, gif, webp, tiff) and OCR is needed. only proceed if OPENAI_API_KEY is set. otherwise, inform the user that image reading requires the optional OCR layer.
if OPENAI_API_KEY is set but the call to the vision model fails (401, rate limit, timeout, or other error). report the error plainly. do not retry OCR without explicit user confirmation or backoff logic. suggest checking credentials or waiting before retrying.
if the file is encrypted, password-protected, or corrupt. the conversion will fail. report the failure without attempting OCR. suggest manual inspection or a dedicated tool for that file type (e.g., a PDF extraction tool for PDFs).
if the output path is outside the allowed root (for file:// or output locations). the call is blocked by the sandbox. report that the file location is not accessible.

output contract

success case

free conversion only: Markdown string containing extracted text, code, tables, and structure from the original document. save to a .md file or return inline, with provenance metadata (source filename, conversion timestamp).
with OCR: Markdown string with both text-layer content and vision-extracted text marked as *[Image OCR] ... [End OCR]* blocks, interleaved in reading order.
format: valid Markdown (UTF-8, no binary). tables rendered as Markdown pipes. lists and hierarchy preserved. inline images in the original may be skipped or marked as [Image: description].

failure cases

empty result: return the empty string or a message "no extractable text found". do not invent content.
no OCR key: return a message like "this file is image-based and requires the optional OCR layer. set OPENAI_API_KEY to enable vision-based reading."
API error: return the error message and HTTP status (e.g., "401 Unauthorized: check OPENAI_API_KEY").
file not found or inaccessible: return "file not found or not readable at ".
timeout: return "conversion timed out after seconds; try a smaller file or check network connectivity".
unsupported format: return "format not supported; markitdown handles PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip".

outcome signal

user can read and act on the document content: the agent answers questions about it, quotes specific sections, or stores it in the knowledge base.
Markdown is inline and immediately usable: the agent does not report "I need more time to process this" or loop waiting for async conversion.
provenance is recorded: if saved, the original filename, source, and conversion date are logged alongside the Markdown.
cost is transparent: if OCR was used, log which vision model and how many images were processed (for billing awareness).
clear error messaging: if OCR is needed but unavailable, or the file is corrupt, the user sees a direct explanation and next steps, not a silent failure.

credits: built on microsoft/markitdown. OCR layer from Self-made-Orange/markitdown fork (packages/markitdown-ocr). original skill author not declared in source.

Markitdown-Skill-for-non-multimodal-agent

related skills

markitdown skill for non-multimodal agent

intent

inputs

procedure

decision points

output contract

outcome signal