Generate polished .docx documents by injecting Markdown content into an existing Word template, preserving the template's cover page, TOC, fonts, headers, an...
---
name: corporate-doc-builder
version: 1.0.0
description: Generate polished .docx documents by injecting Markdown content into an existing Word template, preserving the template's cover page, TOC, fonts, headers, and footers. Use when the user has a .docx template plus reference materials (documents, spreadsheets, slides, or source code) and wants a production-ready Word deliverable. Covers the full pipeline from template analysis through chapter drafting to python-docx injection.
metadata:
openclaw:
requires:
bins:
- python3
- npx
install:
- kind: uv
packages:
- python-docx
- Pillow
- openpyxl
---
# Corporate Doc Builder
## Overview
Turning a corporate .docx template plus scattered source materials into a finished document is error-prone. Models routinely break template styles, exceed token limits, hallucinate TOC entries, and produce images that overlap with text.
This skill codifies a battle-tested 6-stage pipeline that avoids these traps:
```
1. Template Analysis -> Extract TOC, styles, placeholders
2. Spec -> Write a design spec for the document
3. Plan -> Break work into per-chapter tasks (optional for simple docs)
4. Research -> Summarize source materials into reusable notes
5. Authoring -> Write each chapter as an independent Markdown file
6. Injection -> Render diagrams + inject Markdown into the template
```
**Core principle:** Template styles are preserved via `python-docx` copy-and-inject, never via `pandoc` whole-file conversion. All diagrams use Mermaid.
## When to Trigger
Activate this skill when ANY of the following apply:
- The user asks to "write a document based on a template" or "generate a report from a template"
- A task mentions both a `.docx` template path AND source/reference materials
- The user requires "cover page / TOC / fonts / headers / footers must match the template"
- The user asks to produce a corporate design document (outline design, detailed design, database design, interface design, architecture specification, technical white paper, etc.)
- The task involves `.docx` template + `.xlsx` feature lists + `.pptx` architecture diagrams or similar mixed enterprise assets
Do NOT activate when:
- The output is a plain Markdown, README, or blog post with no template
- The user just wants to locally edit an existing `.docx` (use the `docx` skill instead)
- No fixed template is involved
## The 6-Stage Pipeline
Each stage has a pre-flight checklist and exit criteria. **Do not skip stages or run them in parallel.**
---
### Stage 1: Template Analysis
**Goal:** Understand the template structure and agree on a TOC mapping with the user.
> **Companion skill:** If `superpowers:brainstorming` is available, invoke it at the start of this stage to systematically explore user intent and requirements before committing to a TOC mapping.
#### Pre-flight Checklist
| Item | Action |
|------|--------|
| Output directory | `ls` to verify it exists; fix typos before proceeding |
| Source material paths | `ls` each path to confirm accessibility |
| Template file | `ls -la <template>.docx` to confirm it exists and is not locked |
| Historical drafts | If prior output exists, read the first chapter to verify it belongs to THIS document |
#### Extract Template TOC
```python
from docx import Document
doc = Document(template_path)
for p in doc.paragraphs:
if p.style.name.startswith("Heading") or p.style.name.startswith("toc"):
print(p.style.name, p.text)
```
Some templates use custom styles (e.g., `CJ1`, `CJ2`) instead of standard `Heading` styles. Scan all paragraph styles and identify which ones act as headings.
#### Extract Template Images and Tables
Templates often contain placeholder images and tables. Extract them to plan which chapters need diagrams or data tables:
```python
# Count images
print(f"Images: {len(doc.inline_shapes)}")
# Count tables
print(f"Tables: {len(doc.tables)}")
```
#### TOC Mapping Rules
Templates often say "keep titles consistent." The real meaning is:
- **Top-level chapter titles** (1, 2, 3, ...): Keep them exactly as the template defines.
- **Sub-section titles** (1.1, 1.1.1, ...): Rewrite them to match the actual product/project. Do NOT copy the template's placeholder examples.
- **Style consistency**: Match the template's tone (imperative verbs, clause-style statements, etc.)
- **Placeholder text**: Replace ALL placeholder words (e.g., "XXX System", "Oracle Database", "SOA Architecture") with the actual technology stack and business domain.
#### Exit Criteria
- TOC mapping table reviewed and confirmed by the user
- Work approach decided (write from scratch / reuse prior drafts / partial reuse)
- Output directory, source material whitelist, and module scope are all explicit
---
### Stage 2: Spec
**Goal:** Write a design spec that anchors all subsequent work.
Write to `<output>/spec/<YYYY-MM-DD>-<topic>-spec.md`. Include at minimum:
1. Goal and scope
2. Source material constraints (whitelist of allowed paths)
3. Workflow overview
4. Complete TOC (user-confirmed)
5. Writing style baseline (language, depth, terminology)
6. Token budget protection strategy
7. Deliverables list
8. Confirmed key decisions
9. Open items
**Self-check** before submitting for review: scan for leftover placeholders, internal inconsistencies, scope creep, and ambiguity.
#### Exit Criteria
- User has reviewed and approved the spec
---
### Stage 3: Plan (Optional)
**Goal:** Break the work into per-chapter tasks for complex documents.
Skip this stage for simple documents (fewer than 5 chapters). For larger documents, write to `<output>/plans/<YYYY-MM-DD>-<topic>-plan.md` with tasks grouped into:
- **Research phase**: 2-3 tasks (source code analysis, reference doc summary, feature mapping)
- **Authoring phase**: One task per chapter
- **Injection phase**: 2 tasks (Mermaid rendering, docx injection)
Each task should have bite-sized steps (2-5 minutes each). **Per-chapter independent delivery + independent review** is the key token budget protection mechanism.
---
### Stage 4: Research
**Goal:** Extract and summarize source materials into reusable research notes.
Suggested output files in `<output>/research/`:
| File | Content |
|------|---------|
| `code-architecture.md` | Top-level module structure, key packages, tech stack, critical data flows |
| `reference-docs-summary.md` | Heading outline + key table/figure index for each reference document |
| `feature-mapping.md` | Feature list (from xlsx/pptx) mapped to target TOC chapters |
| `<topic>-inventory.md` | Domain-specific inventory (e.g., interface list, data model list, API catalog) |
#### Summarization Principle
Reference documents are **fact anchors**, not **content sources**. Extract headings, table titles, and key data. **Never copy full text** into research notes.
#### Source Traceability Rule
**Every TOC entry must have a traceable source** (source code path, reference document section, or feature list row). If a TOC entry has no source, delete it from the TOC rather than drafting content without evidence.
#### Extraction Snippets
```python
# Extract headings from .docx
from docx import Document
doc = Document(path)
for p in doc.paragraphs:
if p.style.name.startswith("Heading"):
print(p.style.name, p.text)
# Extract structured data from .xlsx
import openpyxl
wb = openpyxl.load_workbook(path)
for sh in wb.sheetnames:
for row in wb[sh].iter_rows(values_only=True):
print(row)
# Bulk-extract embedded images from .docx
# unzip -j <path>.docx 'word/media/*' -d ./extracted_imgs/
```
#### Exit Criteria
- All research notes delivered and reviewed by the user
- Every TOC entry has a source annotation
---
### Stage 5: Authoring
**Goal:** Write each chapter as an independent Markdown file.
#### File Layout
```
<output>/<doc>_md/
ch01_<topic>.md
ch02_<topic>.md
ch03_<topic>_p1.md # Split large chapters into parts
ch03_<topic>_p2.md
...
chNN_<topic>.md
appendix_a.md
full_draft.md # Final concatenation
```
#### Why Per-Chapter Files
- Keeps each request within token limits
- Enables per-chapter user review; problems surface early
- Rewriting one chapter does not affect others
#### Mermaid Diagrams
> **Companion skill:** If `claude-mermaid:mermaid-diagrams` is available, invoke it before writing Mermaid blocks. It provides syntax best practices, diagram type selection, and live preview tools that produce significantly higher-quality diagrams.
- Use fenced ` ```mermaid ` code blocks in Markdown
- Do not embed image placeholders; actual images are generated during injection
- Complex diagrams (deployment, sequence) should be individually numbered for easy replacement
- Do not hardcode colors or themes in Mermaid source; handle theming during rendering
- `sequenceDiagram` does NOT support `style` directives; avoid them
#### Merge
```bash
cat ch01_*.md ch02_*.md ... chNN_*.md appendix_*.md > full_draft.md
```
After merging, review once for: TOC continuity, chapter numbering consistency, and Mermaid block count.
#### Exit Criteria
- All chapters reviewed and approved by the user
- `full_draft.md` created with correct chapter order
---
### Stage 6: Injection
**Goal:** Render Mermaid diagrams to PNG, then inject Markdown into the template to produce the final `.docx`.
#### Step 1: Render Mermaid to PNG
```bash
python scripts/render_mermaid.py <full_draft.md> <images_dir>
```
This extracts all ` ```mermaid ` blocks and renders each to `diagram_1.png`, `diagram_2.png`, etc.
#### Step 2: Inject into Template
```bash
python scripts/inject_docx.py \
--md-dir <markdown_dir> \
--template <template.docx> \
--output <output.docx> \
--chapters ch01.md ch02.md ... appendix_a.md
```
The script: copies the template, clears body content after the TOC, injects Markdown as styled paragraphs, embeds Mermaid PNGs, and forces TOC field refresh.
#### Pre-Injection Template Style Audit
**This is critical.** Before running injection, check the template's paragraph styles for issues that will corrupt the output:
```python
from docx import Document
doc = Document(template_path)
normal = doc.styles['Normal']
pf = normal.paragraph_format
print(f"Normal: line_spacing_rule={pf.line_spacing_rule}, line_spacing={pf.line_spacing}")
for style in doc.styles:
if style.name and style.name.startswith("Heading"):
pPr = style.element.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr')
if pPr is not None:
numPr = pPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numPr')
if numPr is not None:
print(f"WARNING: {style.name} has numPr (auto-numbering)")
```
Check for these three issues and apply fixes:
| Issue | Symptom | Fix |
|-------|---------|-----|
| Normal style has `line_spacing_rule = EXACTLY` | Images are clipped to line height and overlap with text | Override image paragraphs with `line_spacing_rule = SINGLE` |
| Heading styles have `numPr` elements | Double numbering: "1.1.1 2.1.1 Title" | Strip `numPr` from all Heading styles before injection |
| Non-Mermaid code blocks ignored | JSON/SQL/pseudocode blocks are blank in .docx | Render code blocks as shaded monospace paragraphs |
See the [Template Style Pitfalls](#template-style-pitfalls) section for details.
#### Exit Criteria
- `.docx` opens correctly in Word/LibreOffice
- Cover page, TOC, headers, footers match the template
- All images display correctly with no text overlap
- All code blocks are rendered as monospace shaded paragraphs
- TOC updates correctly when refreshed (Ctrl+A, F9 in Word)
---
## Template Style Pitfalls
These issues were discovered across 4 production document generations. They are **universal** to any `.docx` template injection workflow.
### Pitfall 1: Image Clipping from Fixed Line Spacing
**Root cause:** Many corporate templates set the `Normal` paragraph style to `line_spacing_rule = EXACTLY` with a fixed height (e.g., 26pt). When an image is inserted into a paragraph inheriting this style, the paragraph height is locked to 26pt regardless of image size. The image overflows and overlaps subsequent text.
**Fix:** Explicitly set `line_spacing_rule = SINGLE` on every image paragraph:
```python
from docx.enum.text import WD_LINE_SPACING
from docx.shared import Pt, Cm
def add_image(doc, img_path, max_w_cm=14.0, max_h_cm=12.0):
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
pf = p.paragraph_format
pf.space_before = Pt(6)
pf.space_after = Pt(6)
pf.line_spacing_rule = WD_LINE_SPACING.SINGLE # Override EXACTLY
run = p.add_run()
w_cm, h_cm = image_size_cm(img_path, max_w_cm, max_h_cm)
run.add_picture(img_path, width=Cm(w_cm), height=Cm(h_cm))
```
Also cap `max_h_cm` at 12 (not 18) to prevent a single image from filling the entire page.
### Pitfall 2: Double Numbering from Heading numPr
**Root cause:** Some templates configure Heading styles with `numPr` (automatic numbering at the style level). When the Markdown heading text already contains manual numbering (e.g., "2.1.1 System Architecture"), the output shows "1.1.1 2.1.1 System Architecture" - the style's auto-number prepended to the manual number.
**Fix:** Strip `numPr` from all Heading styles before injecting content:
```python
WNS = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
def strip_heading_auto_numbering(doc):
for style in doc.styles:
if style.name and style.name.startswith("Heading"):
pPr = style.element.find(f'{WNS}pPr')
if pPr is not None:
numPr = pPr.find(f'{WNS}numPr')
if numPr is not None:
pPr.remove(numPr)
```
### Pitfall 3: Missing Code Blocks
**Root cause:** Injection scripts that only handle Mermaid fenced blocks often skip other code blocks (JSON, SQL, pseudocode, curl examples), leaving blank spaces in the output.
**Fix:** Collect non-Mermaid code block lines and render them as shaded monospace paragraphs:
```python
from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn
def add_code_block(doc, lines):
code_text = "\n".join(lines)
p = doc.add_paragraph()
pPr = p._element.get_or_add_pPr()
shd = docx.oxml.OxmlElement('w:shd')
shd.set(qn('w:val'), 'clear')
shd.set(qn('w:color'), 'auto')
shd.set(qn('w:fill'), 'F2F2F2')
pPr.append(shd)
run = p.add_run(code_text)
run.font.name = "Consolas"
run.font.size = Pt(9)
run.font.color.rgb = RGBColor(0x33, 0x33, 0x33)
```
### Pitfall 4: Template Placeholder Text Leaking
**Root cause:** Template headers, footers, and cover pages contain placeholder text ("XXX Project", "XXXX System"). If not replaced, the output ships with the wrong project name.
**Fix:** Scan and replace header/footer text during the injection step:
```python
for section in doc.sections:
for header_para in section.header.paragraphs:
for run in header_para.runs:
if "XXX" in run.text:
run.text = run.text.replace("XXX", actual_project_name)
```
---
## Token Budget Protection
LLM context windows have hard limits. These strategies prevent token overflow during document generation:
| Strategy | Stage |
|----------|-------|
| Per-chapter independent Markdown files | Authoring |
| Research notes are summaries, not full-text copies | Research |
| Read source code on demand (`ls` + `Read`), never dump entire directories | Research |
| Compress long lists into tables | All stages |
| Per-chapter review checkpoints | Authoring |
| Never load all chapters into a single request | Authoring / Injection |
---
## Common Pitfalls
| Symptom | Root Cause | Fix |
|---------|-----------|-----|
| Path typos or directories do not exist | No pre-flight path validation | `ls` every path in the whitelist before starting |
| Wrong draft used as starting point | Did not verify which document a prior draft belongs to | Read the first chapter to confirm the topic |
| Template placeholder text appears in output | Treated "keep titles consistent" as "keep content identical" | Keep top-level titles; rewrite sub-sections for the actual project |
| Chapter organization does not match the product | Organized by code modules instead of user-facing capabilities | Organize by product capability, not by engineering repo structure |
| Token limit errors | Too many chapters loaded at once | Per-chapter files + summarized research |
| pandoc destroys template fonts/headers/footers | Used pandoc instead of python-docx | Always use python-docx template copy + injection |
| Reference doc text copied verbatim into chapters | Treated source material as content rather than fact anchors | Research phase produces summaries only |
| TOC entries have no source evidence | Concept-level headings imported without code/doc backing | Research phase: annotate every entry with a source; delete unsupported entries |
| "1.1.1 2.1.1 Title" double numbering | Template Heading styles have `numPr` auto-numbering | Strip `numPr` before injection |
| JSON/SQL/pseudocode blocks are blank in .docx | Injection script skips non-Mermaid code blocks | Render code blocks as shaded monospace paragraphs |
| Images overlap with text or are clipped | Template Normal style uses `EXACTLY` line spacing | Set image paragraph `line_spacing_rule = SINGLE`; cap `max_h_cm` at 12 |
---
## Reference Implementation
This skill includes ready-to-use Python scripts in the `scripts/` directory:
### `scripts/render_mermaid.py`
Extracts all ` ```mermaid ` blocks from a Markdown file and renders each to `diagram_N.png` using `mmdc` (Mermaid CLI).
```bash
python scripts/render_mermaid.py <markdown_file> <output_image_dir>
```
Requirements: `npx` (Node.js), which auto-installs `@mermaid-js/mermaid-cli`.
### `scripts/inject_docx.py`
Copies a `.docx` template, clears the body after the TOC, and injects Markdown content as properly styled Word elements.
```bash
python scripts/inject_docx.py \
--md-dir ./output/chapters_md \
--template ./templates/design_spec.docx \
--output ./output/design_spec.docx \
--chapters ch01.md ch02.md ch03.md appendix_a.md \
--header-replace "XXX=My Project Name"
```
Features:
- Heading injection (levels 1-3)
- Markdown table to Word table conversion
- Bold and inline code formatting
- Mermaid PNG image embedding with correct sizing
- Non-Mermaid code block rendering (shaded monospace)
- Heading `numPr` auto-numbering removal
- Image paragraph `SINGLE` line spacing (prevents clipping)
- TOC field auto-refresh on open
- Optional header/footer text replacement
Requirements: `python-docx`, `Pillow`.
### `scripts/puppeteer-config.json`
Disables Chromium sandboxing for `mmdc` in Linux/container environments:
```json
{ "args": ["--no-sandbox"] }
```
---
## Companion Skills (Optional Enhancements)
This skill is fully self-contained — it works without any companion skills installed. However, if the following skills are available in your environment, they significantly improve specific stages:
| Skill | Stage | Benefit |
|-------|-------|---------|
| `claude-mermaid:mermaid-diagrams` | Stage 5 (Authoring) | Provides Mermaid syntax best practices, diagram type selection guidance, and live preview/save tools (`mermaid_preview` / `mermaid_save`). Produces higher-quality diagrams than writing Mermaid from scratch. |
| `superpowers:brainstorming` | Stage 1 (Template Analysis) | Structured brainstorming workflow that explores user intent, requirements, and design alternatives before committing to a TOC mapping. Reduces rework. |
| `superpowers:writing-plans` | Stage 3 (Plan) | Structured planning workflow for multi-step implementation tasks. Helps break complex documents into well-scoped per-chapter tasks. |
**How to use them:** If a companion skill is available, invoke it via the Skill tool at the relevant stage. If it is not available, follow the inline guidance in this skill — the core instructions for each stage already cover the essential techniques.
**Example:** During Stage 5, if `claude-mermaid:mermaid-diagrams` is installed, invoke it before writing Mermaid blocks. If not, follow the Mermaid guidelines in the [Authoring](#stage-5-authoring) section directly.
---
## Pre-Flight Checklist
Use this checklist when starting any new document:
- [ ] Verify output directory exists
- [ ] Verify all source material paths are accessible
- [ ] Verify template file exists and is not locked
- [ ] Extract template TOC (headings + toc-styled paragraphs)
- [ ] Extract template images and tables to plan per-chapter visuals
- [ ] **Audit template styles**: check Normal `line_spacing_rule` and Heading `numPr`
- [ ] Confirm TOC mapping with the user (top-level fixed, sub-sections adapted)
- [ ] Write spec and get user approval
- [ ] Complete research with source traceability for every TOC entry
- [ ] Author each chapter as an independent Markdown file
- [ ] Merge into `full_draft.md` and review
- [ ] Render Mermaid diagrams to PNG
- [ ] Run injection script
- [ ] Open output `.docx` and verify: cover page, TOC refresh, image layout, code blocks
don't have the plugin yet? install it then click "run inline in claude" again.