clawhub

Brand Dna Extractor

Extract brand identity (colors, typography, visual style, imagery) from any website URL. Scrapes the site, analyzes CSS/images with K-means and VLM, and retu...

view source

installs

stars

karma

SkillRank score ↗

7.8/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

brand-dna-extractor analyzes website urls to extract structured brand identity profiles including color palettes, typography, and visual style through scraping, k-means clustering, and vision language model analysis.

structure

9.0

trigger phrases

7.0

procedure

9.0

edge cases

6.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: brand-dna-extractor
description: Extract brand identity (colors, typography, visual style, imagery) from any website URL. Scrapes the site, analyzes CSS/images with K-means and VLM, and returns a structured brand profile. Use when you need to understand a brand's visual language before generating on-brand content.
homepage: https://canlah.ai
metadata: {"category": "brand-analysis", "tags": ["brand", "colors", "typography", "visual-style", "scraping", "vlm"], "runtime": "python", "env": ["OPENAI_API_KEY", "GOOGLE_GENAI_API_KEY", "SUPABASE_URL", "SUPABASE_KEY"]}
---

# Brand DNA Extractor

Extract a structured brand identity profile from any website URL. Analyzes colors, typography, and visual style to produce a reusable brand profile for on-brand content generation.

## Environment Variables

```bash
export OPENAI_API_KEY="your_openai_key"          # for VLM visual analysis (fallback)
export GOOGLE_GENAI_API_KEY="your_gemini_key"    # for VLM visual analysis (primary)
export SUPABASE_URL="your_supabase_url"          # optional: for caching results
export SUPABASE_KEY="your_supabase_key"          # optional: service role key
```

## What It Extracts

| Component | Details |
|-----------|---------|
| **Color palette** | Primary, secondary, accent, background, and text colors — sourced from CSS variables, computed styles, and K-means image clustering |
| **Typography** | Heading and body fonts, weights, sources (Google Fonts, Adobe Fonts, system) |
| **Visual style** | Mood descriptors, photography styles, composition notes, lighting characterization, brand personality, target audience signals |
| **Imagery** | Logo, favicon, hero images, product images, other images — classified and ranked |

## Python Usage

```python
import asyncio
from brand_dna_extractor.extractor import BrandDNAExtractor, extract_brand_dna

# Quick extraction
async def main():
    result = await extract_brand_dna(
        url="https://example.com",
        user_id="optional-user-id",
        force_refresh=False,
    )

    if result.success:
        dna = result.brand_dna
        print(dna.color_palette.dominant_color)       # "#2563EB"
        print(dna.typography.primary_font.family)     # "Inter"
        print(dna.visual_style.moods)                 # ["warm minimalism", "approachable"]
        print(dna.visual_style.brand_personality)     # "Confident and calm..."
    else:
        print(result.error)

asyncio.run(main())

# Full control
extractor = BrandDNAExtractor(
    vlm_provider="gemini",      # "gemini" (default) or "openai"
    enable_storage=True,        # cache results in Supabase
    enable_embeddings=False,    # skip CLIP embedding generation
)

result = await extractor.extract(
    url="https://example.com",
    include_subpages=True,      # also scrape about/product pages
    max_subpages=5,
    force_refresh=False,
)
```

## 5-Step Extraction Pipeline

### Step 1: Website Scraping

Uses a two-tier scraping strategy:

**Primary — DOM Structure Scraper** (`SimpleScraper`)
- Fast HTTP requests with structured HTML parsing
- Extracts CSS variables, computed styles, stylesheets, JSON-LD data
- Optimized for Shopify stores (reads product JSON-LD)
- Follows `include_subpages` to crawl up to `max_subpages` additional URLs

**Fallback — Playwright Scraper** (`PlaywrightScraper`)
- Activates when simple scraper yields < 3 gallery/product images
- Handles JavaScript-rendered content
- Optional dependency: `pip install playwright && playwright install`

### Step 2: Image Extraction and Classification

Images are classified into types:

| Type | Description |
|------|-------------|
| `logo` | Site logo (detected by position, alt text, size) |
| `favicon` | Site favicon |
| `hero` | Large above-the-fold banner images |
| `product` | Product photography |
| `lifestyle` | Contextual/lifestyle imagery |
| `other` | Remaining UI images |

Up to 100 images extracted; top 30 product + 30 other retained.

### Step 3: Color Analysis

Multi-source color extraction and classification:

```
CSS custom properties (--primary-color, --brand-color, etc.)
    +
Computed element styles (headerBackground, ctaBackground, linkColor, etc.)
    +
K-means clustering on logo pixels (3 colors)
    +
K-means clustering on hero/product images (3 colors each, up to 5 images)
    ↓
Deduplicate (Euclidean distance threshold = 30)
    ↓
Classify by lightness/saturation:
  L > 0.9  → background
  L < 0.15 → text
  S > 0.6  → accent
  source=primary → primary
  else     → secondary
```

**`ColorPalette` output:**
```python
palette.dominant_color        # "#2563EB" (hex string)
palette.primary_colors        # List[ColorInfo] (up to 3)
palette.secondary_colors      # List[ColorInfo] (up to 3)
palette.accent_colors         # List[ColorInfo] (up to 2)
palette.background_colors     # List[ColorInfo] (up to 2)
palette.text_colors           # List[ColorInfo] (up to 2)
```

**`ColorInfo` fields:** `hex`, `rgb`, `hsl`, `role`, `source`, `name`, `frequency`, `css_property`

### Step 4: Typography Analysis

Font detection from three sources:

**CSS Computed Fonts**
- Parses `font-family` declarations from computed element styles
- Classifies by role: heading, body, cta, nav
- Identifies system fonts vs custom fonts

**Google Fonts** (detected from stylesheet URLs)
- Parses both old (`/css?family=`) and new (`/css2?family=`) API formats
- Extracts family names and weight variants

**Adobe Fonts / Typekit** (detected from stylesheet URLs)
- Flags usage of `use.typekit.net` or `use.adobe.com`

**`Typography` output:**
```python
typography.primary_font         # FontInfo — main body font
typography.secondary_font       # FontInfo — heading font (if different)
typography.heading_fonts        # List[FontInfo]
typography.body_fonts           # List[FontInfo]
typography.accent_fonts         # List[FontInfo]
typography.google_fonts_urls    # List[str]
typography.detected_from_google_fonts  # bool
typography.detected_from_adobe_fonts   # bool
```

**`FontInfo` fields:** `family`, `weight`, `role`, `source`, `fallbacks`, `url`

### Step 5: Visual Style Analysis (VLM)

Up to 5 representative images (prioritized: hero > product > lifestyle) are analyzed by a VLM using a structured creative director prompt.

**Analysis dimensions:**
1. Visual mood and atmosphere (3-5 compound descriptors)
2. Photography/visual style (2-3 technical descriptors)
3. Composition analysis (negative space, focal point, depth)
4. Lighting characterization (quality, direction, color temperature)
5. Texture and material language
6. Dominant subjects
7. Brand personality inference
8. Target audience signals

**VLM provider selection:**
- Default: Gemini (`gemini-3-flash-preview` or env `GEMINI_MODEL`)
- Fallback: OpenAI Vision (env `OPENAI_MODEL`)
- Automatic retry with exponential backoff (3 attempts)

**`VisualStyle` output:**
```python
style.moods                  # List[str] — top 5 by frequency across images
style.photography_styles     # List[str] — top 3
style.composition_notes      # str — aggregated composition analysis
style.lighting_style         # str
style.texture_notes          # str
style.dominant_subjects      # List[str] — top 5
style.brand_personality      # str — 2-3 sentences
style.target_audience_hint   # str — 2-3 sentences
style.confidence_score       # float — 0.0-1.0 (higher with more images analyzed)
style.images_analyzed        # int
```

## `BrandDNA` Object

```python
@dataclass
class BrandDNA:
    url: str
    domain: str
    logo: Optional[ExtractedImage]
    favicon: Optional[ExtractedImage]
    hero_images: List[ExtractedImage]
    product_images: List[ExtractedImage]
    other_images: List[ExtractedImage]
    color_palette: ColorPalette
    typography: Typography
    visual_style: VisualStyle
    id: Optional[str]                 # UUID if stored in database
    style_embedding: Optional[List[float]]  # CLIP embedding if enabled
```

## Caching

When `enable_storage=True` and Supabase credentials are configured, results are automatically cached by domain.

```python
# Force re-extraction (ignore cache)
result = await extractor.extract(url, force_refresh=True)

# Retrieve cached result by domain
from brand_dna_extractor.extractor import get_brand_dna_by_domain
dna = await get_brand_dna_by_domain("example.com")

# Retrieve by stored ID
from brand_dna_extractor.extractor import get_brand_dna
dna = await get_brand_dna("uuid-string")
```

## Error Handling

`BrandDNAResponse` always returns a result object:

```python
result = await extractor.extract(url)

result.success         # bool
result.brand_dna       # BrandDNA | None
result.error           # str | None — human-readable error description
result.from_cache      # bool — True if returned from cache
```

Common failure modes:
- Both scrapers fail (site blocks bots, requires login)
- VLM API quota exhausted
- URL is not a public website

## Installation

```bash
pip install aiohttp Pillow numpy scikit-learn openai google-generativeai
# Optional for JS-heavy sites:
pip install playwright && playwright install chromium
```

## Example Output

```python
BrandDNA(
    domain="allbirds.com",
    color_palette=ColorPalette(
        dominant_color="#2B2B2B",
        primary_colors=[ColorInfo(hex="#2B2B2B", role="primary"), ...],
        accent_colors=[ColorInfo(hex="#E8D5C0", role="accent"), ...],
    ),
    typography=Typography(
        primary_font=FontInfo(family="Flanders Sans", weight="400", source="css"),
        detected_from_google_fonts=False,
    ),
    visual_style=VisualStyle(
        moods=["warm minimalism", "earthy authenticity", "understated confidence"],
        photography_styles=["lifestyle documentary", "naturalistic color treatment"],
        brand_personality="Calm and purposeful, with a commitment to sustainability...",
        target_audience_hint="Environmentally conscious millennials and Gen Z...",
        confidence_score=0.83,
        images_analyzed=5,
    ),
)
```

---

## Author

**[Canlah AI](https://canlah.ai)** — Run performance marketing without breaking your brand.

- GitHub: [github.com/PHY041](https://github.com/PHY041)
- All Skills: [clawhub.ai/PHY041](https://clawhub.ai/PHY041)

related skills

semantically similar in the cross-vendor index

clawhub

88% match

Brand DNA Extractor

Extract brand identity (colors, typography, visual style, imagery) from any website URL. Scrapes the site, analyzes CSS/images with K-means and VLM, and retu...

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit decision trees for api failures, caching fallbacks, and scraper selection; documented all external api connections with env var names and retry logic; expanded procedure into 6 numbered steps with input/output signatures; added edge cases for rate limits, auth expiry, and timeout handling; clarified output contract with full json schema for success and failure states.

Brand DNA Extractor

intent

extract a structured brand identity profile from any website url by scraping the site, analyzing css and images with k-means clustering and vision language models, then returning colors, typography, visual mood, and imagery classification. use this when you need to understand a brand's complete visual language before generating on-brand content, building competitive analysis, or auditing visual consistency across web properties.

inputs

required:

url (string): public website url to analyze (https only, no login required)
GOOGLE_GENAI_API_KEY or OPENAI_API_KEY (env vars): api keys for vision language model analysis. gemini is primary, openai vision is fallback.

optional:

SUPABASE_URL and SUPABASE_KEY (env vars): enable caching and persistence of extracted brand profiles. service role key required for write access.
user_id (string): identifier for cache association, improves result lookup.
force_refresh (boolean, default false): ignore cached results and re-extract from url.
include_subpages (boolean, default false): scrape additional pages (about, product pages) for more complete brand picture. increases runtime by 2-3x.
max_subpages (integer, default 5): limit number of additional pages crawled when include_subpages=true.
vlm_provider (string, "gemini" or "openai", default "gemini"): select which vision model to use for visual analysis.
enable_storage (boolean, default true): persist results to supabase if credentials provided.
enable_embeddings (boolean, default false): generate clip embeddings of visual style for similarity search (adds ~10-15 seconds).

external connections:

google generative ai api (gemini-3-flash-preview or via GEMINI_MODEL env): primary vlm for visual analysis
openai api (vision, fallback if gemini exhausted or unavailable)
supabase postgresql (optional): stores brand dna profiles by domain and uuid

procedure

step 1: scrape website structure and assets

inputs: url, include_subpages flag, max_subpages limit

primary method (dom scraper):

issue http get request to url with standard browser user agent
parse html with lxml or beautifulsoup
extract all stylesheets (link rel="stylesheet" and style tags)
extract css custom properties (--primary-color, --brand-color, etc.)
extract computed element styles from heading, nav, cta, body elements
extract all image src and srcset attributes
extract json-ld structured data (product schema, organization schema)
if include_subpages=true, identify internal links (/about, /products, /team, /blog) and repeat scrape on up to max_subpages urls

outputs: raw css properties map, computed styles dict, image urls list (unordered), json-ld objects, collected from all pages

step 2: classify and rank extracted images

inputs: images list from step 1, html dom

classification logic:

logo: image in header region, small aspect ratio, alt text contains "logo", or filename contains "logo"
favicon: image path matches /favicon.ico or in link rel="icon"
hero: image above the fold (first 30% of page), width > 800px, aspect ratio 16:9 or wider
product: image with product json-ld schema, or in product grid/gallery region, or alt text contains product keywords
lifestyle: image with people, contexts, or scenarios (non-product but brand-supporting)
other: remaining images

outputs: ExtractedImage objects with type, url, alt text, inferred role, confidence score

outputs: top 30 product images (by prominence score), top 30 other images, 1 logo, 1 favicon, up to 5 hero images (prioritized by position and size)

step 3: extract color palette

inputs: css properties, computed styles, logo image, hero/product images (top 5)

color source aggregation:

parse css custom properties (--primary-color, --brand-color, --accent, --bg, --text)
extract computed colors from heading, link, button, nav background and text
run k-means clustering (k=3) on logo pixel colors (resize logo to 200x200 first)
run k-means clustering (k=3) on each of top 5 hero/product images (sample 1000 pixels per image to avoid memory bloat)

color deduplication:

calculate euclidean distance in rgb space between all extracted colors
merge colors with distance < 30 (keep highest frequency source)
convert all colors to hex, rgb, hsl formats

color classification by role:

lightness > 0.9: background
lightness < 0.15: text
saturation > 0.6: accent
css_property matches "--primary-" or "--brand-": primary
else: secondary

outputs: ColorPalette with dominant_color (most frequent), primary_colors list (up to 3), secondary_colors (up to 3), accent_colors (up to 2), background_colors (up to 2), text_colors (up to 2). each color includes hex, rgb, hsl, role, source, frequency count, css_property name if detected.

step 4: detect typography

inputs: stylesheets, computed styles from dom scraper, extracted css custom properties

font source detection:

a) computed element styles: parse font-family declarations from heading, body, button, nav elements. classify role as heading/body/cta/nav based on element type. identify system fonts (Arial, Helvetica, Georgia, etc.) vs custom font families.

b) google fonts: parse stylesheet urls containing "fonts.googleapis.com" using both old format (/css?family=) and new format (/css2?family=). extract family names and weight/style variants listed in url.

c) adobe fonts/typekit: detect stylesheet urls containing "use.typekit.net" or "use.adobe.com", flag as detected without parsing individual fonts.

outputs: Typography with primary_font (main body), secondary_font (heading if different), heading_fonts list, body_fonts list, accent_fonts list, google_fonts_urls list, flags for detected_from_google_fonts and detected_from_adobe_fonts. each font includes family, weight, role, source, fallbacks string, and url if applicable.

step 5: analyze visual style with vision language model

inputs: top 5 images (hero > product > lifestyle by priority), url for context

vlm prompt structure: send each image to vlm with structured prompt asking for: (1) visual mood/atmosphere (3-5 compound descriptors like "warm minimalism", "tech-forward sophistication"), (2) photography/visual style (2-3 technical descriptors like "lifestyle documentary", "product studio lighting"), (3) composition analysis (negative space usage, focal point, depth cues), (4) lighting characterization (quality, direction, color temperature), (5) texture/material language, (6) dominant subjects/objects, (7) inferred brand personality (2-3 sentences), (8) target audience signals (2-3 sentences).

vlm provider selection:

primary: gemini-3-flash-preview (env GEMINI_MODEL overrides), retry up to 3 times with exponential backoff (0.5s, 1s, 2s)
fallback: openai gpt-4-vision (env OPENAI_MODEL overrides), if gemini exhausted or unavailable

aggregation: collect all responses across up to 5 images. extract moods list and deduplicate by semantic similarity (top 5 by frequency). extract photography_styles (top 3), consolidate composition_notes and lighting_style across images. compute confidence_score as (images_analyzed / 5) * semantic_consistency. if < 3 images analyzed, reduce confidence by 30%.

outputs: VisualStyle with moods list, photography_styles list, composition_notes string, lighting_style string, texture_notes string, dominant_subjects list (top 5), brand_personality string (2-3 sentences), target_audience_hint string (2-3 sentences), confidence_score (0.0-1.0), images_analyzed count.

step 6: cache result (optional)

inputs: completed BrandDNA object, enable_storage flag, supabase credentials

if enable_storage=true and SUPABASE_URL and SUPABASE_KEY present:

generate uuid for brand dna record
serialize BrandDNA to json
optionally generate clip embedding of visual_style text if enable_embeddings=true (adds ~10-15 seconds)
upsert record into supabase table brand_dna_profiles (keyed on domain)
return with id and from_cache=false

outputs: BrandDNA object with populated id field and optional style_embedding list

decision points

if vlm api quota exhausted or rate limited: retry failed image up to 3 times with exponential backoff. if all retries fail, mark that image as failed, continue with remaining images. if all 5 images fail, return result with confidence_score = 0.0 and error message in response.

if playwright not installed but simple scraper extracts < 3 product images: log warning but continue with available images. do not fail extraction. fallback to playwright scraper only if explicitly installed (check for playwright module on import).

if both scraping methods fail (site blocks bots, requires auth, 404, timeout after 10s): return BrandDNAResponse with success=false, error="unable to scrape : ", brand_dna=null.

if supabase is configured but unreachable: continue extraction without caching. log warning. return result with success=true but note caching failed (do not block extraction on storage failure).

if url is not https: convert http to https. if https fails, retry with http. if both fail, error out.

if user requests include_subpages=true but domain blocks bot requests: fallback to single page extraction. do not fail on subpage scrape errors.

if vlm provider not specified: default to gemini. if gemini api key missing, check for openai key and use that. if neither present, error out.

if cached result exists and force_refresh=false: return cached result immediately with from_cache=true, skip all scraping and analysis steps.

if cached result exists but force_refresh=true: ignore cache, re-extract, and overwrite cache with fresh result.

output contract

success case (BrandDNAResponse with success=true):

{
  "success": true,
  "from_cache": false,  # or true if retrieved from supabase
  "brand_dna": {
    "url": "https://example.com",
    "domain": "example.com",
    "logo": {
      "url": "https://example.com/logo.svg",
      "type": "logo",
      "alt": "Example Logo"
    },
    "favicon": {
      "url": "https://example.com/favicon.ico",
      "type": "favicon"
    },
    "hero_images": [
      {
        "url": "https://...",
        "type": "hero",
        "alt": "Hero banner"
      }
    ],
    "product_images": [ ... ],
    "other_images": [ ... ],
    "color_palette": {
      "dominant_color": "#2B2B2B",
      "primary_colors": [
        {
          "hex": "#2B2B2B",
          "rgb": "rgb(43, 43, 43)",
          "hsl": "hsl(0, 0%, 17%)",
          "role": "primary",
          "source": "css_property",
          "name": "Charcoal",
          "frequency": 12,
          "css_property": "--primary-color"
        }
      ],
      "secondary_colors": [ ... ],
      "accent_colors": [ ... ],
      "background_colors": [ ... ],
      "text_colors": [ ... ]
    },
    "typography": {
      "primary_font": {
        "family": "Inter",
        "weight": "400",
        "role": "body",
        "source": "google_fonts",
        "fallbacks": "sans-serif",
        "url": "https://fonts.googleapis.com/css2?family=Inter"
      },
      "secondary_font": {
        "family": "Playfair Display",
        "weight": "700",
        "role": "heading",
        "source": "google_fonts",
        "fallbacks": "serif",
        "url": "https://fonts.googleapis.com/css2?family=Playfair+Display"
      },
      "heading_fonts": [ ... ],
      "body_fonts": [ ... ],
      "accent_fonts": [ ... ],
      "google_fonts_urls": [ ... ],
      "detected_from_google_fonts": true,
      "detected_from_adobe_fonts": false
    },
    "visual_style": {
      "moods": [
        "warm minimalism",
        "earthy authenticity",
        "understated confidence",
        "natural sophistication",
        "calm spaciousness"
      ],
      "photography_styles": [
        "lifestyle documentary",
        "naturalistic color treatment",
        "soft daylight"
      ],
      "composition_notes": "imagery favors negative space and minimal subjects. focal points are clear and uncluttered. depth created through subtle layering rather than dramatic perspective.",
      "lighting_style": "soft, diffused natural light with warm color temperature (2800-3500K). backlighting and golden hour preferred.",
      "texture_notes": "natural materials emphasized: raw cotton, linen, rubber, cork. textures are tactile but not overstated.",
      "dominant_subjects": [
        "shoes and footwear",
        "natural outdoor environments",
        "human hands and feet",
        "sustainable materials",
        "minimalist product displays"
      ],
      "brand_personality": "calm and purposeful, with genuine commitment to environmental sustainability. approachable and human-centered rather than corporate. speaks with quiet confidence about quality and values.",
      "target_audience_hint": "environmentally conscious millennials and gen z with disposable income. value authenticity, sustainability, and design quality over trend-chasing. active lifestyle but not extreme athletics.",
      "confidence_score": 0.87,
      "images_analyzed": 5
    },
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "style_embedding": [ ... ]  # optional, if enable_embeddings=true
  },
  "error": null
}

failure case (BrandDNAResponse with success=false):

{
  "success": false,
  "from_cache": false,
  "brand_dna": null,
  "error": "unable to scrape https://example.com: connection timeout after 10s"
}

data location: if caching enabled, result stored in supabase table brand_dna_profiles with columns: domain (unique), brand_dna_json (jsonb), id (uuid), created_at, updated_at, style_embedding (vector, if enabled).

outcome signal

user knows the skill worked when:

returned object has success=true and brand_dna is not null
color_palette contains at least 1 dominant_color and 2+ primary_colors extracted from the actual website
typography lists at least 1 font family (primary_font) with source clearly identified (google_fonts, css, adobe_fonts, or system)
visual_style contains moods, photography_styles, brand_personality, and target_audience_hint populated with coherent, brand-relevant descriptors (not generic)
hero_images, product_images, and other_images lists are populated (not all empty) if the website contains images
confidence_score is 0.65 or higher (indicating stable visual analysis across multiple images)
if from_cache=true, result was returned in < 100ms (instant)
if from_cache=false, result completed in 8-45 seconds depending on image count and vlm response time

watch out for:

confidence_score below 0.5: visual analysis may be unreliable due to limited images or api failures
logo and favicon both null: site may not have visible branding assets
primary_colors or secondary_colors empty: color extraction may have failed due to poor css structure or image analysis limits
vlm responses with generic descriptors like "modern", "professional": indicates vlm struggled to analyze images (low image quality, too many ui elements, poor composition)

original author: Canlah AI , run performance marketing without breaking your brand.

github: github.com/PHY041
all skills: clawhub.ai/PHY041