Hologres DDL schema design and table creation expert. Use for generating CREATE TABLE statements, choosing storage formats (column/row/row-column), configuri...
SKILL.md

---
name: hologres-schema-generator
description: |
  Hologres DDL schema design and table creation expert. Use for generating CREATE TABLE statements,
  choosing storage formats (column/row/row-column), configuring table properties (distribution_key,
  clustering_key, bitmap_columns, event_time_column), designing partition tables, selecting data types,
  and optimizing table schemas for different query patterns.
  Triggers: "hologres建表", "hologres create table", "hologres schema", "hologres DDL",
  "distribution_key", "clustering_key", "bitmap_columns", "event_time_column", "列存", "行存",
  "行列共存", "分区表", "hologres table design", "表设计", "hologres数据类型"
---

## Prerequisites

This skill requires **hologres-cli** to be installed first:

```bash
pip install hologres-cli
export HOLOGRES_SKILL=hologres-schema-generator
```

All SQL execution depends on `hologres-cli` commands (`hologres sql run --write`, `hologres table create`).

# Hologres Schema Design & Table Creation

Generate optimized Hologres DDL statements with proper storage format, indexing, and distribution strategies.

## Information Gathering (IMPORTANT)

Before generating DDL, check whether the user has provided enough context. Many Hologres table properties are **immutable after creation** (orientation, distribution_key, clustering_key, event_time_column, primary key, partition column) — getting them wrong means recreating the table.

### Required Information

If ANY of the following are unclear, **ask the user before generating DDL**:

| Dimension | Why Critical | Example Question |
|-----------|-------------|-----------------|
| **Query pattern** | Determines `orientation` (column/row/row-column) | "This table is mainly used for OLAP analytics, KV point lookups, or both?" |
| **Column definitions** | Core schema structure | "Can you list the main columns and their rough types (text, number, timestamp, etc.)?" |
| **Primary key / unique key** | Determines UPSERT capability; immutable | "Does this table need deduplication or updates by a unique key? If so, which column(s)?" |
| **Main query conditions** | Determines distribution_key, clustering_key, bitmap | "What columns are most often used in WHERE, JOIN, or GROUP BY?" |

### Conditionally Required

Ask these only when relevant signals appear:

| Signal | Follow-up Question |
|--------|-------------------|
| Table looks like time-series or log data | "What is the estimated daily data volume? (determines whether to use partitioning)" |
| User mentions JOIN with other tables | "Which column will this table JOIN on, and what is the distribution_key of the other table?" |
| Large data volume or mentions retention | "How long should data be retained? (recommend dynamic partition management for lifecycle)" |
| User mentions both analytics and serving | "What is the read pattern — batch scans, point lookups by ID, or both?" |

### Can Use Defaults (No Need to Ask)

| Property | Default Behavior |
|----------|-----------------|
| `bitmap_columns` | Auto-enabled for TEXT columns; can ALTER later |
| `dictionary_encoding_columns` | Use `:auto` mode; can ALTER later |
| `storage_mode` | Default `hot`; can ALTER later |
| `time_to_live_in_seconds` | **Not recommended.** Deletion time is non-deterministic. Use dynamic partition management instead |

### Gathering Flow

```
User request → Check available info
  |
  +-- Columns + query pattern + key info all clear?
  |     → Generate DDL directly
  |
  +-- Missing critical info?
  |     → Ask 1-3 focused questions (batch them, don't ask one at a time)
  |
  +-- Very vague request (e.g., "help me create a user table")?
        → Ask: 1) main columns  2) query pattern  3) key/dedup needs
```

> **Principle:** Batch questions into a single round. Never ask more than 3-4 questions at once. If the user provides partial info, fill in reasonable defaults for the rest and explain your assumptions in comments.

## Storage Formats

Choose the storage format based on your primary query pattern.

| Format | Syntax | Best For | Primary Key |
|--------|--------|----------|-------------|
| **Column store** | `orientation = 'column'` | OLAP, aggregation, scan-heavy queries | Optional |
| **Row store** | `orientation = 'row'` | Point lookups, high-QPS KV queries | Required |
| **Row-column store** | `orientation = 'row,column'` | Mixed workloads (OLAP + point lookup) | Required |

> **Default recommendation:** Use `row,column` (row-column store) when the workload is unclear. It handles both OLAP and point queries well.

## Quick Start

```sql
-- Column store table (OLAP / analytics)
CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT,
  amount DECIMAL(18,2),
  status TEXT,
  order_time TIMESTAMPTZ NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (order_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'order_id',
  clustering_key = 'order_time:asc',
  event_time_column = 'order_time',
  bitmap_columns = 'status,user_id'
);

-- Row store table (high-QPS point lookup)
CREATE TABLE user_profile (
  user_id BIGINT NOT NULL,
  name TEXT,
  email TEXT,
  tags JSONB,
  updated_at TIMESTAMPTZ,
  PRIMARY KEY (user_id)
)
WITH (
  orientation = 'row',
  distribution_key = 'user_id',
  clustering_key = 'user_id'
);
```

## CREATE TABLE Syntax (V2.1+)

```sql
CREATE TABLE [IF NOT EXISTS] [schema.]table_name (
  column_name data_type [NOT NULL] [DEFAULT expr],
  ...
  [PRIMARY KEY (col1 [, col2, ...])]
)
[PARTITION BY LIST (partition_column)]
WITH (
  orientation = '{column | row | row,column}',
  distribution_key = 'col1[,col2]',
  clustering_key = 'col1[:asc|:desc][,col2[:asc|:desc]]',
  event_time_column = 'col',
  bitmap_columns = 'col1[,col2,...]',
  dictionary_encoding_columns = 'col1[:auto|:on|:off][,...]',
  time_to_live_in_seconds = 'N',          -- NOT recommended, see below
  storage_mode = '{hot | cold}',
  table_group = 'group_name'
);
```

### Legacy Syntax (All Versions)

```sql
BEGIN;
CREATE TABLE table_name (...);
CALL set_table_property('table_name', 'orientation', 'column');
CALL set_table_property('table_name', 'distribution_key', 'col1');
CALL set_table_property('table_name', 'clustering_key', 'col1:asc');
CALL set_table_property('table_name', 'event_time_column', 'col1');
CALL set_table_property('table_name', 'bitmap_columns', 'col1,col2');
COMMIT;
```

## Table Properties Reference

### distribution_key (Distribution Key)

Controls how data is hash-distributed across shards. Critical for JOIN and GROUP BY performance.

| Rule | Description |
|------|-------------|
| Choose high-cardinality columns | Avoid data skew |
| Use JOIN/GROUP BY columns | Enable local computation, avoid shuffle |
| Max 2 columns | More columns reduce distribution effectiveness |
| Must be subset of PK | If table has a primary key |
| Immutable after creation | Cannot ALTER, must recreate table |

```sql
-- Single column
distribution_key = 'user_id'

-- Two columns (join on both)
distribution_key = 'order_id,user_id'
```

### clustering_key (Clustering Key / Sorted Index)

Physically sorts data within files. Accelerates range queries and filters.

| Rule | Description |
|------|-------------|
| Use range-query columns | e.g., timestamp, date columns |
| Max 2-3 columns | More columns dilute sort benefit |
| Put high-selectivity column first | Most-filtered column goes first |
| Specify sort order | `:asc` (default) or `:desc` |
| Column store only | Row store uses PK as sort key |

```sql
clustering_key = 'order_time:asc'
clustering_key = 'ds:asc,order_time:asc'
```

### event_time_column (Segment Key)

Organizes data files by time ranges. Enables file-level pruning for time-range queries.

| Rule | Description |
|------|-------------|
| Use time/date columns | Timestamp of data ingestion or event time |
| At most 1 column | Only one segment key per table |
| Column store only | Not applicable to row store |
| Combine with partition | Partition for coarse pruning, segment key for fine pruning |

```sql
event_time_column = 'event_time'
```

### bitmap_columns (Bitmap Index)

Builds bitmap indexes for fast equality filtering on low-to-medium cardinality columns.

| Rule | Description |
|------|-------------|
| Use filter columns | Columns frequently in WHERE clause |
| Low-medium cardinality | status, type, region — NOT user_id |
| TEXT columns auto-enabled | Default bitmap for text columns |
| Can be added after creation | `ALTER TABLE ... SET (bitmap_columns = ...)` |

```sql
bitmap_columns = 'status,payment_type,region'
```

### dictionary_encoding_columns (Dictionary Encoding)

Compresses text columns by mapping values to integers. Speeds up GROUP BY and aggregations.

| Rule | Description |
|------|-------------|
| Use `:auto` mode | Let Hologres decide based on cardinality |
| Good for low-cardinality text | country, status, category |
| Avoid high-cardinality columns | user_id, order_id — no compression benefit |
| Don't set if unsure | Incorrect setting may hurt performance |

```sql
dictionary_encoding_columns = 'country:auto,status:auto'
```

### time_to_live_in_seconds (TTL) — NOT RECOMMENDED

> **Do NOT use `time_to_live_in_seconds`.** The actual deletion time is non-deterministic — data will be deleted at an arbitrary time after the specified TTL, not at a precise point. This makes it unreliable for data lifecycle management.
>
> **Recommended alternative:** Use **dynamic partition management** — create daily/hourly partitions and drop old partitions on a schedule (via cron or scheduling system). This gives you precise, predictable data lifecycle control.

```sql
-- BAD: TTL-based lifecycle (deletion time unpredictable)
-- time_to_live_in_seconds = '2592000'

-- GOOD: Partition-based lifecycle (precise control)
-- 1. Create table with daily partitions
CREATE TABLE events (
  event_id BIGINT NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, event_id)
) PARTITION BY LIST (ds)
WITH (orientation = 'column', distribution_key = 'event_id');

-- 2. Drop old partitions on schedule (e.g., retain 30 days)
DROP TABLE IF EXISTS events_20251101;  -- drop partition older than 30 days
```

## Primary Key Design

| Storage Format | PK Requirement | Notes |
|---------------|---------------|-------|
| Column store | Optional | Add PK only if needed for UPSERT |
| Row store | **Required** | PK drives the row-store index |
| Row-column store | **Required** | PK serves both point lookup and analytics |

**Rules:**
- Max 32 columns in composite PK
- PK columns must be NOT NULL and UNIQUE
- Prohibited types: `FLOAT`, `DOUBLE`, `NUMERIC`, `ARRAY`, `JSON`, `JSONB`
- PK cannot be altered after creation — must recreate table
- Avoid `SERIAL` as PK — causes table-level locks on write

## Partition Table Design

Use partitions for large tables with time-based or categorical data.

```sql
-- Parent table
CREATE TABLE events (
  event_id BIGINT NOT NULL,
  user_id BIGINT,
  event_type TEXT,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, event_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'event_id',
  clustering_key = 'ds:asc',
  event_time_column = 'ds'
);

-- Child partitions
CREATE TABLE events_20251201 PARTITION OF events FOR VALUES IN ('20251201');
CREATE TABLE events_20251202 PARTITION OF events FOR VALUES IN ('20251202');
```

**Rules:**
- Only `LIST` partitioning is supported
- Partition column must be part of PK (if PK exists)
- Supported partition column types: `TEXT`, `VARCHAR`, `INT`, `DATE` (V1.3.22+)
- Skip daily partitions if daily data < 100M rows — use `event_time_column` instead
- Always filter on partition column in queries for pruning

## Scenario-Based Templates

### 1. High-QPS Point Lookup (KV)

```sql
CREATE TABLE user_kv (
  user_id BIGINT NOT NULL PRIMARY KEY,
  profile JSONB,
  updated_at TIMESTAMPTZ
)
WITH (
  orientation = 'row',
  distribution_key = 'user_id'
);
```

### 2. Prefix Range Scan

```sql
CREATE TABLE order_lines (
  order_id BIGINT NOT NULL,
  line_no INT NOT NULL,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY (order_id, line_no)
)
WITH (
  orientation = 'row',
  distribution_key = 'order_id',
  clustering_key = 'order_id'
);
```

### 3. Time-Range Analytics (Partitioned)

```sql
CREATE TABLE page_views (
  view_id BIGINT NOT NULL,
  user_id BIGINT,
  page_url TEXT,
  view_time TIMESTAMPTZ,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, view_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'view_id',
  clustering_key = 'view_time:asc',
  event_time_column = 'view_time',
  bitmap_columns = 'user_id,page_url'
);
```

### 4. Dimension Filtering (Non-Time)

```sql
CREATE TABLE product_stats (
  product_id BIGINT NOT NULL PRIMARY KEY,
  category TEXT,
  brand TEXT,
  sales_count BIGINT,
  revenue DECIMAL(18,2)
)
WITH (
  orientation = 'column',
  distribution_key = 'product_id',
  clustering_key = 'category:asc',
  bitmap_columns = 'category,brand'
);
```

### 5. JOIN-Optimized Fact + Dimension

```sql
-- Fact table: distribute by join key
CREATE TABLE fact_orders (
  order_id BIGINT NOT NULL PRIMARY KEY,
  customer_id BIGINT,
  amount DECIMAL(18,2),
  order_date TEXT
)
WITH (
  orientation = 'column',
  distribution_key = 'customer_id'
);

-- Dimension table: same distribution key
CREATE TABLE dim_customers (
  customer_id BIGINT NOT NULL PRIMARY KEY,
  name TEXT,
  region TEXT
)
WITH (
  orientation = 'row,column',
  distribution_key = 'customer_id'
);
-- JOIN on customer_id → local join, no shuffle
```

### 6. Mixed Workload (OLAP + Serving)

```sql
CREATE TABLE realtime_metrics (
  metric_id BIGINT NOT NULL,
  device_id BIGINT NOT NULL,
  value DOUBLE PRECISION,
  ts TIMESTAMPTZ NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, device_id, metric_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'row,column',
  distribution_key = 'device_id',
  clustering_key = 'ts:asc',
  event_time_column = 'ts',
  bitmap_columns = 'device_id'
);
```

## References

| Document | Content |
|----------|---------|
| [data-types.md](references/data-types.md) | Complete data type reference |
| [table-properties.md](references/table-properties.md) | Detailed table property guide with selection flowcharts |
| [partition-guide.md](references/partition-guide.md) | Partition table design and management |

## Best Practices

1. **Choose storage format first** — column for OLAP, row for KV, row-column when unsure
2. **Set `distribution_key` to JOIN/GROUP BY columns** — avoids cross-shard shuffle
3. **Set `clustering_key` for range-query columns** — improves time-range and filter queries
4. **Set `event_time_column` for time-series data** — enables file-level pruning
5. **Use `bitmap_columns` for low-cardinality filter columns** — status, type, region
6. **Partition by date only if daily data > 100M rows** — otherwise use segment key alone
7. **PK must include partition column** — required by Hologres for partition tables
8. **Align `distribution_key` across JOINed tables** — same column enables local join
9. **Use `WITH` syntax (V2.1+)** — cleaner than `CALL set_table_property` in transactions
10. **Avoid `SERIAL` as primary key** — causes table-level write locks
Hologres Schema Generator

SKILL.md

related skills