Metadata design is one of the most important parts of building reliable RAG and semantic search systems. Embeddings help a system understand meaning, but metadata helps the system understand context, ownership, freshness, access, structure, and business rules.
If metadata is designed poorly, retrieval becomes harder to control. You may get semantically similar results that belong to the wrong tenant, outdated documents in a generated answer, product results from the wrong region, or content that the user should not be allowed to see. Good metadata design prevents those problems before they reach the retrieval layer.
This guide first explains metadata design in a non-branded way, then uses Weaviate as a practical implementation example for schema design, vectorization, indexing, ingestion, and filtering.
What Metadata Does in RAG and Semantic Search
Metadata is structured information attached to a document, chunk, product, article, record, or knowledge object. It does not usually contain the main semantic content. Instead, it describes how the content should be organized, filtered, ranked, scoped, or governed.
In RAG and semantic search, metadata usually supports five jobs:
- Filtering results before or during retrieval.
- Protecting access with tenant, role, or permission labels.
- Improving freshness by filtering on dates and update times.
- Improving relevance by narrowing results to the right product, category, region, or source.
- Supporting operations such as audits, debugging, deduplication, and lifecycle management.
The main idea is simple: embeddings represent meaning, while metadata represents retrieval control.
Start with Query Patterns, Not Fields
A common mistake is to add every available field as metadata. That creates clutter and often leads to inconsistent filters. Metadata should be designed from expected query patterns.
Ask questions like:
- Will users filter by product, category, source, or region?
- Does retrieval need tenant or organization boundaries?
- Do answers need only current or approved documents?
- Do users need date ranges, price ranges, ratings, or numeric thresholds?
- Will access depend on user role or permission label?
The answers define the metadata model. Fields that affect retrieval correctness should be stable and indexed. Fields used only for display should not be over-optimized.
Choose Data Types Based on How Fields Will Be Queried
The right data type depends on how the field will be searched or filtered. A field that needs exact matching should not be modeled like natural language. A field that needs range filtering should not be stored as free text.
| Use case | Recommended data type | Design reason |
|---|---|---|
| Natural language content | Text | Used for semantic meaning and keyword search. |
| Tags, categories, skills, roles | Text array | Supports multi-value filtering and contains-style queries. |
| Identifiers, SKUs, emails, IDs | Text with exact tokenization | Preserves the full value as one filterable unit. |
| Product or user names | Text with case-insensitive matching | Avoids casing differences breaking filters. |
| Prices, ratings, scores | Number | Supports greater-than and less-than filters. |
| Dates and timestamps | Date | Supports freshness and time-window filters. |
| Flags such as published or active | Boolean or controlled text | Supports simple eligibility filtering. |
Good metadata design reduces ambiguity. If a field is a number, store it as a number. If it is a date, store it as a date. If it is an exact ID, avoid tokenization that breaks it into pieces.
Separate Semantic Content from Filter-Only Metadata
Not every property should influence the embedding. Titles, descriptions, summaries, questions, answers, and body text usually carry semantic meaning. IDs, timestamps, status flags, permission labels, and internal codes usually do not.
This distinction matters because vectorization turns selected text into semantic representation. If you include fields like article_id, tenant_id, or created_at in the vector source, the embedding can become noisier without improving retrieval quality.
A useful pattern is:
- Use semantic fields for embeddings.
- Use metadata fields for filtering, scoping, governance, and operations.
- Only include categories or tags in embeddings if they add real semantic meaning.
Design Metadata for Access Control Early
Access control is difficult to bolt on after a search system is already live. If retrieval needs to respect user permissions, the metadata model should include the fields needed to enforce those permissions.
Common access-control metadata includes:
tenant_idorganization_idworkspace_idrolepermission_labelvisibility
These fields should be populated at ingestion time and used as hard filters during retrieval. A RAG system should not retrieve restricted documents and then hope to remove them later.
Plan for Freshness and Lifecycle States
RAG systems often fail because they retrieve stale or inactive content. Metadata should make document lifecycle explicit.
Useful lifecycle fields include:
created_atupdated_atpublished_atexpires_atstatus, such as draft, published, archived, deleted, or approved
Freshness filters are especially useful for policy documents, product documentation, support content, pricing pages, legal material, and internal procedures.
Avoid Cross-Reference Filtering When Direct Metadata Will Work
Cross-references may look clean in a data model, but they can make filtering slower and more complex. If a field is frequently used for retrieval filtering, it is often better to denormalize that value onto the searchable object itself.
| Avoid for frequent filters | Prefer for retrieval |
|---|---|
| Post references User, then filter by user name through the reference. | Store user_name or user_id directly on the post object. |
| Product references Category, then filter through category relationship. | Store category as a direct array property. |
| Document references Tenant, then filter through tenant object. | Store tenant_id directly on the document chunk. |
Denormalization is not always elegant, but it is often the right choice for fast, predictable retrieval.
Store Metadata at Ingestion Time
Metadata should be created and stored when content enters the system. Waiting until query time to infer metadata is slower, less reliable, and harder to debug.
A strong ingestion pipeline should attach fields such as:
- source system
- content type
- author or owner
- product
- region
- role or permission label
- created and updated dates
- status
- chunk position or parent document ID
The more consistently this metadata is populated, the more reliable filtering becomes.
Practical Implementation Example with Weaviate
The following implementation uses Weaviate as an example. The purpose is not to make the design brand-specific, but to show how the metadata design principles translate into a real schema and retrieval workflow.
Create Properties with the Right Data Types
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
name="KnowledgeArticle",
properties=[
# Semantic content
Property(name="title", data_type=DataType.TEXT),
Property(name="body", data_type=DataType.TEXT),
Property(name="summary", data_type=DataType.TEXT),
# Multi-value metadata
Property(
name="tags",
data_type=DataType.TEXT_ARRAY,
tokenization=Tokenization.FIELD,
),
# Exact-match metadata
Property(
name="article_id",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD,
skip_vectorization=True,
index_filterable=True,
index_searchable=False,
),
# Case-insensitive matching
Property(
name="product",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE,
skip_vectorization=True,
index_filterable=True,
),
# Access and region metadata
Property(
name="region",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD,
skip_vectorization=True,
index_filterable=True,
index_searchable=False,
),
Property(
name="role",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD,
skip_vectorization=True,
index_filterable=True,
index_searchable=False,
),
# Numeric range filtering
Property(
name="priority_score",
data_type=DataType.NUMBER,
index_range_filters=True,
skip_vectorization=True,
),
],
)
This schema separates semantic fields from filter-only metadata. The content fields can support meaning-based retrieval, while IDs, region, role, product, and score fields remain available for filtering.
Control What Gets Vectorized
In many collections, you should explicitly choose the fields that become the vector source. This keeps embeddings focused on content rather than internal metadata.
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
name="KnowledgeArticle",
properties=[
Property(name="title", data_type=DataType.TEXT),
Property(name="body", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT_ARRAY),
Property(
name="article_id",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD,
skip_vectorization=True,
),
],
vector_config=Configure.Vectors.text2vec_weaviate(
source_properties=["title", "body", "category"]
)
)
The vector uses title, body, and category because those fields carry meaning. The article ID is kept for filtering or lookup but excluded from vectorization.
Enable Indexes Based on Query Needs
Indexes should match the actual filters your application will run. If you need filtering by creation time, update time, null state, or property length, enable those capabilities intentionally.
import weaviate.classes as wvc
client.collections.create(
name="Documents",
inverted_index_config=wvc.config.Configure.inverted_index(
index_timestamps=True, # Filter by creation/update time
index_null_state=True, # Filter null/non-null properties
index_property_length=True # Filter by array/string length
)
)
Do not enable every index by default. Each index can add storage and ingestion overhead. Use the indexes your queries actually require.
Insert Metadata with the Object
Metadata should be present when the object is inserted. That makes filtering reliable later.
collection = client.collections.use("KnowledgeArticle")
collection.data.insert({
"title": "Q2 Pricing Strategy",
"body": "Full document content here...",
"summary": "Pricing strategy notes for Q2 planning.",
"category": ["finance", "strategy"],
"tags": ["pricing", "planning", "revenue"],
"region": "APAC",
"role": "manager",
"product": "AnalyticsSuite",
"priority_score": 0.82,
"article_id": "article-2025-0042",
})
This object now has both semantic content for retrieval and structured metadata for filtering.
Use Metadata During Retrieval
Once metadata is designed and stored correctly, it can be used to scope semantic search queries.
from weaviate.classes.query import Filter, MetadataQuery
collection = client.collections.use("KnowledgeArticle")
response = collection.query.near_text(
query="pricing strategy",
limit=10,
return_metadata=MetadataQuery(distance=True),
filters=(
Filter.by_property("region").equal("APAC") &
Filter.by_property("role").equal("manager") &
Filter.by_property("product").equal("AnalyticsSuite")
)
)
for item in response.objects:
print(item.properties["title"])
print(item.metadata.distance)
This query retrieves semantically relevant content about pricing strategy, but only inside the APAC, manager, and AnalyticsSuite context.
Summary Checklist
| Design decision | Recommendation |
|---|---|
| Data type selection | Match the data type to the query pattern: exact, range, array, date, or text. |
| Tokenization | Use exact tokenization for IDs, lowercase matching for names, and word tokenization for content. |
| Vectorization | Exclude filter-only metadata and include only semantically meaningful fields. |
| Indexes | Enable only the indexes your queries require. |
| Metadata storage | Store all filterable fields at ingestion time. |
| Relationships | Denormalize frequently filtered values instead of relying on cross-reference filtering. |
| Access control | Plan tenant, role, and permission metadata before the system goes live. |
Good metadata design makes RAG and semantic search more accurate, safer, and easier to operate. It lets vector search focus on meaning while structured metadata handles context, permissions, freshness, and business rules.