all-MiniLM-L6-v2 for text (384-dim) and clip-ViT-B-32 for images (512-dim).
Architecture overview
The diagram below shows how a single collection stores two named vector spaces—one for text embeddings and one for image embeddings. At query time, the system embeds the user query into both spaces, prefetches candidates from each, and fuses the ranked lists server-side before returning a single result set.Environment setup
Run the following command to install the three Python packages this tutorial depends on.actian-vectorai-clientis the Actian VectorAI Python SDK, providing the async client, named vector support, server-side fusion, and gRPC transport.sentence-transformersgenerates text embeddings usingall-MiniLM-L6-v2and image embeddings usingclip-ViT-B-32.pillowhandles image loading and preprocessing.
Step 1: Import dependencies and configure the client
The block below imports the Actian VectorAI SDK alongside the embedding models, then sets the server address, collection name, and dimensionality constants for both vector spaces. Running it loads both models into memory and prints a confirmation of the active configuration.Expected output
Running this block initializes the Actian VectorAI client, loads both theall-MiniLM-L6-v2 text model and the clip-ViT-B-32 image model into memory, and echoes the active server address, collection name, and the output dimensionality of each model. No collection is created at this stage—it simply confirms that all dependencies are loaded and the configuration constants are set.
Step 2: Define embedding helpers
Each modality has its own embedding function. CLIP maps both images and text into the same 512-dim space, whileall-MiniLM-L6-v2 produces richer text representations in 384 dimensions. Running this block defines five helper functions but produces no output.
| Function | Model | Dim | Purpose |
|---|---|---|---|
embed_text | all-MiniLM-L6-v2 | 384 | High-quality semantic text matching |
embed_text_clip | clip-ViT-B-32 | 512 | Cross-modal matching (text ↔ image) |
Step 3: Create a collection with named vectors
Named vectors let you store multiple vector spaces in one collection. Running this block callsget_or_create with a vectors_config dictionary that defines a 384-dim text space and a 512-dim image space, each with its own HNSW parameters.
VectorParams, pass a dictionary where each key becomes a named vector space. The snippet below shows the minimal form of that dictionary.
"text" and one under "image". Each space can have its own:
- Dimensionality—384 for text, 512 for CLIP.
- Distance metric—Cosine, Dot, or Euclid.
- HNSW config—different
mandef_constructper space.
Expected output
Runningcreate_collection() calls get_or_create with a vectors_config dictionary that registers a 384-dim cosine text space and a 512-dim cosine image space, each with its own HNSW parameters. The printed line confirms that both named vector spaces are active and ready to accept points.
Step 4: Prepare multimodal product data
Each product entry has a text description and a visual description. In production, the image vector would come from actual product photos throughembed_image_from_bytes(). This example uses CLIP text embeddings of visual descriptions as stand-ins so you can run the tutorial without downloading image files. Running this block defines the products list and prints the count.
Step 5: Ingest with named vectors
The function below batch-embeds all product descriptions and visual descriptions, then upserts them as named vectors. EachPointStruct carries a dictionary whose keys ("text" and "image") match the named vector spaces defined during collection creation.
PointStruct carries both a "text" and an "image" vector. The keys must match the names declared in vectors_config when the collection was created—each vector is stored in its own HNSW index and searched independently.
Expected output
Runningingest_products() batch-embeds all ten product descriptions using all-MiniLM-L6-v2 (producing 384-dim text vectors) and all visual descriptions using the CLIP text encoder (producing 512-dim image vectors). Each PointStruct is assigned a sequential integer ID and carries both named vectors alongside the full product payload. After upserting, flush persists the collection to disk and get_vector_count confirms the total number of indexed vectors.
Step 6: Search a single vector space
Before fusing results across modalities, it helps to see what each vector space returns on its own. The two functions below search the"text" and "image" spaces independently using the using parameter, then print both ranked lists for the same query.
| Space | What it captures | Strength |
|---|---|---|
text | Semantic meaning of descriptions | Matches “cold weather” to “Gore-Tex membrane” |
image | Visual appearance and style | Matches “jacket” to brown leather visual |
Expected output
Both functions embed the query"warm jacket for cold weather" using their respective encoders and search each vector space independently. Comparing the two lists side by side reveals where the two models agree and where they diverge.
Expected output
Why do Waterproof Hiking Boots rank first in the text space? The product description mentions “Gore-Tex membrane” and “ankle support”—terms that semantically overlap with cold-weather protection. all-MiniLM-L6-v2 captures this association between weatherproof gear and cold-weather queries. The image space correctly ranks the leather jacket first, since CLIP responds to the visual cue “jacket” in the query. This is exactly why fusing both spaces in Step 7 produces better results than either alone.
Step 7: Multistage prefetch with server-side fusion
This is the core multimodal search pattern. The function below prefetches 20 candidates from each vector space, then passes both lists to the server for RRF fusion, returning a single merged ranking.- Prefetch stage 1—search the
"text"vector space with anall-MiniLM-L6-v2embedding and return 20 candidates. - Prefetch stage 2—search the
"image"vector space with a CLIP embedding and return 20 candidates. - Fusion—the server merges both candidate lists using Reciprocal Rank Fusion, producing a single ranked list.
query={"fusion": Fusion.RRF} tells the server to fuse the prefetch results rather than search directly.
Expected output
The function embeds the query into both vector spaces, issues two prefetch requests, and returns a single ranked result set. RRF assigns each item a score based on its position across both ranked lists, so products that appear highly in both spaces receive the highest fused scores. RRF scores are bounded in the range 0.01–0.033.Step 8: Client-side weighted fusion
When you need to weight one modality higher than the other—for example, favoring text relevance over visual similarity—you can search each space independently and fuse the results client-side. The function below accepts analpha parameter that controls the text-to-image weight balance and sweeps it from 1.0 (text only) to 0.0 (image only).
| Aspect | Server-side (query + Fusion) | Client-side (reciprocal_rank_fusion) |
|---|---|---|
| Network calls | 1 (single query) | 2+ (one per vector space) |
| Weight control | No (equal weights) | Yes (weights parameter) |
| Algorithms | RRF | RRF |
| Latency | Lower (server merges internally) | Higher (extra round-trips) |
| Flexibility | Limited to server-supported fusions | Arbitrary post-processing |
Expected output
The code sweepsalpha across five values for the query "comfortable everyday shoes". At alpha=1.0 the fusion result is driven entirely by text-space scores; at alpha=0.0 it is driven entirely by the CLIP image space.
Step 9: Add payload filters to multimodal search
The function below combines multimodal RRF fusion with structured payload filters. It builds a filter from optionalcategory and max_price arguments and passes it to the outer query() call so it applies after the two prefetch stages have been fused.
filter on the outer query() call applies after fusion. The sequence is:
- Both prefetch stages retrieve 20 candidates each, unfiltered within their space.
- The server fuses the candidate lists.
- The filter removes products that do not match—for example, wrong category or too expensive.
- The top-K from the filtered fused list is returned.
filter acts as a gate on the already-merged candidate pool. To filter before fusion—for example, to restrict which documents each modality can retrieve—pass filter directly to PrefetchQuery instead.
Expected output
Note: Post-fusion filtering on points.query() with RRF is accepted without error but has no effect on the fused results in VectorAI DB 1.0.0 — the full fused candidate list is returned regardless of the filter. The code is correct and will filter as expected in a future release.
Step 10: Run multiple searches across named vectors
When you need to run several queries at once, run them sequentially within a single client connection to minimise connection overhead. The function below accepts a list of query dictionaries and dispatches them in one connection.async with block reuses the same gRPC channel, reducing connection overhead compared to opening a new connection per query.
Step 11: Retrieve specific vectors from named spaces
By default, search results include payloads but not the vectors themselves. The function below runs the same query twice: once requesting the"text" vector and a subset of payload fields, and once requesting the full payload with no vectors.
| Selector | Effect |
|---|---|
with_vectors=True | Return all named vectors |
with_vectors=False | Return no vectors (default) |
WithVectorsSelector(include=["text"]) | Return only the "text" vector |
WithPayloadSelector(include=["name"]) | Return only the "name" payload field |
WithPayloadSelector(exclude=["description"]) | Return all fields except "description" |
Step 12: Update a named vector
In a multimodal system, different modalities change at different rates—product images may be re-shot while descriptions stay the same. The function below re-embeds and updates only the"image" vector for a given point by fetching the existing point and re-upserting with the new image vector alongside the unchanged text vector and payload.
- Product descriptions rarely change, so skip reembedding
"text". - Product images change when new photos are taken, so update only
"image". - Metadata changes with price updates, so use
set_payloadinstead.
Step 13: Per-space search parameters
Different vector spaces may need different accuracy-latency trade-offs. The function below assigns a lowerhnsw_ef to the text space for faster retrieval and a higher hnsw_ef to the image space for more accurate candidate selection, then fuses the results with RRF.
hnsw_ef value for each vector space based on which modality matters more to your use case.
| Scenario | Text hnsw_ef | Image hnsw_ef |
|---|---|---|
| Text is more important | 256 | 64 |
| Image is more important | 64 | 256 |
| Equal priority | 128 | 128 |
| Accuracy-critical | 512 | 512 |
Step 14: Inspect collection configuration
After ingestion and updates, you can verify that the collection is configured correctly. The function below retrieves the named vector configuration, total vector count, and VDE state and prints them together.Step 15: Collection cleanup
The function below flushes any pending writes to disk and optionally deletes the collection when you are done experimenting. Uncomment the delete lines to remove the collection entirely.Patterns summary
The following patterns recap the core multimodal techniques covered in this tutorial. Use them as a quick reference when building your own pipelines.Pattern 1: Independent space search
Passusing="text" or using="image" to search one named vector space at a time.
Pattern 2: Server-side multimodal fusion
Provide twoPrefetchQuery entries and set query={"fusion": Fusion.RRF} to have the server merge the candidate lists.
Pattern 3: Client-side weighted fusion
Search each space independently, then pass both result lists toreciprocal_rank_fusion with a weights list to control the text-to-image balance.
Pattern 4: Post-fusion filter
Passfilter to the outer query() call to gate the fused candidate pool by structured payload conditions.
Pattern 5: Partial vector update
Fetch the existing point, then re-upsert with the updated vector alongside the unchanged vectors and payload.Actian VectorAI features used
The table below lists every Actian VectorAI feature this tutorial demonstrated, along with the corresponding API call and its purpose.| Feature | API | Purpose |
|---|---|---|
| Named vectors | vectors_config={"text": VectorParams(...), "image": VectorParams(...)} | Store multiple embedding spaces per collection |
| Named vector search | points.search(using="text") | Search a specific vector space |
| Server-side RRF | query={"fusion": Fusion.RRF} | Rank-based fusion of prefetch results |
| Prefetch | PrefetchQuery(query=..., using=..., limit=...) | Multistage candidate retrieval |
| Client-side RRF | reciprocal_rank_fusion(results, weights=...) | Weighted client-side fusion |
| Payload filtering | FilterBuilder().must(Field(...).eq(...)) | Structured constraints on fusion |
| Selective return | WithPayloadSelector(include=[...]) | Return specific payload fields |
| Vector return | WithVectorsSelector(include=["text"]) | Return specific named vectors |
| Per-space tuning | SearchParams(hnsw_ef=256) in PrefetchQuery | Different accuracy per modality |
| Collection info | collections.get_info() | Inspect named vector configuration |
| VDE operations | vde.flush(), vde.get_vector_count(), vde.get_state() | Administration and persistence |
Next steps
- Optimizing retrieval quality — Tune HNSW, quantization, and search params for accuracy
- Predicate filters — Combine vector search with structured payload constraints
- Similarity search basics — Learn the core retrieval workflow
- Hybrid search patterns — Mix dense and sparse retrieval with fusion