Modeling with Text Embeddings • Far World

I’ve recently been looking at the popular hkunlp/instructor-large embedding as a tool to sift through data in new ways. The typical approaches in SaaS applications are filtering, search, and tagging. With the development of embeddings and LLMs, there are a number of new approaches. What if we tagged documents and objects with embedding vectors? Where could that take us?

Let’s set aside complex architectures involving vector stores for a moment and assume we’re working with relatively small datasets. We’re interested in exploring the unique capabilities of embeddings in relative isolation. In my own experiments, the Instructor model is able to process a collection of over 5000 documents of varying length in roughly a minute on my personal workstation. This works offline, and the model size and resource consumption are modest.

Visualization of Semantic Spaces

For my documents, it was possible to generate embeddings that can be visualized with T-SNE and UMAP with the embedding projector on the Tensorflow website. I’ve found that this rudimentary approach makes it possible to browse thousands of documents grouped into neat clusters. It takes a lot of time, but it’s possible. It’s possible to retrieve clusters programmatically, too, through simple distance calculations between the vectors. Simon Wilson calls this kind of retrieval via embeddings “vibes-based” search.

Instruction Tuned Embeddings

Instruction Fine Tuning

I found the results satisfactory, but the generality meant there was no way to control the vectors for specific tasks. Objects would cluster, but often not in a helpful way. An interesting aspect of the Instructor model is that it can transform the vectors it creates via a special “instruction” parameter. This changes the weights and resultant locations of your generated vectors, ultimately giving you clusters and retrievals highly tuned to your tasks. This is possible without additional training data and iterations of fine-tuning.

The Instructor model is trained to be instructed. Instructions can take the form of various task templates: retrieval, re-ranking, clustering, pair classification, classification, semantic text similarity (STS), and summarization. This instruction capability is now a popular feature on the models found on HuggingFace. It has its origins in the FLAN transformer-based model, which Google Research showed in 2021 to improve performance in unseen zero-shot tasks.

Here are a few examples of instructions you can supply:

Represent the finance document for retrieval.
Represent the StackExchange question for retrieving duplicate questions.
Represent an emotion sentence for classifying the emotions.

Applications and Use Cases

Business Intelligence

Fine-grained task-specific embedding vectors enable a lot of interesting use cases.

Let’s say we instruct for retrieval by competitive business strategy. In this case, we might retrieve companies by competitive similarity, looking abstractly at dimensions involving products, markets, pricing strategies, scale, returns, differentiation, leadership styles, and similar competitive characteristics. Objects could then be queried with abstract or concrete descriptions of companies in order to retrieve strategically similar companies.

Multiple Perspectives

Let’s say we also tagged those same objects with a completely different focus, like their technological maturity, sustainability practices, or communication styles. Each lens could enable powerful new clusterings over the same dataset. Vibes-based search just got a whole lot more focused and useful.

In my project, I’m looking at both category-specific tasks like these and embeddings that span broad heterogeneous topics in helpful ways. The broad tunings I’m looking at focus on functional characteristics, non-functional characteristics, place, time, and trend-based vector spaces.

Identity and Union

There are ways of using this approach to join data in novel ways. Imagine you queried the topic of “The Beatles” through the instruction embedding focused on “pop music comparison” and joined those returned artists against a “music genre” instruction embedding. You could find interesting niche music genres related to bands musically similar to The Beatles.

Data Modeling with Embeddings

All of these new techniques come together in the process of application data modeling. Embeddings are relatively lightweight components that can augment traditional modeling in object-oriented applications. Objects can be given descriptive textual properties that enable task-specific behaviors through their associated vector spaces. These descriptive text attributes on each object instance can be processed under an instruction-tuned embedding to generate per-instance, per-descriptive-property vectors, which can then be searched and queried with traditional RDBMS idioms. This idea is contrary to the popular approach of encoding the totality of an object and relying on the embedding to do all of the retrieval work.

Embedded Entities

As objects are defined with these semantic properties, instance objects become hubs for descriptive data for various tasks. Systems must keep the model structure, the textual data, and the embedding vectors up to date and aligned with those tasks. Another way of saying this, is that the model-centric approach supports dramatically greater task alignment, and gives more controls for improvement.

Semantic Control Flow

We’ve discussed the value of property-based vectors in supporting database queries. They are similarly helpful in program logic. Property-based vector comparisons enable a powerful semantic control flow unlike anything that previously existed in programming. Type systems typically manage this through class comparisons, duck typing, and rigid primitive comparisons at the property level. Vector spaces are a new connective tissue that links object properties via locations on bespoke semantic landscapes.

In contrast to more rigid graph-based modeling approaches, embedding vectors are a great way to expose a kind of fuzzy “information scent” that applications can use to find and rank objects corresponding with attentional preferences of users. In this way, they could be a powerful component of personalized systems. We can determine objects or qualities in objects that a person is looking for, and use these semantic markers to highlight funnels and tubes that lead to spaces where those objects can be found.

More to Discover

As with my experiments with ChatGPT, I’m looking at this capability in granular ways. Both the tagging and query embedding vectors can be late-bound in this approach and could facilitate any number of novel design patterns and algorithms.