Physics in Video Language Models
In this post, we explore the physics behind text to video language models and how they can be used to generate realistic videos from text prompts. Read more
Attention is the component in transformer which operates on the sequence, creating dependency among the input texts. You can think of attention to be working on representing the current text (given or to be generated) as a function of the texts that came before it (referred to as causal attention). Now, depending on how this representation turns out you can have different representation of the text as input to MLP, i.e, different sequence of texts lead to different versions of the same current text and consequently a different mapping function applied by the MLP.
Figure 2. illustrates geoemtrically the mapping defined by attention on a text. A key takeaway here is that having more context (preceding text) can lead to a larger space of existence and consequently more set of function mappings. This mental picture has often helped me in making sense of several research works on LLMs, be it few-shot prompting, chain-of-thought, or test-time compute to improve reasoning.
Finally, to prompting! The understanding above essentially boils down to this: when writing system instructions and prompts, you are addressing an attention mechanism that favors specific sequences in texts, i.e., the training corpus. By creating prompts and sequences that resemble text found during pre-training, the LLM will respond and follow instruction more accurately.
To become proficient at prompting, the best approach is to practice and experiment with different prompts, as various models integrate training data differently. However, the fundamental concept of the internet remains consistent, allowing skills learned with one model to be transferable to others, i.e, markdowns, HTML tags and so on. If you actively browse the texts in the internet, you are possibly in a good spot already when it comes to prompting.
Open-ended prompting is one where you let the model do the work and you simply state what you’re trying to achieve without much constraints. In this approach, it is important to ensure the model generates some context before answering your question. Note that when you are using ChatGPT or Claude, the application is already prompted with long context. This means the model is already in a state where it can generate answers. There is no escaping the requirement of context, provided or generated, if you want a smart chatbot.
The benefits of this approach include:
However, open-ended prompts can lead to:
Descriptive prompting is like writing with a formal specification document. This approach includes attaching long text files and search based augmentation of user inputs. Through the description one is explicitly setting the space of response, allowing the model to respond with answer immediately.
Advantages:
Drawbacks:
The reality is that effective prompting isn’t about choosing one approach over the other – it’s about finding the right balance for your specific application. Start with one approach, observe the results, and adjust accordingly. Remember that the best prompting strategy is the one that gets you the results you need.
As technology continues to evolve, we may see even more sophisticated ways to interact with these systems. For now, creating a mental model of how different texts in your prompt affects the output you obtain is a powerful tool for consistently delivering results, current or future.
Footnote: The technical content in the post makes several simplifying assumptions to present an easy to understand picture of transformers and its relationship with prompting. However, the message should remain the same for the general case without these assumptions.