Figure 1: Illustration of attention mapping of an input (corresponding to the word "square") to different partitions in an MLP. The partitions do not change, however the space of partition in which the input data falls can change and lead to a different mapping. This means even though we have fixed tokenizer size, the number of data points processed by the MLP during training will exponentially increase with attention.
Attention is the component in transformer which operates on the sequence, creating dependency among the input texts. You can think of attention to be working on representing the current text (given or to be generated) as a function of the texts that came before it (referred to as causal attention). Now, depending on how this representation turns out you can have different representation of the text as input to MLP, i.e, different sequence of texts lead to different versions of the same current text and consequently a different mapping function applied by the MLP.
Figure 2. illustrates geoemtrically the mapping defined by attention on a text. A key takeaway here is that having more context (preceding text) can lead to a larger space of existence and consequently more set of function mappings. This mental picture has often helped me in making sense of several research works on LLMs, be it few-shot prompting, chain-of-thought, or test-time compute to improve reasoning.
Prompting
Finally, to prompting! The understanding above essentially boils down to this: when writing system instructions and prompts, you are addressing an attention mechanism that favors specific sequences in texts, i.e., the training corpus. By creating prompts and sequences that resemble text found during pre-training, the LLM will respond and follow instruction more accurately.
To become proficient at prompting, the best approach is to practice and experiment with different prompts, as various models integrate training data differently. However, the fundamental concept of the internet remains consistent, allowing skills learned with one model to be transferable to others, i.e, markdowns, HTML tags and so on. If you actively browse the texts in the internet, you are possibly in a good spot already when it comes to prompting.
The Open-ended Approach
Open-ended prompting is one where you let the model do the work and you simply state what you’re trying to achieve without much constraints. In this approach, it is important to ensure the model generates some context before answering your question. Note that when you are using ChatGPT or Claude, the application is already prompted with long context. This means the model is already in a state where it can generate answers. There is no escaping the requirement of context, provided or generated, if you want a smart chatbot.
The benefits of this approach include:
Lower barrier to entry for newcomers. Easily transferable to other models.
Opportunity for creative suggestions. The generation is constrained by the space of text the model generates rather than your text.
However, open-ended prompts can lead to:
Inefficiency for use-cases where speed is crucial.
Multiple follow-ups. If not generating enough context, the model might fall short in answering and would require multiple steps before achieving required context to respond correctly.
Hallucinated or unstructured responses. The initial text generated sets the stage for the future response. Consequently, this approach is often unpredictable.
The Descriptive Approach
Descriptive prompting is like writing with a formal specification document. This approach includes attaching long text files and search based augmentation of user inputs. Through the description one is explicitly setting the space of response, allowing the model to respond with answer immediately.
Advantages:
More precise and consistent outputs. Depending on the size of the context, the model is heavily constrained to the vocabulary in the context.
Fewer iterations needed. There is usually less variability in the responses generated.
Better for use-cases where speed to answer is critical.
Ideally, you want to get to this state of prompt when building customer facing applications.
Drawbacks:
Responses can feel mechanical or rigid. Moreover, things that are undesirable in generation would require a lot of effort to undo.
Needs upfront investment. Demands careful planning and multiple refinements to get right.
Requires expertise. Transferring the prompt for use with another model might require effort.
Prompting in a Spectrum
The reality is that effective prompting isn’t about choosing one approach over the other – it’s about finding the right balance for your specific application. Start with one approach, observe the results, and adjust accordingly. Remember that the best prompting strategy is the one that gets you the results you need.
As technology continues to evolve, we may see even more sophisticated ways to interact with these systems. For now, creating a mental model of how different texts in your prompt affects the output you obtain is a powerful tool for consistently delivering results, current or future.
Footnote: The technical content in the post makes several simplifying assumptions to present an easy to understand picture of transformers and its relationship with prompting. However, the message should remain the same for the general case without these assumptions.
When we think about Multi-Layer Perceptrons (MLPs), we often visualize them as interconnected neurons processing information. However, there’s an elegant alternative perspective - viewing MLPs as hashing functions that partition input space and mapping functions on these partitions. Read more
At Tenyx, we’ve delved into the intricate workings of Large Language Models (LLMs) to uncover the geometric structures underlying their reasoning capabilities. Our research provides new insights into how LLMs process information and the implications for improving their reasoning abilities. Read more