This experiment aims to explore various scenarios for prompt fine-tuning using structured generation. We'll test how the order of elements in a prompt affects model performance. The elements we consider are:
We will evaluate the following prompt orders:
This is the most natural order. The model generates reasoning before the final answer, providing the most information prior to making a selection. This order leverages decoding mechanics effectively.
This is our user message, we can see the question and answer choices
{'content': 'Answer the Question and include your reasoning and the final answer in a json like: {"reasoning": <reasoning about the answer>, "final_answer": <letter corresponding to the answer>}.', 'role': 'system'}
This is our assistant message, you can see that we are forcing a JSON (note I added spacing for visual purposes), and we are putting the reasoning first. Using a JSON in fine-tuning will improve our structured generation results as the model will get used to responding in that "space".
{'content': 'Question: What can genetic material have?\nAnswer Choices: (a) Resistance (b) Mutations (c) Clorophyll (d) Nucleotide (e) Symmetry (f) Allow growth (g) Contamination (h) Warmth', 'role': 'user'}
An awkward order, placing reasoning after the final answer. While it is faster, it assumes the model can "know" reasoning internally before generating it. This approach saves tokens but is a skeptical case worth testing.
{'content': 'Answer the Question and include your Final Answer and the Reasoning in a json like: {"final_answer": <letter corresponding to the answer>, "reasoning": <reasoning about the answer>}.', 'role': 'system'}
{'content': 'Question: What can genetic material have?\nAnswer Choices: (a) Resistance (b) Mutations (c) Clorophyll (d) Nucleotide (e) Symmetry (f) Allow growth (g) Contamination (h) Warmth', 'role': 'user'}
This serves as a fine-tuning control. No reasoning is provided in the output.
An un-fine-tuned control for comparison purposes.
Structured generation ensures consistent response formats, which is crucial for reliable fine-tuning. Initial experiments faced difficulties with response consistency and structured generation can solve this.