Prompt Engineering Techniques


Selecting the Appropriate Prompting Technique

Zero-shot, Few-shot, Chain-of-Thought, and Self-Consistency are distinct prompt design methodologies tailored for specific problem sets. Zero-shot relies on direct instruction without examples, while Few-shot leverages demonstrations to establish patterns. Chain-of-Thought (CoT) enhances accuracy for complex, multi-step tasks by exposing intermediate reasoning, and Self-Consistency refines CoT by sampling multiple paths and selecting the most frequent answer.

Selection should be based on the problem domain—classification, pattern recognition, or multi-level inference—rather than technical novelty. This article outlines the practical application of these four methods and discusses how to layer them effectively for production use.


Establishing Evaluation Metrics

A common pitfall is iterating on prompts without a stable evaluation baseline. At a minimum, track accuracy, format compliance, latency, and token cost under identical conditions.

Recommended Evaluation Metrics
- taskAccuracy: Percentage of correct answers or graded scores.
- formatPassRate: Compliance with JSON, XML, or specific label formats.
- p95LatencyMs: 95th percentile response time.
- avgTokens: Mean input and output token count.

While Chain-of-Thought often yields higher accuracy, its latency and cost trade-offs might make Few-shot a more pragmatic choice for high-throughput systems.


Zero-shot Prompting

Zero-shot is the baseline approach where the model receives instructions without explicit examples. Modern models, refined through instruction tuning, handle simple classification and summarization tasks reliably using this method.

Prompt:

Classify the following text as neutral, negative, or positive.
Text: "I think the upcoming vacation will be fine."
Sentiment:

Output:

neutral

Zero-shot is the ideal starting point for prototyping due to its low complexity and minimal token overhead. Its effectiveness stems from the model’s pre-existing alignment with human instructions.

Best Use Cases

  • Simple classification with well-defined labels.
  • Standard tasks like summarization or translation.
  • Initial feasibility testing.

Constraints

  • Accuracy degrades in domain-specific tasks with specialized terminology.
  • Format compliance can be inconsistent compared to Few-shot.

If Zero-shot falters, introduce examples (Few-shot) before upgrading the model or complicating the logic.


Few-shot Prompting

Few-shot prompting provides input-output demonstrations to guide the model toward a specific pattern or tone. It is particularly effective for tasks where criteria are subtle or difficult to articulate through instructions alone.

Prompt:

Input: "That's really cool!" // Sentiment: positive
Input: "This is bad!" // Sentiment: negative
Input: "Wow, that movie was amazing!" // Sentiment: positive
Input: "What a terrible show!" // Sentiment:
Output: negative

Research indicates that the consistency of the demo’s format is often as critical as the accuracy of the labels themselves. Maintaining a stable pattern ensures the model aligns with the desired output structure.

Best Use Cases

  • Subtle classification (e.g., assessing customer complaint severity).
  • Enforcing specific brand tones or complex output formats.
  • Extraction tasks involving domain-specific dictionaries.

Constraints

  • Increased demonstrations lead to higher token costs and latency.
  • Poorly selected examples can degrade performance relative to Zero-shot.
  • Subject to prompt length limits (context window).

In production, prioritize high-quality representative and edge cases (3–5 examples) over a large volume of repetitive demos.


Chain-of-Thought (CoT) Prompting

CoT improves performance on complex reasoning tasks by inducing the model to generate intermediate steps. This is effective for arithmetic, symbolic logic, and multi-step common sense reasoning.

Standard CoT (Few-shot):

Q: The sum of the odd numbers in this group is even: 4, 8, 9, 15, 12, 2, 1.
A: The odd numbers are 9, 15, and 1. Their sum is 25, which is odd. The statement is false.

Q: The sum of the odd numbers in this group is even: 15, 32, 5, 13, 82, 7, 1.
A:

Zero-shot CoT: Simply adding “Let’s think step by step” can trigger reasoning paths without requiring manual examples.

Prompt:

I bought 10 apples. I gave 2 to my neighbor and 2 to the repairman.
Then I bought 5 more and ate one. How many apples do I have left?
Let's think step by step. Provide the final count on the last line.

Output:

1. Started with 10 apples.
2. Gave away 2+2=4 apples, leaving 6.
3. Bought 5 more, totaling 11.
4. Ate 1, leaving 10.
Final Answer: 10

Best Use Cases

  • Arithmetic and logical reasoning.
  • Multi-step decision-making pipelines.
  • Tasks where the rationale is as important as the final answer.

Constraints

  • Overkill for simple extraction or classification, leading to unnecessary latency.
  • Effectiveness correlates strongly with model size; smaller models may produce “hallucinated” reasoning.
  • Requires validation of the final answer, as the model may provide plausible but incorrect steps.


Self-Consistency

Self-Consistency extends CoT by sampling multiple reasoning paths and selecting the most consistent final answer via majority vote. This mitigates the risk of a single “wrong turn” in a model’s reasoning chain.

Question:
When I was 6, my sister was half my age. If I am 70 now, how old is my sister?

Path 1 Result: 67
Path 2 Result: 67
Path 3 Result: 35

Final Selection (Majority): 67

By solving the problem multiple times, you normalize the stochastic nature of the model’s output.

Best Use Cases

  • High-stakes reasoning where accuracy is critical.
  • Logic/Math tasks where correct answers can be verified through redundancy.
  • Mitigating instability in complex CoT prompts.

Constraints

  • Significantly higher cost and latency due to multiple invocations.
  • Requires clear aggregation and tie-breaking rules.
  • Does not guarantee correctness if the model has a systematic bias toward a specific incorrect answer.


Conclusion

Zero-shot, Few-shot, CoT, and Self-Consistency are not competing techniques but a progression. The most cost-effective strategy is to baseline with Zero-shot, optimize with Few-shot, and reserve CoT and Self-Consistency for complex reasoning bottlenecks.

The next article explores Tree of Thoughts for deeper search and Retrieval Augmented Generation (RAG) for external knowledge integration.

References