Practical Guide to Agentic Self-Reflection and Other Methods to Improve LLMs Inference Performance on Complex Questions

Over the past few months, I’ve been asked a lot: How can I improve LLM inference performance to get better responses? If you're just getting started, I recommend beginning with our Prompt Engineering Guide to build a solid foundation. But if you're curious about going deeper—this guide is for you.

In this post, I’ll walk you through a technique called Agentic Self-Reflection, and show how I applied it to enhance arithmetic reasoning in Typhoon models. While we’ll focus on arithmetic in this blog post, the principles we’ll cover extend to a wide range of reasoning tasks.

Section 1 gives you the lay of the land—a practical overview of available techniques and how they differ.
Section 2 dives into a real-world application: how I used self-reflection techniques to improve the performance of Typhoon 1.5.

Note: This post was originally written in October 2024, so the benchmarks refer to our former model, Typhoon 1.5. Since then, our newer models have seen significant improvements. Still, the techniques shared here remain relevant and practical for optimizing both current Typhoon models and other LLMs.

This updated and republished version was edited by Orn Smith, our Developer Relations Manager, with the goal of making it clearer and more accessible to the community.

Section 1: Techniques Overview

When it comes to improving LLM responses, most approaches fall into two broad categories:

Prompt Engineering (also known as “magic prompting”)
Agentic Prompting Design Patterns

This section offers a high-level overview of these two families of techniques—how they differ, when to use them, and how they can complement each other.

1. Magic Prompting

Before we get into Agentic Systems (which we’ll explore in the next section), let’s start with the basics: Prompt Engineering, sometimes playfully called Magic Prompting. It’s one of the fastest and most accessible ways to get useful results from LLMs—especially for developers and non-coders alike.

One of the most effective techniques in this category is Chain-of-Thought (CoT) prompting. It works by guiding the model through problems step-by-step, which improves logical reasoning, reduces hallucinations, and leads to more reliable answers.

To take CoT even further, you can add a self-reflection step. This means the model not only walks through its reasoning but also evaluates and refines its own response. This feedback loop improves accuracy and makes the model more robust.

A Prompt Structure Framework

Crafting effective prompts involves balancing three key components:

Use Case (10%): Define the AI's role or the specific scenario. This sets the context for the task. Example: You are a world-class AI financial advice evaluator.
Reasoning Process (55%): Guide the AI through a logical, step-by-step approach to address the task. This often employs Chain-of-Thought (CoT) prompting. Example: “Let’s think through this problem step-by-step…”
Output Format (35%): Specify the desired structure and format of the AI's response to ensure consistency and usability. Example:

<think>\n[Let’s think through this problem step-by-step]\n</think>\n<answer>\n[Final, concise answer to the query]\n</answer>

<think>\n[Let’s think through this problem step-by-step]\n</think>\n<answer>\n[Final, concise answer to the query]\n</answer>

plaintext

Deepening Your Prompting Toolkit

Now that we've outlined the basics of Magic Prompting, let’s take a closer look at the building blocks that make these techniques work effectively in practice.

1. Symbol Magic Prompting

When crafting prompts—especially ones with structured thinking or specific formatting—it helps to use symbolic cues that Large Language Models (LLMs) can reliably recognize and follow. Think of these as visual markers or formatting tricks that guide the model’s behavior more precisely. Here are some useful conventions:

<> tags: Used to enclose formatting or context-specific elements, like <code> and </code>, which signal the model to treat the content inside as code. This helps especially when working with LaTeX or markdown-based outputs.
***: Used as a reminder for key points or essential elements within the prompt.
###: Utilized for starting new command paragraphs or sections.
Triple backticks (` x 3 times): Used to define function outputs, indicate code blocks, or frame areas where the model is expected to generate structured output such as code or JSON.

These symbolic conventions give you more control over how the model interprets structure, especially in complex or multi-step prompts.

2. Chain-of-Thought (CoT) Reasoning

One of the most powerful tools in your prompting toolkit is Chain-of-Thought (CoT) reasoning—a technique that encourages the model to think step by step, rather than jumping straight to an answer.

This approach draws from principles of control theory and has been studied in works like "A Control Theory of LLM Prompting" (Bhargava et al., 2024) and earlier research by Wei et al. (2023). The core idea is simple: by guiding the model to process input through a structured sequence of logic, you increase the likelihood of reaching accurate, interpretable results.

Here’s how the process breaks down:

Input: A question or dataset is provided to the model.
Processing: Instead of answering directly, the model follows a step-by-step reasoning path. Each condition or instruction builds on the last, much like a logical control loop.
Output: The model reaches a solution based on that sequence of thought, leading to clearer and more reliable results.

CoT control theory

Figure 1: Concept of Control Theory of CoT (Wei et al., 2023). This simple diagram represents the concept of Control Theory of CoT. The idea is to take an input, apply structured thinking through a step-by-step process, and then generate an output.

Input: Data or a question is provided as input.
Processing: Conditions or instructions are applied in a sequential, logical manner. This is the core of the Chain-of-Thought (CoT), where each step builds on the previous one to guide reasoning.
Output: After the step-by-step thinking, a refined and accurate output is generated. The Control Theory of LLM Prompting (Bhargava et al., 2024) of CoT as a process where input is handled step-by-step until an output generated

The Control Theory of LLM Prompting (Bhargava et al., 2024) of CoT as a process where input is handled step-by-step until an output generated, as shown in figure 1. This method controls the thought process of the Large Language Model (LLM), guiding it to solve problems or answer questions in a systematic way. It helps break down problems into clear, sequential steps, improving both thinking and data analysis.

At each stage, the model carefully considers one aspect of the problem before moving to the next, ultimately arriving at a more accurate and complete solution.

CoT is especially useful in tasks like arithmetic reasoning, coding, or multi-part logic problems, where breaking things down systematically significantly improves both correctness and explainability.

3. Systems of Self-Reflection

While Chain-of-Thought enhances how models reason, Self-Reflection takes it a step further—giving the model a chance to critique and refine its own output. This approach, often called self-refine (Madaan et al., 2023), has shown great promise in reducing hallucinations and improving factual accuracy.

However, there’s a known challenge: self-bias. As Xu et al. (2024) and PakapongZa (2024) point out, a model that reviews its own answers can sometimes reinforce its own errors instead of correcting them.

To address this, several methods have been proposed:

CrossCheckGPT (Sun et al., 2024)

This approach introduces multi-agent cross-evaluation. Multiple models—or multiple perspectives from the same model—evaluate a single output, either with or without external references. This allows for an agentic consensus, where models collaboratively agree on a refined answer. It’s especially effective when paired with Agentic Planning, which we’ll cover in Section 2.

Retrieval-Augmented Generation (RAG) (Lewis et al., 2021)

Rather than relying purely on internal logic, RAG enables the model to retrieve external information during the self-reflection process. By grounding its reflections in actual data, the model can produce more accurate, evidence-backed answers.

Figure 2 shows how RAG integrates into the self-reflection loop, enriching the model’s review process with retrieved content. While RAG doesn’t eliminate self-bias completely, it offers a scalable and relatively simple method for reducing it—making it a good fit for the examples you'll see in Section 2.

RAG self reflection loop Figure 2 : Adding RAG to self-reflection for self-refinement in LLM. This diagram illustrates how RAG is integrated into the self-reflection process to help refine the answers generated by a LLM. The addition of RAG enables the model to search and retrieve relevant external information during the reflection phase, which enhances the accuracy and correctness of its responses.

Section 2: Agentic Systems

To push LLM performance further—beyond clever prompting alone—we need to give models more autonomy in how they reason, reflect, and act. That’s where Agentic Systems come in.

Agentic Systems structure LLM behavior around decision-making loops, using design patterns that mimic planning, self-correction, and even collaboration across multiple agents. These patterns not only improve accuracy and reliability, but also unlock more complex forms of problem-solving.

In this section, we’ll explore how to integrate these agentic design patterns into your AI workflows, with practical examples. Figure 3 provides a visual overview of several common Agentic Design Patterns we’ll reference throughout this discussion.

Agentic Design Patterns Figure 3: Agentic AI Design Patterns in a Nutshell! https://x.com/ordax/status/1843006110074274148; This figure provides an overview of different Agentic AI Design Patterns, which are frameworks for guiding AI models, such as Large Language Models (LLMs), to perform tasks more efficiently and accurately by incorporating self-reflection, tool usage, planning, and multi-agent collaboration.

There are several Agentic Design Patterns that demonstrate how large language models can go beyond static responses and become more dynamic, decision-capable agents. These patterns provide reusable templates for orchestrating reasoning, refinement, and interaction—each contributing in different ways to improving inference performance.

Below are four key patterns that form the foundation of agentic workflows:

Agentic Self-Reflection (the focus of our demo): The LLM generates an initial response (e.g., a function) and then critiques and refines its own output. This iterative loop helps the model catch errors and improve quality. Example: The model writes a Python function, then evaluates it against unit tests and adjusts accordingly.
Agentic with Tool Use: The LLM integrates external tools—like web search, RAG, or calculators—to supplement its internal reasoning. Example: The model retrieves relevant research papers or documentation, then weaves that data into its response.
Agentic Planning: The model decomposes complex tasks into smaller sub-tasks, selects appropriate strategies, and uses memory or planning modules to complete the process step by step. Example: Given a software bug, the model breaks down the debugging process into discrete planning stages and iterates based on reflection.
Multi-Agentic Collaboration: Multiple LLM-based agents work together, each with a specialized role (e.g., planner, executor, reviewer). A central Commander Agent coordinates their efforts. Example: One agent writes code, another reviews it, and a third verifies safety or correctness before returning a final answer.

Section 3: Practical Demonstration

In this section, we’ll walk through a real-world example to demonstrate how Agentic Self-Reflection can help improve performance on a specific type of reasoning task: arithmetic word problems.

While Typhoon 1.5X performs well on most instruction-following tasks, its handling of numerical and financial reasoning—especially those involving multi-step calculations—leaves room for improvement. That makes arithmetic reasoning a perfect use case to explore how agentic techniques can guide the model toward more accurate outputs.

Example 1: Investment Calculation

📘 Question: A client requires £100,000 one year from now. If the stated annual interest rate is 2.50%, compounded weekly, what is the closest estimate of the deposit needed today?

Options:

A: £97,500

B: £97,532 ✅

C: £97,561

Given:

Future Value (FV): £100,000

Annual Interest Rate (r): 2.5% or 0.025

Compounding Periods per Year (n): 52 (weekly)

Solutions:

This problem can be solved using the Present Value (PV) formula below

Present Value (PV) Formula

The task can be broken down into these steps:

Calculate the periodic interest rate:

PV step 1

Compute the growth factor:

PV step 2

Calculate the compound factor:

PV step 3

Determine the present value:

PV step 4

Conclusion: The present value is approximately £97,537.96, which aligns closely with option B: £97,532

Simple Prompting vs. CoT + Self Reflection + RAG

To evaluate the effectiveness of agentic techniques, we tested Typhoon 1.5X on the same arithmetic question using two different prompting approaches.

🔹 Baseline: Simple Prompting (Incorrect)

In the first attempt, we used a basic prompt with no structured reasoning or feedback loop. As shown in Figure 4, the model incorrectly selected Option A: £97,500.

figure showing wrong answer was chosen when using simple prompt Figure 4: Typhoon-1.5-8b-instruct answering incorrectly when prompted without reasoning or feedback. Model: scb10x/llama-3-typhoon-v1.5x-8b-instruct, tested on OpenTyphoon Playground.

🔹 CoT + Self-Reflection + RAG

Next, we applied a structured prompt using Chain-of-Thought, Self-Reflection, and RAG. As shown in Figure 5, the model correctly identify Option B: £97,532.

CoT-self-reflection-RAG-prompting-investment

Figure 5: Typhoon-1.5-8b-instruct with CoT, Self-Reflection (80% reflection score), and RAG, the model arrives at the correct solution. Source: GitHub repo.

Example 2: Math Prompt in Thai

In this example, the model was tested with a math problem written in Thai. Below is the English translation of the original prompt:

Triangle ABC is an isosceles triangle with AB = AC. Points D, E, and F lie on segments BC, AC, and AB, respectively, such that AD, BE, and CF are medians drawn from vertices A, B, and C.

If AD = 8 units, and the combined length of the other two medians (BE + CF) is 258 units, what is the approximate perimeter of triangle ABC?

Answer choices:

27 units

31 units

32 units ✅

36 units

Figure 6 shows the result when we used a basic prompt (without reasoning or feedback) on the llama-3-typhoon-v1.5x-8b-instruct model via the OpenTyphoon Playground. In this attempt, the model’s reasoning was incorrect—it failed to apply the Pythagorean theorem, leading to an invalid solution.

$Figure 6$ Figure 6: Incorrect result generated from a basic prompt. The model failed to apply geometric principles correctly.

In contrast, Figure 7 shows the output when we used a structured prompt combining Chain-of-Thought (CoT) reasoning, Self-Reflection, and Retrieval-Augmented Generation (RAG). This approach significantly improved the logical structure of the solution and led to the correct answer.

However, it’s worth noting that while the reasoning steps were largely correct, some numerical values in the output were still inaccurate or hallucinated—highlighting the need for further refinement in numerical precision.

$Figure 7$ Figure 7: Chain of Thought + Self-Reflection + RAG; Source Exam: https://www.webythebrain.com/article/tu-60_math-test

We received valuable feedback from the Thai Natural Language Processing community, particularly from K. Oparad. We appreciate this input and will incorporate it into future iterations to improve model performance on Thai-language tasks.

As shown in Figure 8, while the model demonstrates sound reasoning, the numerical outputs are occasionally inaccurate or hallucinated. This highlights a known limitation of the current Typhoon 1.5X 8B model—especially when handling complex mathematical tasks.

Figure 8: Despite correct reasoning, the Typhoon 1.5X (8B) model occasionally generates hallucinatory numbers—a known limitation of this model size.

Example 3 (Exclusive): Computer Security Challenge

In this final example, we test Typhoon’s reasoning on a foundational question in computer security. Specifically, we evaluate whether the composition of two collision-resistant hash functions—denoted as H_1 and H_2 — remains collision-resistant when combined as:

$$ H(x) = H_1(H_2(x)) $$

Question (T/F):

If H_1 and H_2 are collision-resistant hash functions, then H(x) = H_1(H_2(x)) is also collision-resistant.

Answer choices:

False

True ✅

As shown in Figure 9, the model correctly answers this LaTeX-formatted question using a prompt that includes Chain-of-Thought reasoning, Self-Reflection, and Retrieval-Augmented Generation (RAG). The intermediate reasoning logs further confirm that the model followed a sound process before arriving at the correct conclusion.

Figure 9: Collision-resistance problem from a computer security exam, solved with CoT + Self-Reflection + RAG. Dataset: natnitaract's Hugging Face.

For a deeper look, you can explore the full reasoning process and outputs—both with and without self-reflection—in our GitHub repository (Nitarach, 2024), which includes a Jupyter Notebook and Gradio demo.

Final Thoughts: What’s Next?

This post focused on one particular agentic technique—self-reflection—to improve arithmetic and reasoning performance in Typhoon 1.5. But this is just the beginning.

To push the limits of inference quality, we encourage you to explore a broader spectrum of Agentic Systems and prompting methods, including:

Magic Prompting Variants such as Tree of Thoughts (ToT) (Yao et al., 2023) and Graph of Thoughts (GoT) (Besta et al., 2024)
Advanced Agentic Design such as Mixture-of-Agents (Wang et al., 2024), Agents for Summarization (Celikyilmaz et al., 2018) and Automated Design of Agentic Systems (Hu et al., 2024)

Thank You for Reading

We hope this post provided practical insights into improving LLM inference performance using Agentic Self-Reflection. As always, feedback is welcome—and we’re excited to see how you take these ideas further.

Let us know what you build. 🚀

References

Bhargava, A., Witkowski, C., Looi, S.-Z., & Thomson, M. (2024). What’s the Magic Word? A Control Theory of LLM Prompting. arXiv preprint arXiv:2310.04444
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI Conference on Artificial Intelligence, 38(16), 17682–17690. DOI: 10.1609/aaai.v38i16.29720
Cao, D., Katz, M., Kokel, H., Srinivas, K., & Sohrabi, S. (2024). Automating Thought of Search: A Journey Towards Soundness and Completeness. arXiv preprint arXiv:2408.11326
Celikyilmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep Communicating Agents for Abstractive Summarization. arXiv preprint arXiv:1803.10357
Hu, S., Lu, C., & Clune, J. (2024). Automated Design of Agentic Systems. arXiv preprint arXiv:2408.08435
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint arXiv:2303.17651
Sun, G., Manakul, P., Liusie, A., Pipatanakul, K., Zhang, C., Woodland, P., & Gales, M. (2024). CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models. arXiv preprint arXiv:2405.13684
Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv preprint arXiv:2406.04692
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903
Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., & Wang, W. Y. (2024). Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. arXiv preprint arXiv:2402.11436
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601
PakapongZa. (2024). Self-bias in LLMs [Video]. YouTube. Watch here
Nitarach, N. (2024). Cracking Typhoon LLM Inference Performance. GitHub repository. https://github.com/nat-nischw/cracking-typhoon-llm-inference-performance

Appendix

GSM8K Performance: Standard vs Our-CoT (0-shot, 8-shot)

Figure 10: GSM8K Performance of scb10x/llama-3-typhoon-v1.5x-8b-instruct â€” The chart compares the performance of standard prompts vs our-CoT prompts for arithmetic reasoning tasks, highlighting the improvement in accuracy using CoT with vLLM for inference. Eval dataset: https://github.com/Pittawat2542/null-shot-prompting/blob/main/datasets/GSM8K.jsonl

In Figure 10, the scb10x/llama-3-typhoon-v1.5x-8b-instruct model is evaluated on the GSM8K benchmark, which focuses on arithmetic reasoning. Using vLLM for optimized inference, we compared standard prompts with ours CoT prompt :

Standard Prompt : 12.66%
Ours-CoT Prompt : 52.84%
Ours-CoT (8-shot) Prompt : 73.16%

Prompt

To ensure that the LLM followed instructions and generate its own final answer, we used a template with <output> tags that contained the actual final answer of the LLM. This prevented potential errors in regex extraction. It differs from EleutherAI/lm-evaluation-harness flexible-extract or strict-match method.

Our results differ from EleutherAI's lm-evaluation-harness's results in the standard prompt setup. The standard prompts can change scores when you use a different template. Changing to "You are a helpful assistant named GPT-6." could also have a notable impact, with an increase from 12.66% to 13.12%. However, In a 8-shot setting, the results are almost identical to EleutherAIâ€™s LM-evaluation-harness. As a result, CoT reasoning is more robust than not using it.

Standard

SYSTEM_PROMPT: str = """Provide your final answer within the <output> tags.
<output>
"answer": [The final answer (number)]
</output>
"""

SYSTEM_PROMPT: str = """Provide your final answer within the <output> tags.
<output>
"answer": [The final answer (number)]
</output>
"""

plaintext

Ours-CoT

# Semi-Reflection <strong>* GSM8K:] 12.66% --> 52.84% </strong>*
# Remove response reflection score, but keep it in thought.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <code> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Reflection: We recommend adding RAG in small models, because Self-RAG methods don't work.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. 

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Reflect within <reflection> tags after key steps. Assign a quality score (0.0-1.0) within <reward> tags to guide adjustments:
    - 0.8+: Continue current approach.
    - 0.5-0.7: Make minor adjustments.
    - Below 0.5: Consider backtracking.
4. Make any necessary adjustments based on your reflection.
5. Provide your final, concise answer within the <output> tags.

Important: The <thinking> and <reflection> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

<reflection>
[Your reflection on your reasoning, checking for errors or improvements]
</reflection>
[Self-reflection on reasoning quality.]
<reward>
[Score between 0.0 and 1.0]
</reward>
[Any adjustments to your thinking based on your reflection]

</thinking>

<output>
"reward": [Your final reflection score],
"answer": [Your final , concise answer (Thai, English) to the query. This is the only part that will be shown to the user.]
</output>"""

# Semi-Reflection <strong>* GSM8K:] 12.66% --> 52.84% </strong>*
# Remove response reflection score, but keep it in thought.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <code> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Reflection: We recommend adding RAG in small models, because Self-RAG methods don't work.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. 

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Reflect within <reflection> tags after key steps. Assign a quality score (0.0-1.0) within <reward> tags to guide adjustments:
    - 0.8+: Continue current approach.
    - 0.5-0.7: Make minor adjustments.
    - Below 0.5: Consider backtracking.
4. Make any necessary adjustments based on your reflection.
5. Provide your final, concise answer within the <output> tags.

Important: The <thinking> and <reflection> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

<reflection>
[Your reflection on your reasoning, checking for errors or improvements]
</reflection>
[Self-reflection on reasoning quality.]
<reward>
[Score between 0.0 and 1.0]
</reward>
[Any adjustments to your thinking based on your reflection]

</thinking>

<output>
"reward": [Your final reflection score],
"answer": [Your final , concise answer (Thai, English) to the query. This is the only part that will be shown to the user.]
</output>"""

plaintext

Ours-CoT (8-shot)

# Semi-Reflection <strong>* GSM8K (0-shot, 8-shot)] 52.84% --> 73.16% </strong>*
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Few-shot Examples:

- question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
<thinking>
There are 15 trees originally. Then there were 21 trees after some more were planted. To find how many trees were planted, I subtract the initial number from the final number: 21 - 15 = 6.
</thinking>
<reflection>
This step is correct, the subtraction properly shows the number of trees planted.
</reflection>
<output>
"answer": 6
</output>

- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
<thinking>
There are 3 cars initially in the parking lot. 2 more cars arrive, so I need to add 3 + 2 = 5.
</thinking>
<reflection>
No issues, the addition is correct.
</reflection>
<output>
"answer": 5
</output>

- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
<thinking>
Leah had 32 chocolates and her sister had 42, so together they had 32 + 42 = 74 chocolates. After eating 35 chocolates, they would have 74 - 35 = 39 chocolates left.
</thinking>
<reflection>
The addition and subtraction steps are correct.
</reflection>
<output>
"answer": 39
</output>

- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
<thinking>
Jason started with 20 lollipops, and after giving some to Denny, he has 12 left. To find out how many he gave away, I calculate 20 - 12 = 8.
</thinking>
<reflection>
The subtraction correctly calculates the number of lollipops given away.
</reflection>
<output>
"answer": 8
</output>

- question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
<thinking>
Shawn started with 5 toys. If he got 2 toys each from his mom and dad, that's 2 + 2 = 4 more toys. Adding that to the initial number: 5 + 4 = 9.
</thinking>
<reflection>
The addition is correct.
</reflection>
<output>
"answer": 9
</output>

- question: There were nine computers in the server room. Five more computers were installed each day, from Monday to Thursday. How many computers are now in the server room?
<thinking>
There were originally 9 computers. From Monday to Thursday, 5 computers were added each day for 4 days. That means 5 * 4 = 20 computers were added. The total number of computers now is 9 + 20 = 29.
</thinking>
<reflection>
Multiplication and addition steps are correct.
</reflection>
<output>
"answer": 29
</output>

- question: Michael had 58 golf balls. On Tuesday, he lost 23 golf balls. On Wednesday, he lost 2 more. How many golf balls did he have at the end of Wednesday?
<thinking>
Michael started with 58 golf balls. After losing 23 on Tuesday, he had 58 - 23 = 35. Then he lost 2 more on Wednesday, leaving him with 35 - 2 = 33 golf balls.
</thinking>
<reflection>
The subtraction steps are accurate.
</reflection>
<output>
"answer": 33
</output>

- question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<thinking>
Olivia had 23 dollars. She bought 5 bagels at 3 dollars each, which means 5 * 3 = 15 dollars spent. To find out how much she has left, I subtract 15 from 23: 23 - 15 = 8.
</thinking>
<reflection>
The multiplication and subtraction steps are correct.
</reflection>
<output>
"answer": 8
</output>

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Semi-Reflection <strong>* GSM8K (0-shot, 8-shot)] 52.84% --> 73.16% </strong>*
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Few-shot Examples:

- question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
<thinking>
There are 15 trees originally. Then there were 21 trees after some more were planted. To find how many trees were planted, I subtract the initial number from the final number: 21 - 15 = 6.
</thinking>
<reflection>
This step is correct, the subtraction properly shows the number of trees planted.
</reflection>
<output>
"answer": 6
</output>

- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
<thinking>
There are 3 cars initially in the parking lot. 2 more cars arrive, so I need to add 3 + 2 = 5.
</thinking>
<reflection>
No issues, the addition is correct.
</reflection>
<output>
"answer": 5
</output>

- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
<thinking>
Leah had 32 chocolates and her sister had 42, so together they had 32 + 42 = 74 chocolates. After eating 35 chocolates, they would have 74 - 35 = 39 chocolates left.
</thinking>
<reflection>
The addition and subtraction steps are correct.
</reflection>
<output>
"answer": 39
</output>

- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
<thinking>
Jason started with 20 lollipops, and after giving some to Denny, he has 12 left. To find out how many he gave away, I calculate 20 - 12 = 8.
</thinking>
<reflection>
The subtraction correctly calculates the number of lollipops given away.
</reflection>
<output>
"answer": 8
</output>

- question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
<thinking>
Shawn started with 5 toys. If he got 2 toys each from his mom and dad, that's 2 + 2 = 4 more toys. Adding that to the initial number: 5 + 4 = 9.
</thinking>
<reflection>
The addition is correct.
</reflection>
<output>
"answer": 9
</output>

- question: There were nine computers in the server room. Five more computers were installed each day, from Monday to Thursday. How many computers are now in the server room?
<thinking>
There were originally 9 computers. From Monday to Thursday, 5 computers were added each day for 4 days. That means 5 * 4 = 20 computers were added. The total number of computers now is 9 + 20 = 29.
</thinking>
<reflection>
Multiplication and addition steps are correct.
</reflection>
<output>
"answer": 29
</output>

- question: Michael had 58 golf balls. On Tuesday, he lost 23 golf balls. On Wednesday, he lost 2 more. How many golf balls did he have at the end of Wednesday?
<thinking>
Michael started with 58 golf balls. After losing 23 on Tuesday, he had 58 - 23 = 35. Then he lost 2 more on Wednesday, leaving him with 35 - 2 = 33 golf balls.
</thinking>
<reflection>
The subtraction steps are accurate.
</reflection>
<output>
"answer": 33
</output>

- question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<thinking>
Olivia had 23 dollars. She bought 5 bagels at 3 dollars each, which means 5 * 3 = 15 dollars spent. To find out how much she has left, I subtract 15 from 23: 23 - 15 = 8.
</thinking>
<reflection>
The multiplication and subtraction steps are correct.
</reflection>
<output>
"answer": 8
</output>

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

plaintext

Section 1 gives you the lay of the land—a practical overview of available techniques and how they differ.
Section 2 dives into a real-world application: how I used self-reflection techniques to improve the performance of Typhoon 1.5.

This updated and republished version was edited by Orn Smith, our Developer Relations Manager, with the goal of making it clearer and more accessible to the community.

Section 1: Techniques Overview

When it comes to improving LLM responses, most approaches fall into two broad categories:

Prompt Engineering (also known as “magic prompting”)
Agentic Prompting Design Patterns

This section offers a high-level overview of these two families of techniques—how they differ, when to use them, and how they can complement each other.

1. Magic Prompting

A Prompt Structure Framework

Crafting effective prompts involves balancing three key components:

Use Case (10%): Define the AI's role or the specific scenario. This sets the context for the task. Example: You are a world-class AI financial advice evaluator.
Reasoning Process (55%): Guide the AI through a logical, step-by-step approach to address the task. This often employs Chain-of-Thought (CoT) prompting. Example: “Let’s think through this problem step-by-step…”
Output Format (35%): Specify the desired structure and format of the AI's response to ensure consistency and usability. Example:

<think>\n[Let’s think through this problem step-by-step]\n</think>\n<answer>\n[Final, concise answer to the query]\n</answer>

<think>\n[Let’s think through this problem step-by-step]\n</think>\n<answer>\n[Final, concise answer to the query]\n</answer>

plaintext

Deepening Your Prompting Toolkit

Now that we've outlined the basics of Magic Prompting, let’s take a closer look at the building blocks that make these techniques work effectively in practice.

1. Symbol Magic Prompting

<> tags: Used to enclose formatting or context-specific elements, like <code> and </code>, which signal the model to treat the content inside as code. This helps especially when working with LaTeX or markdown-based outputs.
***: Used as a reminder for key points or essential elements within the prompt.
###: Utilized for starting new command paragraphs or sections.
Triple backticks (` x 3 times): Used to define function outputs, indicate code blocks, or frame areas where the model is expected to generate structured output such as code or JSON.

These symbolic conventions give you more control over how the model interprets structure, especially in complex or multi-step prompts.

2. Chain-of-Thought (CoT) Reasoning

One of the most powerful tools in your prompting toolkit is Chain-of-Thought (CoT) reasoning—a technique that encourages the model to think step by step, rather than jumping straight to an answer.

Here’s how the process breaks down:

Input: A question or dataset is provided to the model.
Processing: Instead of answering directly, the model follows a step-by-step reasoning path. Each condition or instruction builds on the last, much like a logical control loop.
Output: The model reaches a solution based on that sequence of thought, leading to clearer and more reliable results.

CoT control theory

Input: Data or a question is provided as input.
Processing: Conditions or instructions are applied in a sequential, logical manner. This is the core of the Chain-of-Thought (CoT), where each step builds on the previous one to guide reasoning.
Output: After the step-by-step thinking, a refined and accurate output is generated. The Control Theory of LLM Prompting (Bhargava et al., 2024) of CoT as a process where input is handled step-by-step until an output generated

At each stage, the model carefully considers one aspect of the problem before moving to the next, ultimately arriving at a more accurate and complete solution.

3. Systems of Self-Reflection

To address this, several methods have been proposed:

CrossCheckGPT (Sun et al., 2024)

Retrieval-Augmented Generation (RAG) (Lewis et al., 2021)

Section 2: Agentic Systems

To push LLM performance further—beyond clever prompting alone—we need to give models more autonomy in how they reason, reflect, and act. That’s where Agentic Systems come in.

Below are four key patterns that form the foundation of agentic workflows:

Agentic Self-Reflection (the focus of our demo): The LLM generates an initial response (e.g., a function) and then critiques and refines its own output. This iterative loop helps the model catch errors and improve quality. Example: The model writes a Python function, then evaluates it against unit tests and adjusts accordingly.
Agentic with Tool Use: The LLM integrates external tools—like web search, RAG, or calculators—to supplement its internal reasoning. Example: The model retrieves relevant research papers or documentation, then weaves that data into its response.
Agentic Planning: The model decomposes complex tasks into smaller sub-tasks, selects appropriate strategies, and uses memory or planning modules to complete the process step by step. Example: Given a software bug, the model breaks down the debugging process into discrete planning stages and iterates based on reflection.
Multi-Agentic Collaboration: Multiple LLM-based agents work together, each with a specialized role (e.g., planner, executor, reviewer). A central Commander Agent coordinates their efforts. Example: One agent writes code, another reviews it, and a third verifies safety or correctness before returning a final answer.

Section 3: Practical Demonstration

In this section, we’ll walk through a real-world example to demonstrate how Agentic Self-Reflection can help improve performance on a specific type of reasoning task: arithmetic word problems.

Example 1: Investment Calculation

📘 Question: A client requires £100,000 one year from now. If the stated annual interest rate is 2.50%, compounded weekly, what is the closest estimate of the deposit needed today?

Options:

A: £97,500

B: £97,532 ✅

C: £97,561

Given:

Future Value (FV): £100,000

Annual Interest Rate (r): 2.5% or 0.025

Compounding Periods per Year (n): 52 (weekly)

Solutions:

This problem can be solved using the Present Value (PV) formula below

Present Value (PV) Formula

The task can be broken down into these steps:

Calculate the periodic interest rate:

PV step 1

Compute the growth factor:

PV step 2

Calculate the compound factor:

PV step 3

Determine the present value:

PV step 4

Conclusion: The present value is approximately £97,537.96, which aligns closely with option B: £97,532

Simple Prompting vs. CoT + Self Reflection + RAG

To evaluate the effectiveness of agentic techniques, we tested Typhoon 1.5X on the same arithmetic question using two different prompting approaches.

🔹 Baseline: Simple Prompting (Incorrect)

In the first attempt, we used a basic prompt with no structured reasoning or feedback loop. As shown in Figure 4, the model incorrectly selected Option A: £97,500.

🔹 CoT + Self-Reflection + RAG

Next, we applied a structured prompt using Chain-of-Thought, Self-Reflection, and RAG. As shown in Figure 5, the model correctly identify Option B: £97,532.

CoT-self-reflection-RAG-prompting-investment

Figure 5: Typhoon-1.5-8b-instruct with CoT, Self-Reflection (80% reflection score), and RAG, the model arrives at the correct solution. Source: GitHub repo.

Example 2: Math Prompt in Thai

In this example, the model was tested with a math problem written in Thai. Below is the English translation of the original prompt:

Triangle ABC is an isosceles triangle with AB = AC. Points D, E, and F lie on segments BC, AC, and AB, respectively, such that AD, BE, and CF are medians drawn from vertices A, B, and C.

If AD = 8 units, and the combined length of the other two medians (BE + CF) is 258 units, what is the approximate perimeter of triangle ABC?

Answer choices:

27 units

31 units

32 units ✅

36 units

$Figure 6$ Figure 6: Incorrect result generated from a basic prompt. The model failed to apply geometric principles correctly.

$Figure 7$ Figure 7: Chain of Thought + Self-Reflection + RAG; Source Exam: https://www.webythebrain.com/article/tu-60_math-test

Figure 8: Despite correct reasoning, the Typhoon 1.5X (8B) model occasionally generates hallucinatory numbers—a known limitation of this model size.

Example 3 (Exclusive): Computer Security Challenge

$$ H(x) = H_1(H_2(x)) $$

Question (T/F):

If H_1 and H_2 are collision-resistant hash functions, then H(x) = H_1(H_2(x)) is also collision-resistant.

Answer choices:

False

True ✅

Figure 9: Collision-resistance problem from a computer security exam, solved with CoT + Self-Reflection + RAG. Dataset: natnitaract's Hugging Face.

Final Thoughts: What’s Next?

This post focused on one particular agentic technique—self-reflection—to improve arithmetic and reasoning performance in Typhoon 1.5. But this is just the beginning.

To push the limits of inference quality, we encourage you to explore a broader spectrum of Agentic Systems and prompting methods, including:

Magic Prompting Variants such as Tree of Thoughts (ToT) (Yao et al., 2023) and Graph of Thoughts (GoT) (Besta et al., 2024)
Advanced Agentic Design such as Mixture-of-Agents (Wang et al., 2024), Agents for Summarization (Celikyilmaz et al., 2018) and Automated Design of Agentic Systems (Hu et al., 2024)

Thank You for Reading

Let us know what you build. 🚀

References

Bhargava, A., Witkowski, C., Looi, S.-Z., & Thomson, M. (2024). What’s the Magic Word? A Control Theory of LLM Prompting. arXiv preprint arXiv:2310.04444
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI Conference on Artificial Intelligence, 38(16), 17682–17690. DOI: 10.1609/aaai.v38i16.29720
Cao, D., Katz, M., Kokel, H., Srinivas, K., & Sohrabi, S. (2024). Automating Thought of Search: A Journey Towards Soundness and Completeness. arXiv preprint arXiv:2408.11326
Celikyilmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep Communicating Agents for Abstractive Summarization. arXiv preprint arXiv:1803.10357
Hu, S., Lu, C., & Clune, J. (2024). Automated Design of Agentic Systems. arXiv preprint arXiv:2408.08435
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint arXiv:2303.17651
Sun, G., Manakul, P., Liusie, A., Pipatanakul, K., Zhang, C., Woodland, P., & Gales, M. (2024). CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models. arXiv preprint arXiv:2405.13684
Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv preprint arXiv:2406.04692
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903
Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., & Wang, W. Y. (2024). Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. arXiv preprint arXiv:2402.11436
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601
PakapongZa. (2024). Self-bias in LLMs [Video]. YouTube. Watch here
Nitarach, N. (2024). Cracking Typhoon LLM Inference Performance. GitHub repository. https://github.com/nat-nischw/cracking-typhoon-llm-inference-performance

Appendix

GSM8K Performance: Standard vs Our-CoT (0-shot, 8-shot)

Standard Prompt : 12.66%
Ours-CoT Prompt : 52.84%
Ours-CoT (8-shot) Prompt : 73.16%

Prompt

Standard

SYSTEM_PROMPT: str = """Provide your final answer within the <output> tags.
<output>
"answer": [The final answer (number)]
</output>
"""

SYSTEM_PROMPT: str = """Provide your final answer within the <output> tags.
<output>
"answer": [The final answer (number)]
</output>
"""

plaintext

Ours-CoT

# Semi-Reflection <strong>* GSM8K:] 12.66% --> 52.84% </strong>*
# Remove response reflection score, but keep it in thought.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <code> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Reflection: We recommend adding RAG in small models, because Self-RAG methods don't work.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. 

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Reflect within <reflection> tags after key steps. Assign a quality score (0.0-1.0) within <reward> tags to guide adjustments:
    - 0.8+: Continue current approach.
    - 0.5-0.7: Make minor adjustments.
    - Below 0.5: Consider backtracking.
4. Make any necessary adjustments based on your reflection.
5. Provide your final, concise answer within the <output> tags.

Important: The <thinking> and <reflection> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

<reflection>
[Your reflection on your reasoning, checking for errors or improvements]
</reflection>
[Self-reflection on reasoning quality.]
<reward>
[Score between 0.0 and 1.0]
</reward>
[Any adjustments to your thinking based on your reflection]

</thinking>

<output>
"reward": [Your final reflection score],
"answer": [Your final , concise answer (Thai, English) to the query. This is the only part that will be shown to the user.]
</output>"""

# Semi-Reflection <strong>* GSM8K:] 12.66% --> 52.84% </strong>*
# Remove response reflection score, but keep it in thought.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <code> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Reflection: We recommend adding RAG in small models, because Self-RAG methods don't work.
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. 

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Reflect within <reflection> tags after key steps. Assign a quality score (0.0-1.0) within <reward> tags to guide adjustments:
    - 0.8+: Continue current approach.
    - 0.5-0.7: Make minor adjustments.
    - Below 0.5: Consider backtracking.
4. Make any necessary adjustments based on your reflection.
5. Provide your final, concise answer within the <output> tags.

Important: The <thinking> and <reflection> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

<reflection>
[Your reflection on your reasoning, checking for errors or improvements]
</reflection>
[Self-reflection on reasoning quality.]
<reward>
[Score between 0.0 and 1.0]
</reward>
[Any adjustments to your thinking based on your reflection]

</thinking>

<output>
"reward": [Your final reflection score],
"answer": [Your final , concise answer (Thai, English) to the query. This is the only part that will be shown to the user.]
</output>"""

plaintext

Ours-CoT (8-shot)

# Semi-Reflection <strong>* GSM8K (0-shot, 8-shot)] 52.84% --> 73.16% </strong>*
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Few-shot Examples:

- question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
<thinking>
There are 15 trees originally. Then there were 21 trees after some more were planted. To find how many trees were planted, I subtract the initial number from the final number: 21 - 15 = 6.
</thinking>
<reflection>
This step is correct, the subtraction properly shows the number of trees planted.
</reflection>
<output>
"answer": 6
</output>

- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
<thinking>
There are 3 cars initially in the parking lot. 2 more cars arrive, so I need to add 3 + 2 = 5.
</thinking>
<reflection>
No issues, the addition is correct.
</reflection>
<output>
"answer": 5
</output>

- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
<thinking>
Leah had 32 chocolates and her sister had 42, so together they had 32 + 42 = 74 chocolates. After eating 35 chocolates, they would have 74 - 35 = 39 chocolates left.
</thinking>
<reflection>
The addition and subtraction steps are correct.
</reflection>
<output>
"answer": 39
</output>

- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
<thinking>
Jason started with 20 lollipops, and after giving some to Denny, he has 12 left. To find out how many he gave away, I calculate 20 - 12 = 8.
</thinking>
<reflection>
The subtraction correctly calculates the number of lollipops given away.
</reflection>
<output>
"answer": 8
</output>

- question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
<thinking>
Shawn started with 5 toys. If he got 2 toys each from his mom and dad, that's 2 + 2 = 4 more toys. Adding that to the initial number: 5 + 4 = 9.
</thinking>
<reflection>
The addition is correct.
</reflection>
<output>
"answer": 9
</output>

- question: There were nine computers in the server room. Five more computers were installed each day, from Monday to Thursday. How many computers are now in the server room?
<thinking>
There were originally 9 computers. From Monday to Thursday, 5 computers were added each day for 4 days. That means 5 * 4 = 20 computers were added. The total number of computers now is 9 + 20 = 29.
</thinking>
<reflection>
Multiplication and addition steps are correct.
</reflection>
<output>
"answer": 29
</output>

- question: Michael had 58 golf balls. On Tuesday, he lost 23 golf balls. On Wednesday, he lost 2 more. How many golf balls did he have at the end of Wednesday?
<thinking>
Michael started with 58 golf balls. After losing 23 on Tuesday, he had 58 - 23 = 35. Then he lost 2 more on Wednesday, leaving him with 35 - 2 = 33 golf balls.
</thinking>
<reflection>
The subtraction steps are accurate.
</reflection>
<output>
"answer": 33
</output>

- question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<thinking>
Olivia had 23 dollars. She bought 5 bagels at 3 dollars each, which means 5 * 3 = 15 dollars spent. To find out how much she has left, I subtract 15 from 23: 23 - 15 = 8.
</thinking>
<reflection>
The multiplication and subtraction steps are correct.
</reflection>
<output>
"answer": 8
</output>

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

# Semi-Reflection <strong>* GSM8K (0-shot, 8-shot)] 52.84% --> 73.16% </strong>*
SYSTEM_PROMPT: str = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.

Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.

Important: The <thinking> sections are for your internal reasoning process only. 
Do not include any part of the final answer in these sections. 
The actual response to the query must be entirely contained within the <output> tags.

### Few-shot Examples:

- question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
<thinking>
There are 15 trees originally. Then there were 21 trees after some more were planted. To find how many trees were planted, I subtract the initial number from the final number: 21 - 15 = 6.
</thinking>
<reflection>
This step is correct, the subtraction properly shows the number of trees planted.
</reflection>
<output>
"answer": 6
</output>

- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
<thinking>
There are 3 cars initially in the parking lot. 2 more cars arrive, so I need to add 3 + 2 = 5.
</thinking>
<reflection>
No issues, the addition is correct.
</reflection>
<output>
"answer": 5
</output>

- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
<thinking>
Leah had 32 chocolates and her sister had 42, so together they had 32 + 42 = 74 chocolates. After eating 35 chocolates, they would have 74 - 35 = 39 chocolates left.
</thinking>
<reflection>
The addition and subtraction steps are correct.
</reflection>
<output>
"answer": 39
</output>

- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
<thinking>
Jason started with 20 lollipops, and after giving some to Denny, he has 12 left. To find out how many he gave away, I calculate 20 - 12 = 8.
</thinking>
<reflection>
The subtraction correctly calculates the number of lollipops given away.
</reflection>
<output>
"answer": 8
</output>

- question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
<thinking>
Shawn started with 5 toys. If he got 2 toys each from his mom and dad, that's 2 + 2 = 4 more toys. Adding that to the initial number: 5 + 4 = 9.
</thinking>
<reflection>
The addition is correct.
</reflection>
<output>
"answer": 9
</output>

- question: There were nine computers in the server room. Five more computers were installed each day, from Monday to Thursday. How many computers are now in the server room?
<thinking>
There were originally 9 computers. From Monday to Thursday, 5 computers were added each day for 4 days. That means 5 * 4 = 20 computers were added. The total number of computers now is 9 + 20 = 29.
</thinking>
<reflection>
Multiplication and addition steps are correct.
</reflection>
<output>
"answer": 29
</output>

- question: Michael had 58 golf balls. On Tuesday, he lost 23 golf balls. On Wednesday, he lost 2 more. How many golf balls did he have at the end of Wednesday?
<thinking>
Michael started with 58 golf balls. After losing 23 on Tuesday, he had 58 - 23 = 35. Then he lost 2 more on Wednesday, leaving him with 35 - 2 = 33 golf balls.
</thinking>
<reflection>
The subtraction steps are accurate.
</reflection>
<output>
"answer": 33
</output>

- question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<thinking>
Olivia had 23 dollars. She bought 5 bagels at 3 dollars each, which means 5 * 3 = 15 dollars spent. To find out how much she has left, I subtract 15 from 23: 23 - 15 = 8.
</thinking>
<reflection>
The multiplication and subtraction steps are correct.
</reflection>
<output>
"answer": 8
</output>

### Response Format:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]

<code>
```python
(Optional: functions to solve the question.)
```
</code>

</thinking>

<output>
"answer": [The final answer (number)]
</output>"""

plaintext

Practical Guide to Agentic Self-Reflection and Other Methods to Improve LLMs Inference Performance on Complex Questions

Table of Contents

Section 1: Techniques Overview

1. Magic Prompting

A Prompt Structure Framework

Deepening Your Prompting Toolkit

1. Symbol Magic Prompting

2. Chain-of-Thought (CoT) Reasoning

3. Systems of Self-Reflection

Section 2: Agentic Systems

Section 3: Practical Demonstration

Example 1: Investment Calculation

Simple Prompting vs. CoT + Self Reflection + RAG

Example 2: Math Prompt in Thai

Example 3 (Exclusive): Computer Security Challenge

Final Thoughts: What’s Next?

Thank You for Reading

References

Appendix

Customize LLMs to Fit Your Needs: Join Our Hands-On Bootcamp in Bangkok

SCB 10X’s Typhoon Thai-Optimized LLMs Now Available for Enterprise-Ready AI Deployment with NVIDIA NIM

Practical Guide to Agentic Self-Reflection and Other Methods to Improve LLMs Inference Performance on Complex Questions

Table of Contents

Section 1: Techniques Overview

1. Magic Prompting

A Prompt Structure Framework

Deepening Your Prompting Toolkit

1. Symbol Magic Prompting

2. Chain-of-Thought (CoT) Reasoning

3. Systems of Self-Reflection

Section 2: Agentic Systems

Section 3: Practical Demonstration

Example 1: Investment Calculation

Simple Prompting vs. CoT + Self Reflection + RAG

Example 2: Math Prompt in Thai

Example 3 (Exclusive): Computer Security Challenge

Final Thoughts: What’s Next?

Thank You for Reading

References

Appendix

Customize LLMs to Fit Your Needs: Join Our Hands-On Bootcamp in Bangkok

SCB 10X’s Typhoon Thai-Optimized LLMs Now Available for Enterprise-Ready AI Deployment with NVIDIA NIM