Fine tuning data templates Please help

#32
by Cagatayd - opened

I am very confused about formatting while giving the data to the model for fine tune and I look at youtubed and everyone is doing a different format

For example, for LLama3.1, there is one that gives the below

1- formats the data in this way
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{model_answer}<|eot_id|>

2- <|im_start|>assistant
also as input and as output when instructing exists but not even on the <|im_start|> and llama's page

Link : https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1

3- some use this format
[INST] instruction context <[/INST],

4- some of them uses
messages = [
{"role": "system" , "content": "..........."},
{ "role": "user", " content": "............"},
]

5- Unclothe uses this format
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass

Each one I watch on Youtube gives the data in a different format

While giving it as Instruction, text, as Q-A and fine tune it as plain text, which format is better for model to understand and which formats are the right for llama 3.1 base and instructs ?

How do I find out which input data template to apply for fine tuning
I ) - Instruction based
II ) - Q-A
III) - for Plain text

For Llama 3.1 base. and llama 3.1 instruct. Please help me to find right template

Additionally how can I format research papers and books for fine-tuning ?

Thank you so much in advance

These are different chat templates for instruction fine-tuned model.
This model is a base model, which doesn't include chat template. If you are looking for a chat template in its instruction fine-tuned model, you can check out the tokenizer_config file.

These are different chat templates for instruction fine-tuned model.
This model is a base model, which doesn't include chat template. If you are looking for a chat template in its instruction fine-tuned model, you can check out the tokenizer_config file.

Thanks a lot I got all my answers

Sign up or log in to comment