Mismatching tokenizer and LLM model

#22
by XibinBayesZhou - opened

Hi there,
I'm using your model and trying to decode the output of llm by provided tokenizer. It seems that the output of llm might output token_id that is larger than the tokenizer.vocab_size, which cause tokenizer.decode error.

After checking the vocab_size in both model and tokenizer, it is the difference of vocab_size between configs in config.json (51200 from this line) and tokenizer_config.json(at most 50295 from this line).

image.png
The difference in vocab_size while debuging also indicate this issue.

How to solve this? Is this avoidable by set some arguments or should I modify config file?

Thank you for your time for reading. Looking forward to your advices!

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2",)
tokenizer.add_tokens([f'<SPL_{i}' for i in range(0,943)])  # returns 943 

adding new tokens and using the tokenizer can avoid the error

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!

Oh there is the huggingface repo, and the azure one. But they both have the same discrepancy.

Microsoft org

Could you please provide the script which is generating those identifiers?

We ended up setting 51200 as the vocabulary size just to accommodate any new tokens that we might need in the future. You can follow @Deepakvictor answer and it should fix the issue.

As far as I know, no tokens from 50295+ should be generated because those embeddings were not trained. Though, depending on the generation's parameters, they could appear (low probabilities however).

gugarosa changed discussion status to closed
gugarosa changed discussion status to open
gugarosa changed discussion status to closed

Sign up or log in to comment