Mismatching tokenizer and LLM model

#22

by XibinBayesZhou - opened Dec 15, 2023

Dec 15, 2023

Hi there,
I'm using your model and trying to decode the output of llm by provided tokenizer. It seems that the output of llm might output token_id that is larger than the tokenizer.vocab_size, which cause tokenizer.decode error.

After checking the vocab_size in both model and tokenizer, it is the difference of vocab_size between configs in config.json (51200 from this line) and tokenizer_config.json(at most 50295 from this line).

The difference in vocab_size while debuging also indicate this issue.

How to solve this? Is this avoidable by set some arguments or should I modify config file?

Thank you for your time for reading. Looking forward to your advices!

wassname

Dec 16, 2023

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

XibinBayesZhou

Dec 16, 2023

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!

Deepakvictor

Dec 16, 2023

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2",)
tokenizer.add_tokens([f'<SPL_{i}' for i in range(0,943)])  # returns 943

adding new tokens and using the tokenizer can avoid the error

wassname

Dec 17, 2023

This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.

It's present on the azure repo, for the latest v2, as well.

@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!

Oh there is the huggingface repo, and the azure one. But they both have the same discrepancy.

gugarosa

Microsoft org Jan 9

Could you please provide the script which is generating those identifiers?

We ended up setting 51200 as the vocabulary size just to accommodate any new tokens that we might need in the future. You can follow @Deepakvictor answer and it should fix the issue.

As far as I know, no tokens from 50295+ should be generated because those embeddings were not trained. Though, depending on the generation's parameters, they could appear (low probabilities however).

gugarosa changed discussion status to closed Jan 9

gugarosa changed discussion status to open Jan 9

gugarosa changed discussion status to closed Jan 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment