Model Has Some Coherence. But only uses single-letter tokens?

#2
by MartialTerran - opened

This model sample prompt generated one coherent sentence. Then it later failed to generate a coherent sentence. But, it seems upon examination of your vocab and tokenizer files that the only useful vocabulary words/tokens used in this model is limited to the letters of the alphabet? " config file says: "vocab_size": 341" Is that correct? What is the purpose of using only single-letter tokenization? But, another line in the code says you use huggingface "autotokenizer" so maybe you are actually using a different tokenization scheme beyond "vocab_size": 341"?

Owner

No, you are quite correct, the tokenizer only does the standard lower case letters and a 'shift' key, and this doesn't understand upper-case characters.

I did it as I couldt work out any other way to encode easily and stay a valid tokenizer.

The bulk of the tokenizer is the hex values and special start/stop tokens.

It's an experiment in forcing the model to work out all words from scratch and see what happens.

So far it's trained on the tinystories dataset and a similar one I did for longer samples that isn't released yet.

Try starting with only lower-case letters, as the auto casifier/decasifier isn't working.

...
Yet.

Sign up or log in to comment