Question about their name.. why it is 2b???

#36
by sh0416 - opened

I count the number of parameter and it is 3_030_460_416, which is 3.03 billion in my knowledge.

Does gemma-2b mean that gemma architecture with 2 billion parameters? or is there any other meanings in their name?

from transformers import AutoModelForCausalLM

sum(x.size for _, x in AutoModelForCausalLM.from_pretrained("google/gemma-2b").state_dict().items())
# 3030460416

Did you exclude the vocabulary weight and lm head weight?
If it is, it makes sense as the two weights have 1b parameters.
There are some fairness issues in counting the number of parameters as the gemma has 5 times larger vocabulary size than other models.

I've done the math manually and automatically and, if my math is right, it should add up to 2.5B parameters. I think that you accounted for the embedding parameters twice in your calculation, but be mindful that the input and output embeddings are tied (shared parameters). If you discount the output embeddings count: 3030460416 - 256000*2048 = 2506172416.

Here's my calculation:

Manually:

embedding_params = 256000*2048 # input/output embeddings tied
attention_params = (2048*256*1)*8 + (2048*256*2)*1 + (256*8)*2048 # no biases, 8 attention heads, 8 query heads, 1 key-value heads (2)
layer_norm_params = (2048)*2 # pre- and post- normalization, learned scale but not bias
feedforward_parms = 2048*16384*2 + 16384*2048 # no biases
num_transformer_layers = 18
transformer_params = num_transformer_layers * (attention_params + layer_norm_params + feedforward_parms)
last_layer_norm_params = 2048
total_params = embedding_params + transformer_params + last_layer_norm_params
print(total_params / 1e9)

Automatically:

num_params = 0
for _, param in model.named_parameters():
  if param.requires_grad:
    num_params += param.numel()
print(num_params / 1e9)

The expected output for both cases is 2.506172416.

Sign up or log in to comment