How to calculate GPQA score?

#928
by JJaeuk - opened

Hello, I've been trying to reproduce leaderboard results for meta-llama/Meta-Llama-3-8B.

I noticed that the GPQA score listed on the leaderboard is 7.38:
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

However, when I checked the model details, GPQA scores range from 0.25 to 0.34:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json#L152-L171

Could you clarify how the 7.38 score is derived from these individual scores?

Thank you!

I've also met this score issue, I got much higher score than the leaderboard, looking for help!

Open LLM Leaderboard org

Hi @JJaeuk ,

Thank you for the question!
That's true, the normalised GPQA score for meta-llama/Meta-Llama-3-8B is 7.38, but if you see at the GPQA Raw column it's 0.31, but we normalised it. You can find more info on normalisation in our documentation here:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization

Or in our V2 blogpost:
https://huggingface.co/spaces/open-llm-leaderboard/blog

If you have any questions I will help you to understand the normalisation!

Screenshot 2024-09-17 at 14.57.25.png

The issue with your answers is that since the lower bound is 0.25 then the normalized results R should be R=100.0*(0.31-0.25)/0.75==8.00% which is different from 7.38 .
So you have not really answered the question unless you provide a better explanation about how you normalized.

Regarding the performance, i notice that different versions of lm_eval have different performance ranges with the same model. and that only depends on lm-harness implementation i guess.

Open LLM Leaderboard org

Here is the exact normalisation function for the GPQA score:

# Normalization function
def normalize_within_range(value, lower_bound=0, higher_bound=1):
    return (np.clip(value - lower_bound, 0, None)) / (higher_bound - lower_bound) * 100

# Normalize GPQA scores
gpqa_raw_score = data['results']['leaderboard_gpqa']['acc_norm,none']
gpqa_score = normalize_within_range(gpqa_raw_score, 0.25, 1.0)

gpqa_raw_score, normalize_within_range(gpqa_raw_score, 0.25, 1.0)

Output: (0.3053691275167785, 7.38255033557047)

You can try to apply it to the recent results file:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json

According to your calculations, you used the raw score of 0.31, but it's a rounded value. The actual raw score is 0.3053691275167785. You can also find it here in the Contents dataset:
https://huggingface.co/datasets/open-llm-leaderboard/contents

Considering lm_eval, for the V2 we use this our fork:
https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess

You can find more info in the reproducibility section:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility

I think it should be clear enough now, so let me close this discussion. Please, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed

Sign up or log in to comment