- Good benchmark is very hard to find, we don’t want the benchmark to be something can be overcame quickly to 99% in a month, and we also don’t want all the frontier models to score 0 for the coming 5 years.
- Good Gist Token Metric is actuall the comparison how swell is the bridge loss against the continuation loss, if the ratio is large, the model is terrible:
- bridge loss: loss of 1st token after thought vector, as no prior token for typical
next token prediction, have to entirely rely on the thought token
- continuation loss: loss of the rest of the tokens after the thought vector, these are generated after the last given text token, so much easier