[R] "It's not just memorizing the training data" they said: Scalable Extraction of Training Data from (Production) Language Models

wojcech@alien.top · 10 months ago

cegras@alien.top · 10 months ago

What is the size of ChatGPT or the biggest LLMs compared to the dataset? (Not being rhetorical, genuinely curious)

zalperst@alien.top · 10 months ago

Trillions of tokens, billions of parameters

StartledWatermelon@alien.top · 10 months ago

GPT-4: 1.76 trillion parameters, about 6.5* trillion tokens in the dataset.

could be twice that, the leaks weren’t crystal clear. The above number is more likely though.