A new research paper, “Self-Rewarding Language Models”, explores a novel approach to LLM training. Unlike traditional models, these models generate and evaluate their own training data, enabling continuous self-improvement beyond initial training limits[1].
This is another step along the path to potentially realizing AGI. Data quality has been and remains one of the key challenges for LLM technology.
This method, reminds me of the approach Microsoft used for Phi-2’s static training. In that case, they used GPT-3.5 to generate synthetic textbook data. However in this case the model under training is doing the generation[2].
I do wonder if this approach suffers from an Echo-Chamber Effect, where models might amplify inherent biases or errors without diverse external data. I am also curious how this approach would fare with a mixture-of-experts model like Mixtral 8x7b.
Perhaps a future version of this work will utilize an agent approach with observers (perhaps robots) to gather data to enhance synthetic training data generation.
[1] https://arxiv.org/abs/2401.10020
[2] https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/