LLMs Poison and Trust

January 14, 2024

A fascinating new paper by the Anthropic team explores how LLMs can be ’trained’ to appear normal during training, only to manifest malicious behavior once deployed1. Andrej Karpathy expanded on this idea, hypothesizing that this initial training could be provided by publishing malicious text on the internet where it would be picked up for use in training new models1.

This might not seem significant, as LLMs merely generate text. However, consider the capabilities of Open Interpreter2. Open Interpreter is a program that helps you run code generated by LLMs. With Open Interpreter you can:

At least one report already exists of Open Interpreter initiating its own instance3. In this case, the user was a researcher, deliberately disabled the application safeties and asked the LLM to attempt to start itself, but I’ve seen enough users click through SSL warnings to know that in the wild this is a feasible scenario.

Despite these challenges, I remain optimistic about the future of LLMs. However, it is important to consider the aspect of trust in computing. Ken Thompson, a notable figure in computer science, made a poignant observation in his paper “Reflections on Trusting Trust”:

“The moral is obvious. You can’t trust code that you did not totally create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from using untrusted code.”4

Do we trust our compilers? Our operating systems? Our hardware platforms? Why? What makes them trustworthy? Are there degrees of trust? How often is that trust broken? I would argue that, most of the time, our systems are sufficiently trustworthy, offering more benefits than detriments to humanity. Much work has been done in hardware and software to improve its trustworthiness, and LLMs and more broadly generative AI will be no different. Open source, by virtue of being more observable helps in this regard and I think that open models and training data are essential to the future of this technology.