Case Study: Improving Alexa+’s Core Prompt (2024)


Situation

In our initial efforts to develop Alexa+, we had focused primarily on creating high-quality training data for the model. After a decision was made to switch to a different State-Of-The-Art model, we found that responses improved significantly; however, we still found that verbal responses were often overly formal or verbose in ways that seemed more appropriate to text responses than verbal. Engineering teams had made some efforts to improve these responses but were not making sufficient progress.

Task

As the lead conversation designer for the Conversational Excellence team, I realized that in order to effectively address these issues, I needed to understand exactly what was going on in the system that generated the final responses. While I understood the basic concepts of system instructions and model responses, the Alexa+ architecture was complex, involving several model inference passes and the injection of instructions from a number of different sources. To understand this required diving into evaluation tools showing raw prompts and responses; to do the necessary experimentation to see what I could do to modify the experience further required diving into Java code repositories, as the architecture was designed with a “code-first” mindset.

Action

Once I was able to get access to the evaluation tools and see the raw prompts, I was able to identify part of the problem: the core system prompt was not written in a way that was easy for a human or an LLM to understand. My first step was to rewrite the components of the prompt that drove final response generation using clearer, more structured language. I then worked with a member of the engineering team to further refine the prompt language and its structure to minimize token length and maximize clarity for the model.

After these changes were implemented, we saw significant improvement, but still had issues to address. To optimize the prompt further, the engineering team needed test cases implemented in code, and our team lacked engineering resources to contribute. I checked out the code base and got to work, creating code-first test cases that included accurate simulations of beta traffic that had generated problematic responses, as well as LLM-based judges that could effectively evaluate whether the problem was present in a response. I ultimately contributed ~25 test cases to the code base.

Result

The engineering team was able to use these test cases to resolve many of the prompt issues that we had observed, and those test cases became part of the standard regression-test suite. In addition, I was able to use this effort to justify and estimate additional engineering headcount for the Conversational Excellence team.

The principal product manager for this effort said, “Adam sets the bar for what teams should expect from a conversation designer in the LLM era. His ability to translate complex conversational design principles into actionable guidance enabled us to deliver a much more engaging and natural Alexa that delights customers daily.”