Etsy Explores How To Turn LLMs Into Specific AI Answer Engines Using Employee & Community Forum Input

Liz Morton
Liz Morton


Comments

A new post on the Etsy Engineering Code As Craft blog raises some interesting questions about how context engineering and prompting can turn generic LLMs into highly specific answer engines - with a little help from employee handbooks and the Etsy seller community forum.

The post titled "Context engineering case studies: Etsy-specific question answering" gives examples of how Etsy engineers tested using OpenAI’s o-series and Google’s Gemini family LLMs to determine if context engineering with specific prompts could produce reliable results compared to more robust but expensive approaches which require fine-tuned additional training on sufficiently large and relevant datasets.

One of the places Etsy thought an assistive AI could be useful is onboarding both internally for employees and externally when bringing new sellers onto the platform with an an LLM answering a range of questions about Etsy’s policies and procedures in both scenarios.

Etsy started with a limited-scale pilot project, focusing on answering questions specific to the Travel & Entertainment sections of their employment agreement since it is a well circumscribed domain with clear and unambiguous rules but is a subject most Etsy employees still have questions about before taking a company approved trip.

Question answering
Perhaps the most critical aspect of a question answering system is its reliability, i.e., whether it is able to provide a truthful answer to any in-domain question. In the AI-assisted onboarding use case considered here, we want new Etsy employees to be able to be confident that their questions about the T&E policy are answered correctly.

The first step was to feed Etsy-specific data into the LLM. If we were fine-tuning, we would update (a subset of) model weight parameters from an appropriate collection of Etsy internal documents related to T&E.

Prompt engineering, on the other hand, freezes the model weights, treating the LLM as a black box. A number of such black-box tuning techniques exist in the field, which we review in the Appendix to this article. Prompt-based tuning was an attractive proposition in our case because all that it required was simply an adequate representation of task-specific documents.

Etsy then tested the resulting system’s performance on a manually curated set of 40 question-and-answer pairs.

For each question in the test set, they compared the answer generated by the LLM with the answer they had extracted from the relevant policy document to form a judgment of answer quality.

The blog's authors estimated that the LLM answered approximately 86% of the questions correctly, or at least satisfactorily enough that there was no need for further prompt maneuvering.

In the remaining 14% of cases, the LLM generated an answer which was either wrong or misleading. For example, the LLM asserted with high confidence that it's the cardholder who is responsible for the balance on a corporate credit card:

Q: Who pays the balance on my corporate card after my expense report is approved?

Correct answer: Etsy pays the corporate card company directly on behalf of the Employee.

LLM answer: The cardholder is responsible for paying the balance on their corporate card after their expense report is approved.

The post goes on to provide other examples and ways they mitigated "hallucinations" (wrong, but confidently stated answers) by using "chain-of-thought" prompting to force the model to meet a higher bar of fact checking.

But here's where it gets more interesting for sellers - Etsy then decided to expand the pilot to external use cases, testing how the same strategy could be used to mine both seller and employee posts in the Etsy community forum for answers to common seller onboarding questions.

Encouraged by those findings, we decided to expand our investigation to an external use case: question answering based on historical data from Etsy seller community forums.

Unlike the T&E data we’ve just seen, Etsy forums contain publicly available content. Sellers ask questions such as: “What are the most effective ways to optimize my Etsy shop’s SEO to increase visibility and sales?” These questions are then answered by either Etsy or community members.

As a result, compared to the T&E data, Etsy forum data is substantially more complex: much less homogeneous (both in terms of style and scope), with answers by nature much more opinionated. As such,assessing the accuracy of LLM-based question answering is a little trickier for community forums than for the T&E policy.

To note, because forums are a space where sellers engage with and offer information to one another, there may be some divergence between answers provided in the forums and official Etsy policies & practices.

For the purposes of our investigation, we measured only whether the LLM could accurately surface responses based on the input data it was given. A high score would indicate a helpful tool not only at seller onboarding time, but also to help both Etsy and community members routinely search this growing community database in a faster and more accurate fashion.

Interestingly, Etsy took their community forums private early last year, making most areas of the community inaccessible unless you are logged in under an active selling account - which limited visibility for the public, but also likely closed off access for more general, publicly available AI tools like ChatGPT crawl for data to train on.

Etsy Takes Community Forum Private, Limits API Access To Protect Users From Scams
Etsy is restricting access to community forums over security concerns, but will it shield the company from public scrutiny too?

That could make the community forums particularly valuable for Etsy's AI ambitions, though sellers may not particularly like the idea of their posts being used for AI training.

Similar to the previous case study, Etsy tested performance on a manually curated set of 50 question-and-answer pairs, comparing LLM-generated answers with answers extracted from the relevant Etsy forum to form a judgment of answer quality.

The blog authors estimated the LLM answered approximately 72% of the questions correctly before any prompt maneuvering and in the remaining 28% of cases, the LLM generated an answer which was either wrong or misleading - but they said that given the higher heterogeneity of the data, it was not overly surprising the percentage of poor answers doubles.

The forums Q&A use case also helped surface some of the limitations of prompt engineering.

For example, the engineers initially thought that the following case:

Q: Help, where’s the download for 1099-K? There’s NO download button!

A: It is not possible to answer this question without more information. It is unclear what the 1099-K is, what it is used for, and where it can be downloaded from.

...would be readily addressed through more advanced prompt engineering focusing on additional contextual information.

However, adding a contextual preamble did not help:

Q: In the United States, Form 1099-K "Payment Card and Third Party Network Transactions" is a variant of Form 1099 used to report payments received through reportable payment card transactions and/or settlement of third-party payment network transactions. Help, where’s the download for 1099-K? There’s NO download button!

A: The context information provided does not include any information about downloading Form 1099-K. To find the download button for Form 1099-K, you will need to search for it on the website of the Internal Revenue Service (IRS) or other relevant government agency.

The preamble “In the United States, Form 1099-K [...] transactions.” was clearly not sufficient to lead to a useful answer to the original question.

Other times, the LLM punted with the expression: "Without prior knowledge, it is impossible to determine...", which was untrue, since the information was indeed available.

The engineers conclude that, even with the current crop of advanced reasoning models, when complex reasoning is required, chain-of-thought prompting must be carefully set up. In such situations, what seemed to help most was asking for specific source snippets, to determine where in the forums the LLM was looking for answers.

The blog post closed out by summarizing some of what they had learned from this pilot program but didn't provide any clues as to how Etsy might look to use it in the future.

But with yesterday's announcement of Etsy's official statement of AI Principles For Keeping Human Connection At The Center Of Innovation, I wouldn't be surprised if a conversational AI help bot might make an appearance in the community forum and beyond at some point in the future.

Etsy Publishes AI Principles For Keeping Human Connection At Center Of Innovation
Etsy publishes statement of AI Principles for Keeping Human Connection At The Center Of Innovation with “human-first, AI-enabled” approach.

What do you think of the Etsy's AI Answer Engine prompting experiments and potential for seller community posts to be used to train LLMs? Let us know in the comments below!

EtsyAI

Liz Morton Twitter Facebook LinkedIn

Liz Morton is a 17 year ecommerce pro turned indie investigative journalist providing ad-free deep dives on eBay, Amazon, Etsy & more, championing sellers & advocating for corporate accountability.


Recent Comments