Citations with Amazon Nova understanding models: prompting verifiable sources on Bedrock
Sources: https://aws.amazon.com/blogs/machine-learning/citations-with-amazon-nova-understanding-models, aws.amazon.com
TL;DR
- LLMs can hallucinate; prompting for citations improves verifiability source.
- Amazon Nova understanding models can be prompted to cite sources and indicate the response format on Bedrock source.
- Nova Pro demonstrated outputs that include citations when given context such as a shareholder letter; quotes were verified in the 2009 letter source.
- An LLM-as-a-judge evaluation on Bedrock yielded a coherence and faithfulness score of 0.78 and a correctness score of 0.67, indicating strong performance under the prompting approach source.
- The approach is complemented by open source tooling on the AWS Samples GitHub and a prompt library for best practices source.
Context and background
Large language models have become prevalent across consumer and enterprise applications. However, their tendency to hallucinate information and deliver incorrect answers with apparent confidence has created a trust problem. Consider LLMs like expert humans: trust grows when claims are backed by references and a transparent reasoning process. The same principle applies to LLMs; proper prompting can instruct models to cite sources and show reasoning where appropriate. Amazon Nova, launched in Dec 2024, is a new generation of foundation models that deliver frontier intelligence and industry-leading price performance, available on Amazon Bedrock. Nova includes four understanding models (Nova Micro, Nova Lite, Nova Pro and Nova Premier), two creative content generation models (Nova Canvas and Nova Reel), and one speech-to-speech model (Nova Sonic). Through seamless integration with Bedrock, developers can build and scale generative AI applications with Amazon Nova foundation models. Citations for the Amazon Nova understanding models can be achieved by crafting prompts where we instruct the model to cite its sources and indicate the response format. To illustrate this, we’ve picked an example where we ask questions to Nova Pro about Amazon shareholder letters. We will include the shareholder letter in the prompt as context and ask Nova Pro to answer questions and include citations from the letter(s). Here’s an example prompt that we constructed for Amazon Nova Pro following best practices for prompt engineering for Amazon Nova. Note the output format that we included in the prompt to distinguish the actual answers from the citations. System prompt User Prompt Here’s the response from Nova Pro for the above prompt As you can see Nova Pro is following our instructions and providing the answer along with the citations. We’ve verified the quotes are indeed present in the 2009 shareholder letter. Here’s another user prompt (with the same system prompt as above) along with the model’s response User Prompt: Model response While citations are good, it’s important to evaluate that the model is following our instructions and including the citation verbatim from the context and not making up the citations. To evaluate the citations at scale, we used another LLM to judge the responses from Amazon Nova Pro. We used the LLM-as-a-judge technique in Amazon Bedrock evaluations and evaluated 10 different prompts. LLM-as-a-judge on Amazon Bedrock Model Evaluation provides a comprehensive, end-to-end solution for assessing and optimizing AI model performance. This automated process uses the power of LLMs to evaluate responses across multiple metric categories (such as correctness, completeness, harmfulness, helpfulness and more) offering insights that can significantly improve your AI applications. We prepared the input dataset for evaluation. The input dataset is a jsonl file containing our prompts that we want to evaluate. Each line in the jsonl file must include key-value pairs. Here are the required and optional fields for the input dataset: Here’s an example jsonl file for evaluating our prompts (full jsonl file not shown for brevity). We then started a model evaluation job using the Bedrock API with Anthropic Claude 3.5 Sonnet v1 as the evaluator/judge model. We have open sourced our code on the AWS Samples GitHub . We evaluated our prompts and responses for the following built-in metrics Here’s the result summary of our evaluation. As you can see, Nova Pro had a 0.78 score on coherence and faithfulness and 0.67 on correctness. The high scores indicate that Nova Pro’s responses were holistic, useful, complete and accurate while being coherent as evaluated by Claude 3.5 Sonnet. In this post, we walked through how we can prompt Amazon Nova understanding models to cite sources from the context through simple instructions. Amazon Nova’s capability to include citations in its responses demonstrates a practical approach to implementing this feature, showcasing how simple instructions can lead to more reliable and trustworthy AI interactions. The evaluation of these citations, using an LLM-as-a-judge technique, further underscores the importance of assessing the quality and faithfulness of AI-generated responses. To learn more about prompting for Amazon Nova models please visit this prompt library . You can learn more about Amazon Bedrock evaluations on the AWS website . Sunita Koppar is a Senior Specialist Solutions Architect in Generative AI and Machine Learning at AWS, where she partners with customers across diverse industries to design solutions, build proof-of-concepts, and drive measurable business outcomes. Beyond her professional role, she is deeply passionate about learning and teaching Sanskrit, actively engaging with student communities to help them upskill and grow. Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.
What’s new
We illustrate a practical method for prompting Nova understanding models to include citations in the responses. In the example, Nova Pro is prompted with a shareholder letter from 2009 as context and asked to answer questions while citing passages from the letter. The system prompt and the user prompt are crafted to guide the model toward including verbatim citations from the context. The model outputs both the answer and the citations, and we verify that the quotes are indeed present in the source letter. The same approach can be repeated with other prompts and documents to generate traceable answers. We also describe how we evaluated these citation responses at scale using an LLM-as-a-judge technique on Bedrock, with a dataset of prompts and responses evaluated by Claude 3.5 Sonnet v1. Our evaluation reports a coherence and faithfulness score of 0.78 and a correctness score of 0.67 for Nova Pro, indicating useful and accurate outputs in the assessment. We have open sourced our evaluation code on the AWS Samples GitHub and reference the prompt library for best practices. You can read more in the prompt library and Bedrock evaluation resources linked in the References. [source]
Why it matters (impact for developers/enterprises)
For developers and enterprises, reliable citations in model outputs enable better auditability and trustworthiness in AI applications. Verifiable sources help users understand not only what the model says but why it says it, reducing risk when deploying Nova based workflows on Bedrock. The ability to prompt for citations supports governance and compliance use cases where traceability of facts is essential. The demonstration highlights a practical path from prompting to verifiable results, aligning with broader goals of trustworthy AI in enterprise settings. The open source tooling and evaluation resources further provide a foundation for teams to adapt the approach to their own documents and data sources. [source]
Technical details or Implementation
Key elements include a system prompt and a user prompt that together direct Nova to cite sources and present citations in a clearly separated format. The example uses Nova Pro and the 2009 Amazon shareholder letter as context in the prompt. The quotes were verified to exist within the letter, establishing the accuracy of cited material. The evaluation workflow uses an LLM-as-a-judge approach on Bedrock, evaluating 10 prompts with Claude 3.5 Sonnet v1 as the evaluator. The input dataset for evaluation is a jsonl file where each line contains key-value pairs describing prompts and expected outputs. Built-in metrics include coherence, faithfulness and correctness, among others. The approach and tooling are documented in the AWS Samples GitHub repository and the prompt library. Further details on Bedrock evaluations are available on the AWS site. [source]
Key takeaways
- Citations improve trust and verifiability when the model references sources alongside its outputs.
- Proper prompting enables Nova understanding models to cite sources and present a traceable answer format on Bedrock.
- Evaluation at scale using LLM-as-a-judge provides quantitative insights into coherence, faithfulness and correctness of responses.
- The combination of a clear prompt structure, documented context, and open tooling supports reproducible results across documents and domains. [source]
FAQ
-
How does Nova show citations in its responses?
The model is prompted to cite sources and to present the answer and citations in a clearly separated format, with quotes drawn from the given context. [source]
-
What is the role of LLM-as-a-judge in Bedrock evaluations?
LLM-as-a-judge uses additional LLMs to evaluate responses across metrics such as coherence, fidelity and correctness, providing end-to-end assessment of model performance. [source]
-
Where can I access the code and resources mentioned?
The evaluation code is open sourced on the AWS Samples GitHub and the prompt library provides best practices for prompting Nova models. [source]
-
Which Nova models support understanding and citations?
Nova Micro, Nova Lite, Nova Pro and Nova Premier are the Nova understanding models, with citations demonstrated in prompts for Nova Pro in this example. [source]
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Move AI agents from proof of concept to production with Amazon Bedrock AgentCore
A detailed look at how Amazon Bedrock AgentCore helps transition agent-based AI applications from experimental proof of concept to enterprise-grade production systems, preserving security, memory, observability, and scalable tool management.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.