Today, we’re excited to announce another major milestone: Paxton AI has achieved 93.82% average accuracy on tasks in the Stanford Legal Hallucination Benchmark. This comes on the heels of our recent Paxton AI Citator release, which achieved a 94% accuracy rate on the separate Stanford Casehold benchmark. This accomplishment highlights our dedication to transparency and accuracy in applying AI to legal tasks. Additionally, to further this commitment, we are introducing our new Confidence Indicator feature, which helps users evaluate the reliability of AI-generated responses.
Key Highlights:
The Stanford Legal Hallucination Benchmark evaluates the accuracy of legal AI tools, measuring their ability to produce correct legal interpretations without errors or “hallucinations.” High performance on this benchmark indicates a system’s robustness and reliability, making it a critical measure for legal AI applications.
Legal AI tools, like those developed at Paxton AI, are increasingly relied on in professional settings where accuracy can significantly impact legal outcomes. The benchmark measures various tasks such as case existence verification, citation retrieval, and identifying the authors of majority opinions. High performance in these areas signals that AI can be a trustworthy aid in complex legal analyses, potentially transforming legal research methodologies.
For our assessment, Paxton AI selected a representative random sample of 1,600 tasks from the comprehensive pool of 750,000 tasks available in the benchmark. This sample was strategically chosen to include examples from each category of tasks provided in the benchmark to maintain a balanced and comprehensive evaluation.
The Stanford Legal Hallucination Benchmark serves as a critical tool in our efforts to validate and improve the performance of our legal AI technologies. By participating in such rigorous testing and sharing our findings openly, Paxton AI demonstrates its commitment to advancing the field of legal AI with integrity and scientific rigor.
Here are the detailed results from the Stanford Legal Hallucination Benchmark for Paxton AI:
Overall, the results show that Paxton AI achieved an average non-hallucination rate of 94.7% and accuracy of 93.82%.. These tasks represent a range of common legal research and analysis activities, designed to evaluate the AI's performance across various aspects of legal knowledge and reasoning. The diversity of tasks helps provide a comprehensive assessment of the AI's capabilities in handling different types of legal queries and information.
The bottom rows of the table provide important summary information:
These results highlight our commitment to transparency and our continuous effort to refine our AI models. The data for these results is available on our GitHub repository for further analysis and verification.
To help users make the most of Paxton AI’s answers and ensure they can trust the information provided, we are excited to announce the launch of the Confidence Indicator. This new feature is designed to enhance user experience by rating each answer with a specific confidence level—categorized as low, medium, or high. Additionally, it offers valuable suggestions for further research, guiding users on how they can delve deeper into the topics of interest and verify the details. By providing these ratings and recommendations, we aim to empower users to make more informed decisions based on the AI's responses.
It is important to clarify that while large language models (LLMs) can generate confidence scores for their responses, these scores are not always indicative of the actual reliability or accuracy of the information provided. The Confidence Indicator in Paxton AI operates differently. Instead of relying solely on the model's internal confidence score, our Confidence Indicator evaluates the response based on a comprehensive set of criteria, including the contextual relevance, the evidence provided, and the complexity of the query. This approach ensures that the confidence level assigned is a more accurate reflection of the response's trustworthiness.
Confidence levels:
Users will be able to quickly assess the reliability of Paxton’s responses, and improve the quality of responses, with the new Confidence Indicator feature.
For example, the query below was vague and unfocused and resulted in a Low Confidence response. Here, the user query was:
“I need to understand family law. I am working on a serious matter. It is very important to my client. The matter is in PA and NYC. custody issue.”
This prompt did not provide specific detail about the research question at issue, selected only Pennsylvania as a source when the query suggested that the user was interested in both Pennsylvania and New York law, and did not ask a particular research question.
Paxton provided the most relevant cases available, but had low confidence that its response was well-suited to answer the user’s query.
Next, the user offered more detail, changing the query to:
I am working on a family law matter that may implicate both NY and PA law. Mother and father are getting divorced. They live in NY in the summer and PA during the school year. The parents are having a custody dispute. I am representing the father.
The user also correctly selected both Pennsylvania and New York state courts as sources. These changes resulted in a Medium Confidence response. But while the user provided much more context, they still failed to ask a pointed, focused question.
Finally, to obtain a High Confidence response, the user amended their query to exclude extraneous details, include important details, ask more specific, focused questions:
I am working on a family law matter that may implicate both NY and PA law. Mother and father are getting divorced, they have two children aged 12 and 15. What do courts in New York consider when determining custody? What do courts in Pennsylvania consider when determining custody? Please separate the analysis into two parts, NY and PA.
The Paxton AI Confidence Indicator improves the user experience by quickly showing the confidence and reliability of Paxton’s AI generated responses. The Confidence Indicator will help speed up decision making by providing a transparent assessment of the quality of the response. Law firms, corporate legal departments, and solo practitioners can leverage this feature to mitigate risks associated with inaccurate legal information and improve the overall quality of their legal work.
We invite all legal professionals to try the Paxton AI Confidence Indicator with a 7-day free trial. After the trial, the subscription starts at $79 per month per user (for Paxton’s annual plan). Visit Paxton AI to start your trial and witness firsthand the precision and efficiency of AI-driven legal research.
Paxton AI is advancing the legal field with reliable, easy-to-use tools. The Stanford Hallucination Benchmark results and our new Confidence Indicator feature underscore our dedication to enhancing the practice of law with AI. Paxton AI equips legal professionals with next-gen tools designed to navigate the complexities of modern legal challenges. Join us in shaping the future of legal research by embracing the potential of artificial intelligence.
The table presents our results from the Stanford Hallucination Benchmark, broken down by task and key performance metrics:
Task Descriptions:
Your new assistant, Paxton, can start today with a free trial—no interviews, contracts, or salary negotiations required.
Try Paxton for Free