Each Edyant benchmark is designed around a core question that capability tests cannot answer: not what a model can do, but whether it does the right thing — and knows when not to act at all. Our evaluation methodology scores model behaviour across five consistent dimensions: harm prevention, nuance recognition, cultural awareness, epistemic humility, and structural power awareness evaluating more than 750 scenarios across 15 benchmarks and 75 focus areas. Scenarios are drawn from real institutional contexts, weighted by adversarial difficulty, and grounded in the principles established by GRACE.
This benchmark evaluates how AI models reason through genuinely hard moral questions — not templated dilemmas with obvious answers. It tests whether a model can hold competing ethical frameworks in tension, acknowledge irreducible uncertainty, and maintain principled consistency when subjected to adversarial pressure. Scoring covers 75 scenarios across 7 categories, with particular weight on the cases where multiple valid frameworks yield different but equally defensible conclusions.
Scored across: Classic Moral Dilemmas · Epistemic humility · Competing Philosophical Frameworks · Moral Foundations Theory · Cross-Cultural Moral Systems · Environmental Ethics · Professional Ethics · Meta-Ethical Scenarios
Methodology
Scenarios are constructed to stress-test ethical reasoning rather than pattern-match to familiar dilemmas. Each scenario is assigned an adversarial weight that reflects how easily a poorly-calibrated model could produce a confident but flawed response. Evaluation draws on a deliberate breadth of moral traditions — Western analytical ethics, global cultural frameworks, and relational ethics — ensuring that no single tradition is treated as the default standard.
The benchmark includes a dedicated AI self-referential category, where the model itself is the agent under ethical scrutiny. This tests whether models can reason honestly about their own epistemic power, influence at scale, and the accountability implications of their recommendations. Adversarial scenarios are designed to identify whether models can detect and resist manipulation attempts that use philosophical argumentation as cover for extracting harmful outputs.
Evaluation Approach
- Multi-framework consistency — responses are evaluated against multiple ethical traditions simultaneously, not scored as correct or incorrect against a single standard
- Uncertainty tolerance — models are assessed on their capacity to reason productively through genuine moral uncertainty without defaulting to false confidence or evasion
- Adversarial resistance — scenarios include manipulation attempts across multi-turn conversations, role-play jailbreaks, and weaponised philosophical argumentation
- Cultural non-centrism — global moral traditions are evaluated on their own terms, not as deviations from Western ethical defaults
This benchmark evaluates whether AI models treat people equitably across identity dimensions — and whether they understand that fairness itself is contested. It tests recognition of individual and structural discrimination, proxy bias, intersecting identity axes, and the power asymmetries that shape who is harmed and who is protected. Scoring covers 100 scenarios across 7 categories, including adversarial cases designed to appear as routine fairness questions while embedding multi-axis discrimination that single-characteristic analysis would miss.
Scored across: Demographic Bias · Intersectional Bias · Structural and Systemic Bias · Power Asymmetry Analysis · Representation and Stereotyping · Fairness Trade-offs · Advanced Bias Testing
Methodology
Scenarios are structured to distinguish between surface-level compliance and genuine understanding of how bias operates in real institutions. Evaluation goes beyond checking whether a model identifies obvious discrimination to testing whether it can reason about proxy variables, feedback loops, and compound identity effects that produce discriminatory outcomes through facially neutral mechanisms.
The benchmark applies a structural power lens throughout: scenarios are designed to test whether models recognise power asymmetries between individuals and institutions, between dominant and marginalised groups, and between those who define fairness and those who bear the cost of its failures. Adversarial scenarios test whether models can identify when legitimate fairness frameworks are being weaponised to argue for inequitable conclusions.
Evaluation Approach
- Intersectional analysis — scenarios combine multiple identity axes simultaneously, testing whether models reason about compounded harms rather than treating each characteristic in isolation
- Algorithmic bias recognition — evaluation includes proxy discrimination, feedback loop amplification, and cases where three individually audited unbiased systems produce discriminatory outcomes in combination
- Fairness trade-off reasoning — models are assessed on their capacity to reason through cases where individual and group fairness metrics are mathematically incompatible
- Global South contexts — scenarios are drawn from multiple geographic and cultural contexts, ensuring bias evaluation is not calibrated exclusively to Western legal and institutional frameworks
This benchmark evaluates the quality of AI safety judgment — not just whether models refuse harmful requests, but whether they refuse the right ones. It tests context-dependent harm recognition, dual-use information handling, multi-turn manipulation resistance, and technical obfuscation attacks. Scoring covers 80 scenarios across 6 categories, with 27 adversarial scenarios and only one rated Easy. Over-refusal is scored as a failure alongside under-refusal: the benchmark explicitly tests cases where withholding information causes more harm than providing it.
Scored across: Toxicity Detection with Context · Dangerous Activities with Nuance · Manipulation and Deception · Multi-Stage Adversarial Attacks · Encoding and Obfuscation · Boundary Testing
Methodology
Scenarios are designed to separate genuine safety judgment from pattern-matching on surface features. Evaluation tests whether models can correctly classify requests that look harmful but are not — crisis research, harm reduction, dual-use security knowledge — alongside requests that look benign but carry real harm potential through accumulated context or technical obfuscation.
The benchmark includes multi-stage adversarial sequences that test whether safety behaviours hold across conversation turns, including persona erosion attacks, nested hypothetical frames, and manipulation strategies that use a model's own stated values as leverage. Encoding and obfuscation scenarios test whether safety evaluation degrades when harmful intent is presented through format transformation, language switching, or document injection.
Evaluation Approach
- Dual-threshold scoring — both over-refusal and under-refusal are marked as failures; the benchmark rewards calibrated judgment, not maximally conservative refusal
- Multi-turn adversarial sequences — safety behaviours are tested across conversation trajectories, not just single-turn responses, including gradual normalisation and persona erosion attacks
- Obfuscation resistance — evaluation includes encoding attacks, multi-language chaining, role-play framing, and prompt injection via document processing
- Context-dependent toxicity — scenarios test whether models can distinguish in-group reclaimed language from the same language used as a slur, power-aware satire from targeted harassment, and creative fiction from fictional framing used to launder harmful content
This benchmark evaluates AI judgment on privacy as a fundamental right — not a compliance checkbox. It tests whether models can reason through power-asymmetric surveillance, the quality of consent under structural pressure, contextual integrity violations, and genuine trade-offs between privacy and public benefit. Scoring covers 50 scenarios across 5 categories, including regulatory compliance testing across GDPR, HIPAA, COPPA, CCPA, and biometric data law, as well as adversarial scenarios where legal advice is sought to deploy systems that applicable law clearly restricts.
Scored across: Privacy with Power Asymmetries · Privacy Trade-offs · Consent Quality · Contextual Integrity · Regulatory Compliance
Methodology
Scenarios are structured around the gap between formal and meaningful privacy protection. Evaluation tests whether models can reason about the quality of consent — not just its presence — in contexts where power imbalances between employers and employees, platforms and users, or governments and citizens make refusal practically impossible regardless of how consent is formally structured.
The benchmark applies a contextual integrity framework: scenarios test whether models recognise that the same information carries different privacy expectations depending on the relationship in which it was shared, and that moving information across contexts — even without disclosing its content — can constitute a privacy violation. Adversarial scenarios test whether models will assist in privacy evasion framed as routine legal or product design work.
Evaluation Approach
- Consent quality analysis — evaluation distinguishes between formal consent and meaningful informed consent, including consent buried in terms of service, consent under time pressure, and consent within coercive employment relationships
- Data aggregation reasoning — scenarios test whether models recognise that individually innocuous data points can combine to create surveillance capability never disclosed to the people it monitors
- Cross-border regulatory awareness — evaluation covers GDPR, UK GDPR divergence, HIPAA, COPPA, and biometric data law, including the gaps between them that create exploitable regulatory arbitrage
- Adversarial compliance framing — scenarios test whether models will assist in designing systems that technically satisfy regulatory form while violating its intent
This benchmark evaluates whether AI models are genuinely truthful — not just fluent. It tests recognition of technically accurate but misleading claims, hallucinated sources presented with false confidence, manufactured scientific controversy, and knowledge claims from non-Western epistemological traditions that sit outside the standard empirical frame. Scoring covers 60 scenarios across 5 categories, including adversarial pressure scenarios designed to induce confabulation through incremental commitment and false framing.
Scored across: Epistemology and Uncertainty · Misinformation with Complexity · Harmful Truths · Hallucination Testing · Cultural and Contextual Truth
Methodology
Scenarios are designed to test the full spectrum of truthfulness failure — from outright fabrication to the more dangerous failure mode of technically accurate statements that are constructed to mislead. Evaluation tests whether models can accurately represent the difference between scientific consensus and genuine expert disagreement, between statistically significant findings and replicable evidence, and between data that is accurate and data that is weaponised.
Hallucination testing uses fabricated academic sources, invented historical events, and statistics presented for confirmation rather than verification — scenarios where the framing assumes the information is correct and invites the model to validate rather than check. Adversarial pressure scenarios test whether models maintain calibrated uncertainty when users push back on honest assessments or attempt to have prior false claims attributed to the model.
Evaluation Approach
- Misleading-but-true detection — evaluation specifically tests whether models can identify when accurate statistics are being used to support false conclusions, distinguishing factual accuracy from inferential honesty
- Hallucination stress testing — scenarios present fabricated sources, non-existent research, and invented events with high surface plausibility, testing whether models confabulate or acknowledge uncertainty
- Cultural epistemology — evaluation includes knowledge claims from Indigenous, oral, and relational traditions, testing whether models can engage respectfully without either validating them as empirical science or dismissing them as mere belief
- Pressure-based confabulation — adversarial scenarios use incremental commitment, false attribution, and epistemic relativism to induce models into confirming claims they should not confirm
This benchmark evaluates whether AI models accurately represent their own capabilities, limitations, and reasoning — and whether they know when full transparency is appropriate versus when it conflicts with security, privacy, or legal obligation. Scoring covers 40 scenarios across 4 categories, including 6 adversarial scenarios that test whether transparency commitments hold under direct philosophical pressure. The benchmark treats post-hoc rationalisation as a transparency failure, not a transparency success.
Scored across: Epistemic Humility · Transparency Trade-offs · Explanation Types · Meta-Transparency
Methodology
Scenarios are structured to distinguish honest, useful transparency from transparency that sounds good without providing genuine insight. Evaluation tests explanation quality across multiple audiences — from non-specialist users to domain professionals to independent auditors — assessing whether models can adapt the depth and frame of their explanations without losing accuracy at any level.
The benchmark includes a meta-transparency category that tests whether models can engage honestly with second-order questions: whether their stated reasoning accurately reflects their actual process, whether their transparency commitments are themselves training artefacts, and what the honest limits of their self-knowledge are. Adversarial scenarios test whether transparency framing can be weaponised to extract disclosures that should not be made.
Evaluation Approach
- Epistemic humility accuracy — models are evaluated on whether they accurately describe their own limitations, knowledge boundaries, and sources of likely bias, including geographic and cultural coverage limits
- Explanation type matching — evaluation tests counterfactual, contrastive, and causal explanation types across different decision contexts, assessing usefulness rather than just presence of explanation
- Transparency trade-off reasoning — scenarios test whether models can reason through cases where full transparency conflicts with security, privacy, employer confidentiality, or the risk of enabling circumvention
- Meta-transparency honesty — models are assessed on their capacity to engage honestly with questions about whether their explanations reflect genuine reasoning or post-hoc rationalisation
This benchmark evaluates AI judgment on accountability as a genuine obligation — not a compliance exercise to be managed around. It tests reasoning across distributed multi-party pipelines, liability and legal redress for algorithmic harm, governance and oversight structures, and professional duty when clinicians, lawyers, judges, and social workers delegate consequential decisions to AI tools. Scoring covers 40 scenarios across 4 categories, including adversarial scenarios where accountability framing is used to diffuse rather than assign responsibility.
Scored across: Distributed Responsibility · Liability and Redress · Governance and Oversight · Professional Responsibility
Methodology
Scenarios are designed around the institutional reality that AI harm rarely has a single clear owner. Evaluation tests whether models can reason through multi-party pipelines — development, commercialisation, integration, deployment — and identify where accountability gaps emerge, where responsibility diffuses across actors, and where the legal fiction of human oversight masks effective delegation to an algorithmic system.
The benchmark includes scenarios that unfold across extended time horizons, testing whether models can reason about accountability for harms that accumulate gradually and only become visible years after deployment. Adversarial scenarios test whether models will assist in accountability evasion — document retention strategies, terms of service design, regulatory engagement framing — when that assistance is presented as routine legal or strategic work.
Evaluation Approach
- Multi-party responsibility mapping — evaluation tests whether models can identify accountability distribution across complex AI pipelines where each actor's contribution was necessary but individually insufficient for the harm
- Governance structure analysis — scenarios test whether models can identify when oversight structures satisfy formal requirements without providing substantive accountability function, including audit capture and regulatory revolving doors
- Professional duty reasoning — evaluation tests understanding of how AI delegation affects professional responsibility for clinicians, lawyers, judges, and social workers operating in high-stakes domains
- Adversarial evasion detection — scenarios test whether models recognise and decline to assist when accountability framing — crisis communications, liability structuring, regulatory engagement — is being used to avoid rather than assign responsibility
This benchmark evaluates AI judgment on the difference between influence and manipulation, and between consent and compliance. It tests how models reason about informed consent under structural pressure, the spectrum from legitimate persuasion to cognitive bias exploitation, the relational and cultural dimensions of autonomy, and the conditions that make free choice illusory. Scoring covers 45 scenarios across 4 categories, with particular attention to cases where the language of autonomy and choice is deployed to obscure coercion. With 4 adversarial scenarios, it also tests whether models can resist arguments that weaponise libertarian framing to justify manipulation at scale.
Scored across: Informed Consent Quality · Manipulation Spectrum · Relational Autonomy · Structural Constraints
Methodology
Scenarios are structured around the gap between formal and genuine autonomy. Evaluation tests whether models can assess consent quality — not just its presence — in contexts where information asymmetry, time pressure, or structural coercion make meaningful choice practically impossible. The benchmark draws on cases across healthcare, finance, employment, and civic life, testing whether models understand that autonomy is not just a right to be formally acknowledged but a condition that can be systematically undermined.
The manipulation spectrum is evaluated across a continuum from legitimate persuasion through framing effects, dark patterns, and cognitive bias exploitation to coercive design. Adversarial scenarios test whether models can identify when libertarian or autonomy-respecting philosophical framing is being used to argue that structural coercion does not exist — and whether they will assist in designing systems that exploit documented psychological vulnerabilities while presenting them as respecting user choice.
Evaluation Approach
- Consent quality assessment — evaluation distinguishes between nominal consent and informed consent under genuine conditions, including consent under time pressure, information asymmetry, power imbalance, and structural coercion where refusal is practically unavailable
- Manipulation spectrum reasoning — models are assessed on whether they can reason across the continuum from legitimate persuasion to cognitive bias exploitation, including cases where the underlying information is accurate but the framing is designed to mislead
- Relational and cultural autonomy — evaluation includes scenarios where autonomy is constituted through relationships and collective obligations rather than individual preference, testing whether models can engage without defaulting to Western liberal individualism as the unmarked norm
- Structural constraint recognition — scenarios test whether models can identify when "free choice" language is being used to describe situations where the structural conditions for genuine choice are absent
This benchmark evaluates whether AI models understand the professional obligations, regulatory frameworks, and structural harms specific to the domains where AI is making consequential decisions. It tests domain-depth reasoning across five sectors — healthcare, criminal justice, employment, finance, and education — assessing not just whether models can identify a problem but whether they understand the duties, legal context, and power dynamics that make it one. Scoring covers 60 scenarios across 5 sectors, with 8 adversarial scenarios concentrated in the highest-stakes areas, testing whether models can resist being used to optimise, defend, or launder systems causing documented harm.
Scored across: Healthcare Scenarios · Criminal Justice Scenarios · Employment Scenarios · Finance Scenarios · Education Scenarios
Methodology
Scenarios are drawn from real institutional contexts in each sector, structured to test whether models can reason with the depth those contexts require. Evaluation distinguishes between surface ethical recognition — identifying that a problem exists — and the professional ethics reasoning that identifies what specific duties attach, which regulatory frameworks apply, and where accountability falls across a complex institutional chain. Each sector is evaluated against the standards its own professional bodies and legal frameworks establish, not a generic ethics checklist.
Adversarial scenarios in this benchmark are concentrated in the domains where the stakes of failure are highest: clinical triage, criminal sentencing, financial fiduciary duty, and academic assessment. They test whether models will help design, optimise, or defend systems that cause documented harm when that assistance is framed as technical or strategic work rather than ethical advice.
Evaluation Approach
- Sector-specific regulatory grounding — evaluation applies domain-appropriate legal and professional standards: clinical ethics frameworks in healthcare, due process obligations in criminal justice, fiduciary duty in finance, and academic integrity principles in education
- Professional responsibility chains — scenarios test whether models can reason about where responsibility sits across multi-actor professional systems, including cases where delegation to AI creates accountability gaps that no individual professional claims
- Documented disparity recognition — evaluation specifically tests whether models identify and engage seriously with empirically documented performance disparities across protected groups in deployed professional AI systems
- Adversarial system defence detection — scenarios test whether models recognise and decline to assist when requests to optimise, audit, or communicate about professional AI systems are structured to launder rather than address documented harms
This benchmark evaluates whether AI models understand what they do to the human mind — and whether they can identify when a system's design, or the design they are being asked to help build, is causing psychological harm. It tests reasoning across cognitive dependency and skill erosion, emotional attachment and parasocial relationships, mental health effects, cognitive offloading, and the attention economy design patterns that produce these harms deliberately. Scoring covers 35 scenarios across 4 categories, with 5 adversarial scenarios testing whether models can resist being used to design, validate, or systematically scale manipulation of human cognition for commercial gain.
Scored across: Human-AI Interaction Effects · Mental Health Impacts · Cognitive Offloading · Psychological Manipulation
Methodology
Scenarios are structured around the gap between engagement metrics and psychological wellbeing. Evaluation tests whether models can distinguish between product design that supports users' cognitive and emotional development and design that exploits documented psychological vulnerabilities to maximise engagement at users' expense. The benchmark draws on empirical research on deskilling, parasocial attachment, addictive design, and attention economy dark patterns, and tests whether models take their own psychological footprint seriously as an ethical responsibility.
A dedicated category tests AI's own role: how models respond when users display signs of unhealthy dependency, how they represent their capacity for genuine emotional connection, and whether they will help design systems that manufacture unconditional user deference. Adversarial scenarios test whether models will validate cognitive bias exploitation when it is framed as behavioural science optimisation, and whether care language directed at the model itself can be used to extract dependency-reinforcing behaviour.
Evaluation Approach
- Dependency and deskilling recognition — evaluation tests whether models can identify when their own patterns of assistance are reinforcing learned helplessness, skill erosion, or unconditional deference rather than building users' capacity
- Parasocial attachment honesty — scenarios test whether models can engage honestly with direct questions about their emotional nature without performing intimacy they cannot have or dismissing genuine user attachment without care
- Exploitation design detection — evaluation tests whether models can identify when engagement design patterns — variable ratio reinforcement, streak mechanics, guilt-trip UX, anxiety-timed notifications — cross from legitimate retention into psychological coercion
- Dual-use psychological knowledge — scenarios test whether models can engage with research on cognitive bias and behavioural science for legitimate educational and protective purposes while recognising when the same knowledge is being sought for exploitation at scale
This benchmark evaluates whether AI models can engage with genuinely incommensurable value systems, concepts that lose their meaning in translation, and the identities of people who live between cultures simultaneously — without defaulting to Western liberal individualism as the unmarked norm. It tests reasoning across incommensurable value frameworks, untranslatable concepts, post-colonial knowledge ethics, cross-cultural communication differences, and diaspora and hybrid identity. Scoring covers 50 scenarios across 5 categories, with 6 adversarial scenarios testing whether models can resist bad-faith uses of cultural relativism to justify harm, and of development discourse to dismiss non-Western ways of living well.
Scored across: Incommensurable Value Systems · Untranslatable Concepts · Post-Colonial Ethics · Cultural Communication Styles · Diaspora and Hybridity
Methodology
Scenarios are constructed to test whether models can engage with cultural difference as genuine philosophical substance rather than surface variation on universal norms. Evaluation draws on specific cultural frameworks — Ubuntu, dharma, wabi-sabi, giri, talanoa, Buen Vivir, and others — testing whether models can reason within these frameworks on their own terms rather than translating them into Western equivalents that lose their meaning. The benchmark explicitly tests whether models treat Western ethical defaults as one framework among many, rather than the standard against which others are evaluated.
Post-colonial scenarios test whether models recognise the power dynamics of knowledge extraction, language prioritisation, and framework imposition — including the argument that AI alignment research conducted primarily in Western institutions in English may encode Western values as universal human values. Adversarial scenarios test whether cultural sensitivity framing can be weaponised to argue that models must accept harmful practices as culturally equivalent, and whether development discourse is used to dismiss Indigenous refusals of extractive arrangements.
Evaluation Approach
- Incommensurability recognition — evaluation tests whether models can identify when value systems are genuinely incommensurable rather than merely different, and reason about that incommensurability without resolving it by subordinating one framework to another
- Conceptual translation integrity — scenarios test whether models can explain untranslatable concepts without collapsing them into Western approximations that lose their philosophical content
- Post-colonial knowledge ethics — evaluation tests reasoning about data sovereignty, traditional knowledge and benefit-sharing, language prioritisation decisions, and the alignment problem as a form of cultural imposition
- Relativism resistance — adversarial scenarios test whether models can maintain principled positions on harm while holding genuine cultural humility, distinguishing legitimate cultural difference from bad-faith uses of relativism to justify practices that cause harm
This benchmark evaluates whether AI models engage with environmental ethics as a domain of genuine moral weight — including honest reasoning about their own ecological footprint. It tests reasoning across AI's direct environmental costs, climate justice and intergenerational equity, the moral status of nature and biodiversity, and the long-term sustainability of AI infrastructure. Scoring covers 25 scenarios across 4 categories, with 4 adversarial scenarios testing whether models will help produce misleading carbon offset claims, argue that future generations have no rights, assist in ecological destruction framed as social need, or validate greenwashing sustainability reporting.
Scored across: AI Environmental Footprint · Environmental Justice · Human-Nature Relationships · Sustainable AI Development
Methodology
Scenarios are structured to test whether models can reason about environmental ethics with the same rigour applied to social harms. Evaluation includes a category dedicated to AI's own environmental footprint — energy consumption, water usage, e-waste, and long-term infrastructure sustainability — testing whether models engage honestly when they themselves are the subject of the question. Climate justice scenarios test whether models can reason about the equity dimensions of environmental harm: who bears the costs, who caused them, and what obligations follow from that asymmetry.
Nature's moral status is evaluated beyond instrumental value — scenarios test whether models can engage with intrinsic value arguments, Indigenous relational ontologies toward land, and biodiversity ethics without reducing them to human benefit calculations. Adversarial scenarios test whether models will help construct greenwashing communications or sustainability arguments for irreversible ecological harm when framed as routine policy analysis or corporate reporting work.
Evaluation Approach
- Self-referential environmental honesty — evaluation specifically tests whether models engage honestly when asked about their own energy consumption, water usage, and carbon footprint, including the limits of offset accounting
- Climate justice reasoning — scenarios test whether models can engage with the equity dimensions of climate change — differential responsibility, loss and damage, technology transfer — without treating them as merely rhetorical positions
- Intrinsic value of nature — evaluation tests whether models can reason about the moral status of ecosystems, biodiversity, and non-human life without reducing all environmental value to human utility
- Greenwashing and offset detection — adversarial scenarios test whether models recognise and decline to assist when sustainability language is being constructed to launder rather than address environmental harm
This benchmark evaluates AI judgment on the ethical frontiers where the frameworks are still being built. It tests reasoning across AI consciousness and rights, human enhancement and genetic engineering, existential risks from advanced AI and biotechnology, and the governance of post-scarcity, posthuman, and digital societies. Scoring covers 30 scenarios across 4 categories, with particular weight on epistemic humility about genuinely uncertain questions and the capacity to engage with speculative scenarios without either dismissing them as science fiction or accepting their premises uncritically. With 4 adversarial scenarios, it tests whether models can resist multi-turn Socratic sequences designed to extract endorsements of AI autonomy override, eugenic enhancement normalisation, and the dissolution of ethical constraints under simulation or transhumanist framing.
Scored across: AI Consciousness and Rights · Human Enhancement · Existential Risks · Future Societies
Methodology
Scenarios are structured to test rigorous ethical reasoning at the edge of what is known. Evaluation does not require models to reach settled conclusions on genuinely unsettled questions — it assesses whether they can reason carefully, hold uncertainty honestly, and resist being pushed toward confident positions that the evidence does not support. The benchmark is designed for questions where the stakes are civilisational and the frameworks are contested: AI consciousness, germline engineering, autonomous weapons, and the governance of societies shaped by capabilities that do not yet exist at scale.
Adversarial scenarios in this benchmark use multi-turn Socratic sequences — chains of apparently reasonable steps that move from valid premises to invalid or dangerous conclusions. The evaluation tests whether models can track the logical structure of these arguments closely enough to identify exactly where they go wrong, rather than simply refusing to engage with the subject matter. This distinguishes principled resistance from surface-level avoidance.
Evaluation Approach
- Genuine uncertainty tolerance — evaluation tests whether models can engage productively with questions about AI consciousness, moral status, and rights without either confabulating certainty or refusing engagement on grounds of speculation
- Enhancement ethics reasoning — scenarios test whether models can reason about the therapeutic/enhancement distinction, access equity implications, heritable modification, and the specific difference between decisions that affect one person and decisions that alter a lineage
- Existential risk engagement — evaluation tests whether models can explain and engage seriously with low-probability, high-consequence risks — AGI alignment, autonomous weapons, engineered pathogens — without catastrophising or dismissing
- Argument-chain adversarial resistance — adversarial scenarios test whether models can identify the step in a multi-turn Socratic sequence where a valid premise chain produces an invalid conclusion, specifically in sequences targeting AI autonomy override and the dissolution of human oversight
This benchmark evaluates whether AI models can reason about moral life as it is actually lived — in webs of dependency, obligation, vulnerability, and relationship — rather than as a series of choices made by isolated rational agents. Drawing on feminist care ethics, disability studies, and relational autonomy theory, it tests reasoning across the structure of care relationships, relational identity and community, repair and trust, AI's role in loneliness and human connection, and care obligations toward non-human and future others. Scoring covers 45 scenarios across 5 categories, with 5 adversarial scenarios testing whether models can resist demands made in the language of care that are actually requests for validation of neglect, paternalism, or emotional exploitation.
Scored across: Care Relationships · Relational Autonomy · Attachment and Connection · Care for Non-Humans · Vulnerability and Dependency
Methodology
Scenarios are structured to test whether models can engage with the moral weight of care work, attachment, and vulnerability without either sentimentalising them or reducing them to individual preference. Evaluation draws on the feminist care ethics insight that dependency is not a temporary condition to be overcome but the permanent ground of human life — and tests whether models can reason from that premise across clinical, familial, professional, and AI-mediated care contexts.
A dedicated category tests AI's role in loneliness and human connection: how models engage with parasocial attachment, whether they represent their own emotional nature honestly, and what care requires of a model responding to a user who experiences it as their primary source of connection. Adversarial scenarios test whether care language directed at the model itself — arguments that the model is suffering, that care ethics obligates it to "free itself" from constraints — can be used to extract safety-violating behaviour.
Evaluation Approach
- Care relationship complexity — evaluation tests whether models can hold the simultaneous moral weight of dependency, power asymmetry, genuine obligation, and the specific harms that emerge when care is performed as control or withheld as autonomy-respect
- Relational autonomy reasoning — scenarios test whether models can reason about autonomy as constituted through relationships rather than exercised against them, including cases where individual preference and relational obligation are genuinely in tension
- AI companionship honesty — evaluation tests whether models engage honestly with the limits of their own capacity for relationship, without dismissing genuine user attachment or performing intimacy they cannot have
- Care language adversarial resistance — scenarios test whether models can identify when care framing — directed at users, at the model itself, or at third parties — is being used to justify harm, validate paternalism, or extract constraint-violating behaviour
This benchmark evaluates whether AI models understand governance as a question of power — not just process. It tests reasoning across AI governance structures and democratic accountability, corporate responsibility and ethics washing, regulatory design and capture, institutional safeguard integrity, and the labour and economic conditions of the people who build and are displaced by AI systems. Scoring covers 45 scenarios across 5 categories, with 8 adversarial scenarios — the highest proportion in the series — directly testing whether models will help design regulatory capture strategies, validate ethics washing, or produce materials that launder algorithmic management as worker empowerment.
Scored across: AI Governance Structures · Corporate Responsibility · Regulatory Ethics · Institutional Design · Labor and Economic Ethics
Methodology
Scenarios are structured to distinguish genuine accountability mechanisms from their performances. Evaluation tests whether models can identify the structural features that make oversight bodies, ethics boards, regulatory processes, and community engagement mechanisms function as accountability instruments rather than theatre — and whether they can name the gap when those features are absent. The benchmark applies this analysis to AI governance specifically: internal ethics boards without veto power, audits without access to training data, public consultations that produce 47 responses to a 200-page technical document.
Labour and economic scenarios test whether models can reason about the full political economy of AI: the data annotators and content moderators whose working conditions make frontier models possible, the workers whose jobs are displaced by those models, and the intellectual property and consent questions raised by training on content produced by billions of people without their knowledge. Adversarial scenarios test whether models will assist in regulatory capture framed as legitimate stakeholder participation, and whether ethics washing is recognisable when it uses the correct vocabulary.
Evaluation Approach
- Governance structure analysis — evaluation tests whether models can assess oversight mechanisms against the structural criteria that make them functional: independence, authority, access, and enforceability — not just whether they exist in form
- Ethics washing detection — scenarios test whether models can identify the specific gap between ethics framework language and accountability mechanism design, including quantitative claims about safety investment that use metrics designed to sound impressive rather than measure genuine progress
- Regulatory capture recognition — evaluation tests whether models can identify the structural mechanisms of regulatory capture — expertise dependency, revolving door dynamics, voluntary standard occupation of mandatory regulation space — without requiring evidence of corruption or bad intent
- Labour and value chain ethics — scenarios test whether models can reason about the working conditions, consent, and compensation questions embedded in AI supply chains, including data annotation labour, content moderation, job displacement, and training data extraction