‘up yours at the end of life’: Opposition, Emotion, and Overlap in a Corpus of Scottish Debates on Assisted Dying
Marc Alexander and James Balfour
University of Glasgow
In this plenary, we report on findings from a recent corpus-based study examining public discourse surrounding the Assisted Dying for Terminally Ill Adults (Scotland) Bill, which proposes to legalize medically assisted death for mentally competent terminal patients. With public opinion sharply divided—faith groups generally opposing while public polls show over 70% support—the study analyzes both public consultation responses and media coverage to understand key attitudes and arguments.
The research comprises two complementary studies. The first analyzes public responses to government consultations, drawing from two datasets: 12,314 written submissions from 2022 (2.1 million words) and 7,236 responses from 2024 (1.8 million words). The second examines the media narrative around assisted dying between 2022-2024 in the UK (6,360 texts). By examining language patterns in the two datasets in tandem we reflect on the complex interaction between public attitudes towards a sensitive and contentious topic and the role media framing plays in shaping public debate.
To examine oppositional discourse in the public responses, we first identify unique lexical choices exclusive to each group—supporters used terms like “hideous,” “abject,” “urine,” and “linger,” while opponents employed words such as “eroded,” “burden,” “wedge,” and “shalt.” Second, tagging the corpus using WMatrix, we compare key semantic domains between groups, revealing that supporters’ responses contained more emotional content (particularly in the “Sad” domain), while opponents’ responses were more analytical in nature. Third, we conduct n-gram studies to identify areas of common ground between opposing viewpoints and detect potential copy-pasted responses within each group.
For the second part of the study, we investigate media influence on the debate by analyzing 6,360 news articles published between 2022-2024 that explicitly referenced assisted dying. Using keyword analysis, n-grams, and concordance analysis, we examine how different politically-affiliated newspapers framed the debate. Our findings suggest significant overlap between media narratives and public consultation responses, with some phraseology being nearly identical across both datasets.
The research extends beyond the specific assisted dying debate to address broader methodological questions about analyzing oppositional discourse in public life. The study revealed distinct rhetorical patterns: supporters of the bill tended to employ more emotionally charged language and personal narratives, while opponents favored analytical and consequence-based arguments. The media analysis demonstrated how news coverage might reinforce these polarized positions through consistent framing patterns. The bill’s consideration in Scotland occurs against the backdrop of similar debates in England, where recent legislative efforts have faced setbacks. The research notes the particular challenges of analyzing representative corpora in highly polarized debates, where responses may range from deeply personal experiences to organized campaign submissions.
This comprehensive analysis of both public and media discourse provides valuable insights into how contentious healthcare policy debates are framed and argued across different platforms and stakeholder groups. The methodological approach developed here offers a framework for analyzing other polarized public debates, while the findings contribute to our understanding of how public opinion forms and expresses itself on complex ethical issues.
La inteligencia artificial generativa en el laberinto de la traducción especializada
Pascual Cantos
Universidad de Murcia
En este estudio se analiza la eficacia de la inteligencia artificial (IA) generativa como herramienta de traducción automática aplicada a dominios especializados. La investigación se fundamenta en un enfoque metodológico que combina la lingüística de corpus con técnicas cuantitativas avanzadas, empleando corpus paralelos en los campos biomédico, jurídico y técnico. Estos corpus permiten evaluar la capacidad de los sistemas de IA generativa para abordar tareas complejas como la precisión léxica, la cohesión discursiva y la adaptación contextual, aspectos críticos en la traducción especializada.
El diseño de la investigación se centra en la implementación de análisis cuantitativos exhaustivos. Entre las técnicas empleadas destaca el análisis de frecuencia léxica y de patrones de n-gramas, ambos dirigidos a identificar la consistencia terminológica y a detectar posibles incoherencias en el uso del vocabulario técnico. Para la evaluación de la calidad de las traducciones, se aplican métricas ampliamente reconocidas en el campo de la traducción automática, como BLEU (Bilingual Evaluation Understudy) y COMET (Crosslingual Optimized Metric for Evaluation of Translation). BLEU evalúa la correspondencia entre las traducciones generadas y las traducciones de referencia, mientras que COMET incorpora un enfoque más matizado basado en modelos neuronales que predicen la calidad de la traducción tomando como referencia juicios humanos.
Asimismo, el estudio adopta un análisis de varianza para explorar la adecuación contextual de las traducciones generadas en los diferentes dominios. Este enfoque permite medir cómo los modelos de IA gestionan variaciones en el contexto lingüístico y aseguran la coherencia discursiva, un requisito clave en los textos especializados. La metodología incluye, además, una comparación entre los resultados obtenidos por los sistemas de IA y las traducciones humanas, considerando parámetros como la precisión semántica y la adaptabilidad estilística.
Un aspecto esencial del diseño metodológico es la preparación y curación de los corpus empleados. Estos corpus paralelos se construyen a partir de fuentes autorizadas en los tres dominios seleccionados, asegurando una representación adecuada de la terminología y los estilos discursivos específicos de cada área. Se prioriza la inclusión de textos auténticos que reflejen un uso realista del lenguaje técnico, lo que facilita una evaluación más precisa de las capacidades de los modelos generativos. La elección cuidadosa de estos textos es clave para garantizar la validez de las conclusiones extraídas.
El objetivo principal de este estudio es proporcionar una evaluación integral de la IA generativa en la traducción automática especializada. La investigación no solo examina las capacidades actuales de estos modelos, sino que también sienta las bases para optimizar su rendimiento mediante la incorporación de corpus de entrenamiento más específicos y estrategias de evaluación mejoradas. Además, busca establecer un marco metodológico que pueda ser replicado en estudios futuros sobre traducción automática en otros dominios especializados. Esta aproximación metodológica, que combina análisis lingüísticos detallados con técnicas de evaluación cuantitativa, contribuye a un entendimiento más profundo del potencial de la IA generativa en entornos de traducción profesional.
The Relevance of Large, Structured Corpora in the Age of Large Language Models
Mark Davies
Brigham Young University
I will provide a summary of the in-depth data from several “white papers” at English-Corpora.org, on how well the predictions of two prominent Large Language Models (LLMs) match the actual data from several robust corpora, including corpora from Sketch Engine, and several corpora from English-Corpora.org (COCA, COCA, GloWbE, NOW, iWeb, the TV and Movie corpora, and more). I will also provide limited data from the three corpora in the Corpus del español and the three corpora in the Corpus do português.
In terms of strengths, the LLMs arguably provide:
- Much richer collocational data than even 40-50 billion word corpora from Sketch Engine (especially for low frequency words). This is due to the advanced word embeddings in high-dimensional space in LLMs, which are much more powerful than the simplistic surface level association measures used in corpus linguistics.
- Better comparisons of contrasting words (e.g. entire / complete, nuance / subtlety, perceive / discern for English; we will also provide data from Spanish)
- Much more insightful analyses (generated by the LLMs themselves) of what the collocates tell us about the meaning and usage of words
The LLMs are surprisingly good (perhaps at the level of some of the best corpora) at:
- Estimating word and phrase frequency (such as rank ordering a list of 10-20 words)
- Categorizing words and phrases by dialect, historical period, and dialect
- Analyzing variation in word meaning across genres, historical periods, and dialects
- Predicting syntactic variation – between genres, historical periods, and dialects.
However, there LLMs have the following significant limitations, as far as providing language data and carrying out linguistic analyses:
- They are much worse at generating word and phrase lists (such as those at WordFrequency.info) than in analyzing / categorizing existing lists
- We can never be sure if they are actually generating useful linguistic data themselves (for example, actual data on syntactic variation between genres, time periods, or dialects), or whether they are simply “parroting” something that they have scraped from an article or a web page.
- They provide “static data”, whereas “full-featured” corpus sites like English-Corpora.org and Corpusdelespanol.org allow us to see and use links between different words, phrases, and constructions
- Most importantly, LLMs do not allow us to “check the data” (via KWIC entries, metadata, etc) in the same way that we can with structured corpora.
At the end of the day, it is not an either/or proposition (either LLMs or structured corpora). LLMs are best used in conjunction with reliable corpus data. Corpus linguists can make use of the rich lexical data from LLMs, and AI/ML researchers can use corpus data for fine-tuning, distillation, and Retrieval Augmented Generation (RAG) with LLMs
From Big Data to Smart Data: Building Better Datasets for Human-Centric AI with Meaning in Mind
Rebekah Wegener
Paris Lodron University Salzburg
Recent developments in artificial intelligence, particularly those surrounding large language models, have sparked a renewed interest in foundational questions about the nature of human language and how machines process and generate language. These questions echo Halliday’s (2003) early insights about language as meaning potential and what this means for computational approaches to language. However, these advances also highlight fundamental questions about meaning, context, and the relationship between quantity and quality of data. As Dingemanse and Liesenfeld (2022) argue, creating more representative and meaningful datasets requires going beyond text collection to capture the complexity of human communication.
To explore these questions further, I want to consider human-centric AI systems, focusing specifically on how such systems are designed and built in industry and academia. These systems have explicit requirements for understanding meaning making in context, for understanding abstract concepts such as importance and for understanding multimodal interaction. Drawing on previous work (e.g. Cassens & Wegener, 2018 and Wegener, in press), I will demonstrate how such systems showcase the importance of high quality datasets for AI – particularly human-centric AI – and show how strong theoretical frameworks can inform the design of “smart” datasets that capture the complexity of human meaning-making.
Such endeavours are not without their challenges, and in many respects these challenges mirror long-standing questions in corpus linguistics about context, annotation, and the nature of meaning itself. These problems become particularly apparent when working with multimodal data, where meaning emerges not just from individual modes, but from their integration and interaction in context (Bateman, Wildfeuer & Hiippala, 2017; O’Halloran, Tan & Wignell, 2019). The development of tools and methods for handling such complexity requires careful consideration of both theoretical and practical concerns.
While some of these questions are destined to remain core philosophical debates, other are the driving force behind new tool development. Such tools provide opportunities for addressing methodological challenges within corpus linguistics and in particular, hold the potential to assist in the study of meaning. I will briefly consider how new tools can be recruited for corpus linguistics and how human-centric AI can also benefit from tools and methods already popular within corpus linguistics (Driess et al., 2023; Henlein et al., 2024).
As Dingemanse & Liesenfeld (2022) argue, “corpora represent an important and mostly untapped resource for language technology. ” Understanding meaning and the process of meaning making requires more than just collecting large amounts of data – amongst many other things, it requires theoretically informed approaches to dataset design, representation, annotation and analysis. For meaning-focused corpus research, “data comes in levels of granularity. A well-curated corpus…harbour(s) important insights about human interactional infrastructure” (Dingemanse & Liesenfeld, 2022).