Ideas are like people. They have direct ancestors and share genes with their sib‑ lings. Sometimes they are lucky enough to have children; often their influence is more indirect, like that of a benign uncle or aunt sharing an interest with their near relatives. If they are lucky, they can grow old surrounded by ideas they have given birth to or had some influence on; if they are less lucky, they can see their family grow more distant and end up alone or patronised at a distance. Occasion‑ ally they have no families, either because they are difficult to relate to or simply because they grew up in a culture where their good qualities are not recognised.

Lexical priming is an idea lucky enough to have a family around it, and is itself the child of many famous parents (and here the analogy between ideas and people breaks down). Ideas from the minds of John Sinclair, Randolph Quirk and Eugene Winter, amongst many others, can all lay claim to its parentage. The idea also has distinguished older siblings, such as Sinclair’s ‘idiom principle’ and Hunston & Francis’s ‘pattern grammar’, from which Lexical Priming learnt and against which it gently reacted (like siblings the world over).

I am immensely fortunate (and humbled) that Lexical Priming has been allowed to be a benign uncle (and in a few cases the parent) to other ideas; this volume shows some of the exciting thinking going on in current corpus linguistics, and as someone at the end of my career, I am glad to have played a modest part in the development of this thinking. Like all ideas, lexical priming will in time grow old and die, replaced by fuller and more satisfying theories that may or may not make use of the idea they replace. This is entirely natural and healthy, but before this happens, I would like to reflect briefly on a couple of implications of the the‑ ory that have perhaps had less attention than others.

In essence, the theory says that a person’s repeated exposure to contextualised instances of highly similar phonetic sequences or identical letter sequences results in their being primed to associate those sequences (typically, though not neces‑ sarily, words) with the recurrent features of those contexts; this claim is based on extensive psycholinguistic research into priming, well surveyed by Pace‑Sigge in his recent book on Lexical Priming in spoken English usage. The effect of the priming caused by such exposure is that when the primed person uses the word

(or other piece of language) in question, s/he typically replicates the recurrent fea‑ tures of the context, thereby ensuring the perpetuation of the association of the word (or whatever) with those features. This, I claim, accounts for the existence of collocation, colligation, semantic association (or semantic preference) and a range of other corpus‑identified features of language. Of course, collocation, colligation and the other features exist independently of the theory, which simply seeks to account for their existence. (A solely social explanation for collocation and the other features will not do, because we enter society and language at the same time and by the same processes.)

The first implication of the psychological explanation of collocation and other features is that there is no rational basis for believing that everybody’s primings are identical. Each person, at least in theory, has their own unique language, which is harmonised with those of other speakers to a considerable degree by education and the media, but which reflects the people they talk with, the places they meet in and the material they choose to read. This seems to support those trends in both sociolinguistic research and work on language change that see social groupings as fluid, local and genre/domain specific.

The second implication is that priming is the mechanism whereby we arrive at our own personal (and incomplete and inconsistent) grammars and semantics. In other words, grammar and semantics are secondary outputs, rationalisations from the data, not inputs into the language system. This does not make them less impor‑ tant; they are powerful generalisations that all of us make, that some of us allow to interact with our primings and that a few of us try to systematise. But it does mean that we have to reject a number of positions that have historically dominated our discussion of these fields of linguistics. It also means that collocations, colliga‑ tions and semantic associations are the source of our grammatical categories and semantic sets, not just drawing upon them. There are risks of circularity here, but pre‑existing categories are a record of the way some previous speakers have been primed and provide a starting‑point outside the circle.

As the idea of lexical priming starts the process of growing old (it is currently a teenager), perhaps these implications will begin to interest people or perhaps they are dead ends. One thing is certain: there are no obvious dead ends in this volume. It is my hope that the ideas in this book have their own progeny and provoke you, the reader, to have ideas of your own that are themselves fruitful to you and others.

Michael Pace‑Sigge & Katie J. Patterson

University of Eastern Finland

  1. Why this book
    When, in 2003, the first editor of this volume watched Michael Hoey give a pre‑ sentation on Lexical Priming (LP) it seemed to be a revelation: not because of his presentation style (he had to, after all, struggle with numerous transparencies on an overhead projector) but because the theory presented seemed to have the abil‑ ity to provide answers to a number of contentious issues. For the second editor, LP also offered a solution to a number of issues that had come up in relation to under‑ standing and recognising metaphor. We were happy calling ourselves linguists. We were less happy to encounter a system of rules and regulations that seemed to be intractable, given the number of exceptions and “other uses” that had to be taken into account. A lot of the grammatical models that the editors had encountered during their years as undergraduate students seemed to be just that: models that appear to be fine in theory, yet, when confronted with everyday language use – both written and spoken – these models appeared to be both ill‑fitting and incon‑ sistent; suitable for some uses and unsuitable for others. This presentation given at the University of Liverpool Wednesday Seminar series presented an alternative, lexically‑based and natural‑language usage driven approach to language research.The idea for this book goes back to the summer of 2012. The first editor had just started as a lecturer at the University of Eastern Finland and was given the opportunity to design and teach his own option module. This commenced in January 2013, a course that introduced Corpus Linguistics in the first weeks and then looked at the philosophical questions brought about by work with naturally occurring language as the weeks progressed: semantic prosody, and, prominently, the exegesis of lexical priming and its applications. Crucially, the first editor had attended a SiBol CADS1 conference in Bologna that summer: Michael Hoey talked about his theory and its application for prose‑stylistic investigations. The
    1. Universities of Siena and Bologna Corpus-Assisted Discourse Studies

      doi 10.1075/scl.79.003int© 2017 John Benjamins Publishing Company
      eye‑opener, however, was seeing papers by Alan Partington and Tony McEnery. Partington discussed lexical priming in the context of what he called evaluative primings. McEnery also made a strong reference to the theory, linking the delib‑ erate move to break mainstream primings with the kind of discourse found in groups of radicalised young men. If it had not happened before, it became clear at this point that the theory had progressed from offering up an alternative approach to describing patterns and behaviours in corpus‑linguistic data – it was now taken up by an increasing number of linguists, working in a variety of fields, to provide an explanation for what drives us in our language use. There have been publica‑ tions in which one or two main themes of lexical priming have been discussed: creativity and stylistics in Hoey, Mahlberg, Teubert et al. (2006); and Corpus Assisted Discourse Analysis in Partington, Duguid & Taylor (2013). Furthermore, findings previously made have been re‑interpreted in light of the theory – some‑ thing that, for example, Geoff Thompson undertook in a presentation on patient‑ doctor exchanges (Thompson 2012). Yet there existed no single source that took the lexical priming theory and showed what advances had been made since its con‑ ception in Hoey’s (2005) seminal book. One of our aims as editors is to showcase the now considerable variety of applications of the lexical priming theory to lin‑ guistic study: to explain language occurrence patterns in, for example, the usage of metaphors; for teaching or teaching materials; for languages other than English; for spoken English; and for political discourse. Creating a single collection then, which demonstrates the theory’s influence and consequential evolution was the primary aim of this volume.
  2. Michael Hoey’s theory of lexical priming
    Michael Hoey has been exploring the concept of colligation in detail since the late 1990s and, stemming from this, developed his theory of lexical priming, which was presented in great detail in his 2005 monograph. The idea of lexical priming is not new: in fact, the concept was presented by James Neely in the late 1970s and goes back to the notion concept of priming first developed by Ross M. Quillian back in the 1960s (cf. Pace‑Sigge 2013). What is new is that Hoey uses the theory developed by psycholinguists (theoretically and then under laboratory condi‑ tions) and applies it to corpus linguistics (which deals with naturally occurring language). Indeed, Gries (2005) has shown that corpus‑based investigations can lead to results that are almost identical (<95.0 %) to the results psycholinguists have obtained under controlled conditions.Answering the question of what lexical priming is, one can, first of all, high‑ light the fact that it approaches the acquisition, understanding and production of
    language (either spoken or written) from as lexis‑driven, not a grammar‑driven stance. Like John Sinclair (1991), Hoey is sceptical of the slot‑and‑filler model. Instead, single words do not have equal value: they prefer or reject the company of others. Lexical priming then, is a modern, neo‑Firthian theory of language. Hoey proposes that each word, or each set of word, triggers in turn another word or set of words. In essence, a listener will expect a word like ‘nurse’ to be followed by ‘doctor’ or ‘hospital’ or ‘caring’ and so forth. A listener will, however, wonder whether he or she has misheard if the word following ‘nurse’ is ‘bank’ or ‘editor’. A single word can have a reader expect a certain range of words to follow because repeated previous experiences have primed the reader to expect a certain term. Where this is not the case, Neely’s research has shown a significant slow‑down in recognition time.2 Yet priming works on a far subtler level than mere word‑ collocation. Through constant experience and our own use, words and word com‑ binations tend to appear in certain structures (its colligations) and our word usage is also reflecting the connotations in which these words are deemed appropriate (its semantic associations) and the position and types of text where they are most likely to be employed (textual colligations). All this is summarised by Michael Hoey as follows (Hoey 2005: 13):

    Priming hypothesesEvery word is primed for use in discourse as a result of the cumulative effects of an individual’s encounters with the word. If one of the effects of the initial priming is that regular word sequences are constructed, these are also in turn primed. More specifically:
    1. Every word is primed to occur with particular other words; these are its collocates.
    2. Every word is primed to occur with particular semantic sets; these are its semantic associations.
    3. Every word is primed to occur in association with particular pragmatic functions; these are its pragmatic associations.
    1. One of the editors (M P-S) presents an example of how friends and colleagues act as primes: annoyingly, he keeps finishing other people’s sentences. This can be seen as an example of his impatience. Or, because in 9 out 10 cases the utterance-completion is fitting, as an example of how primes trigger the brain to complete what is being said before a conversation partner has actually spoken.
      1. Every word is primed to occur in (or avoid) certain grammatical positions, and to occur in (or avoid) certain grammatical functions; these are its colligations.
      2. Co‑hyponyms and synonyms differ with respect to their collocations, semantic associations and colligations.
      3. When a word is polysemous, the collocations, semantic associations and colligations of one sense of the word differ from those of its other senses.
      4. Every word is primed for use in one or more grammatical roles; these are its grammatical categories.
      5. Every word is primed to participate in, or avoid, particular types of cohe‑ sive relation in a discourse; these are its textual collocations.
      6. Every word is primed to occur in particular semantic relations in the dis‑ course; these are its textual semantic associations.
      7. Every word is primed to occur in, or avoid, certain positions within the discourse; these are its textual colligations.
    Very importantly, all these claims are in the first place constrained by domain and/or genre. They are claims about the way language is acquired and used in specific situations. This is because we prime words or word sequences, as already remarked, in a range of social contexts and the priming, I argue, takes account of who is speaking or writing, what is spoken or written about and what genre is being participated in, though the last of these constraints is probably later in developing than the other two.

    The subtlety of priming can also be seen on a different level: these are not con‑ scious choices. Without a degree of formulacity, both the production and the comprehension of language would be slowed down markedly. Yet primings can consciously be broken (for example, when making a joke); primings can also be changed, expanded, or re‑assigned (as when a word like ‘Google’ is no longer just a proper noun but also a verb or when one learns a new language). As such, primings are prevalent but they are neither fixed nor prescriptive: they reflect one’s exposure and language use. Therefore, a word like care will act as a prime for different terms depending on who is talking: a mother might have a stronger inclination to say care [for my] baby than a politician who might have a prefer‑ ence for care provision [is a crucial element in]. This example highlights possible divergence in collocations, colligations and semantic associations – all depending on the textual context.
  3. Lexical priming: Advances and applications
    In this volume, we want to showcase some of the advances that have been made (not least by Hoey himself), using the lexical priming theory as the starting point for research investigations.The book is divided into four parts. The first and largest section looks at Dis‑ course Studies and the ways in which lexical priming can be applied to discourse analysis and critical discourse analysis. The second section focuses on the issue of synonymy and the related language devices of simile and metaphor. The third section is motivated by statistical correlations and how such findings highlight the strength of collocations. Subsequently the implications this has on the lexical priming theory will be discussed. The book concludes with the fourth and final section, which focuses on the direct application of the lexical priming theory when teaching languages other than English.Part I, Discourse Analysis, presents a wide range of approaches. The first chap‑ ter, by Michael Hoey, looks at the ideas of cohesion and bonding – and how texts on the same subject, written over a period of 40 years, show intertextual prim‑ ing. Baker, McEnery and Hardie also look at the characteristic wordings employed over a period of time, examining the issue of language change. Partington and Duguid, in their chapter look at a different forms of language change: not dia‑ chronic change, but what they term to be “forced primings”. This means that they look at interested groups, (like spokes people for a political party), try and create a characteristic form of wording to further their agenda. In doing so, they seem to force and push a chosen kind of wording use to become the widely accepted form. The section is tied up with the chapter by Michael Pace‑Sigge, who examines the kinds of lexical signals which are salient signals of turn‑closure and those which tend to be primed to be turn‑starters in casual conversations.The first contribution of the book is by Michael Hoey, who formulated the Lexical Priming Theory. In his chapter, he brings advances and applications together, travelling back to the future by returning to his original work on text cohesion. Back then, Hoey used the term bonding (cf. Hoey 1991, 1995) and, with a newly created specialist corpus of texts collected over the last forty years, he makes a strong case for how cohesion and lexical priming are intrinsically linked. Hoey points out that Lexical Priming is a theory intended to account for the exis‑ tence of corpus linguistic phenomena such as collocation and colligation that can‑ not be explained in terms of logic or theoretical grammars of the generative kind. It is not circular because the explanation offered is not based on the data it seeks to explain but on the work of psycholinguists over a thirty‑year period. Hoey also highlights that the term ‘collocation’ is used in two fields of linguistics – namely
    corpus linguistics and in cohesive studies. However, few have made the link between the two strands of investigation. His paper is an exploration of the extent to which priming can be seen as the cause and driving force for cohesion. Using 40 years’ worth of material which reports on the ‘planet’ Pluto, Hoey describes in detail how the strong bonding found in different texts, by different writers, over a span of decades, can be seen as a result of intertextual priming.Like Michael Hoey, Baker, McEnery and Hardie look at primings over time. Unlike him, a much longer time period is investigated in their chapter to looking at diachronic change. Making use of the vast EEBO corpus, now newly available, the authors focus on the representation of one specific group, the Ottomans, in the seventeenth century, looking at the change of representation of this group over time. With reference to Hoey, they claim that our knowledge of a word ‘includes the fact that it co‑occurs with certain other words in certain kinds of context’. This, according to the authors, leads to the assumption that “time is clearly one important context within which primings may be acquired; through exposure to word primings over time, words are imbued with meaning and a key feature of this process is collocation”.Language, however, changes over time, and this must imply that primings change with time too. Hoey (2005: 9) speaks of ‘a drift in priming’. Baker, McEnery and Hardie explore any observable drifting of primings by looking at colloca‑ tions, the key product of lexical priming, following Hoey (2005: 7–9). Looking at the research undertaken to date, the practical challenges when looking at any form of drift have proven to be difficult: language change occurs rather slowly and investigations need to be made on the basis of large amounts of searchable data which covers a long time span. The authors make use of the EBBO corpus, which provides a billion words of English data for the seventeenth century. Beyond the explanation for any drift in priming observed, Baker and her co‑authors highlight the crucial fact that such a drift in usage may relate to society as much as language per se. Consequently, the authors have combined a corpus‑linguistic study with historical insights looking at the way the Ottomans were referred to in seventeenth century British English texts.Alan Partington and Alison Duguid present a very different take on the idea of lexical priming, focusing not on subconscious acquisition but conscious pre‑ scription or enforcement of language. In this case, it is the concerted efforts by those who try to enforce a media discourse that fits the particular purpose of those in power – the spin doctors who repeat words and phrases to an extent to which such language is seen as the accepted norm within that context or situation. In the first of their case studies, they demonstrate how corpus techniques can shed light on how the one of the authors of the Declaration of Independence, namely, Thomas Jefferson, steers the discourse with messages favourable to the secessionist agenda.
    The second of their case studies focuses on the forced primings projected in a more modern discourse type: how the White House press spokespersons pres‑ ent to the media the participants in the Arab uprisings in 2011. Partington and Duguid have examined in how far news organisations like CNN and the New York Times have taken on or resisted the forms preferred by the White House.In the final part of their chapter it is examined in how far forced primings have worked in British political electoral campaigns. Members of the cabinet who answer questions in media outlets are found to have been prepared by media specialists. Answers are pre‑prepared and planned for insertion and this is often referred to as ‘singing from the same hymn‑sheet’. For this, a press corpus for the years 2013–2015 (the year of the last British general election) is being used. The examination of the data highlights that such a practice can also backfire: when the press becomes aware of an organized strategic intention. This can mean that the forcing is exposed and at times even ridiculed.Michael Pace‑Sigge turns his attention to salient patterns in the most direct and in‑time form of discourse: unplanned spoken communication. To do so, he looks at turn‑taking strategies in spoken communication, an area that has been widely researched since the early 1970s. The key issue confronting a corpus lin‑ guist when looking at turn‑taking signals is the fact that these are mostly pro‑ sodic, non‑lexical pointers (see Pace‑Sigge, 2015). In the last twenty years there has been, nevertheless, corpus‑based research which has focused on lexical items (see McCarthy 1998; Tao 2003, and others).According to Hoey, corpus linguistics has demonstrated that language is structured through collocation, colligation and semantic association. It was his lexical priming theory (2005), that attempts to explain why words fall into these particular patterns by taking recourse to psychological processes in the listener and speaker. A prime is a trigger – and, in a conversation, this trigger can be non‑ verbal: a prosodic feature like, for example, falling intonation; verbal yet non‑ lexical: feedback by way of laughter or back‑channelling or, indeed, pauses; the trigger can also be a lexical word: for example, an exclamation. The latter two can, to a fairly high degree, be observed by looking at orthographically transcribed corpora. Following the tenets that form the basis of Hoey’s theory, in spoken con‑ versation some kind of trigger item should be in evidence, showing a listener that a turn is given up. Consequently, recognisable turn‑final and turn‑initial lexical items should be in evidence. Michael Pace‑Sigge approaches the issue in two steps. First of all, a corpus of monologues – prepared public speeches – is compared with a corpus of dialogues – casual conversations. This approach reveals the kinds of items (both lexical and not‑so lexical) which are typically found in a conversation but which would be atypical for a monologue. Expanding his investigation, Pace‑ Sigge describes highly frequent (sets of) words found employed in conversational
    exchanges. There are clear preferences for certain words and groups of words. These are marked for turn‑starters. At the same time, clear indications for the category of words which are prevalent for end‑of‑turn use by a large proportion of speakers are also observable. The evidence thus uncovered is used to support the claim that language users appear to be primed in their turn‑taking word choices to follow a structured, recognisable pattern, thus facilitating fluency in their conversation.Part II of our book, entitled Similes, Synonymy, and Metaphors deals with the more creative aspects of language. In 2008, Michael Hoey made the claim that more work needs to be done in relation to lexical priming and creativity. Whilst the theory successfully accounts for the lexical characteristics and patterns of use associated with both spoken and written language within particular domains, little attention has been paid to more ambiguous or creative instances of language, such as word senses and figurative language. Drawing on past research focusing on polysemy (Hoey 2005 and Tsiamita 2009), which showed that two distinct senses of a word or item tend to avoid each other’s primings, each of the three studies by Bawcom, Patterson and Shao, apply similar aspects of the theory to other word senses. The findings suggest that whilst creative language by definition remains less restrictive than other forms of language, the lexical priming theory is shown to help identify patterns and behaviours in such seemingly creative choices of lan‑ guage use.Like Partington and Duguid, Linda Bawcom looks at newspaper texts and how primings are evidenced by the usage of particular word choices when describ‑ ing the same event – though the news outlets are different and have different edi‑ torial set ups. Bawcom uses as her starting point earlier corpus‑based research which has identified factors that describe various reasons for the preference of one synonymous lexical item over another. Her paper highlights the versatility of synonyms and their functions. Beyond looking at the mere statistics of usage, she also investigates the psychological reasons why a writer would prefer one particu‑ lar choice of word over another – something that is referred to in psycholinguistic priming tasks as ‘the frequency effect’.Bawcom approaches the category ‘synonyms’ from two different corpus‑based perspectives. As a result, she describes how the choice of a synonym can be depen‑ dent on a number of the same factors that influence our choice of any lexical item such as collocation, colligation, genre and register. She indicates that, based on experiments with lexical decision tasks, it would appear that language users store and retrieve a word along with its associations relevant to the situation and context in which it has been repeatedly encountered in before.According to Bawcom this psychological, subliminal effect can be seen as dovetailing with Hoey’s theory of lexical priming as a possible explanation for observations made during corpus‑based studies.
    Katie Patterson’s chapter presents a discussion on creativity and the problems of applying a theory based on patterns and formulaic behavior, to creative and often less restricted types of language. Creativity is often defined as a breaking of particular linguistic norms and conventions and as a result is thought of as a largely free act of expression, but while this may be true to some extent, the expres‑ sive effect of that choice of language is diminished if it does not retain meaning for the user. The focus of the research is on the conventions which govern both metaphoric and non‑metaphoric uses of language.The paper explores nesting (cf. Hoey 2005) patterns of grew that are specific to its use in metaphoric contexts, and compares these to their absence in non‑ metaphoric contexts. The findings go some way to suggesting that as a metaphor, grew is qualitatively a different lexical item, when compared to its non‑metaphoric use(s) (see Patterson, 2016). It is proposed that Hoey’s (2005) Drinking Problem Hypothesis can account for these lexical differences, providing a psychological explanation for what drives us as language users to identify metaphor. By focusing on meaning within a Neo‑Firthian framework, this research aims to re‑focus dis‑ cussions of metaphor within the wider discourse field, taking into consideration context, pragmatic meaning, the individual’s mental lexicon, and subsequently what role these factors play in interpreting metaphoric meaning.The section is completed with Juan Shao’s chapter on synonymy. Hers is one of two papers in this volume to look at the occurrence of lexical priming in a in a lan‑ guage other than English. The purpose of her study is to explore a number of ways to effectively explain how near‑synonyms in Mandarin Chinese are distinguished. To do so, she exploits a corpus‑based exploration for the collocational and col‑ ligational analysis of Chinese synonyms. A group of words expressing the idea of being “happy” Mandarin Chinese are being investigated in her study: 高兴 (gāo xìng), 快 (kuài lè) and 开心 (kāi xīn). The result shows that these three Chinese words fulfil functions which are widely seen as making them synonyms of each other. However, detailed corpus analysis shows how their areas of use is particular; they can and should be distinguished based on their occurrence patterns. This is seen by Shao as backing the claims Hoey made with regards to so‑called synonyms in English (see Hoey 2005: 74–80). It furthermore provides useful reference for teaching Chinese to speakers of other languages.Part III is entitled Collocations, Associations and Priming. This section has a strong focus on using statistical and computational tools to highlight the validity of various claims made by the lexical priming theory. Beyond that, the chapters in this section share a common approach: how the primings found for a set of target words can be used to predict (and maybe even automate) their properties. These include register (Berber Sardinha) and the strength of attraction between a lexical item and a grammatical slot in which it can be found (Cantos and Almela).
    Tony Berber Sardinha uses for his research a different approach, and differ‑ ent corpora, from Hoey in his original work on lexical priming. In order to func‑ tionalise the lexical priming theory in providing a fairly good way to predict text genres. According to Hoey, the theory predicts that individuals are able to store in their minds information about the textual varieties (registers, genres, etc.) in which particular collocations are most typical. In other words, if presented with a particular collocation, individuals should be able to tell the registers that it is generally primed for. This hypothesis actually presupposes a regular association between collocation and register, so that register differences should be marked by differences in collocational use; in other words, different registers should have largely distinct groupings of collocations. In his chapter, Berber Sardinha proceeds to determine whether there is such a regular association between collocation and register, and in how far statistical tools allow us to observe the strength of such associations. Using Multi‑Dimensional Analysis, an analytical framework that has been used as a means for studying register variation through multivariate statisti‑ cal techniques (see Biber 1988), this study makes use of the Corpus of Contem‑ porary American English (COCA). In this chapter, Berber Sardinha looks at five equal‑sized sections, each representing a major English register: academic, fiction, news, magazine and ‘spoken’ (television programs). Using the logDice coefficient (see Rychly 2008), the most characteristic collocations of each register were identi‑ fied and then entered into a Factor Analysis, which yielded the statistical group‑ ings of collocation across the registers. These factors were interpreted for their semantic preferences (see Hoey 2005: 16–37).The final chapter in this section retains the focus on statistical probability for collocations, namely by extending the notion of co‑collocation to Hoey’s concept of colligational priming. Pascual Cantos and Moisés Almela look at the relation‑ ship of particular collocations and syntactic preferences of the noun CAUSE. The existence of dependency relations among different collocations of a word has been shown in previous research (see Cantos & Sánchez 2011 for example).As has been demonstrated by Almela (2014), such dependencies are obtained when the strength of the attraction between a node and one or more of its col- locates shows a degree of dependence with the co‑occurrence with a third ele‑ ment: the co‑collocate. As an example, the probability that the verb see collocates with eye is increased by the presence of modifiers of a specific semantic type (e.g., adjective naked) but weakened by the presence of other types of modifiers such as through or behind. The authors propose that the strength of attraction between a lexical item and a grammatical slot can be influenced (strengthened or weakened) by the instantiation of other colligations of the same node in the same syntagmatic environment, and that it is possible to capture these dependencies between col‑ ligations by adapting the methodology of co‑collocation analysis.
    Using data from a large English web corpus, Cantos and Almela analyze the association between specific collocations of verbs with CAUSE as object and their impact on the co‑occurrence probability of two different types of modifiers, namely, premodifiers and of-headed prepositional postmodifiers.Part IV, the final part of the book, deals with the direct application of the lexical priming theory with a particular focus on language learning and teaching. Michael Hoey himself has a strong background in teaching English as a foreign language and it is, therefore, not surprising that applications of the theory have been found that aid this. Jarmo Jantunen (like Juan Shao above) makes clear that the notion of lexical priming can be expanded to languages other than English. In fact, students trying to emulate recently learned ideas of how their target language is constructed can often lead to hyper‑primings, where students use a form they have adopted – though this would not be a form used by fluent, proficient speak‑ ers of this language. The last chapter in this book shows how Stephen Jeaco went to great lengths to address just this problem when teaching his own students. He adopted the premises of the lexical priming theory in full in order to create a com‑ puter programme that assists (in his case, Chinese) learners of English towards a learner‑appropriate way of making use of naturally‑occurring language data.Jarmo Jantunen uses the International Corpus of Learner Finnish in order to examine Finnish time expressions – here, in particular, the lexical item kello (‘watch, time, o’clock’). Evidence from the corpus shows that this item is mark‑ edly overused by learners of Finnish. This chapter is of particular interest as it extends the scope of the lexical priming theory. Jantunen describes how previous studies on phraseology have mostly concentrated on English, a language which has little inflection. This has meant, however, that morphology has rarely been touched upon in phraseology studies. This chapter aims to provide a holistic phra‑ seological account of kello: one that includes not only its collocates but also its morphological priming, n‑grams and semantic associations. The author highlights that the analysis of learner language can reap enormous benefits from an approach that combines lexical words and their morphological variants when phraseology is investigated. Looking at usage patterns for a highly inflected language like Finn‑ ish indicates that morphological priming as well as semantic preference plays an important role in the learner language phraseology.In the final chapter of the book, Stephen Jeaco introduces the design of a con‑ cordancer, created and developed firmly in line with the theory of Lexical Prim‑ ing. Jeaco stresses that the theory brings together a range of linguistic patterns that should be an important focus of language learning and teaching. This means, however, that both teachers and learners are confronted with the fact that there is a new dimension to the language taught which needs to be taken into account. The difficulty is, however, that collocations, colligations and, in particular, semanticxxii Michael Pace‑Sigge & Katie J. Patterson
    associations are difficult to find and to contextualize when one consults dictionary entries or, indeed, various types of concordancing software. Consequently, learn‑ ers and teachers will face difficulties finding and presenting information about these primings. In his chapter, Stephen Jeaco introduces the pedagogical rationale for some of key features of the software he has developed, and presents a first glimpse of how the interface works by showing examples with the search screen interface and the display of concordance lines.There is further work and research being undertaken making use of the the‑ ory of lexical priming, that we were not able to include in our book. There is, for example, the work by the lexicographer Patrick Hanks, who takes into account what priming means for language production and comprehension: “The lexical sets (…) can in turn be mapped, as colligations, onto syntactic structures. Indeed, they must be so mapped in order to enable speakers to utter meaningful sentences at all – though not through any conscious effort on the part of the speaker” (Hanks 2009). Hanks thus highlights the importance of lexis to form grammatical structures and, in line with the lexical priming theory’s tenets, the speaker does not make deliberate effort to use such structures in a typical online production. The editors would also like to direct the readers to work done on priming and textual position (Hoey & O’Donnell 2008; O’Donnell et al. 2012); lexical priming, formulaic sequences and association patterns (O’Donnell, Römer & Ellis 2013) as well as Michaela Mahlberg’s work on the relationship between lexis and text, with her focus on literary texts (Mahlberg 2007).
    ReferencesAlmela, M. 2014. ‘You shall know a collocation by the company it keeps’: Methodological advances in lexical‑constellation analysis. In Investigating Lexis: Vocabulary Teaching, ESP, Lexicography and Lexis Innovation, J.R. Calvo‑Ferer & M.A. Campos (eds), 3–26. Newcastle upon Tyne: Cambridge Scholars.Biber, D. 1988. Variation across Speech and Writing. Cambridge: CUP.doi: 10.1017/CBO9780511621024Cantos, P. & Sánchez, A. 2001. Lexical constellations: What collocates fail to tell. International Journal of Corpus Linguistics 6(2): 199–228. doi: 10.1075/ijcl.6.2.02canGries, S.T. 2005. Syntactic priming: A corpus‑based approach. Journal of Psycholinguistic Research 34(4): 365–399. doi: 10.1007/s10936-005-6139-3Hanks, P. 2009. The linguistic double helix: Norms and exploitations. In After Half a Century of Slavonic Natural Language Processing. Festschrift for Karel Pala, D. Hlaváčková, A. Horák,K. Osolsobĕ, & P. Rychlý (eds), 63–80. Brno: Masaryk University. Hoey, M. 1991. Patterns of Lexis in Text. Oxford: OUP.Hoey, M. 1995. The lexical nature of intertextuality: A preliminary study. In Organization in Discourse. Proceedings from the Turku Conference [Anglicana Turkuensia 14], B. Wårvik, S.‑K. Tanskanen, & R. Hiltunen (eds), 73–94. Turku: University of Turku.
    Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.doi: 10.4324/9780203327630Hoey, M. & O’Donnell, M.B. 2008. Lexicography, grammar, and textual position. International Journal of Lexicography 21(3): 293–309. doi: 10.1093/ijl/ecn025Hoey, M., Mahlberg, M., Stubbs, M. & Teubert, W. (eds) 2006. Text, Discourse and Corpora.London: Continuum.Mahlberg, M. 2007. Lexical items in discourse: Identifying local textual functions of sustainable development. In Text, Discourse and Corpora. Theory and Analysis, M. Hoey, M. Mahlberg,M. Stubbs, & W. Teubert (eds),191–218. London: Continuum.McCarthy, M. 1998. Spoken Language and Applied Linguistics. Cambridge: CUP.O’Donnell, M.B., Römer, U. & Ellis, N.C. 2013. The development of formulaic sequences in first and second language writing: Investigating effects of frequency, association, and native norm. International Journal of Corpus Linguistics 18(1): 83–108. doi: 10.1075/ijcl.18.1.07odo O’Donnell, M.B., Scott, M., Mahlberg, M. & Hoey, M. 2012. Exploring text‑initial words, clus‑ ters and concgrams in a newspaper corpus. Corpus Linguistics and Linguistic Theory 8(1):73–101. doi: 10.1515/cllt-2012-0004Patterson, K.J. 2016. The analysis of metaphor: To what extent can the theory of lexical priming help our understanding of metaphor usage and comprehension? Journal of Psycholinguistic Research 45(2): 237–258. doi: 10.1007/s10936-014-9343-1Pace‑Sigge, M. 2013. Lexical Priming in Spoken English Usage. Houndmills: Palgrave Macmillan. doi: 10.1057/9781137331908Pace‑Sigge, M. 2015. The Function and Use of TO and OF in Multi-Word Units. Houndmills: Palgrave Macmillan. doi: 10.1057/9781137470317Partington, A., Duguid, A. & Taylor, C. 2013. Patterns and Meanings in Discourse: Theory and Practice in Corpus-assisted Discourse Studies (CADS) [Studies in Corpus Linguistics 55]. Amsterdam: John Benjamins. doi: 10.1075/scl.55Rychlý, P. 2008. A lexicographer‑friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, P. Sojka,A. Horák & (eds), 6–9. Brno: Masaryk University.Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.Tao, H. 2003. Turn initiators in spoken English: A corpus‑based approach to interaction and grammar. In Corpus Analysis: Language Strucure and Language Use [Language and Com‑ puters 46], Pepi Leistyna & Charles F. Meyer (eds), 187–207. Amsterdam: Rodopi.Thompson, G. 2012. From one text to many: Functional grammar and corpus linguistics. Pre‑ sentation, University of Liverpool Language Seminar Series, 7th March.Tsiamita, F. 2009. Polysemy and lexical priming: The case of drive. In Exploring the Lexis-Grammar Interface [Studies in Corpus Linguistics 35], U. Romer & R. Schulze (eds), 247–264. Amsterdam: John Benjamins. doi: 10.1075/scl.35.16tsipart i
    Discourse analysisCohesion and coherence in a content‑specific corpus
    Michael HoeyUniversity of Liverpool
    This paper is a tentative and incomplete exploration into whether there are any grounds for believing that the way cohesion is utilised and recognised is the result of priming. As a necessary consequence of the first goal, it is aninvestigation into the relationship between two ways of looking at lexis in text, the one associated with cohesion, the other associated with corpus linguistics. It is only perhaps because cohesion studies have taken a back seat in 21st century linguistics that there has not been more attention given to the terminological embarrassment that the term ‘collocation’ is currently used in two apparently quite separate ways in the linguistic literature, to mean in corpus linguisticsthe semi‑arbitrary co‑occurrence of two or more words and in cohesive studies the relationship between different lexical items in a text that contributes to the creation of coherence in the text. Such an investigation is also an investigation into the compatibility of the methodologies employed in corpus linguisticsand cohesive studies, and this is an investigation in which I have a personal investment, since I have at different times been a student of cohesion and a corpus linguist.
    1. Introduction
      Lexical priming (Hoey 2004a,b, 2005a, et seq) is a theory intended to account for the existence of corpus linguistic phenomena such as collocation and colligation that cannot be explained in terms of logic or theoretical grammars of the gen‑ erative kind. The explanation offered avoids circularity because it is not based on the data it seeks to explain but on the work of psycholinguists over a thirty year period (see Pace‑Sigge 2013, for a detailed account of this work). No other expla‑ nations have been offered for the existence of the phenomena under consider‑ ation. Expressed in a single sentence, its argument is that each encounter we have with a recurrent piece of language (word, phrase, morpheme, etc.) primes us to recognise and reproduce the encountered piece of language, thereby also ensuring

      doi 10.1075/scl.79.01hoe© 2017 John Benjamins Publishing Company
      its continued recurrence. Since the recurrence subconsciously noted may be a rep‑ licated grammatical context or recognition of a shared semantics or reference, the reproduction need not be mechanical. Accordingly the theory is not behaviourist and is not challenged by (most kinds of) linguistic creativity (Hoey 2007a,b, in Hoey et al. 2007).1The value of a theory, however, lies not just in its explanatory ability but in its ability to generate hypotheses that can be tested, and one such hypothesis is that all recognised linguistic phenomena should be explicable in terms of the same mechanism of acquisition. More precisely, it is hypothesised that textual phenomena such as text structure, turn‑taking and cohesion should be deriv‑ able from repeated encounters with pieces of language in a manner comparable to the assumed acquisition of collocation through such encounters. In practical terms, this means that such textual phenomena should be detectable through the application of traditional corpus‑linguistic techniques. Of course a corpus rep‑ resents no one’s experience of language but in so far as a concordance represents a collection of potentially encounterable instances of a piece of language, it may be argued to be analogous in some (and only some) respects to an individual’s experience.This paper is a tentative and incomplete exploration into whether there are any grounds for believing that the use and recognition of one of the phenomena noted above – cohesion – is the result of priming. As a necessary consequence of the first goal, it is an investigation into the relationship between two ways of looking at lexis in text, the one associated with cohesion, the other associ‑ ated with corpus linguistics. There is an inevitable but largely ignored termino‑ logical embarrassment around the term ‘collocation’. It is currently used in two apparently quite separate ways in the linguistic literature. In corpus linguistics it means the semi‑arbitrary co‑occurrence of two or more words. In cohesion studies, on the other hand, it means the relationships between different lexical items in a text that contribute to the creation of coherence in that text (Halliday & Hasan 1976), where cohesion can be defined as “a property of text whereby certain grammatical or lexical features of the sentences of a text connect them to other sentences in the text” (Hoey 1991: 266) and coherence as “a measure of the extent to which the reader or listener finds that [a] text holds together and makes sense as a unity” (ibid: 265–266). An investigation of this kind is also a study into the compatibility of the methodologies employed in corpus linguistics and cohesive studies, and this is an investigation in which I have a
      1. Extreme forms of creativity have not yet been examined, such as those described by Eckler 1997, or as illustrated by some passages of Finnegan’s Wake (Joyce 1939).
        personal investment, since I have at different times been a student of cohesion and a corpus linguist. I began this investigation in 2005 and published a pre‑ liminary and incomplete paper on the subject that year, on which the first half of this paper draws (Hoey 2005b). Since the paper in question appeared in the proceedings of a Taiwanese conference published for a local ELT audience and has never to my knowledge been cited, and since it promised that the topics raised would be followed up, I feel less ashamed of my self‑plagiarism than I otherwise might be. (I am more ashamed that it has taken me over 10 years to return to the topic.)The structure of this paper is as follows. I begin by outlining the heavily lexical approach to the analysis of cohesion presented in Hoey (1991), drawing atten‑ tion to the problematic features of this approach. I then look at the relationship between the information provided by cohesion in a text and the information pro‑ vided by a concordance, drawing attention to some similarities. This then leads both to an extension of the notion of cohesion to cover intertextual connections across disparate writers and to the suggestion that lexical bonding in cohesion is a natural priming mechanism, in that as we make note of bonding, we are primed. The particularity of this bonding leads in conclusion to an exploration of the domain specificity of collocation and semantic association.
    2. The corpus
      The paper makes use of a tiny corpus of 36 popular science texts of slightly over 30,000 words length that I have created over many years. Inclusion in this corpus required the text to report some investigation of the outer region of the solar system defined as including Pluto and whatever lies beyond. The earli‑ est text in the corpus dates from March 1976 and the latest text dates from June 2016. Together they trace the changing status of Pluto and report repeated investigations of the Kuiper Belt and of objects orbiting the Sun that lie beyond Pluto. Each text is lightly annotated for textual position and function and each sentence numbered. For convenience, I shall refer to this corpus as the ‘outer regions’ corpus.One of the early texts in this corpus, The invisible influence of Planet X, was used in several papers on cohesion; in Hoey (2004b), for example, the possibil‑ ity was explored that our lexis is primed for us with respect to its (in)capacity for participating in cohesive ties and chains. Because its cohesion was necessarily analysed in close detail, this text is reutilised in this paper. The points it is used to make are however quite different (Hoey 2005b excepted of course). The full text is given in the Appendix.
    3. Cohesion studies versus corpus linguistics
      Although there are the usual terminological differences amongst those involved in the study of cohesion, it is broadly true that linguists concerned with cohe‑ sion would accept that the texture, or textuality, of a text is in part created by the cohesion that strings the parts of the text together. The view that cohesion is merely an epiphenomenon of coherence created by other means has not held the field, but even if it were accepted, it would still be true that cohesion gives us a window onto the text’s coherence. All the leading figures in the study of cohe‑ sion regard cohesion as (at least in part) a semantic property and most give an important place to lexical connections (Halliday & Hasan 1976; Hasan 1984, and de Beaugrande & Dressler 1981). I discussed cohesion in Hoey (1983), and par‑ ticularly in Hoey (1991).But all these writers, including myself, treat words as if they operated in isola‑ tion. Corpus linguistics of course has demonstrated beyond all question that in fact words collocate with other words, such that it becomes impossible to break the combinations up into their component parts. Words do not operate in isola‑ tion. Amongst the key figures in such work are Sinclair (1991, 2004); Stubbs (1996, 2001); and Partington (1998). I have also written on collocation and related rela‑ tions (e.g. Hoey 1993, 1998, and particularly 2005). At a personal level, therefore, I must investigate whether the differences in the methodologies adopted in Hoey (1991) and (2005) and the accompanying differences in the underlying assump‑ tions of these two works can be resolved or whether one of the positions I have adopted must be abandoned or seriously modified. The positions on cohesion and corpus linguistics I consider are accordingly those with which I am associated. It is my hope, though, that the observations I make about their relationship will be found to be applicable to the relationship between other approaches to cohesion and corpus study (particularly within the Sinclair tradition).
    4. Types of lexical cohesion used as a way of repeating textual material
      Most types of cohesion can be regarded as more or less transparent strategies for repeating something (though this is not true of collocation in its cohesive sense to which we shall be returning later in the paper). Because my own approach to the study of cohesion is concerned almost entirely with cohesive strategies of repeti‑ tion, and because these are central to the argument of this paper, I shall describe and illustrate each of the major types of cohesive repetition, even though I do this in the knowledge that much of what I say will seem blindingly obvious or wholly
      familiar. However, if we are not agreed on what counts (or does not count) as rep‑ etition for analytical purposes, some of the positions argued for later on will seem untenable or untestable.In Hoey (1991) I identified, and made subsequent analytical use of, a number of cohesive strategies used by speakers and writers to repeat something in a text. I shall briefly define and illustrate these here, drawing my examples from the text with which we will be most concerned in this paper – ‘The Invisible Influence of Planet X’ (henceforward the ‘Planet X’ text).The first and most common repetition strategy, at least in non‑narrative writ‑ ing, is simple repetition, where a piece of language that has been used earlier is reused without variation or with only such variation that can be entirely accounted for in terms of some regular grammatical paradigm (e.g. tense, plurality):(2) The textbooks say there are nine planets in our solar system. (3) The most distant is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona. (4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection.(The numbering in this and in all subsequent examples indicates the position of the sentences in the original text.)The second type of repetition strategy, complex repetition, is again a case of an ear‑ lier piece of language being reused with variation, except this time the variation between the items is not attributable to the operation of a regular paradigm:(4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection. (5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photographs showing that a speck of light had shifted position against the stars.These two kinds of repetition strategy can be identified automatically without great difficulty. This is not so with the third strategy which requires interactants to share a perception that two items in a particular textual context can be regarded as having the same meaning; such items may be regarded as instantial synonyms. Shao (2016) shows that synonymy is psychologically real but that informants vary in the synonyms they recognise and also in respect to the words that they recog‑ nise as having synonyms; Bawcom (2011) suggests the term ‘similonym’ would capture better the approximation of meaning found in so‑called synonyms. In
      Hoey (1991) I use the term ‘simple paraphrase’. The following extract exemplifies this kind of repetition strategy.(5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photographs showing that a speck of light had shifted position against the stars.The fourth strategy is probably the most thoroughly investigated of the repetition strategies, and is that of pro‑forms (pronouns and pro‑verbs). Unlike the previ‑ ous three strategies, the relationship between the item(s) reiterated and the mark of reiteration has no independent existence outside the particular piece of text in which it is used:(4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection. (5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photographs showing that a speck of light had shifted position against the stars.The fifth repetition strategy considered here is ellipsis. Like the pro‑forms, ellipsis has been much studied, partly because it relies to an even greater extent on the cooperation of the listener/reader. Sinclair (personal communication) objected to this being considered as a kind of repetition on the grounds that it makes little sense to talk of an absence repeating a presence, a view I have some sympathy with. Nevertheless, it cannot be denied that listeners/speakers use the absence to bring into play something previously mentioned in the text, nor that speakers/ writers rely on their listeners/readers’ ability to do this. For this reason, it retains its place in my list of repetition strategies. An instance is the following:(2) The textbooks say there are nine planets in our solar system. (3) The most distant ^ [planet] is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona.A sixth strategy that combines the shared perception of simple paraphrase with the text‑specificity of pro‑forms and ellipsis is that of co‑reference:(5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer
      had detected a sure sign of an object orbiting the Sun by comparing two photo‑ graphs showing that a speck of light had shifted position against the stars.Similar to, and in some circumstances indistinguishable from, co‑reference is the seventh strategy included here – the relationship of particular‑general, where mention of the general conjures up the particular that preceded it. The previous example illustrates this, but an example that is not also an instance of co‑reference is the following:(6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photographs showing that a speck of light had shifted position against the stars.There are other repetition strategies but they are less common and with two impor‑ tant exceptions do not affect the investigation being reported here. The first excep‑ tion, and our eighth strategy, is that the use of several members of a closed (and/ or precisely defined) set are deemed to involve some repetition. So March and July are deemed to share monthness, and (in Italian) sabato and domenica (Saturday and Sunday) share dayness. The reason for treating such closed set members as repetition is that languages are inconsistent in the degree to which they mark the posited repetition. Thus English marks the dayness of Saturday and Sunday in the shared lexical morpheme day, but Italian does not do so in sabato and domenica (though it does mark the others with the morpheme ). Similarly, Chinese marks the monthness of all the months of the year with the character 月(yuè) but there is no shared morpheme that marks monthness in English. Normally, closed sets are uncomplicated and unimportant. However in a corpus concerned with the outer regions of the solar system, the closed set of planets quite regularly generate repeti‑ tions of this kind. Furthermore, in the life of this corpus, the closed set has become problematized. Many of the later texts in the corpus explicitly engage with the question of whether there is a closed set of planets and what constitutes the criteria for membership of the set. An example of closed set repetition which sidesteps the question is the following, where the precisely defined set of numbers is utilised:(2) The textbooks say there are nine planets in our solar system. (4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection.Strategy nine is a logical consequence of there being a range of strategies for repeat‑ ing in a text. If item A is, by means of one of the strategies in this list, repeated by item B, then it follows that if item A is subsequently repeated by item C, item B must be as well. This is exemplified in the following:
      (3) The most distant is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona. (4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection. (5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar tech‑ niques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by com‑ paring two photographs showing that a speck of light had shifted position against the stars.As we can see (and as we discussed earlier), discovered is repeated by detected (simple paraphrase). Couple this with the fact that detection is also repeated by detect (complex repetition), and logically a repetition must also be acknowledged between discovered and detection.
    5. The significance of cohesive repetition
      There are two ways in which these various types of repetition acquire significance. The first, identified by Hasan (1984), is that the cohesive items form chains, and these chains interact with each other. The cumulative effect of the interactions is to provide in diagrammatic form an encapsulation of key content. Consider the first 7 sentences of the ‘Planet X’ text, where sentence 1 is a sub‑heading and sentence 2 marks the beginning of the body of the text:(1) Sixty years ago today, Pluto was discovered, now the hunt is on for an elusive tenth planet.2(2) The textbooks say there are nine planets in our solar system. (3) The most distant is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona. (4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detec‑ tion. (5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photographs showing that a speck of light had shifted position against the stars.
      1. The punctuation of sentence 1 is as in the original.

        1tenth2tenth planet


        4tenth planet4detection4astronomers4planet4tenth5Planet X5search


        6discover6He Tombaugh

        7Object orbiting the sun7detected7astronomer

        Figure 1. Lexical chains in the Pluto text
        This diagram (Figure 1) can be interpreted as suggesting that planets (particularly Pluto and Planet X) are searched for and/or discovered by astronomers; it also suggests that there is some issue as regards the count of planets (nine, tenth). Both suggestions are true both for the passage analysed and for the whole text.Hasan’s approach is in my view rewarding and reveals something true about the texts to which it is applied. In its focus on the interaction of chains it also picks up on potential collocations (in the corpus linguistic sense). But the big problem with Hasan’s system is that it becomes unwieldy to use on longer texts. It would have been impossible to represent in paper form the dense web of interac‑ tions between chains in the ‘Planet X’ text, though it could of course have been represented computationally.My own approach (Hoey 1991) makes use of approximately the same cohe‑ sive data but focuses on the relation between sentences across the text rather than the relationship between chains running through (parts of) the text. Since I have elsewhere described the approach in considerable detail, I shall try here to provide the bare minimum of explanation in the hope that this will be sufficient to make my argument clear. Whenever one of the repetition strategies described above is used, the repetition is termed a ‘link’. There are three important points to be made about links. The first is that they do not just connect adjacent sentences; in other words, they may connect sentences at some distance from each other. Secondly links often occur in combination; therefore, a sentence may repeat not just one but a number of lexical items from another sentence. Thirdly a sentence may have links with a myriad other sentences; put another way, any sentence may be trawl‑ ing its lexis from a large variety of earlier sentences.To exemplify these points, consider sentence 57 of the ‘Planet X’ text:(57) Dr Harrington says astronomers still do not understand the outer regions of our solar system.
        The previous mention in the text of the outer regions of the solar system occurred in sentence 10, illustrating that the repetition created by the links may connect sentences at a considerable distance:(10) But Dr Harrington and other sceptics say Pluto is too small to explain the orbits of the planets in the outer regions of the solar system, such as Uranus and Neptune. (57) Dr Harrington says astronomers still do not understand the outer regions of our solar system.Secondly, the repetitions of outer, regions, solar and system, although identified separately, have their effect in combination; the same is true of the repetitions of Dr Harrington and say/s. Ultimately all six of these repeated items (or seven if we treat Dr as a separate repetition) are working together to connect sentence 57 back to sentence 10.Thirdly, sentence 57 also repeats lexis from other sentences, including sen‑ tences 2 (the first sentence of the main body of the text) and sentence 49. I have highlighted below the repetitions that connect sentence 57 to each of these sentences:(49) Dr Harrington says the most remarkable feature predicted for Planet X is that its orbit is tilted 30 degrees away from the ecliptic, the main plane of the solar system, where all previous searches have concentrated. (57) Dr Harrington says astronomers still do not understand the outer regions of our solar system.(2) The textbooks say there are nine planets in our solar system. (57) Dr Harringtonsays astronomers still do not understand the outer regions of our solar system.Crucially, though, it is not just a question of retrieving earlier matter. It is a ques‑ tion of making sense of earlier matter in a new context. So when sentence 57 retrieves matter from sentence 10, it is so that the later sentence can spell out the implications of the earlier sentence. Likewise, when sentence 57 retrieves mate‑ rial from sentence 49, it is so that it can offer implied criticism of earlier searches, and when it retrieves lexis from sentence 2, it is to set up a contrast between what textbooks say and what Dr Harrington says. Try the experiment of substituting He for Dr Harrington in each of the first two pairings and read them aloud, and you will find them no less coherent than any adjacent pairs in the text. Likewise, if you add on the other hand after Dr Harrington in the final pair, you will, I believe, find that the same is true.This is not an isolated phenomenon. It occurs whenever sentences are con‑ nected by an above average level of repetition (typically three repetitions). Such sentences are termed ‘bonded’ in Hoey (1991). A ‘bond’ between two sentences is created whenever the threshold of repetition is met. It has been demonstrated
        that a pair of sentences connected by an above average level of repetition is charac‑ teristically capable of being interpreted together and often judged to be coherent, whatever the distance that separates them in a text, as long as the text in question is predominantly non‑narrative (Hoey 1991; Ahmad & Benbrahim 1995).We can recognise five levels of (in)coherence that may exist between bonded sentences. All of these levels of (in)coherence are judgements by the reader (or lis‑ tener) and therefore two readers may differ in respect of the level of coherence they would recognise. Also, given that coherence is ultimately a subjective judgement, its existence in any particular text can never be proved; all that can be done is to attest that some readers judge a text to be coherent. It is of course possible to seek to correlate readers’ judgements with the presence or absence of certain features in a particular text, but there is no way of demonstrating that the judgement was shaped by the presence or absence of these features. Allowing that this is the case, though, it is possible to classify the judgements that readers make into different kinds. Firstly, a reader may perceive a pair of sentences to be fully coherent when read together. Secondly a reader may perceive that only minor adjustments are needed to the cohesion of a pair of sentences in order for them to be fully coherent (e.g. the replacement of a full name by a pronoun, or vice versa). Thirdly, a reader may not find a pair of sentences fully coherent but readily recognises that the con‑ tent of the two sentences is closely related over and above the necessarily shared content indicated by the individual repetitions. In such circumstances, it is often the case that it would not take heavy editing to bring out the coherence between them. Fourthly, there are occasionally pairs of bonded sentences that are so alike that the second is to all intents and purposes a replication of the first. These have the maximum shared content, while not strictly being coherent by virtue of their excessive repetitiveness. Finally, we have bonded pairs where the reader perceives no shared content apart from that represented by the repetitions that led to their being bonded in the first place.It will be apparent from the wording of the five categories of bonding that coherence is not a property of text but a property assigned to a text by a reader. Readers may on occasion differ in the degree to which they are willing to assign coherence to a text or part of a text. Claims about the coherence of bonded sen‑ tences are therefore claims about the willingness (or not) of a reader to read them together as coherent. We can, I hope, agree to accept the five categories just men‑ tioned, but it does not follow that we would agree in how we assigned bonded pairs to these categories.With that caveat in mind, I will nevertheless attempt to illustrate four of the five categories according to my own perceptions of their degree of coherence, drawing on the first seven sentences of the text as before. An example of the kind of bond where I perceive coherence without the need to adjust the text is that
        between sentences 1 and 6, connected by five repetitions (emboldened as before), which, when placed together, reads as follows:(1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet. (6) Robert Harrington is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto.The following is an example of the second kind of bonded pair. In this pair I per‑ ceive coherence but a minor adjustment needs to be made to the cohesion of the pairing to make the coherence possible:(1) Sixty years ago today, Pluto was discovered, now the hunt is on for an elusive tenth planet.(4) But some astronomers have continued to suspect there may be a tenth planet lurking even further away which has somehow escaped detection.For me it only requires the deletion of But (which serves an entirely local textual function) for this pairing to be fully coherent, with the later sentence serving as an explanation for the ‘hunt’.A bonded pair that, for me, illustrates the third kind of pairing is that between sentences 1 and 3, which are connected by three repetitions. (There is co‑reference between sixty years ago today and 18 February 1930, as the ‘Planet X’ text was pub‑ lished on 18 February 1990):(1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet. (3) The most distant planet is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, ArizonaThis pair is, I would argue, not quite coherent. However, it only requires that the Subject and Complement of the independent clause of sentence 3 be reversed (i.e. Pluto is the most distant planet) for the pairing to become almost completely so, with sentence 3 providing extra detail about the discovery (i.e. who made it) and about the planet (as it was at the time the text was written) (i.e. the furthest away):(1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet. (3) Pluto is the most distant planet, discovered on 18 February 1930 by astronomers at the Lowell Observatory, ArizonaFinally, this reader cannot regard as coherent the pairing of sentences 3 and 6, despite their meeting the criterion for bonding:(3) The most distant planet is Pluto, discovered on 18 February 1930 by astrono‑ mers at the Lowell Observatory, Arizona. (6) Robert Harrington is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto.Even so, there is heavy overlapping between the first half of sentence 3 and the second half of sentence 6.
        There is no instance of the fourth kind of (in)coherence – a pair of sentences that are in effect identical to each other – in the Planet X text; such pairings occur rarely and normally the sentences concerned are at a considerable dis‑ tance from each other.The extent of bonding in texts of moderate length and the distance on occa‑ sion between bonded pairs is perhaps surprising. There is no easy way of indicat‑ ing all the bonds in the Planet X text, but the following diagram is an attempt to do so. The numbers in bold in the right hand column of the diagram are simply the successive sentences of the text. The numbers in the left hand column are the sentences with which the sentences in the right hand column form bonds; in other words, the sentences in the left hand column are those whose content the writer has reused in his production of the sentences in the right hand column, making use of course of the range of repetition strategies discussed earlier. (I am not suggesting that this process is conscious.)As already noted, there is an element of subjectivity in the attribution of coher‑ ence to any pair of bonded sentences, but I have indicated my judgements on the diagram. An underlined sentence number in the left hand column indicates that it is my judgement that this sentence forms a coherent bond with the sentence in the right hand column; for simplicity’s sake, this conflates the first two categories of coherence described above. If a sentence number on the left hand side is neither underlined nor italicised, it means that in my judgement the pair of sentences are not fully coherent but that their content is felt to be closely related in a manner not wholly explained by the individual links that led to their being bonded.Thus, Figure 2 above shows that sentence 8, for example, bonds with sentences 1, 2, 4, 6 and 7, and indicates by underlining in four of these cases (2, 4, 6 and 7), that the separate pairings of these sentences with sentence 8 are judged to be coherent or nearly so; the lack of underlining in the case of sentence 1 is intended to indicate that its pairing with sentence 8 is judged not to be coherent but that the two sentences taken together are nevertheless felt to contain mutually relevant information. In this case there is complete coherence between the first half of sen‑ tence 1 and sentence 8, which is undermined by the different and irrelevant con‑ tent of the second half of sentence 1:(1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet. (8) For a number of years before Mr Tombaugh’s discovery, the exis‑ tence of a ninth planet was suspected, because something large was affecting the orbital path of Uranus around the Sun.Again the diagram shows in line 6 that sentence 6 bonds with two earlier sen‑ tences, sentences 1 and 3, and that though the bonding of sentences 1 and 8 does not result in coherence, the sentences contain mutually relevant content. The
        121 31 3 41 3 5671 2 4 6 7 892 8 9 10113 6 7 121 4 131 2 4 8 12 13 1415168 10 17181 3 5 13 14 193 13 19 206 7 8 17 2110 17 22238 10 14 19 22 248 10 17? 22 253 5 12 13 2610 17? 22 25 2727 28293031322 8 10 17 25 32 338? 17 25 33 3410 353637383940415 33? 35 422 5 8 10 33 4310 33 43 445 43 44 451? 5 43 44 45 4625? 33 43 47485 10 13 17 33 43 45 4910 33 43 49 5042 5152537 5455562 10 43 49 571 5 10 33 35 43 44 45 46 49 5810 17 25? 43 47 49 50 59
        Figure 2. Bonding across sentence in the Pluto text
        italicisation of 3 in the left hand column however indicates that in my judgement these sentences are incoherent together and do not share content over and above the links that led to their being bonded:(3) The most distant is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto.There are only 21 sentences in the text that bond with no earlier sentences (exclud‑ ing sentence 1, which of course has no opportunity to do so). This means that slightly under two‑thirds of the sentences bond with at least one earlier sentence. (The diagram does not reflect the bonds that a sentence may make with subse‑ quent sentences; if these are included, the number of non‑bonding sentences drops to 18).There are in total 135 bonds amongst the sentences of the Planet X text. Of these I judge 82 (60.7%) to be coherent and a further 39 (28.9%) I judge to contain mutually relevant information. Only 14 (10.4%) are judged to be incoherent, and interestingly four of these are bonds formed by a single sentence, sentence 12. Analyses of other texts of similar length suggest that there is nothing unnatural or exceptional about these proportions, though there is a reasonably wide range of densities of bonding amongst texts, some bonding relatively sparsely and others bonding with even greater density than that found for the Planet X text.The working of bonding to create coherence need not be thought of as opaque. Given that the gap in time between reading a sentence at the beginning of a text may be a matter of minutes rather than seconds, it cannot simply be a matter of the utilisation of short term memory. Lexical priming theory, however, drawing as already noted on psycholinguistic research into semantic priming and repetition priming, assumes that each sentence is stored in what might, in a crude analogy, be thought of as a mental concordance. Consider again the pair:(2) The textbooks say there are nine planets in our solar system. (57) Dr Harrington says astronomers still do not understand the outer regions of our solar system.In Hoey (1995) I argued that if the assumption of a mental concordance were cor‑ rect, an encounter with sentence 57 would automatically result in that sentence being added to the concordance for solar, and that since it is a relatively uncom‑ mon word it would be added to a relatively uncrowded concordance. (It would also be added to the more crowded concordances for say, our, and system). The close juxtaposition of the two instances of solar might be argued then to activate the reader’s interpretative mechanisms, immediately allowing him/her to note the collocations of solar with system and of solar system with our, and then the shared framing of X say(s).
        The interpretative mechanisms need not be thought as opaque, either. In Hoey (1991), I argue that the creation of a sense of coherence through bonding was through a set of unoriginal interpretative processes, such as the recognition of lexical equivalence and of co‑reference, the ability to reduce the information load of an expression by activating a superordinate or the trimming of superfluous information, the parallel ability to expand a lexical expression to take account of our ‘dictionary’ or ‘encyclopaedic’ knowledge associated with the wording of the expression and the activation of local discourse knowledge (e.g. pronoun refer‑ ence). Hoey (1995) provides a detailed application of these posited processes to sentences from the Planet X text.I have described the cohesive approach in some detail because the argument that follows depends on three facts – firstly, that the analytical system depends on the identification of links as largely isolated items, with only co‑reference making serious use of clusters of words, secondly, that the cohesive phenomenon of bond‑ ing is not an incidental feature of text but central to the construction of coherence, and thirdly, that, despite its text specificity and its dependence on connections perceived between isolated words in defiance of (rightly) received corpus‑linguis‑ tic wisdom, it works as a way of exploring the meaningful connections of a text. With this in mind, the next section looks further at the problematic features of the approach from a corpus‑linguistic perspective.
    6. A corpus‑linguistic perspective on the cohesion in the ‘Planet X’ text
      We saw in the last section that cohesion analyses of the kinds developed in the 1980s and 1990s depend upon the traditional assumption that words operate as self‑contained entities that may individually and separately connect with each other through reiteration, paraphrase or other means. Such analyses may work and be revealing, but the underlying assumptions they make are challenged by corpus‑based work on lexis, work to which I have enthusiastically contributed.To illustrate the scale and nature of the problem, let us look at the opening words of the first sentence in the ‘Planet X’ text, which functions as a supplement to the headline and stands somewhat apart from the rest of the text. As we have seen in several of the examples above, the words sixty, years, ago and today all contribute to the creation of bonds of the text and are counted separately in the calculation of the number of links necessary to trigger recognition of a bond. But in fact sixty collocates with years: in 401 instances of sixty in a corpus of Guardian news text dating from 1990 to 1995, there are 62 instances of the combination sixty years, making it the second most common lexical collocate of sixty (after per cent). Furthermore, sixty years collocates with ago: of the 62 instances of sixty years, 26
      are followed by ago. Turning then to years ago, we find that years ago in turn col‑ locates with today, albeit more weakly than in the previous two instances: there are 104 occurrences of years ago today in my Guardian data (out of 10,727 instances of years ago).This is not the end of the interconnections. The word years has a semantic association with NUMBER, of which sixty is an instance, and sixty has a pragmatic association with VAGUENESS; almost, around, more than, nearly, or [as in fifty or sixty, sixty or seventy, sixty or more, sixty or so], some [as in some sixty], and over all turn up in WordSmith’s list of collocates for sixty (Scott 2013). Today is designed to contradict the possibility of interpretation of sixty as vague. The combination X years ago has a strong colligational preference for appearing as an adjunct. On top of this, discovered collocates with both years and ago and has a semantic associa‑ tion with indicators of TIME. So the cohesive items in sentence 1 are all inextrica‑ bly tied together and can in no sense be regarded as separate entities.This needs explanation. The theory of lexical priming (Hoey 2004 a,b, 2005a) argues on the basis of both corpus linguistic studies and psychological research (Neely 1977, 1991; Anderson 1983) that we acquire word components, words, and combinations of words through repeated encounters with them as they occur in conversation or written texts. (Pace‑Sigge 2013, substantially broadens the coverage of the psychological research; a brief but useful overview can be found in Harley 2001.) Word components, words, and combinations of words are not however acquired ‘naked’; rather, they are acquired wrapped in the contexts in which they are encountered (local textual contexts, generic contexts, social contexts and so on). The contexts are consequently acquired along with the words. Indeed it may be that the consequence of not reproducing at least some of the contexts with which the word has been primed for us may be unintelligibility (which is not to say that we are unable to say new things – see Hoey 2007a, b).So sixty is typically primed for us to occur with yearssixty years ago is primed to occur as Adjunct, years is primed to occur with NUMBER and sixty is primed to occur with VAGUENESS. The word ‘typically’ in the previous sentence is impor‑ tant in that primings in principle may vary from speaker to speaker, given that we each have a unique set of conversational and reading experiences which serve to prime us.
    7. A way forward
      So these are the positions that need to be resolved, and though I have referred these positions to my own work, they can be articulated without reference to my research either into cohesion or in the area of corpus linguistics. The question
      could be articulated just as readily by reference to the work of Halliday and Hasan (1976) on cohesion and to Sinclair’s work in corpus linguistics (1991, 2004). It is a question about the compatibility of two waves of data‑based research, the first at its peak between the 1970s and the 1990s looking at whole discourses, the second at its peak from the 1990s until today utilising computer corpora.To explore the ways that cohesive analysis and corpus linguistic analysis might relate to each other, it may be fruitful to look at two concordances of the same word – planet. The first of these is a concordance of the word planet* as it appears in the ‘outer region’ corpus described in Section 2. There are 662 hits for planet* in this corpus. Looking at this concordance (given in the Appendix), several fea‑ tures identify themselves as characteristics of the word planet*. In the first place, there is a semantic association with NUMBER, both ORDINAL and CARDINAL. So, as noted earlier, nine is a strong collocate, occurring 93 times in conjunction with planet (usually in the combination Planet Nine, the provisional name for the as yet undiscovered but mathematically predicted giant planet at the far reaches of the solar system); it also occurs nine times with planets. Ninth occurs 25 times with planet. Other numbers occurring are eight, which occurs seven times with plan- etseighth, which occurs with planet twice, tenth which occurs fourteen times with planet and ten which occurs twice with planets. To these can be added 16 instances of 9, 26 instances of X [=10], eight instances of 10th, two instances of two and single instances of 10, 12, 9th, five, and six. All of these instances are used to count the planets; other uses have been ignored. This means that of 662 instances of planet*, 209 (31.6%) occur with NUMBER (either cardinal or ordinal). Counting the planets has been an important part of popular science reporting in newspapers and on‑line. In the same data, we find that there are nine instances of discovered (none fol‑ lowed by that, which would of course have indicated a different use of the word), two instances of discovery and one each of undiscovered and yet-to-be-discovered; three instances of detection and one of detected; six instances of find(ing) and five of found (again, none followed by that), and finally four instances of located. In the 662 lines, therefore, there are 32 occurrences of words associated with FIND. Similarly, there are 10 lexical items associated with SEARCH – five instances of search, three of hunt(er), and two of look for. So, in total, in the 665 lines there are42 instances (6.3%) of SEARCH/FIND.A third (and, it will be argued later, important) association of planet* is with (PLANET) NAME. This might seem obvious (though stating the apparently obvi‑ ous has never been a mistake, as Stubbs (1996) cleverly points out), but in a cor‑ pus I have begun to create on exoplanets (planets in other solar systems), this association does not appear to exist, or rather it exists much more weakly and the names take a very different form. Sample exoplanet names are KIC 6185331 b, EPIC 203826436 d and Kepler‑974 b.
      In the 662 lines for planet* derived from the outer region corpus, there are 176 instances of planet names – Earth, Eris (a minor planet), Jupiter, Mars, Mercury, Neptune, Pluto, Saturn, Uranus, Venus and Xena (the original name given to Eris). All of them, apart from Earth and Xena, are names after gods in some former belief system, and that again will prove important later on. Since the names are often paired, and occasionally listed, it would not be revealing to represent the proportion of lines as a percentage, but it will be obvious that the NAME association is a strong one. Interestingly, there are 15 instances of lexical items used for naming – dubbed, called, calling, and so-called – and all of them are followed by either Planet Nine or Planet X, showing that these collocations function as provisional names. They have not of course been included in the count of names.Let’s now look at another concordance of planet*. This time the corpus is a single text – the Planet X text! There are 25 instances of planet* in this corpus/text. We find accompanying these instances two instances of nine, three of tenth and eight of X [=10]. That means that of the 25 instances of planet in the Planet X text, 13 (52%) occur with either a cardinal or ordinal number. In so far as they are all counts of the planets, they are simultaneously a semantic association of planet and part of the cohesion of the text in the form of closed set repetitionAgain in the single text corpus, we find evidence of semantic associations with SEARCH and FIND. In 25 lines, we have six lexical items of searching – hunt(ing) (three times) and search(ing) (again three times) , and nine lexical items of find‑ ing – finding, discovery (twice), discovered, detection, located, found, identified and picked out. So we have a total of 15 instances (60%) of SEARCH/FIND. Again, since it is a single text, we not only have evidence of the same semantic association that was found in the larger corpus but evidence also of the cohesion of the text in the form of simple repetition, complex repetition, simple paraphrase, and complex paraphrase.Finally, in the 25 lines of the concordance for planet* derived from the single Planet X text, we find the association with NAME. There are eight lines (24%) in the concordance that contain one or more planet names.The 15 instances of SEARCH/FIND in the Planet X text form a chain that interacts with another chain that includes planet but also includes instances of NAME – Pluto, Uranus, Jupiter, Venus (the last two of which fall outside the tra‑ ditional concordance window) – included as well as co‑hyponyms with planet – moon, asteroids (which would be included cohesion as in Hasan’s analysis, though not in mine because the terminology does not constitute a closed set). Thus the corpus linguistic findings turn out to relate directly within the text to cohesive analysis of the kind described by Hasan (1984, and in Halliday & Hasan 1985) and myself (Hoey 1991).
      As can be seen, the smaller corpus carries the same semantic associations as the ‘outer region’ corpus, only more strongly. So both the concordance for the whole corpus and the concordance for the single text show that planet has a semantic association with NUMBER, and likewise both concordances show a semantic association with SEARCH/DISCOVERY and with NAME. But there is a vital difference between the two concordances. The first mirrors the experience that a user of the language has when being primed over a period of time by scat‑ tered encounters with the word planet or planets in many types of context and in many different texts, albeit indirectly and incompletely. The second, on the other hand, is the experience that a user has on a single occasion when reading a very specific text. The first contributes to the user’s primings, establishing certain pat‑ terns of usage for the word in question, while the other contributes to the user’s cohesion, establishing certain patterns of connection within the text.Readers of the ‘The Invisible Influence of Planet X’ text are highly likely to be already primed for planet. They will have encountered the word in a wide range of contexts (including of course many of no relevance to the text in question, such as names – Lonely Planet – and hyperboles – a brain the size of a planet). Consequently, they are likely to have already been primed to associate planet with NUMBER and will not be surprised by the association of planet with in sentence 1 or with nine in sentence 2. They may also already recognise Planet X as a col‑ location; there are 26 instances of the collocation in my ‘outer region’ corpus. (As early as 1953, there was an amusing Looney Tunes cartoon3 that showed Daffy Duck sent to find Planet X – he passes nine other planets on the way, all with large Roman numerals on them.) For such readers, sentence 1 simply reinforces the priming for planet to occur with an ordinal number. For those of course who have not been primed for the collocation or the semantic association, this sentence will begin the process of priming them. For these readers the semantic association will then be reinforced by the presence of nine planets in sentence 2 (as of course it will be for those who are already primed).The first sentence also reinforces for some readers (and begins to create for others) the priming of planet to occur with SEARCH/DISCOVERY, since discov- ered and hunt both belonging to this semantic association. Outside the scope of a traditional concordance line but within the same sentence, the association of the word planet with (planet) NAME is also reinforced (Pluto).If readers find their primings reinforced, there is little more to say. If, how‑ ever, they begin to identify a new association, I suggest that the association will in the first instance be recognised as text‑specific, in other words as temporary
      1. Duck Dodgers in the 24½ Century.
        and confined to this instance of text. Each occurrence of the temporary associa‑ tion will then be cohesive. Mahlberg (2012) demonstrates that Dickens makes use of such temporary associations not only to describe his characters but to make them immediately identifiable, even without mention of their names. Readers are unlikely to ever make such primings productive, such that they start reproducing them in their own writing and speech. They are, though, essential to the enjoy‑ ment of the novel and simultaneously contribute to the cohesion of the novel.An example in the Planet X text, with regard to my own priming, was that of elusive in conjunction with planet. I did not recognise a collocation with elusive nor a semantic association with HARD TO FIND but its presence in the first sen‑ tence of the text temporarily primed me to expect some kind of cohesion with it. This follows in sentence 4: which has somehow escaped detection. Sentence 35 indi‑ rectly picks this up with unseen. As it happens, this semantic association (and col‑ location) is not unique to this text. Elusive occurs four times in my ‘outer region’ corpus (three times with planet and once with Pluto), escaped being detected/ detection occurs twice, undetected occurs three times (twice with planet and once with cosmic object) and hard to detect occurs once (with comets). Looking only at planet*, that accounts for 7 instances out of 662. But the point is that I started in 1990 without the semantic association with HARD TO FIND, and initially identi‑ fied it as a cohesive property of this particular text.In short, as we pick up on cohesion, we reinforce or add to our primings, and as we are primed, we recognise the cohesion of a text. The process of utilising the cohesion of a text to assist in the creation of a sense of coherence for that text and the process of being primed by the collocations and semantic associations etc. encountered in a text are essentially the same. One might make one step further. It may be that in the search for coherence in what we hear or read, which is after all both automatic and inevitable, we are making ourselves available to be primed by what we are encountering. In turn when we ourselves speak or write, it is our primings that enable us to be coherent and fluent.
    8. Collocation and semantic association across texts that creates cohesive chain interaction
      If a concordance derived from a single text is (with some important qualifica‑ tions) a record of part of the cohesion of the text contributing to the coherence of the text, the question must arise: is a concordance derived from a topic‑specific corpus such as the ‘outer region’ corpus also a record of cohesion ACROSS texts contributing to intertextual coherence? I have considered this question before in Hoey (1995, 2005b), but in neither case with reference to an analysable corpus.
      In order to investigate this, I identified the first sentence in the earliest text in the ‘outer regions’ corpus, which was:(TI)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system4This sentence contains nine lexical words – Pluto, often, regarded, distant, nine, known, planets, solar and system, though often is not a word that contributes much to cohesion and solar system functions as a single item most of the time. I then searched the concordance for Pluto, which contains 313 lines, for evidence of their forming cohesive chains relating to each other in the same way that the words in the sentence relate to each other. My strategy was to look on each line for at least two of these items in the KWIC format, following the categories of repetition described in section X. Where two or more such items were found, I then looked at the complete sentence to see whether any further repetitions could be observed. Out of the 313 lines, some were drawn from the annotation of diagrams and others drew on the same sentence as each other, wherever there were several mentions of Pluto in a sentence. So there were considerably fewer than 313 separate sentences reflected in the concordance. Nevertheless, there were 74 sentences that contrib‑ uted to least three lexical chains, and these were drawn from over two‑thirds of the texts in the corpus (26 out of 36). I give below one sentence that meets the criteria from each of the 26 texts. In each case I have taken the earliest sentence in the text to contribute to three chains, except in a small number of cases where there is doubt as to whether the earliest sentence meets the criterion (usually where solar system is being treated as contributing to two chains). Where only one sentence meets the criterion and is doubtful, it has nevertheless been included.
      (3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(18‑2‑1990 1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet(11‑7‑1993 21) Discrepancies in the motions of Uranus and Neptune had led some astronomers to conclude there was another body far beyond Pluto, a tenth planet that was exerting its gravitational influence on the outer Solar System. (27‑5‑1995 24) Pluto, which has a very oval, tilted orbit, is currently at its closest to the Sun – inside the orbit of Neptune.
      1. The TI indicates that the sentence is text-initial and the bracketed material gives the month and year of publication together with the sentence number within the text. The news- paper cutting I have retained does not contain the date within the month.

      (8‑12‑1995 24) Some scientists believe there might be a 10th planet beyond the orbit of Pluto.(9‑12‑1995 1) Sir: What a pity Tom Wilkie ended his otherwise excellent article on spacecraft that have gone to Jupiter and beyond (8 December) with a para‑ graph about a “tenth planet beyond the orbit of Pluto” that might have been writ‑ ten years ago.(7‑8‑2002 17) Even the venerable astronomer Patrick Moore said: “Pluto cannot be classified as a bona fide planet.”(16‑5‑2004 3) Sedna, hailed as the most distant object orbiting the sun, and the biggest heavenly object to be discovered since Pluto, is a lump of icy rock the size of Great Britain, or Italy, that takes 10,500 years to orbit the sun. (31‑7‑2005 4) The giant lump of rock and ice [Xena/Eris] is larger than the planet Pluto and is now the farthest known object in the solar system. (1‑8‑2005 3) The body [Xena/Eris] is believed to be about 1,700 miles in diam‑ eter, about a quarter the size of the Earth, and about one‑and‑a‑half times the size of Pluto, the ninth and last planet to be discovered, in 1930.(2‑11‑2005 2) Pluto, discovered as the ninth planet in 1930, was thought to be alone until its moon Charon was spotted in 1978.(18‑1‑2006 5) But, weather permitting, the New Horizons probe will set off today on a nine‑year journey to the smallest, coldest and least understood planet in the solar system – the “ice‑dwarf” world of Pluto.(2‑2‑2006 1) School textbooks on astronomy may have to be rewritten following official confirmation that a new object discovered at the edge of the solar system is bigger than the ninth planet, Pluto.(25‑8‑2006 1) The world’s leading astronomers yesterday voted to reduce the size of the solar system by stripping Pluto of its status as a planet.(25‑2‑2008 9) They [craft launched into orbit] are *en route* to Pluto, the most distant planet, and are even hoping to land on the shifting sands of a comet. (8‑8‑2008 10) Should the debate see Pluto regain its planetary crown, we could soon be counting the planets in their dozens and not just the (once) familiar nine.(7‑1‑2011 1) When astronomer Mike Brown discovered a distant world called Eris, he didn’t realise it would see Pluto kicked out of the Solar System – and his letterbox fill with hate mail.(13‑2‑2011 2) If you grew up thinking there were nine planets and were shocked when Pluto was demoted five years ago, get ready for another surprise. (9‑5‑2014 35) For example, recent new discoveries have been made of objects in the outer regions of our Solar System that have sizes comparable with and larger than Pluto.
      (19‑1‑2‑15 2) They are thought to exist beyond the orbits of Neptune, the far‑ thest true planet from the Sun, and the even more distant tiny “dwarf planet” Pluto.(21‑1‑2016 4) Ever since Pluto was downgraded to a dwarf planet 10 years ago, there have been eight known planets making up the solar system, but astrono‑ mers have for years proposed the existence of an extra “Planet X” lurking on the dark, icy fringes of the system.(5‑4‑2016 24) Although 600 AUs {Astronomical Units] – roughly 15 times the average distance to Pluto – does sound far, Planet Nine could theoretically hide as far away as 1,200 AUs.(1‑5‑2016 1) For three quarters a century, Pluto was regarded as the ninth planet. (8‑5‑2016 2) It is hydrostatic and orbits the sun, not another planetary body – which would make it a moon (The great Pluto debate, last week).(12‑5‑2016 19) For perspective, Pluto orbits the sun at an average distance of39.5 AU, and completes one lap every 248 years(31‑5‑2016 80) He [Michael Brown] is best known for his discovery of Eris, the most massive object found in the solar system in 150 years, and the object which led to the debate and eventual demotion of Pluto from a real planet to a dwarf planet.Chain reaction(s) of the kind described in Hasan (1984; Halliday & Hasan, 1985) is very apparent here. We have:
      nine known planets tenth planettenth planet10th planettenth planet ninth and last planet ninth planetninth planetplanets in their dozens and not just the (once) familiar nine. nine planetseight known planetsPlanet XPlanet Nineninth planet
      Most of these interact with a chain of instances of Pluto. Two of the instances of NUMBER + planets interact with a small chain of known. Separately from this, a
      smaller chain of instances of Pluto interacts with a chain comprising regarded as, classified as, and regarded as (again). A chain of superlatives interact with solar systemmost distant in our solar systemmost distant orbiting the Sunfarthest known in the solar system smallest, coldest and least understood in the solar system the farthest from the Sunthe most massive … for 150 years found in the solar systemThe chain interactions extend across texts over forty years. Furthermore, of the 26 sentences (chosen, it will be remembered, because they are the first in each text to conform to the criterion of making use of at least three of the lexical words in the first sentence of the oldest text in the corpus), seven are text‑initial sentences and a further four are the second sentence in the text. So the chain interactions across the texts are associated with text beginnings.
    9. Content‑specific collocations, semantic associations and cohesive chains
      six 16previously known objects that orbit beyond Neptunefurthestknown knownsecond largest known greater in volume than all knownsix most distant knownmoons planetdwarf planet asteroids combinedobjects in the solar systemExamination of the chains in the manner just undertaken leads one to notice idiosyncrasies in the wording of texts discussing the outer regions of the solar system. For example, the word known in our first sentence is used in a manner rather different from the most common usages in other types of writing and speech. Out of 45 instances of known in the outer region corpus, six are instances of NUMBER known SOLAR OBJECT and nine are instances of SUPERLATIVE/ HIGH COMPARATIVE known SOLAR OBJECT. Examples are:

      In a hundred randomly chosen lines from the Guardian corpus, there was one instance of NUMBER known NOUN and three of SUPERLATIVE/HIGH COMPARATIVE known NOUN. The combinations appear to be a characteristic
      of the local topic – too specific to be a register, but register‑like in characterising the topic in particular ways.Again, there are seven uses of hunt in my outer region corpus, all of which interact with instances of solar system object:
      The hunt is onfora gas giant up to four times

      the mass of JupiterThe hunt is onto find“Planet Nine”to huntforexoplanetsthis huntforminor satellites of the earththe hunt is onforan elusive tenth planetthe great planet hunta systematic hunt [ELLIPSIS OF nearby object such as aplanet]
      It will be noted that three of these take the form of the hunt is on. In 254 instances of hunt in the Guardian 1990 corpus (after proper names have been removed), there are only two instances of this expression.Here are some more instances of the local collocations/semantic associa‑ tions that are simultaneously chain interactions. In a sample of a hundred lines of myster* in the Guardian, the words mystery and mysterious occur 26 times (26% obviously) in the combination myster* + NOUN, often in front of words such as thing, way, nature and inquiry. In the ‘outer regions’ corpus, 27 out of 34 occur in the combination myster* + NOUN, and of these 27, 19 occur in the combination myster* planet, and a further five occur in the combination myster* + SOLAR OBJECT, comprising 70.6% of the data. Examples are:
      mysterious ninth planetmystery planetMysterious ‘Planet Nine’mysterious dwarf planetmystery sphere found in our solar system mysterious objects of the Kuiper Belt
      Again, there are 13 instances of downgrad*, all but one of which interact with instances of Pluto. There are also 13 instances of demot*, all of which interact with instances of Pluto. Other expressions with approximately the same sense that occur twice and once respectively, always with instances of Pluto, are relegated and expelled from the club. Examples are:

      demotion poor old Pluto demoteddemotedof PlutoPlutoto the status of dwarf planetit [Pluto] was downgradedPluto, which was officially downgradedPluto Plutowas downgradedwould have to be “dwarf planet” status to a dwarf planetexpel Pluto from the planetary clubDemotion is something that can happen to a person or a football club, as are rel- egated and expelled, and all three words intuitively carry negative connotations. The same is true of downgrading. Of a hundred random instances of downgrad* from the Guardian 1990 corpus, only three are positive as opposed to 87 that were negative (the remaining 10 being either neutral or indeterminate).To these negative items might be added: unworthy, jumped-up, and runt. To someone who observes the changes in terminology with an objective eye, all that has happened to Pluto is that it has been given a different scientific classification, and no scientific classification is any better or worse than any other. But within this content‑specific corpus, the language has developed differently.There are also three uses of family to describe the planets or their moons. We noted earlier the importance of words connected with naming. This naming is either a symptom of a tendency to humanise the planets of our own solar system or a possible cause. In my incipient corpus of the search for exoplanets, no such human terms are used, nor, as already noted, are the names given to exoplanets those of mythical beings.I conclude that cumulatively these features are compatible with the view that the writers who contribute to a content‑specific corpus prime each other such that certain content‑specific collocations and semantic associations develop that are not apparent in connection with other types of content.
    10. Intertextual bonding

    We have one more claim to test. The claim made above in Section 5 and in Hoey (1991) was that where there was an above average level of repetition between two sentences (typically three repetitions) there was a high chance of coherence being recognised between the two sentences. The question then is whether this is also true of sentences from different texts written at different times and published in different places. Let us start by considering some of those sentences that make use
    of the negative expression of reclassification and at the same time contain at least three repetitions as defined in Section 5:(25‑8‑2006 14) The IAU’s decision to demote Pluto represents a U‑turn on a pro‑ posal put before the scientists last week, which would have seen the solar sys‑ tem expand to 12 planets, with four, including Pluto, being classed as new objects called plutons.(8‑8‑2008 1) Two years ago, the International Astronomical Union voted todemote Pluto – and became probably the most unpopular group nobody has ever heard of.The two sentences seem to belong to the second or third category of coherence referred to. Minor adjustments of tense and removal of time references are neces‑ sary (adjustments not envisaged when the criteria were articulated for bonded sentences within the same text), and the shared information in the second sen‑ tence needs to be subordinated. With these changes, the pair reads as follows:(25‑8‑2006 14) The IAU’s decision to demote Pluto represented a U‑turn on a pro‑ posal put before the scientists, which would have seen the solar system expand to 12 planets, with four, including Pluto, being classed as new objects called plutons. (8‑8‑2008 1) [When] the International Astronomical Union voted to demote Pluto, [they] became probably the most unpopular group nobody has ever heard of.Again, consider the following pair:(7‑1‑2011 7) With it [a relatively simple telescope] he [Mike Brown] redrew our understanding of our star system, demoting Pluto from its exalted planetary sta‑ tus to that of a lowly “dwarf planet”.(1‑5‑2016 27) Stern rejects the IAU vote that demoted Pluto to the status of dwarf planet.Here only the addition of But is necessary to create coherence (and the pairing incidentally identifies the key antagonism between Brown and Stern in planetary astronomy). It will be noticed too that the second sentence follows naturally on from both the previous sentences considered, despite their being 10 and 8 years apart respectively:(25‑8‑2006 14) The IAU’s decision to demote Pluto represents a U‑turn on a pro‑ posal put before the scientists last week, which would have seen the solar sys‑ tem expand to 12 planets, with four, including Pluto, being classed as new objects called plutons.(1‑5‑2016 27) Stern rejects the IAU vote that demoted Pluto to the status of dwarfplanet.
    (8‑8‑2008 1) Two years ago, the International Astronomical Union voted to demote Pluto – and became probably the most unpopular group nobody has ever heard of.(1‑5‑2016 27) Stern rejects the IAU vote that demoted Pluto to the status of dwarf planet.The addition of but to the first pair and of for example to the second might make them more readable, but essentially the coherence is there to see.Earlier I listed 25 sentences that bonded with the first sentence of the oldest text in the outer regions corpus. I now list them without discussion in pairs with a judgement about their coherence that you are of course in a position to agree or disagree with. It will be recalled that I classify the (in)coherence of bonded pairs according to whether they are coherent, are coherent with minor modifica‑ tions, have shared content (over and beyond that presupposed by the cohesion), are identical or seem incoherent. All modifications mentioned are to the second sentence unless otherwise stated. Here then are my judgements:(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(18‑2‑1990 1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet(COHERENT WITH MINOR MODIFICATIONS – non‑fronting of adjunct; replacement of Pluto with it)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (11‑7‑1993 21) Discrepancies in the motions of Uranus and Neptune had led some astronomers to conclude there was another body far beyond Pluto, a tenth planet that was exerting its gravitational influence on the outer Solar System.(COHERENT WITH MINOR MODIFICATIONS – had → have; addition ofhowever)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(27‑5‑1995 24) Pluto, which has a very oval, tilted orbit, is currently at its closest to the Sun – inside the orbit of Neptune.(COHERENT WITH MINOR MODIFICATIONS – addition of however)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(8‑12‑1995 24) Some scientists believe there might be a 10th planet beyond the orbit of Pluto.
    (COHERENT WITH MINOR MODIFICATIONS – addition of however)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (9‑12‑1995 1) Sir: What a pity Tom Wilkie ended his otherwise excellent article on spacecraft that have gone to Jupiter and beyond (8 December) with a paragraph about a “tenth planet beyond the orbit of Pluto” that might have been written years ago.(INCOHERENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (7‑8‑2002 17) Even the venerable astronomer Patrick Moore said: “Pluto cannot be classified as a bona fide planet.”(COHERENT WITH MINOR MODIFICATIONS – replacement of Even withhowever)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (16‑5‑2004 3) Sedna, hailed as the most distant object orbiting the sun, and the biggest heavenly object to be discovered since Pluto, is a lump of icy rock the size of Great Britain, or Italy, that takes 10,500 years to orbit the sun.(SHARED CONTENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (31‑7‑2005 4) The giant lump of rock and ice [Xena/ Eris] is larger than the planet Pluto and is now the farthest known object in the solar system.(COHERENT WITH MINOR MODIFICATIONS – addition of however; replacement of the planet Pluto with it; addition of Xena/Eris in parenthesis after The giant lump of rock and ice)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(1‑8‑2005 3) The body [Xena/Eris] is believed to be about 1,700 miles in diam‑ eter, about a quarter the size of the Earth, and about one‑and‑a‑half times the size of Pluto, the ninth and last planet to be discovered, in 1930.(SHARED CONTENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(2‑11‑2005 2) Pluto, discovered as the ninth planet in 1930, was thought to be alone until its moon Charon was spotted in 1978.(COHERENT WITH MINOR MODIFICATIONS – addition of also; replace‑ ment of Pluto with it and moving it immediately before was)
    (3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (18‑1‑2006 5) But, weather permitting, the New Horizons probe will set off today on a nine‑year journey to the smallest, coldest and least understood planet in the solar system – the “ice‑dwarf ” world of Pluto. (COHERENT WITH MINOR MODIFICATIONS – removal of But and rever‑ sal of the parenthesis that ends the sentence)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system.(2‑2‑2006 1) School textbooks on astronomy may have to be rewritten following official confirmation that a new object discovered at the edge of the solar system is bigger than the ninth planet, Pluto.(COHERENT WITH MINOR MODIFICATIONS – addition of however;replacement of the ninth planet, Pluto with it)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (25‑8‑2006 1) The world’s leading astronomers yesterday voted to reduce the size of the solar system by stripping Pluto of its status as a planet.(COHERENT WITH MINOR MODIFICATIONS – addition of however;replacement of Pluto with it)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (25‑2‑2008 9) They [craft launched into orbit] are*en route* to Pluto, the most distant planet, and are even hoping to land on the shifting sands of a comet.(INCOHERENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (8‑8‑2008 10) Should the debate see Pluto regain its planetary crown, we could soon be counting the planets in their dozens and not just the (once) familiar nine.(SHARED CONTENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (7‑1‑2011 1) When astronomer Mike Brown discovered a distant world called Eris, he didn’t realise it would see Pluto kicked out of the Solar System – and his letterbox fill with hate mail.(COHERENT WITH MINOR MODIFICATION TO FIRST SENTENCE – isoften → was once)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (13‑2‑2011 2) If you grew up thinking there were nine
    planets and were shocked when Pluto was demoted five years ago, get ready for another surprise.(COHERENT WITH MINOR MODIFICATION TO FIRST SENTENCE –is often → was once)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (9‑5‑2014 35) For example, recent new discoveries have been made of objects in the outer regions of our Solar System that have sizes comparable with and larger than Pluto.(INCOHERENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (19‑1‑2‑15 2) They [At least two planets as big as Earth] are thought to exist beyond the orbits of Neptune, the farthest true planet from the Sun, and the even more distant tiny “dwarf planet” Pluto.(SHARED CONTENT – close to COHERENT WITH MINOR MODIFICATIONS)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (21‑1‑2016 4) Ever since Pluto was downgraded to a dwarf planet 10 years ago, there have been eight known planets making up the solar system, but astronomers have for years proposed the existence of an extra “Planet X” lurking on the dark, icy fringes of the system.(COHERENT WITH MINOR MODIFICATIONS – is often → was once in first sentence; replacement of Pluto with it)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system. (5‑4‑2016 24) Although 600 AUs {Astronomical Units] – roughly 15 times the average distance to Pluto – does sound far, Planet Nine could theoretically hide as far away as 1,200 AUs.(Arguably COHERENT WITH MINOR MODIFICATIONS – addition of how- ever and Nine → X))(3‑1976 1) PLUTO is often regarded as the most distant of the nine known planets in our solar system. (1‑5‑2016 1) For three quarters a century, Pluto was regarded as the ninth planet.(IDENTICAL WITH MINOR MODIFICATION TO FIRST SENTENCE –is → was)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system (8‑5‑2016 2) It [Pluto] is hydrostatic and orbits the sun, not another planetary body – which would make it a moon (The great Pluto debate, last week).
    (Arguably COHERENT)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system (12‑5‑2016 19) For perspective, Pluto orbits the sun at an average distance of 39.5 AU, and completes one lap every 248 years. (COHERENT WITH MINOR MODIFICATION – removal of for perspective; replacement of Pluto with it)(3‑1976 1) PLUTO is often regarded as the most distant of the nine known plan‑ ets in our solar system (31‑5‑2016 80) He [Michael Brown] is best known for his discovery of Eris, the most massive object found in the solar system in 150 years, and the object which led to the debate and eventual demotion of Pluto from a real planet to a dwarf planet.(SHARED CONTENT)
    My judgements are of course open to dispute, but according to those judge‑ ments, of the 25 bonded pairs, 15 are deemed coherent with minor modifica‑ tions, those modifications falling into three groups – the marking of contrast, a reflection of time change, and the supplying of a pronoun in place of full repetition (or vice versa) – and one is deemed coherent without modifications (64% in total). Three are deemed incoherent (12%). Between these two poles there are five pairs deemed to share content and one where there is so much shared content that they are close to being identical. These proportions sug‑ gest that intertextual bonding, as identified through a partly corpus‑linguistic, partly text‑linguistic investigation, has some of the same significance as it has intratextually, just as Hasan’s chain interaction had some of the same relevance as it has intratextually.
    12. Some brief conclusions
    In this paper a number of positions have been arrived at.
    1. A concordance of a single text indicates that cohesion simultaneously primes a reader/listener for collocation etc. at the same time as it sets up the condi‑ tions for coherence.
    2. The identification of bonds between sentences in a text by a reader or listener simultaneously primes the reader or listener with respect to the items that cre‑ ated the bonding.
    3. The identification of chains that interact within a text by a reader or listener simultaneously primes the reader or listener with respect to the items that contribute to the chains.
    4. The identification of collocations and semantic associations in a multi‑text corpus is at the same time the identification of chains that interact across the range of texts and that have a similar significance to those found within texts.
    5. The identification of collocations and semantic associations in a multi‑text corpus is at the same time the identification of bonds between sentences from different texts that have a similar significance to those found within texts.

    To these I would add four further comments. The first is that the search for coher‑ ence and the process of priming are in fact one and the same. When we build up in our minds a record of the way a word behaves, we are at the same time building up in our minds an awareness of the potential cohesive links in the text. Conversely, when we identify the potential cohesive links in a text, we are at the same time being primed to use and recognise a word as a result of a succession of experiences of the word in a very specific textual context.The second is that as we search for coherence and are primed for the colloca‑ tions and semantic associations that create bonds and chains, the priming in ques‑ tion is not only linguistic but encyclopaedic, which ties in very well with Teubert’s argument (2007: 68) that the distinction between lexical and encyclopaedic mean‑ ing is untenable. Corpus linguistics is actually the study of the language of knowl‑ edge and a full account of the intertextual collocations, semantic associations and so on that contribute to the creation of the conditions for coherence in texts on a topic would be a study of the current state of knowledge of a topic.The third is that some of the modifications that are necessary to create the con‑ ditions for coherence are reflections of change in knowledge, particularly changes in tense and time references. Where there is shared content but no coherence, there may be a crack in the priming (Hoey 2005) visible where the first sentence’s language has been replaced by some aspects of the second sentence’s language. The collocation of Pluto with planet becomes a nested collocation of dwarf with planet and of Pluto with dwarf planet. The referent of the nested combination ninth planet changes from Pluto to an as yet undetected giant planet at the far side of the Kuiper Belt.Finally, as a sweeping statement, corpus linguistics has in the past been inclined to identify collocations etc. in corpora of some generality, though Stubbs (2016) and Partington and Duguid (2016) show how much can be gained by oper‑ ating with tighter, narrower, more content specific corpora. The study offered here suggests that some collocations and semantic associations operate primarily in texts that are highly content‑specific. This is natural enough if encountering these
    collocations and semantic associations primes us with the language of that con‑ tent, and thereby with the content itself. Our mental concordances are also our mental encyclopaedias.
    ReferencesAhmad, K. & Benbrahim, M. 1995. Text summarisation: The role of lexical cohesion analysis.The New Review of Document and Text Management 1: 321–335.Anderson, J.R. 1983. The Architecture of Cognition. Cambridge MA: Harvard University Press. Bawcom, L. 2011. What’s in a Name? The Functions of Similonyms and their Lexical Priming forFrequency. PhD dissertation, University of Beaugrande, R. & Dressler, W. 1981. Introduction to Text Linguistics. London: Longman. Eckler, R. 1997. Making the Alphabet Dance: Recreational Wordplay. New York NY: St Martin’sGriffin/London: Macmillan.Halliday, M.A.K. & Hasan, R. 1976. Cohesion in English. London: Longman.Halliday, M.A.K. & Hasan, R. 1985. Language, Context and Text: A Socio-Semiotic Perspective.Geelong: Deakin University Press/Oxford: OUP.Harley, T.A. 2001. The Psychology of Language: From Data to Theory, 2nd edn. Hove: The Psychology Press doi: 10.4324/9780203345979Hasan, R. 1984. Coherence and cohesive harmony. In Understanding Reading Comprehension, J. Flood (ed.), 181–219. Newark DE: International Reading Association.Hoey, M. 1983. On the Surface of Discourse. London: George Allen & Unwin. Hoey, M. 1991. Patterns of Lexis in Text. Oxford: OUP.Hoey, M. 1993. A common signal in discourse: How the word ‘reason’ is used in texts. In Tech- niques of Description – Spoken and Written Discourse, J. Sinclair, M. Hoey, & G. Fox (eds), 67–82. London: Routledge. doi: 10.4324/9780203168097Hoey, M. 1995. The lexical nature of intertextuality: A preliminary study. In Organization in Discourse: Proceedings from the Turku Conference [Anglicana Turkuensia 14], B. Wårvik, S.‑K. Tanskanen, & R. Hiltunen (eds), 73–94. Turku: University of Turku.Hoey, Michael. 1998. From concordance to text structure: New uses for computer corpora. In PALC ‘97: Proceedings of the Practical Application of Linguistic Corpora Conference, B. Lewandowska‑Tomaszczyk & P.J. Melia (eds), 1–22. Lodz: Lodz University Press.Hoey, M. 2004a. The textual priming of lexis. In Corpora and Language Learners [Studies in Corpus Linguistics 17], S. Bernadini, G. Aston, & D. Stewart (eds), 21–41. Amsterdam: John Benjamins. doi: 10.1075/scl.17.03hoeHoey, M. 2004b. Lexical priming and the properties of text. In Corpora and Discourse, A. Par‑ tington, J. Morley, & L. Haarman (eds). Bern: Peter Lang.Hoey, M. 2005a. Lexical Priming: A New Theory of Words and Language. London: Routledge.doi: 10.4324/9780203327630Hoey, M. 2005b. Textuality, Intertextuality and the mental lexicon. In 14th International Sympo- sium on English Teaching, Y.‑J. Chen & Y.‑N. Leung (eds), 40–54. Taipei: English Teachers’ Association, Republic of China.Hoey, M. 2007a. Lexical priming and literary creativity. In Hoey et al., 7–29.Hoey, M. 2007b. Grammatical creativity: A corpus perspective. In Hoey et al., 31–56.
    Hoey, M., Mahlberg, M., Stubbs, M. & Teubert, W. 2007. Text, Discourse and Corpora. London: Continuum.Joyce, J. 1939. Finnegans Wake. London: Faber & Faber.Mahlberg, M. 2012. Corpus Stylistics and Dickens’s Fiction. London: Routledge.Neely, J.H. 1977. Semantic priming and retrieval from lexical memory: Rules of inhibitionless spreading activation and limited capacity attention. Journal of Experimental Psychology: General 106: 226–54. doi: 10.1037/0096-3445.106.3.226Neely, J.H. 1991. Semantic priming effects in visual word recognition: A selective review of current findings and theories. In Basic Processes in Reading: Visual Word Recognition, D. Besner & G.W. Humphreys (eds), 264–336. Hillsdale NJ: Lawrence Erlbaum Associates.Pace‑Sigge, M. 2013. Lexical Priming in Spoken English Usage. Houndmills: Palgrave Macmillan.doi: 10.1057/9781137331908Partington, A. 1998. Patterns and Meanings: Using Corpora for English Language Research and Teaching [Studies in Corpus Linguistics 2]. Amsterdam: John Benjamins. doi: 10.1075/scl.2 Partington, A. & Duguid, A. 2016. Forced lexical priming: Cohesive harmony in adversarial political discourse. Paper given at Corpora and Discourse International Conference, Pon‑tignano (Siena), July.Scott, M. 2013. WordSmith Tools, Version 6. <>Shao, J. 2016 in preparation. Synonymy and Lexical Priming: A Cross‑Linguistic Investigation of Synonymy from Corpus and Psycholinguistic Perspectives. PhD dissertation, University of Liverpool.Sinclair, J.M. 1991. Corpus, Concordance, Collocation. Oxford: OUP.Sinclair, J.M. 2004. Trust the Text: Language, Corpora and Discourse. London: Routledge. Stubbs, M. 1996. Text and Corpus Analysis. Oxford: Blackwell.Stubbs, M. 2001. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell. Stubbs, M. 2016. Two case studies of intertextuality: Illustrations from Umberto Eco and ArthurConan Doyle. Plenary paper given at Corpora and Discourse International Conference, Pontignano (Siena), July.Teubert, W. 2007. Parole‑linguistics and the diachronic dimension of the discourse. In Hoey et al., 57–87.
    AppendixThe invisible influence of Planet X(1) Sixty years ago today Pluto was discovered, now the hunt is on for an elusive tenth planet. Simon Mitton reports.(2) The textbooks say there are nine planets in our solar system. (3) The most distant is Pluto, discovered on 18 February 1930 by astronomers at the Lowell Observatory, Arizona. (4) But some astronomers have continued to suspect there may be a tenth planet lurking even fur‑ ther away which has somehow escaped attention. (5) One, Robert Harrington, of the US Naval Observatory in Washington, has begun a new search for “Planet X”. (6) He is using similar techniques to those used by Clyde Tombaugh 60 years ago to discover Pluto. (7) The young astronomer had detected a sure sign of an object orbiting the Sun by comparing two photo‑ graphs showing that a speck of light had shifted position against the stars. (8) For a number of years before Mr Tombaugh’s discovery, the existence of a ninth planet was suspected, because
    something large was affecting the orbital path of Uranus around the Sun. (9) The locating of Pluto was thought to explain the Uranus effect. (10) But Dr Harrington and other sceptics say Pluto is too small to explain the orbits of the planets in the outer regions of the solar system, such as Uranus and Neptune.(11) Most astronomers nowadays work on such exotic problems as the origin of our universe or the properties of black holes, but Dr Harrington is cast in a traditional mould. (12) As director of the astrometry section at the Naval Observatory he prefers the classical work of finding the posi‑ tions of stars and planets with the greatest accuracy. (13) Dr Harrington is continuing a tradition of planet‑searching which began thousands of years ago, when ancient astronomers identified the five planets visible to the naked eye; Jupiter, for instance, is the brilliant object high up in the southern sky, and Venus can be easily seen in the east at sunrise.(14) In 1781, the British astronomer Sir William Herschel found a new planet, Uranus, setting off feverish planet‑hunting. (15) This led to the discovery of the asteroids, or minor planets, in greatest profusion between Mars and Jupiter. (16) Meanwhile, mathematicians had become involved in the great planet hunt. (17) They found they could not match the orbital path of Uranus around the Sun to that predicted from the laws of gravity; there must be another planet, or planets, pulling it off course. (18) Neptune, located in 1846, appeared to offer an incomplete solution. (19) This century, Percival Lowell, an American astronomer, who achieved notoriety for suggesting Mars had canals and life, urged a search for a further planet using wide‑angle cameras. (20) For 20 years, astronomers at the Lowell Observatory searched the skies. (21) Mr Tombaugh finally spotted his moving speck of light after a year at the job. (22) However, Pluto seemed surprisingly small and faint, and astronomers almost immediately suspected it was not massive enough to pull Uranus off‑course. (23) Mr Tombaugh himself had doubts. (24) Still a professional astronomer at 83, he said last week: “It was much fainter than we expected, and I carried on searching, just in case there was another one, for 14 years, until May 1943.” (25) More recent evidence confirms that Pluto is much too small to influence Uranus and Neptune’s orbits.(26) In 1978, James Christy at the US Naval Observatory accidentally found a moon, subse‑ quently called Charon, orbiting Pluto. (27) Its motion showed that the mass of Pluto is a thou‑ sand times too small to influence the giant planets. (28) This has become the strongest evidence for the mysterious tenth planet. (29) David Dewhirst, of the Cambridge Institute of Astronomy, sees the current search as more promising. (30) “There are another 20 years of data for a start, and that helps. (31) But more significant perhaps are the great advances in computing. (32) The computer models used by the Jet Propulsion Laboratory in Pasadena can, for instance, handle much more complex calculations.”(33) At JPL, where the tracks of space probes through our solar system are computed with phe‑ nomenal accuracy, theorists find that recent observations of Uranus and Neptune do not fit computer predictions using a nine‑planet model. (34) The laboratory’s observers found that Uranus is drifting out of its predicted orbit by 1,000 miles a year. (35) “One possible explanation is an unseen planet,” Dr Harrington says. (36) Nevertheless, Mr Tombaugh remains sceptical.(37) “I did my searching very thoroughly and very slowly. (38) If it’s there, it should have shown on my plates. (39) However, I only covered two‑thirds of the sky and the weakest part of my search was in the south. (40) I think the case for Planet X is marginal; maybe it’s there, and maybe it isn’t. (41) Let’s see.”(42) Dr Harrington has now begun work on two fronts, running new computer calculations in Washington and making fresh observations in New Zealand. (43) “My computer strategy is to
    make model solar systems that include the nine known planets plus a guess at Planet X. (44) I then run lots of these 10‑planet simulations to give the smallest possible deviation of Uranus and Neptune from their observed positions. (45) Each time we do this, we predict a position for Planet X in 1990. (46) What we are finding is that the permitted positions for Planet X cluster in a small region in the sky.” (47) The inclusion of the irregularities in Neptune’s orbit is new, and that could be why computer models are showing a narrower search area. (48) Neptune’s true position is accurately known, following the Voyager 2 encounter in August 1989. (49) Dr Harrington says the most remarkable feature predicted for Planet X is that its orbit is tilted 30 degrees away from the ecliptic, the main plane of the solar system, where all previous searches have concentrated. (50) His models also predict a greater distance from the Sun, about 10 billion miles, or between two or three times as distant as Pluto.(51) In April the new sweep starts in earnest at the Black Birch Observatory in New Zealand.(52) A modest 8in telescope, similar to that used by Mr Tombaugh, will examine the northern part of the constellation Centaurus. (53) Pairs of photographs of the same region of sky taken on successive nights will be sent to Washington. (54) Using a blink comparator, a device that compares two photographs, Dr Harrington hopes to locate any faint object that has moved dur‑ ing the interval between the two pictures.(55) A serious problem is that the search area falls close to the Milky Way, and every plate will include millions of faint stars in our galaxy. (56) The planet, if it exists, must be picked out from this crowded background.(57) Dr Harrington says astronomers still do not understand the outer regions of our solar sys‑ tem. (58) He hopes Planet X will explain the mysterious “wobble” of Uranus and Neptune. (59) “I think we have a 50‑50 chance of showing that the anomalies are due to another planet orbiting 10 billion miles from the Sun.”A corpus‑based investigation into English representations of Turks and Ottomans in the early modern periodHelen Baker, Tony McEnery & Andrew HardieESRC Centre for Corpus Approaches to Social Science, Lancaster University
    Lexical priming theory (Hoey 2005) works not just at any single moment in time. For Hoey (2005: 8) words are “primed for collocational use. A word is acquired through encounters with it in speech and writing, it becomes cumulatively loaded with the contexts and c‑texts in which it is encountered, and our knowledge ofit includes the fact that it co‑occurs with certain other words in certain kinds of context”. So time is clearly one important context within which primings may be acquired; through exposure to word primings over time, words are imbued with meaning and a key feature of this process is collocation. Some suitable data sources are now becoming available, notably the Corpus of Historical American English (Davies 2012) and the EEBO corpus (for details see McEnery & Baker 2016). The existence of a corpus such as the EBBO corpus, which provides a billion words of English data for the seventeenth century alone, allows for the exploration of priming drift for many content words across the century. Yet the explanation for any drift in priming observed may relate to society as much as language per se. Accordingly, in this chapter, we will look with both a linguisticand historical lens at a group, the Ottomans, who, in the seventeenth century, we might assume to be subject to a change of representation in discourse.
    1. Introduction
      Lexical priming theory (Hoey 2005) works not just at any single moment in time. For Hoey (2005: 8) words are “primed for collocational use. A word is acquired through encounters with it in speech and writing, it becomes cumulatively loaded with the contexts and c‑texts in which it is encountered, and our knowledge of it includes the fact that it co‑occurs with certain other words in certain kinds of context”. So time is clearly one important context within which primings may be acquired; through exposure to word primings over time, words are imbued with meaning and a key feature of this process is collocation. Collocates we will define briefly as words which consistently co‑occur with a word to a degree greater than
      doi 10.1075/scl.79.02bak© 2017 John Benjamins Publishing Company
      chance would permit; collocates have a crucial role in determining what that word means in context through the process of priming as described by Hoey.1Yet given that language changes over time, it must follow that primings may change with time too, as acknowledged by Hoey (2005: 9) who introduced the con‑ cept of a drift in primings to admit this possibility. However, tracing how priming drift works has, until recently, been methodologically challenging. In order to look at primings at any point in time, one needs access to a suitably large corpus of data in which to observe primings. Given that language change will typically occur slowly, to be able to look at priming drift we should ideally have a corpus which allows us to look at large volumes of data over a long period of time. While corpora such as the Helsinki corpus (Rissanen et al. 1993) have existed for some time that might feasibly allow the exploration of frequent, typically grammatical, words in this way, no data has been available to allow for the exploration of less frequent, typically content, words. Some suitable data sources are now becoming available, notably the Corpus of Historical American English (Davies 2012) and the EEBO corpus (for details see McEnery & Baker 2016). The existence of a corpus such as the EBBO corpus, which provides a billion words of English data for the seven‑ teenth century alone, allows for the exploration of priming drift for many content words across the century. Yet the explanation for any drift in priming observed may relate to society as much as language per se.2 Accordingly, in this chapter, we will look with both a linguistic and historical lens at a group, the Ottomans, who, in the seventeenth century, we might assume to be subject to a change of represen‑ tation in discourse. Consequently the words which are used to refer to the group may be subject to priming drift across the century. We explore these drifting prim‑ ings by looking at collocations, the key product of lexical priming, following Hoey (2005: 7–9). But why look at the Ottomans?
      1. We will not labour the procedure used to derive collocates in this chapter. The curious reader is referred to McEnery and Hardie (2012: Chapter 6). For those with some experience of collocation, in this chapter collocates of Ottoman are calculated using a window of ± 5 words, frequency of at least 5, using the Log Ratio statistic; and collocates of Turk are calcu- lated in the same way but with the difference that a frequency of at least 50 was required. We looked at plurals alongside singulars, capitalised and non-capitalised forms, and we took into account early modern variant spellings for both terms.
      2. The EEBO v3 corpus is not marked up for genre. However, from initial work done by the authors we estimate that there is a preponderance of religious material in the data – texts concerning religion appear to make up at least half of the corpus. This is unsurprising given the salience of religion in seventeenth-century life. Readers should be aware that this focus on religion pervades not only the corpus, but findings based on it also.
    2. England and the Ottomans
      England and the Ottoman Empire were poles apart in the early modern period – they shared no common borders, did not speak the same language and practised different religions. England, an island‑nation with a population that was both eth‑ nically and linguistically homogeneous, was relatively isolated even from its clos‑ est neighbours in Europe; but it experienced a unique and significant relationship with the Ottoman Empire. During the reign of Elizabeth I, English subjects were permitted, for the first time, to engage in trading relations with Ottoman Turks, and the Queen herself acquired Turkish clothing and corresponded with members of the Sultan’s family.3 In 1581 the Levant Company, which regulated English trade with Turkey, was formed, and a series of magnificently attired Turkish ambas‑ sadors appeared in England over the course of the century.4 Perhaps equally as conducive to the establishment of friendly relations was the fact that both states shared a mutual enemy in Catholic Spain.Islam was the first non‑Christian civilisation with which English citizens came into contact, and it has been argued that interactions with the Ottoman Empire permanently left their mark on the country, helping to formulate English identi‑ ty.5 The association between the Ottoman Empire and England was referenced by early modern English writers, who, by the beginning of the seventeenth century, were displaying a heightened interest in Turkish people. In 1603, the first edition of Richard Knolles’ popular General historie of the Turkes was printed, and this was followed by the publication of numerous travelogues. An array of Muslim char‑ acters, many of them Turkish, started to appear in London plays. Yet, the emerg‑ ing political accord between England and the Empire did not, to any real extent, counter the fear and intolerance Elizabethans felt towards Turkish people. At the beginning of the seventeenth century, James I – who had in his youth written a poem, Lepanto, revelling in a Turkish naval defeat – signed a treaty with Spain which signalled a souring of official relations with the Ottoman Empire. However, despite James’ personal feelings of animosity towards the Ottoman Turks, it was during his reign that commercial interactions between English and Turkish mer‑ chants reached unprecedented levels.The historiography of early modern England tends to concentrate upon the domestic turbulence of the period or, in terms of foreign policy, England’s
      1. Matar (1997: 42–43); Matar (1998: 34); Matar (1999: 123); Hutchings (2006: 2); and Beck(1987: 32).
      2. Matar (1999: 32–33).
      3. Matar (1998: 14, 184) and Kugler (2012).
        relations with Europe. However, there is a growing body of literature which charts the Anglo‑Ottoman relationship. A humanities researcher, Matar (1999) has explored English attitudes towards North Africans and responses to Islam in the late sixteenth and seventeenth centuries. Other scholars have written about Anglo‑Turkish commercial relations and the growing numbers of English trav‑ ellers visiting the Ottoman Empire in the early modern period.6 There is also a wealth of scholarship concerning the characterisation of Turks in early modern English plays and pageants.7In this chapter we examine the textual portrayal of the Ottoman Empire and Ottoman Turks by early modern English writers. In order to amass and utilise all mentions of the Ottoman Turks, we have undertaken a corpus‑based analysis to explore a billion words of writing from the seventeenth century, drawn from the transcribed version of Early English Books Online (EEBO) being constructed by the Text Creation Partnership.8 The subset of EEBO‑TCP that we use currently includes over 39,212 texts from the seventeenth century, amounting to just under one billion words.9 By analysing this collection of material we wish to answer the following questions and interpret them in their historic context:
        1. What kind of language was used by early modern writers to describe the Ottoman Empire and identify Ottoman Turks?
        2. What drift, if any, took place in terms of the primings of words referring to the Ottomans and how did this interact with contact with the Ottomans as the decades progressed?
        3. What can this exploration of primings tell us about England’s cross‑cultural relationship with the Ottoman Empire and, more widely, its perceptions of Muslim people in general in the early modern period?
        We concentrated on printed material from the seventeenth century and selected the terms Turk and Ottoman for analysis. To facilitate meaningful exploration of word primings in such a large collection of texts, we relied upon collocation, as noted. Using the corpus analysis system CQPweb,10 we were able to easily navi‑ gate between collocations and texts for close reading, thereby combining extensive
      4. See Wood (1964) and Maclean(2004).
      5. See, for instance, Vitkus (1997); Bergeron (2010); Hutchings (2006); Jowitt (2002); andMcJannet (2006).
      6. See
      7. The precise figure is 996,472,953 words.
      8. See Hardie (2012).
        quantitative computer analyses of word primings with a qualitative investigation. Both forms of analysis of the data were integrated with an exploration and under‑ standing of current historical research on the Ottomans and the Turks in that cen‑ tury. This makes our study a fusion of linguistic and historical research. We believe lexical priming theory can be helpful to both. For historians, drifts in primings show shifts in meanings and perceptions and provide an approach broadly com‑ patible with conceptual historical analysis.11 For linguists, the historical analysis of these changes has the potential to provide a socio‑cultural explanation of the forces brought to bear on language which may cause a drift in word primings.
    3. Defining Ottoman and Turk
      One of the initial questions we wished to answer concerned the nature of terminol‑ ogy – did English writers use the words Ottoman and Turk as synonyms, or could we tease apart their meanings, i.e. are they near synonyms? Historians, based on close reading of relevant historical material, have made a number of claims regard‑ ing the meaning of the words. Some tell us that the term Ottoman was derived from the Turkish Osmanli, meaning followers of Osman. Osman I or Osman Gazi was the leader of a small principality in Anatolia, one of many, which declared its independence from the Seljuk Sultanate of Rum at the very end of the thirteenth century and which would go on to form the Ottoman Empire. There have been dis‑ cussions of the term Ottoman in recent historiography. Beck (1987: 1) argues that the name Ottoman was associated with Osman’s dynastic heirs and the military and ruling class that served the dynasty. However, Ingram (2015: 14) explains that although Ottomans tended to be a term the ruling elite reserved for themselves, westerners in sixteenth and seventeenth‑century England regarded the terms Ottoman and Turk as synonymous.12 The origins of the term, Turk, meanwhile are more uncertain. The Oxford English Dictionary suggests: ‘A national name of unknown origin. Possibly the same as the Chinese equivalent Tu-kin, applied to a division of the Hiong‑nu (identified by Deguigne with the Huns), who occupied the country south of the Altaian mountains c177 b.c.’13 Beck (1987: 29) has com‑ mented that Turk was used as both a name of origin and a typology, for example, of horses, swords and sabres.
      1. Koselleck (2002).
      2. The authors would like to thank Anders Ingram for kindly allowing us access to the introduction of his book before its publication.
      3. “Turk, n.1.” OED Online. Oxford University Press, June 2016. Web. 28 June 2016.
        So what does the corpus reveal about these words? Even a cursory glance at the collocates for both Ottoman and Turk reveals significant differences between the primings of the two words, in particular with regard to what we term consistent collocates – collocates which are present for at least seven decades of a century,14 giving a strong indication that the meaning they denote, and hence the priming of that word, may be in a stable relationship with the word under investigation. The highest‑ranking consistent collocates of Ottoman – racehouse and family – all support the thesis that Westerners frequently used Ottoman to refer to members of the Turkish ruling dynasty, rather than simply to the Turkish people in general. Moreover, none of these terms collocates with Turk in any decade of the seven‑ teenth century. By contrast, race collocates with Ottoman in every decade of the seventeenth century apart from the 1620s and 1630s, appearing, for instance, in a translation of a history by René de Lucinge (1606), which argued the advan‑ tages to Europe “if the great Turk should die without heirs of the true line and race of Ottoman”. Similarly, a history of the Turks published in 1683 noted of the Sultan Ibrahim (r.1640–1648) that he was the “only Male survivor of the Ottoman Race”, meaning that he was the only living male with a direct claim to the impe‑ rial throne.15 The collocate family appears in descriptions of the Ottoman ruling dynasty: its lineage, its role in establishing the empire itself, and the long‑term threat it represented to Christendom. Similarly, the collocate house also appears within references to the imperial rulers: Henry Neville (1681) unambiguously defined the House of Ottoman as “the present Royal Family”. The use of the term Ottoman by some western contemporaries suggested that these writers believed that Turkish people themselves also reserved it for those belonging to the impe‑ rial household. For instance, Sir Thomas Higgons (1684) wrote in his biography of Isuf Bassa that, after the deposition of a Mufti by Ibrahim, the people declared “that God would never bless the Ottoman house, as long as the chief Minister of his great Prophet was thus used”.What do the collocates of Turk reveal information about seventeenth‑century usage of the term? Saracen is a collocate of Turk in material for the 1640s, 1650s and 1680s but it does not collocate with Ottoman in any decade of the seven‑ teenth century. Twenty‑first century Britons may only be familiar with the term Saracen in terms of its inclusion in the name of a number of British public houses, and perhaps as an archaic pejorative ethnic slur, but what did it mean to the early moderners who used it? Ameer Ali (2010, p. 4) believes the term Saraceni was
      4. See McEnery and Baker (2016) for a fuller discussion of types of collocation through time.
      5. I.S. (1683).
        originally used by Greeks and Romans to refer to Arabs, particularly those who lived in the desert west of the Euphrates.16 Tolan (2002: 4) has argued that the term Saracen was an ethnic term which, along with the words ArabTurk and Moor, was used generically to refer to Muslims. Beck (1987: 29), meanwhile, tells us that the terms Saracen and Turk were used almost interchangeably in the early modern period.17Let us switch to considering the corpus evidence. When Saracen appears as a collocate of Turk, the relevant concordances reveal writers frequently linking both words together with the conjunction ‘and’ which suggests that these terms, though closely associated with one another, were not synonyms, i.e. their prim‑ ings differ to the extent that coordinating them is meaningful, it adds rather than simply reduplicates information. In 1674, for instance, William Sherwin (1674) referred to “those two Wo‑Trumpets of Mahumetans, the Saracen’s and Turks”. Early modern English writers sometimes provide useful information regarding their understanding of the language they use – the simplest example of this would be a straightforward definition of a particular term – but Saracen, which appears almost 5000 times in the EEBO corpus, goes largely unexplained by the writers who use it, with no metalinguistic remarks about it in the corpus. An analysis of the collocates of Saracen itself does provide clues, however. A number of col‑ locates, such as Mauvia and Mavia, which both relate to a fourth‑century Arabian warrior queen, indicate that some authors used the term to refer to a historical ethnic group of people who were believed to be ancestors of the Turks. However, other collocates suggest that the term was frequently used by early modern writ‑ ers as a catch‑all categorisation for Muslims. The collocates AgarensHagarensHagarIsmaelitesSara and Sarah, for instance, all appear in texts which reference the belief that Muslims descended not from Abraham’s wife, Sarah, but from his bondswoman, Hagar.18 Samuel Purchas (1625) informed his readers that: “Those Bulgarians are most wicked Saracen’s, more earnestly professing the damnable Religion of Mohamed”. So while Saracens may indicate people of the same religion as the Turks, the two terms are not simple synonyms.
      6. The first attested use of the term was in Ptolemy’s ‘Geography’ in reference to people living in a region named Sarakene in the northern Sinai peninsula. Millar (1993: 140, 221) has written that classical writers tended to use the term to refer to unsettled people rather than those associated with a particular language or culture.
      7. Ingram (2015: 14), however, argues that the term Saracen, along with Turk and Moor, was used ‘in specific and differentiated ways’ by English writers from the late sixteenth century onwards.
      8. See, for example, Willet (1633).
    4. Receptive and productive primings
      At this point, however, to better understand the data we need to use two further concepts introduced by Hoey – receptive primings and productive primings. Receptive primings (Hoey 2005: 11) “occur when a word or word sequence is encountered in contexts in which there is no probability, or even possibility, of our ever being an active participant”, while productive primings “occur when a word or word sequence is repeatedly encountered in discourses or genres in which we ourselves expected (or aspire) to participate and when the speakers or writers are those we wish to emulate” (Hoey 2005: 11). We may characterise receptive primings as passive – I will not produce those primings myself, but I have knowledge of them and can understand text or speech which rely upon them. On the other hand, productive primings will tend to be active – these are the primings that influence my own speech and writing. Moreover, while recep‑ tive primings encode collocations that a reader/listener expects certain words to have, productive primings are those which speakers/writers use on occasion for a specific audience. There is an element of contextualization in the selection of collocations for production. They are produced in a context that is fitting for the speaker personally (they are primings which are active for the speaker), contextually (we may assume that certain contexts encourage certain productive primings, e.g. the priming of a word such as bank will be different in a financial institution in comparison to next to a river) and socially (for example, speakers wishing to assert membership of a group may select productive primings which reflect such a membership.This distinction is important here because it may be the case that there are receptive primings of Saracen that are not evident in the corpus data. Primings which all or many writers at the time held as receptive primings, but which were not productive and hence not clearly evidenced in writing. To give a modern example, speakers of British may be assumed to know the primings of racist terms, but for most these will remain receptive. As a consequence if we look at the writ‑ ten record we may find scant evidence of those primings being productive, but it does not imply that these primings do not exist as receptive primings. Yet how do we explore what is known but not said? Meta linguistic comments would help – but these are lacking in the corpus. In the absence of relevant metalinguistic comments, can another rescource which presents metalinguistic knowledge, the lexicography of the early modern period, provide a guide to the receptive prim‑ ings of Saracen? The website Lexicons of Early Modern English (LEME)19 is a
      1. 
        searchable historical database of dictionaries, encyclopaedias and glossaries which provides a valuable guide to how early moderners understood their own language. LEME directs us to Edmund Bohum who, in his Geographical Dictionary of 1693, revealed that although the term was also the subject of ambiguity in early modern times, it can more reasonably be viewed as a near synonym for Muslim rather than as a synonym of Turk:Some, deriving the original of this people from Hagar and Ismael, call them Hagarenes and Ismaelites. Others make them to be descended from Cham; and that they were the Inhabitants of the ancient Saraca in Arabia, (mentioned by Ptolemy;) and of the Country whereof that City was the Capital. It is certain, they were an Arabian people: and withal, that their Name in Arabick signi‑ fies Robbers, according to the common practice of their lives; which they first began to discover in the fifth Century. Attaining in the course of time to such an universal puissance, as to over‑run Syria, Persia, Palestine, Egypt; part of Sicily, Italy, France, and most of the Islands of the Mediterranean, under Kings of their own; and to withstand the united Forces of Christendom in the elev‑ enth and twelfth Ages: till the Turks, the Caliphs of Egypt, and the Sophyes of Persia, breaking severally into their Estates; the very name of Saracen became abolished, only as it is sometimes now applied to Mahometans; because the Saracens were Mahometans.So the evidence from the dictionaries of the times does not support the hypothesis that there is some hidden receptive priming for Saracen that establishes it as a synonym of Turk. Instead what we find is more a confirmation of what the corpus data suggested.
    5. Turk as a religious identity
      To complicate matters further, and to perhaps explain where the impression of syn‑ onymy arose, in addition to Saracen being used by writers to refer to Muslims, there is overwhelming evidence from our corpus analysis that the term Turks was also used to refer to followers of Islam, making Turk and Saracen near, but not true, syn‑ onyms. Two consistent collocates of Turk, which, again, are not present as collocates of Ottoman, are Jew and Jews. The recurrent phrases “Turks and Jews” and “Jews and Turks” are a feature of our seventeenth‑century corpus and demonstrate how early modern writers could present Turk as a religious rather than an ethnic iden‑ tity. Consider, for instance, this quotation from John Meredith (1624): “Sheweth that God requires truth in Religion, which must be squared to the Rule of his Word; and therefore Jews, Turks, and Papists, whose Religion is false, because contrary to the Scriptures, cannot be saved if they persist in their obstinacy; and that Papists are
      but Pseudo‑Christians.”20 Meredith presented Jews, Turks and Papists as members of a group of non‑Christian others and allotted them an equivalent negative status. Vitkus (1997: 161) has written that the term Turk was used to denote “a generalised Islamic Other” and that Protestant authors rarely distinguished the ethnicity of Muslims from different regions.21 Furthermore, Protestant writers demonstrated a misunderstanding of central Islamic belief. Despite most – but not all – Turks being practising Muslims and, therefore, monotheists sharing the same God as the other Abrahamic religions, another consistent and high‑ranking collocate of Turk is Pagans. Although two dictionaries of 1616 and 1623 simply define pagan as “an Heathen, an Infidell”,22 within the seventeenth‑century EEBO corpus, the second highest collocate of pagan is Polytheists. The vast majority of the relevant concor‑ dances link Turk together with Pagans by means of a conjunction, or present Pagans and Turk in a coordinated list of groups who have rejected Christianity. Although this suggests that the primings of early modern English writers associated Turks with paganism, it also indicates that they had different primings for the two terms, just as the primings of Turks and Saracens were similar yet different. These primings are general, however – occasionally an author clearly voiced a belief that the Turks were pagans. For instance, William Pemble (1627) declared in a lecture: “Compare we advisedly our condition at this present with that of Turks or other Pagans: and let us magnify his mercy, that hath by Grace made so great a difference between us and them, who by Nature were all alike”. Hence some degree of individual variation in these primings may have been possible, though the overall picture makes these variations the exception rather than the norm.The collocate Infidels, also a consistent collocate of Turk throughout the sev‑ enteenth century, is utilised in a similar manner.23 The discourse around Infidels is similar to that of Pagans: writers tell us that these people, as non‑Christians, are doomed. William Cowper (1608), the bishop of Galloway, warned in a treatise: “Turks and Pagans shall not escape uvnpunisht, but thou that abuse thy soul and body to the service of Satan, which by Baptism were separate & consecrate to the Lord, committest a double sacrilege, and therefore must look for a double judge‑ ment except in time thou repent.” However, another discourse, albeit weaker, is centred around the belief that Turks had a greater chance of being granted divine
      1. Note that in this chapter we use spellings from the original documents as rendered in the corpus, i.e. the spellings in the examples cited are prone to variation.
      2. This view is supported by Matar (1998, p. 21, n. 2) who has written that Turk and Muslimwere used interchangeably by seventeenth-century English writers.
      3. See LEME.
      4. Infidels is a collocate of Ottoman for material from the 1690s only.
        leniency in the event of a second coming than did Christian heretics: Robert Rollick (1603) preached, “Well, Pagans and Turks shall find greater case and less judgment at the coming of the Lord Jesus, then false Christians”.In the 1640s, Turk attracts a new collocate, Heathens, which then collocates consistently with Turk in every remaining decade of the seventeenth century. We believe that the presence of a new collocate such as this gives a strong indication that a word’s priming is drifting. Heathens operates in a similar way to Pagan and Infidels, often appearing in a list of groups that are perceived to be opposed to true Christians – but note that its appearance in such a list is an implication of, at best, near synonymy, not true synonymy. The word heathen thus represents a priming drift, not simply a reinforcement of an existing priming, which the attraction of a true synonym of an existing collocate may imply. In this case, we do find some metalinguistic comment of note in the corpus that is helpful – one author, Robert Maton (1646), confronts his contemporaries’ differentiation between heathens and Turks with irritability: “And it is observable too, that you make a difference betwixt Turks and heathens, as if Turks were not heathens.” It is clear that for Maton this new priming was receptive rather than productive, he understood what the other authors meant, but he did not want to follow their linguistic practice in this regard. But why does the priming of Heathens shift to attract Turk from the 1640s onwards? If we look at a distribution of the word Heathens in the EEBO corpus, we can see that it undergoes a massive increase in usage from the 1640s. In the 1630s, it appears with a frequency of 51.91 per million words; in the 1640s this increases to 83.30 and, a decade later, it only slightly decreases at 80.96 mentions per million words. The explanation for this change lies in the society that pro‑ duced the corpus. This change in frequency corresponds to a time of immense religious upheaval in English society. The English Civil War was not simply a con‑ flict of political and class interests: most historians now accept that the conflict was reflective of a wider religious struggle within society, one not simply restricted to Protestants versus Roman Catholics. In 1646 episcopacy in the Church of England was abolished and a Presbyterian system was established. However, the subsequent overturning of the Act of Uniformity of 1558 meant that separatists, including groups such as Quakers and Fifth Monarchists, were for the first time able to pursue different faiths less secretly. Many Presbyterians reacted to such sects with intolerance throughout the years of the Commonwealth and, there‑ fore, viewed the Restoration of the monarchy in 1660 with approval. The term Heathens was invoked in treatises which condemned Nonconformists, Anglicans and Protestants, often as part of an assertion that “Turks and Heathens” posed less of a threat than rival branches of Christianity. James Ussher (1645), the Primate of All Ireland, differentiated between the unconcealed enemies of the church andthose professing friendship:
        Who are the open enemies? Heathens, Jews, Turks, and all that make profession of profaneness by sitting down in the seat of scorners. What enemies are they that make show of friendship? Such are all those, Of the general Apostasy. that bear‑ ing the name of Christians do obstinately deny the faith whereby we are joined unto Christ, which are called Heretics; or that break the bond of charity, whereby we are tied in communion one to another, which are termed Schismatics, or else add tyranny to schism and heresy, as that great Antichrist, head of the general apostasy, which the Scriptures forewarned by name.Interestingly, Papists collocates with Turk only in the 1640 and 1650s, adding further to the impression that the Civil War was about more than a simple political or class struggle, as discussed.Matar (1998: 103–106) has claimed that there were increased references to Alcoran, Mahometans and Turks as the Civil Wars loomed, arguing that sev‑ enteenth‑century “attention to Islam was proportionate to religious anxiety in society”. He writes that many of these allusions to Islam were antagonistic, with writers such as John Milton highlighting analogies between Christian opponents and the Turks. However, some early modern authors expressed muted admira‑ tion for the Turko‑Islamic model, whether it be for its perceived ability to effec‑ tively crush political sedition or its relative tolerance of other religions. The terms Mahometan24 and Alcoran, though neither are collocates of Turk in any decade of the seventeenth century, do show notable increases in usage, respectively, in the 1650s and between the 1640s and 1650s, with a second larger spike occurring in the last two decades of the century for both terms. This larger peak at the end of the century probably reflects heightened English interest in the Ottoman Empire as a result of the second siege of Vienna of 1683. The ensuing warfare lasted until 1698 and ultimately saw the Ottoman Empire lose almost all of its territory in Hungary at the hands of the combined forces of the Holy Roman Empire of the German Nation and the Polish‑Lithuanian Commonwealth.This historical background shows clearly why a drift in priming in the corpus is a reflection of wider cultural and historical process. As the religious and political
      5. We also looked at other terms pertaining to Muslim people, such as Mahometist which peaks in the 1590s, and Mussleman and Musselman which peak at the end of the seventeenth century. The word Muslim is rarely used until the 1690s, when it appears 183 times in EEBO. There is only one mention of Muslem and none of Moslem. Mahomed collocates with Turk in the first decade of the seventeenth century, the 1650s and the 1680s. These references appear in descriptions of Turks adhering to Islamic faith and, to a lesser extent, in historical accounts of the Sultan Mehmed II, who led the Ottoman conquest of Constantinople in 1453. Very occasionally, an English writer presented Turk and Mohamed as synonyms. For instance, Na- thanael Homes (1653) wrote of ‘the Turk (Mohamed)’. Mohamed also collocates with Ottoman in material from the 1690s.
        context within which English was used changed, some words, such as Heathens and Turks, drifted as a result because those terms were being used in discourse for particular ends. The next section explores this interface between primings, dis‑ course and society.
    6. Perceptions of Ottoman expansionism
      Let us look in a little more depth at a major event discussed in the historical back‑ ground, the second siege of Vienna. If we consider the frequency per million words of the terms Turk and Ottoman in the EEBO corpus illustrated by Figure 1, we can see that interest in the siege of Vienna is apparent. A larger peak between the 1590s and 1600s can also be explained in terms of a major Habsburg‑Ottoman conflict, this time the Thirteen Years’ War of 1593–1606.
      200180Freq. per million words160140120100806040201570–791580–891590–991600–091610–191620–291630–391640–491650–591660–691670–791680–891690–990Ottoman Turk
      Figure 1. Frequency per million words of the terms Ottoman and Turk throughout the seven‑ teenth century in EEBO (v3)
      The low number of new collocates of Turk in addition to the high proportion of con‑ sistent collocates suggests that the priming of the term was fairly stable throughout the seventeenth century and not subject to sustained drift. However, Turk does have a large number of collocates which are highly transient in nature, appearing for one decade only and then disappearing, implying that while the priming of the word is relatively stable, it is not monolithic and temporary drifts in priming may occur, as Hoey hypothesised, calling these transitory primings (Hoey 2005: 12). Moreover, a significant number of these transient collocates appear in material for
      the 1600s; 1650s; and/or 1680s only. Examples of collocates of Turk which appear only in material for the first decade of the seventeenth century and the 1650s include Bassa, forcesenemycityfleet and Galleys. Collocates appearing only in both the 1650s and 1680s include armybattleleaguethousandConstantinople and siege. Empire and hundred, meanwhile, collocate with Turk in material for both the first and penultimate decade of the century. The nature of these collo‑ cates gives the most compelling explanation of their couplings in decades of the century. These are words that relate to warfare and Ottoman expansionism – in this context, even hundred and thousand take on a military flavour because their concordances relate to numbers of Turkish soldiers.In terms of priming theory, we can then say that primings may be produc‑ tive when discourse, and world events, give cause for a priming to be used. When those events do not give cause for these productive primings to be expressed, the primings are still present and may act as receptive primings. So events in society may change the status of primings – the priming of warfare is associated with the Ottomans and can become productive when they are engaged in warfare, or appeared to threaten warfare in some way, with the society within which our cor‑ pus was produced. When they are not engaged in such warfare, we may hypoth‑ esise that the priming is still present and that if an early moderner were to read about warfare involving the Ottomans then the receptive priming would mean that such an association would be felicitous. This explanation seems more satisfac‑ tory than one which presupposes that a priming is acquired when the Ottomans are engaged in warfare, discarded when peace breaks out, and then re‑acquired when war starts again.Yet before we jump to the conclusions, that (1) these decades share military collocates purely as a result of reflecting English interest in Ottoman expansionist activity, or (2) perhaps as a response to domestic political upheaval and that (3) this in turn switches the status of primings from receptive to productive, it is wise to examine critically the textual evidence on which the collocations are based. In particular, we need to consider the dispersion of these collocates. When we do so, we find that many of these collocates originate from the same texts, namely Richard Knolles’ The generall historie of the Turkes of 1603; Alexander Ross’ The History of the World which was printed in 1652; Paul Rycault’s The history of the Turkish empire of 1680, and two lesser known anonymous works entitled, respec‑ tively, The History of the Turks (I.S. 1683) and The history of the Venetian conquests (J.M. 1689). The high frequency of fleeting first‑decade, 1650s, and 1680s collo‑ cates therefore is influenced to a significant extent by the coincidental publication of a number of volumes chronicling Turkish history – our view of primings may be unduly influenced by those of a few authors if we do not critically inspect the collocates. Of course, one might argue that these volumes appeared in print as a
      response to growing public interest in the Ottoman Empire, so that their responsi‑ bility for the appearance of collocates is a reflective rather than a distorting feature of the corpus analysis. Whether or not one accepts this theory, it is clearly crucial for a corpus analyst to be aware of the dispersion of collocates amongst the popu‑ lation of writers studied so that they may temper, if needs be, any claims about how widespread productive primings are in the corpus studied.In this case, we believe that when we consider the context in which these prim‑ ings are drifting, the drift in priming may plausibly be claimed to be real. Nabil Matar (1998: 115) has described how the period of the late 1650s and 1660s was a period of military reinvigoration for the Turks. In this time, the Empire strength‑ ened its hold over Transylvania and Moldavia and finally took possession of Crete after a twenty‑year siege. Britons felt an unpleasant sense of national powerless‑ ness, being sorely aware they possessed no means of preventing Ottoman expan‑ sion. Indeed, there is evidence that as early as the sixteenth century, people living in England felt directly threatened by Ottoman military activity. A series of common prayers for delivery from Turkish attack formulated by English ecclesiasts both reflected and propagated this public fear. After the defeat of the Ottomans at the siege of Malta in 1565, the archbishop of Canterbury ordered a prayer to be read out three times per week throughout the entire country which contained the hope that “if the Infidels… should prevail… all the rest of Christendom should lie as it were naked and open to the incursions and invasions of the said savage and most cruel enemies the Turks, to the most dreadful danger of whole Christendom…”25 This anxiety cannot simply be dismissed as national paranoia. From the sixteenth century onwards, the numbers of people seized by Turks during piratical raids on English and Irish coastal towns was increasing. Moreover, English indentured servants and immigrants travelling to the North American colonies were also vul‑ nerable to attack by Turkish privateering.26 Bergeron (2010: 273) has written that during the reign of James I, a number of pageants organised in honour of various members of the royal family provided the English with a comforting means of confronting – and defeating – the Turks in battle without any of the accompany‑ ing risk.This perception of increased risk may also be apparent when we explore the collocates of Ottoman, which increase in number and diversity in the second half of the seventeenth century, overwhelmingly as a result of the appearance of tran‑ sient collocates which wink into existence for a decade or two, producing transi‑ tory primings, and then disappear again. For example, the collocates subduedvast
      1. Quoted in Vitkus (1997: 148).
      2. Matar (2009: 222).
        and power appear in the 1650s only; Moon, provinces, magnificence27 and grand collocate with Ottoman in the 1670s; Sultan, Bassa, Hungary and troops feature in the 1680s; and Invincible, Limits, Infidels, Ambassador, Mohamed, Policy and Paris appear as collocates in last‑decade texts. Yet there are also collocates which indi‑ cate a drift – army and forces are present as collocates consistently from the 1670s onwards. Many of these terms are similar in character to the collocates for Turk discussed above which appear in two decades’ worth of material only: they relate to warfare and expansionism. However, there are some interesting additions.Policy appears as a collocate of Ottoman in texts of the 1690s during a period of warfare which followed the Battle of Vienna in 1683. This is unlikely to be a receptive priming becoming productive. While we may admit the possibility that at any point in time a receptive priming may be silent, i.e. we see no textual evi‑ dence that it is productive, in this case we would be looking at a priming which is solely receptive for a very long period of time. In the corpus data, the preceding 90 years does not find policy used as a collocate of Ottoman at all. In this context we would need to explain how the word retained its status as a receptive priming, given that, even if it was a receptive priming for speakers in 1600, there seems to be no use of it productively that might explain how it was communicated to future generations of speakers and hence survived to the end of the century. We believe that in this case we clearly have no evidence that policy had ever been a recep‑ tive priming of Ottoman. There seems to be the very real possibility here that the second siege of Vienna, which brought to an end the Ottoman threat to Europe, also brought about a change in approach to discussing the Ottomans, i.e. priming drift produces a new productive priming for the word as evidenced by a collocate such as policy. This priming emerges to reflect the change in relationship between Christian Europe and the Ottomans. When we explore that new priming, how‑ ever, it does not suggest an interaction between the two. The collocate policy tends to appear in works which ruefully allude to the existence of Ottoman statecraft, albeit only to condemn its ineffectiveness or lack of moral leadership – this is a product of the perceived failings of the defeated menace, the Ottoman state. An anonymous Englishman (Person of quality 1688), assured his readers:Imagining my self obliged to give an Account to the Public, of whatever I could learn of the Ottoman Policy, and different Manners of Government, during Nine Years Abode at Constantinople, and several other Parts of their Empire, which I have passed over; I have therefore laid hold on this Conjuncture of their Down‑ fall, to show the Christian World, that this great Coloss, which has been hitherto
      3. The collocate magnificence tends to appear in concordances describing Turkish food and entertainment.
        respected as Impregnable, stands on Foundations easily moved and overthrown, as subsisting by such Prejudices, and false Descriptions, as have been made of its Greatness.The impression that the priming of the word policy is being caused to drift by a growing perception of Ottoman vulnerability is strengthened by the collocate Limits. Although this collocate appears exclusively in translations of works by Giovanni Paolo Marana in the last decade of the century, it too suggests, for the first time through collocation in the century, an element of doubt regarding the invincibility of the Ottoman Empire.28 Another collocate indicating that the rela‑ tionship to the Ottoman Empire has changed is Ambassador, which also appears in the 1690s, highlighting the development of a political and commercial relation‑ ship between European states and the Ottoman Empire at the end of the century. Yet while the collocate could imply a two way interaction, it does not – the rel‑ evant concordances refer to English, German and French diplomats serving at the Ottoman court, rather than to any of the Turkish delegates who are known to have arrived in England from the second half of the sixteenth century onwards. Another change of relationship has been discussed by historians but is not evidenced in the corpus – the important role that these North African ambassadors played in giv‑ ing Londoners their first real sight of a Muslim (Matar 1999: 32–33). There is no evidence for this in our corpus analysis, where, as noted, the focus is on European ambassadors in the Ottoman Empire, not Ottoman ambassadors in Europe.
    7. The spectre of apostasy
      Turn is another collocate of Turk which only appears in the 1650s and 1680s but, in this case, almost certainly does not arise from disproportionate representation, because the texts by Knolles, Ross and others mentioned above barely mention the word in relation to Turk. This collocate, turn, appears frequently in the expression to turn Turk, and has been of particular interest to academics studying the phe‑ nomenon of Christians converting to Islam in the early modern period. Vaughan (1994: 31) has written that the phrase to turn Turk dated from the fourteenth cen‑ tury and had a variety of meanings: to become a Turkish person; to become a Muslim; or to become tyrannical and cruel. In the whole EEBO corpus, the phrase turn Turk appears 183 times (168 times in the seventeenth century): it first appears in 1592 and has modest peaks in the 1650s and 1680s, matching the decades in which turn appears as a collocate of Turk. An exploration of the collocates of turn
      1. See, for instance, Marana (1692).
        Turk reveal discussions of the willingness of Muslims to allow Jews to convert to Islam29 and the belief that Muslims who copulated with Christians were punished by burning.30 In 1599, the phrase springs into usage as a result of its inclusion in the first volume of Richard Hakluyt’s popular The principal nauigations, voyages, traffiques and discoueries of the English nation. Hakluyt presents a number of anec‑ dotes of men originally from England and France converting in the hope of a bet‑ ter standard of life, to avoid capital punishment or as a result of coercion.Christian and Christians are consistent collocates of Turk. While Christians and Turks tend to be juxtaposed for rhetorical effect by writers who use the latter term, as we have noted it may be used, with a meaning equivalent to Muslim, the collocate Christian (singular) highlights a discourse of religious conversion, both in terms of a Christian converting to Islam and Muslims converting to Christianity. In a play by Philip Massinger (1630), a female character declares: “Grant me access and means, I’l undertake / To turn this Christian Turk, and marry him…” Similarly, a sermon of the same decade by Charles Fitz‑Geffry (1637) ‘warned of the dangers of conversion: “If a Christian become Turk, he is more the child of perdition then the Turks themselves.” The phenomenon of Christians converting to Islam has best been explored by Matar (1998: 15, 21, 42, 43; 1999: 43). He describes how English writers regarded apostasy as a heinous and cowardly act and viewed the thou‑ sands of converts to Islam with a growing sense of embarrassment. Conversion was a natural consequence of English servants, merchants, pirates, and soldiers travelling to North Africa and the Levant. They were, Matar writes, attracted to the greater opportunities for social mobility within the Ottoman Empire and the rich cultural opportunities offered by handsome Muslim cities. Conversion might also occur as a result of social displacement and powerlessness: a Parliamentary paper (England and Wales. Parliament 1641) commented:Whereas many thousands of your Majesties good and loving Subjects with their Ships and goods, have of late time been surprised and taken at Sea (as they were in their lawful trading) by Turkish, Moorish, and other Pirates; And some of them, to free themselves of the cruel and barbarous usage of those Pirates, have renounced the Christian Religion, and turned Turks; And others yet kept in bondage, are used with so extreme cruelty, as they are in great danger thereby to lose their lives, unless they shall also forsake the Christian Religion…Consequently, the link between Turk and conversion again probably betokens a concept linked to Turk which is mostly receptive but under certain circumstances it may become productive as a priming.
      2. For example, Marana (1692).
      3. See, for example, T.C. (1698).
        To what extent are the claims made about Turk by historians supported by the corpus data? In the EEBO corpus, there are conflicting views of Turkish religious toleration. Some writers declared that Christians were forced or tricked into con‑ verting. A traveller’s account by William Lithgow (1640) confidently asserted that:…there is a stately tree, called Tubah, the lease of which is partly of gold, and partly of silver: whose boughs extend round about the walls of this seventh Para‑ dise, whereon the name of Mohamed is written, near to the name of God, in these words, Alla illa, he, all ah, Mohamed Rezul allah. The which words are in such reverence amongst the Turks, that if a Christian should happen, unadvisedly to repeat them, he is adjudged to a most cruel death, or compulsed to renounce his Christian Religion.Other themes involve Turks forcing Christian children into Janissary units31 or castrating them: “As it is a barbarous custom at this day among the Turks, to deprive divers Christian children of their privities.”32 Viktus (1997: 174) has writ‑ ten of the widespread belief that conversion to Islam required circumcision.This fear of forced conversion is tempered, to a certain extent, by writers who acknowledge that a policy of religious toleration was in operation in the Ottoman Empire. Indeed, the pioneering advocate of religious tolerance, Leonard Busher (1646), put forward the Turks as an exemplary model from which Christians had much to learn: “Also I read that Iews, Christians, and Turks are tolerated in Constantinople, and yet are peaceable, though so contrary the one to the other. If this be so, how much more ought Christians not to force one another to Religion?” However, the image of the oppressive and manipulative Turk, forcing Christians to turn renegade, found a louder voice in popular discourse and might be seen as a dominant productive priming of the word: in a period of intense religious perse‑ cution in England itself, it is possible that many writers simply could not conceive of a nation that allowed people to openly practise different faiths.33
    8. Conclusion

    In the seventeenth century, English writers did differentiate between the terms Ottoman and Turk. The former tended to relate to members of the imperial family, but in the second half of the century, the priming of the word drifted as it started
    1. See, for example, Ryves (1648).
    2. Trapp (1649).
    3. See Vitkus (1997: 161).
      to be used in texts that referred to the power of the empire: its military operations; and the religion of its inhabitants. Turk remained more stable in meaning over the course of the century, with a large number of consistent collocates appearing in texts which presented Turkish people as members of an opposing force, who, the writers of the time claimed, practised an inferior and false religion. Turk was very much conceived as a religious identity by early modern writers, who sometimes utilised the term as a near‑synonym for Muslim. Hence we see the productive priming of both Ottoman and Turk being relatively stable, but we also see evi‑ dence that some receptive primings related to the words have the ability to switch between being receptive to productive and vice versa depending on circumstance. Also, and importantly, as the century concludes the evidence seems to point to the priming of the word Ottoman drifting, with the association of the Ottomans with invincibility being replaced by a perception of their weakness and a shift to peace‑ ful relations instead of war.Our exploration of early modern dictionaries proved them to be generally poor guides to productive and receptive primings of the words we explored. Two early modern western definitions of the term Turk, “any cruel hard‑hearted Man” or “one that is accursed as a vagabond” which appear, respectively in a canting dic‑ tionary of 169934 and in The Policy of the Turkish Empire of 1597,35 are not reflected in any collocates attracted to Turk. Indeed, there are no collocates which directly relate to Turks having savage or pitiless personalities. This is somewhat surpris‑ ing, as scholars specialising in early modern Anglo‑Ottoman relations often go to considerable lengths to emphasise the English preoccupation with the Turkish national character. There remains the possibility that this was expressed in recep‑ tive primings in the era, but in the absence of any corpus evidence that shows this priming becoming productive in the century, in spite of many events which could have brought that about, the existence of this receptive priming must be doubtful. Yet the link between social reality and priming may be disjointed. Seventeenth‑ century English merchants profited from trade with Islamic countries to a greater extent than any other Europeans.36 However, there is also a complete absence of collocates relating to commercial activity in written public discourse in the cen‑ tury. So it does not necessarily follow that primings always follow the pressuresexerted on words by the social world.As shown, however, the corpus still provides very valuable evidence for word primings in the century and provides, in combination with historical research, a
    4. B.E (1699) reproduced in Lexicons of Early Modern English (LEME).
    5. See Vaughan (1994: 31, n. 58).
    6. Matar (1998: 10).
      way of approaching both a description and explanation of word primings in the past. Our research showed what can happen when the corpus is eschewed and the close reading of small numbers of texts are relied on to explore representation instead. For instance, historians have written a great deal about the emergence of the coffee house in post‑Restoration London and the accompanying fears that the Mahometan berry (the coffee bean) threatened Christianity itself. But our analysis fails to highlight any such discourse – indeed the corpus contains no examples of the phrase Mahometan berry. If it was a concern, it is not a salient one in the writ‑ ten public discourse of the time, and it certainly does not seem to have been talked about frequently using the term Mahometan berry.37It appears that the chief social pressures acting on the primings of words like Turk and Ottoman in the seventeenth century was, overwhelmingly, religious anxiety and, to a lesser extent, reactions to foreign policy.38 On occasion other fac‑ tors influenced the primings of these words – the preoccupation with the Turks’ non‑Christian status was particularly intense during periods of heightened reli‑ gious anxiety in the 1640s and 1650s. Finally, interest in Ottoman military activity peaked (unsurprisingly) during the second siege of Vienna towards the end of the century. English writers in early modern times displayed a realistic acceptance of their country’s limitations in countering Ottoman expansionism. The corpus con‑ tains plentiful expressions of wonder and envy: although Islam was regarded with contempt, the Ottomans are presented as fearsome men who are to be respected. As Matar (1999: 11) has commented, English writers did not use terms such as colonyplantation or settlement in descriptions of Muslim territories. It is only towards the very end of the century, during a period that witnessed Turkish mili‑ tary decline and the establishment of a powerful English navy, that the notion of Turkish invincibility is challenged in discourse to any extent with a consequent drift in priming observed in the corpus.To conclude, primings do have the capacity to drift over time, as demonstrated by this analysis. Yet though priming may drift, the pattern of change associated with such drift is complex. Drift may be the result of new primings being added, as we saw with the discussion of Ottoman vulnerability at the end of the century. But drift may also be expressed by receptive primings becoming productive, and vice versa. Another important point discussed is that drift may work at the level of the
    7. Turkish coffee appears three times within our corpus and Arabian coffee is mentioned once.
    8. This preoccupation with religious issues has been explored by William Hamlin (2014), who has undertaken a corpus analysis of EEBO to investigate the evolution of god-language in early modern printed texts.

    speech community or at the level of the individual. As was shown in this chapter, some authors may exhibit strong productive primings that the majority of authors do not. This has an important consequence – if we do not consider the disper‑ sion of a priming we may, mistakenly perhaps, project a set of primings from one author to a whole society. Primings may work on large speech communities, but they may also work on levels as small as the individual. This layering of priming suggests that there is the capacity across the speakers of a language for primings to vary across groups and individuals. This study also showed another important feature of priming: while tied to the social world, priming is not a simple reaction to it. While the social world, inevitably, exerts a huge pressure on the primings of content words, that pressure does not inevitably lead to a corresponding drift of priming either for an individual or a whole speech community. This uncertain rela‑ tionship between priming and the social world almost certainly provides another source in variation in priming across a population of speakers of a language.
    AcknowledgementsThe work in this chapter was supported by the Newby Trust and the UK Economic and Social Research Council, grant reference ES/K002155/1
    ReferencesAmeer Ali, S. 2010. A Short History of the Saracens. New York NY: Routledge.B.E (anon). 1699. A New Dictionary of the Terms Ancient and Modern Of The Canting Crew, In its several Tribes of Gypsies, Beggars, Thieves, Cheats, &c. With an Addition of Some Proverbs, Phrases, Figurative Speeches, &c. Useful for All Sorts of People, (Especially Foreigners) to Secure Their Money and Preserve Their Lives; Besides Very Diverting and Entertaining, Being Wholly New. London: Printed for W. Hawes, P. Gilbourne, W. Davis.Beck, B.H. 1987. From the Rising Sun: English Images of the Ottoman Empire to 1715. Bern: Peter Lang.Bergeron, D.M. 2010. “Are we turned Turks?”: English pageants and the Stuart court. Compara- tive Drama 44(3): 255–275. doi: 10.1353/cdr.2010.0001Busher, L. 1646. (no title) London: Printed for John Sweeting at the Angel in Popes‑head‑alley. Cowper, W. 1608. The triumph of a Christian Contayning Three Excellent and Heauenly Treatises. 1 Iacobs Wrestling with God. 2 The Conduit of Comfort. 3 A Preparatiue for the Lords Supper.Full of Sweet Consolations for all that Desire the Comfortable Sweetnesse of Iesus Christ, and Necessary for Those who are Troubled in Conscience. Written by that Worthy Man Master William Couper, Minister of Gods Word. London: Printed (by T. East) for Iohn Budge, and are to be sould at the great South doore of Paules Church.Davies, M. 2012. Expanding horizons in historical linguistics with the 400‑million‑word Corpus of Historical American English. Corpora 7(2): 121–157. doi: 10.3366/cor.2012.0024
    England and Wales. Parliament. 1641. (no title) London: Printed by Robert Barker, Printer to the Kings most excellent Majestie: and by the assignes of John Bill.Fitz‑Geffry, C. 1637. Compassion towards Captives Chiefly towards our Brethren and Country- men who are in Miserable Bondage in Barbarie. Vrged and Pressed in Three Sermons on Heb.13.3. Preached in Plymouth, in October 1636. London: Printed by Leonard Lichfield, for Edward Forrest.Hakluyt, R. 1599. The Principal Nauigations, Voyages, Traffiques and Discoueries of the English Nation: Made by Sea or Ouer-land, to the Remote and Farthest Distant Quarters of the Earth, at any Time within the Compasse of these 1600. Yeres: Deuided into Three Seuerall Volumes, According to the Positions of the Regions, Whereunto They Were Directed … (Volume 1). London: By George Bishop, Ralph Newberie, and Robert Barker.Hamlin, W.M. 2014. God‑language and scepticism in early modern England: An exploratory study using corpus linguistics analysis as a form of distant reading. English Literature 1(1): 17–41.Hardie, A. 2012. CQPweb – Combining power, flexibility and usability in a corpus analysis tool.International Journal of Corpus Linguistics 17(3): 380–409. doi: 10.1075/ijcl.17.3.04harHiggons, T. 1684. The History of Isuf Bassa, Captain General of the Ottoman Army at the Invasion of Candia. London: Printed for Robert Kettlewell.Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. Abingdon: Routledge.doi: 10.4324/9780203327630Homes, N. 1653. Apokalypsis Anastaseåos The Resurrection Revealed, or, The Dawnings of the Day-star about to Rise and Radiate a Visible Incomparable Glory far beyond any Since the Creation upon the Universal Church on Earth for a Thousand Yeers yet to Come, before the Ultimate dDy of the General Judgement to the Raising of the Jewes, and Ruine of all Anti- christian and Secular Powers, that do not Love the Members of Christ, Submit to his Laws and Advance his Interest in this Design: Digested into Seven Bookes with a Synopsis of the Whole Treatise and two Tables, 1 of Scriptures, 2 of Things, Opened in This Treatise. London: Printed by Robert Ibbitson and are to be sold by Thomas Pierrepont.Hutchings, M. 2006. The “Turk Phenomenon” and the repertory of the late Elizabethan play‑ house. Early Modern Literary Studies 16: 1–39.Ingram, A. 2015. Writing the Ottomans: Turkish History in Early Modern England. New York NY: Palgrave MacMillan. doi: 10.1057/9781137401533I.S. (anon). 1683. The History of the Turks Describing the Rise and Ruin of Their First Empire in Persia, the Original of Their Second: Containing the Lives and Reigns of Their Several Kings and Emperors from Ottoman its First Founder to this Present Year, 1683, Being a Succinct Series of History, of all Their Wars (Forreign and Domestick) Policies, Customs, Religion and Manners, With What Else is Worthy of Note in That Great Empire. London: Printed by Ralph Holt for Thomas Passinger, William Thackery and Thomas Sawbridge.J.M. (anon). 1689. The History of the Venetian Conquests, from the Year 1684 to this Present Year 1688. Translated out of the French by J.M. Licensed, Octob. 2. 1688. London: Printed for John Newton at the three Pigeons over‑against the Inner Temple‑Gate in Fleet‑street.Jowitt, C. 2002. Political allegory in Late Elizabethan and Early Jacobean “Turk” plays: Lust’s Dominion and The Turke. Comparative Drama 36(3–4): 411–443. doi: 10.1353/cdr.2002.0022 Knolles, R. 1603. The Generall Historie of the Turkes from the First Beginning of That Nation to the Rising of the Othoman Familie: With all the Notable Expeditions of the Christian Princes against Them. Together with the Liues and Conquests of the Othoman Kings and Emperours Faithfullie Collected out of the Best Histories, both Auntient and Moderne, and Digested intoone Continuat Historie untill This Present Yeare 1603. London: Printed by Adam Islip.
    Koselleck, R. 2002. The Practice of Conceptual History: Timing History, Spacing Concepts. Stan‑ ford: Stanford University Press.Kugler, E.M.N. 2012. Sway of the Ottoman Empire on English Identity in the Long Eighteenth Century. Leiden: Brill. doi: 10.1163/9789004225435Lithgow, W. 1640. The totall discourse, of the rare adventures, and painefull peregrinations of long nineteene yeares travailes from Scotland, to the most famous kingdomes in Europe, Asia, and Affrica. Perfited by three deare bought voyages, in surveying of forty eight kingdomes ancient and modern; twenty one rei-publicks, ten absolute principalities, with two hundred islands. London: By I. Lucinge, R. 1606. The Beginning, Continuance, and Decay of Estates Vvherein are Handled Many Notable Questions Concerning the Establishment of Empires and Monarchies. London: Printed at Eliot’s Court Press for Iohn Bill.McEnery, T. & Baker, H. 2016. Corpus Linguistics and Seventeenth-Century Prostitution. London: Bloomsbury.McEnery, T. & Hardie, A. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge: CUP. doi: 10.1093/oxfordhb/9780199276349.013.0024McJannet, L. 2006. The Sultan Speaks: Dialogue in English Plays and Histories about the Ottoman Turks. New York, NY: Palgrave Macmillian. doi: 10.1057/9780230601499Maclean, G. 2004. The Rise of Oriental Travel: English Visitors to the Ottoman Empire, 1580– 1720. New York, NY: Palgrave Macmillan. doi: 10.1057/9780230511767Marana, G.P. 1692. The Fourth Volume of Letters Writ by a Turkish Spy Who Lived Five and Forty Years Undiscover’d at Paris: Giving an Impartial Account to the Divan at Constantinople of the Most Remarkable Transactions of Europe, and Discovering Several Intrigues and Secrets of the Christian Courts (Especially of that of France) Continued from the Year 1642 to the Year 1682. London: Printed by J. Leake for Henry Rhodes.Massinger, P. 1630. The Renegado a Tragæcomedie. As it Hath Beene Often Acted by the Queenes Maiesties Seruants, at the Priuate Play-house in Drurye-Lane. London: Printed by A[ugustine] M[athewes] for Iohn Waterson, and are to be sold at the Crowne in Pauls Church‑Yard.Matar, N. 1997. Renaissance England and the Turban. In Images of Other: Europe and the Mus- lim World before 1700, D. Blanks (ed.). Cairo: American University in Cairo Press.Matar, N. 1998. Islam in Britain, 1558–1685. Cambridge: CUP. doi: 10.1017/CBO9780511582738Matar, N. 1999. Turks, Moors and Englishmen in the Age of Discovery. New York NY: Columbia University Press.Matar, N. 2009. Britons and Muslims in the early modern period: From prejudice to (a theory of) toleration. Patterns of Prejudice 43(3–4): 213–231.Maton, R. 1646. (no title) London: Printed by Matthew Simmons, and are to be sold by George VVhittinton at the blew Anchor neere the Royall‑Exchange.Meredith, J. 1624. The Iudge of Heresies one God, one Faith, one Church, out of Which There is no Saluation. Excluding all Infidells, Mahumetans, Iewes, Obstinate Papists, and Other Here- tikes of all Sorts, and Consequently all Newters, Who Conforme Themselues Onely Externally to any Religion, from Hope of Participation of the Kingdome of Heauen. If They Finally Persist Therein, and Returne not to the Knowledge and Zealous Profession of the True Faith. London: Printed by A[ugustine] M[athewes] for Iohn Grismand, and are to bee sold at his shop in Pauls Alley, at the signe of the Gunn.Millar, F. 1993. The Roman Near East, 31 B.C.–A.D.337. Cambridge MA: Harvard University Press.
    Neville, H. 1681. Plato Redivivus, or, A Dialogue Concerning Government Wherein, by Observa- tions Drawn from Other Kingdoms and States Both Ancient and Modern, an Endeavour is Used to Discover the Present Politick Distemper of Our Own, With the Causes and Remedies. London: Printed for S.I. and sold by R. Dew.OED Online. Oxford University Press. Consulted July 2016.Pemble, W. 1627. Vindiciæ Gratiæ. A Plea for Grace. More Especially the Grace of Faith. Or, Certain Lectures as Touching the Nature and Properties of Grace and Faith: Wherein, amongst Other Matters of Great Use, the Maine Sinews of Arminius Doctrine are cut Asunder. Delivered by That Late Learned and Godly Man William Pemble, in Magdalen Hall in Oxford. London: Printed by R. Young for I. Bartlet, at the golden Cup in Cheape‑side.Person of quality. 1688. (no title). London: Printed and sold by Randal Taylor.Purchas, S. 1625. Purchas his Pilgrimes In Fiue Bookes. The First, Contayning the Voyages and Peregrinations Made by Ancient Kings, Patriarkes, Apostles, Philosophers, and Others, to and Thorow the Remoter Parts of the Knowne World: Enquiries also of Languages and Religions, Especially of the Moderne Diuersified Professions of Christianitie. The Second, a Description of all the Circum-nauigations of the Globe. The Third, Nauigations and Voyages of English- men, alongst the Coasts of Africa… The Fourth, English Voyages beyond the East Indies, to the Ilands of Iapan, China, Cauchinchina, the Philippinæ with Others… The Fifth, Nauigations, Voyages, Traffiques, Discoueries, of the English Nation in the Easterne Parts of the World… The First Part. London: Printed by William Stansby for Henrie Fetherstone, and are to be sold at his shop in Pauls Church‑yard at the signe of the Rose.Rissanen, M., Kytö, M. & Palander‑Collin, M. 1993. Early English in the Computer Age: Explora- tions through the Helsinki Corpus. Berlin: Mouton de Gruyter.Rollock, R. 1603. Lectures Vpon the Epistle of Paul to the Colossians. Preached by That Faithfull Seruant of God, Maister Robert Rollok, Sometime Rector of the Vniuersitie of Edenburgh. London: Imprinted by Felix Kyngston, dwelling in Pater‑noster row, ouer against the signe of the Checker.Ross, A. 1652. The History of the World the Second Part in six Books, Being a Continuation of Famous History of Sir Walter Raleigh, Knight: Beginning Where he Left viz at the End of the Macedonian Kingdom, and Deduced to These Later-times: That is from the Year of the World 3806, or 160 Years before Christ till the End of the Year 1640 after Christ. London: Printed for John Saywell.Rycault, P. 1680. The History of the Turkish Empire from the Year 1623 to the Year 1677 Contain- ing the Reigns of the Three Last Emperours, viz., Sultan Morat or Amurat IV, Sultan Ibrahim, and Sultan Mahomet IV, his Son, the XIII Emperour Now Reigning. London: Printed by J.M. for John Starkey.Ryves, B. 1648. (no title). London: (no publisher).Sherwin, W. 1674. An Additional Supplement to the Eirenikon, or, Peaceable Considerations of Christs Peaceful Kingdome to Come upon the Earth in the Thousand Years Rev. 20, Lately Published, 1665. London: (no publisher).T.C. (anon). 1698. The New Atlas, or, Travels and Voyages in Europe, Asia, Africa, and America, thro’ the Most Renowned Parts of the World … Performed by an English Gentleman, in Nine Years Travel and Voyages, More Exact Than Ever. London: Printed for J. Cleave … and A. Roper.Tolan, J.V. 2002. Saracens: Islam in the Medieval European Imagination. New York NY: Columbia University Press.Trapp, J. 1649. (no title). London: Printed for Timothy Garthwait, at the George in Little‑Brittain.
    Ussher, J. 1645. A Body of Divinitie, or, The Summe and Substance of Christian Religion Cat- echistically Propounded, and Explained, by Way of Question and Answer: Methodically and Familiarly Handled / Composed Long Since by James Vsher B. of Armagh, and at the Earnest Desires of Divers Godly Christians Now Printed and Published; Whereunto is Adjoyned a Tract, Intituled Immanvel, or, The Mystery of the Incarnation of the Son of God Heretofore Writen and Published by the Same Authour. London: Printed by M.F. for Tho. Downes and Geo. Badger.Vaughan, V.M. 1994. Othello: A Contextual History. Cambridge: CUP.Vitkus, D.J. 1997. Turning Turk in Othello: The conversion and damnation of the Moor. Shake- speare Quarterly 48(2): 145–176. doi: 10.2307/2871278Willet, A. 1633. Hexapla in Genesin & Exodum: That is, a Sixfold Commentary upon the Two First Bookes of Moses, Being Genesis and Exodus. London: Printed by John Haviland, and are sold by James Boler at the signe of the Marigold in Pauls Church‑yard.Wood, A.C. 1964. A History of the Levant Company. London: Frank Cass.Forced lexical primings in transdiscoursive political messagingHow they are produced and how they are received
    Alison Duguid & Alan PartingtonUniversity of Siena / University of Bologna
    Lexical priming is a term for the processes by which listeners, by repeated exposure, first internalise and then reproduce the constituent elements of language, their combinatorial possibilities and the semantic and pragmatic meanings associated with them (Hoey 2005). Forced priming (Duguid 2009), on the other hand, describes a process whereby speakers or authors frequentlyrepeat a certain form of words to deliberately ‘flood’ the discourse with messages for a particular strategic purpose. There are many fields where primings can be forced for particular effect, such as education, particularly in the primary school, for example through exercises in rote learning, or advertisements, as in slogans coined to be memorable and repeatable. Advertising combines with politicsin the periods around general elections and referendums where professional campaigns are run, employing advertising agencies to put over political messages in a simple way. Here, however, we are not interested in campaign posters or brief messages clearly created to express a party position, but in the linguistic discipline of day to day political communication, where there is the careful studied and strategic preference of a particular form with an associated evaluation, positive for the speaker’s side or negative for the opponents
    1. Introduction
      Lexical priming is a term for the processes by which listeners, by repeated expo‑ sure, first internalise and then reproduce the constituent elements of language, their combinatorial possibilities and the semantic and pragmatic meanings associ‑ ated with them (Hoey 2005).Forced priming (Duguid 2009), on the other hand, describes a process whereby speakers or authors frequently repeat a certain form of words to deliber‑ ately ‘flood’ the discourse with messages for a particular strategic purpose (though we need to treat the word ‘deliberately’ with caution). There are many fields where primings can be forced for particular effect, such as education, particularly in the

      doi 10.1075/scl.79.03dug© 2017 John Benjamins Publishing Company
      primary school, for example through exercises in rote learning, with group and individual reproduction as part of such as the times tables, or advertisements, as in slogans coined to be memorable and repeatable. Advertising combines with poli‑ tics in the periods around general elections and referendums where professional campaigns are run, employing advertising agencies to put over political messages in a simple way. Here, however, we are not interested in campaign posters or brief messages clearly created to express a party position, but in the linguistic discipline of day to day political communication,1 where there is the careful studied and strategic preference of a particular form with an associated evaluation, positive for the speaker’s side or negative for the opponents. For example, in opposite sides of debates on voluntary termination of pregnancy, participants on one side may be advised to choose the term pro-life, those on the other pro-choice. In current finan‑ cial policy debates, politicians may be encouraged to use the term austerity rather than deficit reduction or fiscal responsibility, or vice versa, depending on whether or not you belong to the group carrying out the policy.The composers of such insistent messages may have been media specialists, often journalists, themselves, while the addressers are usually politicians who insert the messages while answering questions or making statements. Since it is used on many occasions, by many addressers, the practice is sometimes referred to meta‑ phorically as ‘singing from the same hymn‑sheet’. This metaphor suggests a bring‑ ing together of addressers, message, timing, and attribution to produce a choral effect. The addressees or better still, beneficiaries of the insistent messaging, indeed, those whose primings the composer hopes to force, are, of course, the voting public. Baker speaks of the ‘incremental effect of discourse’ (2005: 13–14), but forced priming represents an artificial boosting of this incremental effect to give extra weight to an evaluation. The evaluation is the point of the discourse and a cohesive factor in it (Partington, Duguid & Taylor 2013: 55–57 speak of ‘evaluative har‑ mony’). In this sense, it is a deliberate, even blatant, form of intertextuality or rather transdiscoursive intertextuality where meaning‑building can also be inter‑ discoursal, passing from one discourse type to others, indeed along a potentialchain as will be discussed in Section 2.1.
      1. Claire Short, a senior UK Labour party politician, recounted how clusters were flooded into the discourse for forced priming in the New Labour administration: “A team of briefers work with Blair to prepare for Prime Minister’s Question Time, they then provide briefing on the lines and phrases to any senior politician appearing on Question Time, Any Questions, etc. Written briefings are sent out to all MPs so they too know the phrases to use and the line to take”. (Short 2004: 47) The creation of clusters (lines and phrases) which are intended to become collocations is a conscious triggering of the idiom principle.
        In this chapter we also discuss a particular type of intertextuality which we call transdiscoursive intertextuality, a phenomenon ever more common in a world where new discourse types are constantly being invented, in which messages pass from one discourse type to another and then perhaps to yet another, and so on. Lexical priming and meaning building is thus carried out and even reinforced by people’s exposure to a message in a variety of discourse types. The example we concentrate on here is the political message cycle (Section 2.1 below).In this chapter we look at examples of forced priming in a variety of political discourses. A considerable number of corpora were employed, including:
        • corpora of White House press briefings from various time periods (3.4million words);
        • the SiBol corpus, a sample corpus of English broadsheets from 1993–2010, while the latest version, SiBol 2013, includes thirteen English‑language news‑ papers from around the world,
        • a corpus containing the two British broadsheets, the right‑leaning Telegraph and the left‑leaning Guardian, called G and T, from 2014 and 2015 (121mil‑ lion words)
        • the AM (the Andrew Marr show) corpus, a corpus of political interviews from 2015 (100,000 words)
        • CNN news reports (63 million words) from 2011
        • all New York Times articles (74 million words) from 2011.
          Other ad hoc newspaper data‑collections were also downloaded from Lexis Nexis, particularly for the study outlined in Section 3.4 where the simple search string NHS OR National Health Service AND envy of AND the world was used to collect articles containing examples from a variety of British newspapers.
    2. Political press briefings
      Several authors have noted that press briefings constitute a rich area for the study of forced priming; indeed the very aim of press briefings from the US administra‑ tion’s point of view is to get the media, and through them, the public, to accept a particular version of events. Partington (2003) notes that, in a keywords and key cluster analysis of a 250,000 word corpus of White House press briefings (the reference corpus consisting of contemporaneous political interviews), the spokes‑ person or podium uses certain words differently from the press questioners. For example, the latter use the item clear to ask for more clarity, whilst the podium uses it to flood the discourse with the message that both he and his masters (usually the
      President) are always clear, a great political virtue; e.g. the president has made clear (13 occurrences) we have made very clear (7). The other virtues podiums attempt to project is that of strength and of a hard‑working Presidency, with keywords including effort/sstep/swork*. Key clusters also suggest an ongoing process: to get this done (7), it’s being done (2), getting the work done (3), move forward (9) and continue to work with (7; this latter also expressing yet another frequent priming of a ‘cooperative’ administration) (Partington 2003: 199–203).In her collection of White House press briefings from 2001–2005 (699 texts, 3.4 million words), Riccio (2009) uncovers a number of attempted priming mes‑ sages which involve the item message/itself. The item is relatively common in the discourse type, occurring 0.33 times per thousand words, compared to 0.06 per thousand in the BNC. She concludes that when the addressee is an opponent of the US, most frequently Saddam Hussein, the word is a euphemism for threat (2009: 139–140), for example:
      1. Mr. Fleischer: All of these actions by the United States and our allies – and we have worked every step of the way with our allies – have, I believe, sent an unmistakable message to regimes that are seeking or that possess weapons of mass destruction: these weapons do not bring the benefits of security, as the President has stated; they bring isolation and unwelcomeconsequences. (19th December 2003; cited in Riccio 2009: 128)The other main collocates of message, an essentially vague umbrella term, is the World (46 occurrences) and the American people (20 occurrences) both of which are represented as joining in and agreeing with the President’s message (2009: 123–128). The message primings overlap with the above mentioned ‘clarity’ and ‘strength/power’ priming; the President’s or the US’s messages are, of course, clear (44 co‑occurrences, the most frequent collocate), strong (19) and powerful(13) (Riccio 2009: 134).During the same period of the Iraq war, and as part of the same project, Partington notes two particular lexical primings around the item job (172 occur‑ rences), both of which had the purpose of praising the military. These primings were found not only in press briefings but in spoken political discourses in the US House of Representatives and the UK House of Commons. The first was the pat‑ tern a(n) [adjective] job, for example:
      2. Our servicemen and women are doing a remarkable job in difficult circumstances (House of Commons, 10/09/2003)In Hoey’s terms, we might say that the pattern a(n) [adjective] job is primed in this discourse type to have an intensifier in the adjective slot (in the House of Commons data, in 70% of cases), displays a semantic preference for a military actor and has a pragmatic priming for epideictic praise. In slightly different terms
        – those of speaker intentionality – we might equally say that, whenever a speaker, in these particular discourse circumstances, desires to praise the activities of a military (or associated) actor, they are strongly primed to employ this pattern.When the second template, namely, [the / possessive] job is found, it is accom‑ panied by some form of the verbs do and get, including do the jobget on with their job and get the job done with wider associated elements including properlycan – cannotenough resources to. This pattern, as the previous one, is a vehicle for praise, though perhaps less outstanding praise; more for fulfilling a duty rather than exceeding it. Notably, in the briefings corpus, these two templates were only ever used by the podium, never by the press, which rarely indulges in praise in its questioning.One dramatic example of the podium’s attempts at ‘flooding’ the briefings debate with a particular lexical priming occurred in the early stages of the Arab Uprisings in North Africa and the Middle East, in 2011. In normal times, it is the diplomatic habit for the podium to refer to the administration of any foreign coun‑ try, whether democratically elected or not, as its government. However, as Figure 1 shows, in the space of nine months, the podium stops referring to the administra‑ tion of Syria as the Syrian government and refers to it as the Assad regime, a clear attempt to ‘flood’ the discourse with a far more negatively primed representation of the administration in question. It seems hard to believe that the shift is not deliber‑ ate; it is evidence of distancing the US administration from its Syrian counterpart.
        45403530252015105Dec–Feb March–May June–Aug Sept–NovSyrian government Syrian regime Assad regimeFigure 1. How the Syrian an administration is referred to by the podium in the first three months of White House press briefings in 2011
        One question which often arises in any study of press briefing primings is whether the podium’s attempts to flood the discourse with a particular priming actually succeed, in other words, does the Washington press go on to adopt the same kind of phraseology when discussing the topic? In order to study this, a corpus of CNN
        news reports2 and one of New York Times articles over the year 2011 were com‑ piled (see the Introduction). These media organisations were chosen since they both send correspondents to White House briefings.
        1. Transdiscoursive intertextualityAs often noted, many of the meanings conveyed in texts are incremental, that is, built up over time over many texts, the meaning in the current text depending on its ‘historical’ development through the text‑type. This is true of the forced primings we have been discussing so far; they are deliberately constructed over time. What is perhaps less frequently noted is that meaning‑building can also be transdiscoursive, passing from one discourse type to others, indeed along a potential chain. As regards, political language, a message from a political source is designed to be adopted and reproduced in the mainstream media, both print and broadcast. But it may then be noted and discussed on‑line and in the social media, such as twitter. And the social media activity may then be noted and reported on in “Media Watch” (France 24) type programmes in the mainstream media, whilst politicians next morning read concentrates of the what the media is saying in the press reviews their agents prepare for them, as in Figure 2.
          a political source
          mainstream reporting: TV, newspaperCMC / social media
          back into the mainstream: ‘Media Watchpoliticians’ press review, next morning
          Figure 2. Transdiscoursivity. The potential political message cycle
          1. Compiled by Anna Marchi, University of Bologna.
            It is beyond the scope of the current research to follow forced primings along the whole of this path, we limit ourselves to the origin of the messages and their recep‑ tion or otherwise by the mainstream media.As regards the CNN news reports, Figure 3 illustrates how the references to the Syrian administration change over the course of 2011.

            1st Qtr 2nd Qtr 3rd Qtr 4th QtrSyrian government Syrian regime Assad regimeFigure 3. CNN news reports: The proportion of mentions of Syrian government / Assad gov‑ ernment / Assad regime over 2011
            As we can see, there is a relative increase over the years to the use of regime, espe‑ cially Assad regime but it is less drastic than that seen in the briefings. This becomes clearer if combine both types of mention of regime, as in Figure 4.

            1st Qtr 2nd Qtr 3rd Qtr 4th QtrSyrian government Syrian regimeFigure 4. CNN news reports: the proportion of mentions of Syrian government / Syrian regime + Assad regime over 2011
            Looking at an example of the printed press, Figure 5 below shows how the New York Times chose to refer to the Syrian administration over 2011. As is evident, the use of government prevails throughout. On the occasions when regime does get used, by the end of the year Assad regime is preferred to Syrian regime.In this case, the evidence suggests that the US administration’s forced priming via briefings did not immediately alter the newspaper’s discourse.A word of caution is needed on the process of the transdiscoursive statisti‑ cal tracking over time of primings and messages in general in different discourses.
            Even if a correlation is apparent, this does not necessarily mean that one source, a broadcaster or newspaper, say, is following the other; they could of course be both reacting in a similar way to outside stimuli. For instance, the increase in use of the negative label Assad regime on the part of CNN may simply be the result of growing disapproval of the administration’s actions. In Popper’s terms it is easier to disprove the influence of one discourse on another, as in the case of the New York Times evident lack of take‑up of briefings primings, than to prove it (Popper 1934/1959).
            90B0 706050403020100 1st Qtr 2nd Qtr 3rd Qtr 4th QtrSyrian government Syrian regime Assad regimeFigure 5. New York Times: The proportions of mentions of Syrian government / Assad government / Assad regime over 2011
    3. Lexical priming in UK party politics
      1. Praise and blame: Forcing evaluationsIn interactive political discourses like press briefings and political interviews, pre‑ prepared strategic snippets are very often reactive, that is, found in an answer to something and intended to rebut some previous piece of discourse. In the interac‑ tion between politics and public, most talk is mediated and the public rarely fol‑ lows a full speech or parliamentary debate or press briefing and are more likely to consume soundbites, that is, extracts inserted into a speech specially designed to be quotable. This gives us a profile of reiterated phrases which are fairly short, may be rhythmical or contain rhetorical flourishes, but are assertive, evaluative, vague and of course reiterated.3 Evaluation is the pragmatic point of the forcing: to praise or blame, to side with or against, to support or remove support.
        1. For example, the Chancellor George Osborne used the phrase a budget for the next genera- tion with small variations 18 times in a one-hour speech.
      2. Delivering good things and doing the right thingA positive portrayal of a party’s policies and decisions can be made in vague but fre‑ quently repeated terms and may have several possible interpretations. Ultimately they may simply signify that ‘our hearts are in the right place, we are doing what people want or need’ (making the ‘right noises’, Partington & Taylor 2010: 94–95). Sometimes, however, an instance of forced priming becomes particularly noticeable. One’s attention is drawn to an item on hearing it repeated in the media and noting it as a recognizable feature of political discourse but not part of our own personal primings for production. The item deliver caught the authors’ attention because of the reiterated colligation with on and for. We then analysed these items in politicians’ speeches in a variety of sources: the G and T corpus 2014 and 2015 and the 2015 Andrew Marr Show corpus; key events in the periods covered by this data were the referendum on Scottish independence and the 2015 general election. In quantitative terms the verb deliver in the BNC written corpus has a frequency of0.02 per thousand words (ptw) compared to the frequencies in G and T 2014 and 2015 of 0.07ptw, and there are no instances of deliver on or for in the BNC. In G and T, we find deliver on and deliver for, collocating with promisepromisespledgecommitment, agenda and mandate. These collocates are all anaphoric nouns which need some sort of unpacking to be understood in propositional terms. That is to say, they are general nouns which summarise a number of details which may or may not be available to the listener/reader when the delivery is mentioned; to be fully understood the noun needs more specific details.In the AM corpus of British TV political interviews we find a frequency of0.2 per thousand words and, with one exception, it is found in response moves, predominantly with the subject we or I. Deliver has a positive evaluative prosody, something one wishes to do, something desirable, although also something one can fail at. What is not clear is exactly what the promises, my changes, the road, the plan, the job, this agenda involve. In these discourses it is primed with a semantic preference for vagueness in the extracts below.Examples of Deliver in the AM corpus
        1. but we made some promises in the election and I want to deliver on the promises
        2. But what we have to do is make sure that the European Union reforms to deliver on our agenda
        3. That is the central question at this coming election is, do we stick to the plan, do we stick to the road we’re on, do we complete the job, because that can deliver something better for all of Britain,
        4. This is exactly the reason we are focusing on economic development and why we have put in place the right people and the right systems to deliver on our strategy.
        5. We’ve delivered and want to go on delivering for hard working people who do the right thing
        6. I think if we deliver on this agenda, if we can demonstrate that we are doing things for the north of England because it’s the right thing to do,The last two Examples (6) and (7) also contains the combination of an asserted and presupposed evaluation, in this case deontic modality of desirability, with vague‑ ness, the right thing, to characterize the actions of the speaker’s party, but as we can see from Examples (8) to (11) below this phrase is also applied to behavior by the public. It is noticeable that they are all presented inside a quotation, and not owned by the journalist.the right thing
        7. DAVID CAMERON has refused to rule out a deal with the UK Independence Party if he fails to form a majority at the election, saying only that he would “do the right thing for Britain”. (Telegraph 2014)
        8. Referring to the coalition with the Liberal Democrats, he added: “People know with me that I will always try and do the right thing for Britain.”(Telegraph 2014)
        9. He added: “With the Conservatives, if you work hard and do the right thingwe say you should keep more of your own money to spend as you choose.”(Guardian 2014)
        10. George Osborne told BBC Radio 4’s Today programme that low interest rates were required to help prop up the economy, but he acknowledged that this was hurting those “who have done the right thing” by putting money aside for a rainy day. (Guardian 2014)This phrase is used as a useful shorthand and, like deliver, is not easy to unpick in terms of precise interpretation. The evaluative force is that of approval; it is used to evaluate the speaker’s policy favourably, and this is the pragmatic point of the utterance. It closes down the dialogistic space by asserting, presupposing and fore‑ grounding the evaluation while keeping the actual noun to which the evaluation is applied (thing), vague and less open to rebuttal.
      3. Labelling the politics of othersMost political electoral campaigns are organized around stressing the positive in one’s own policies and not mentioning the negative but frequently enough politi‑ cians cannot avoid addressing attacks on their own policies, in which case one common strategy is to attempt to take control of language by the evaluative force of a label used to qualify an opponent’s argument. Discourses on the theme of
        inequality, and the concentration of wealth became frequent after the 2007 eco‑ nomic crisis, and in the run‑up to the Scottish referendum in 2014. The coalition government, and all parties in the general election campaign in 2015 needed to address the situation in the UK while rebutting accusations of unfairness or the promotion of inequality in their policies. They were being charged with a neglect of areas beyond London, discrimination against ethnic minorities, an unequal burden being placed on the more vulnerable parts of society in fiscal policies and all of these accusations needed to be addressed. One strategy to achieve this purpose of rebuttal was a labelling of the opposition’s accusations with a negative characterization of the motivation: the use of the item grievance (as an uncount‑ able noun or to modify a noun) (0.00319 per thousand words in G and T 2014 and 2015 compared with the BNC written 0.00231 per thousand words). In G and T 2014 and 2015 it was used with reference to the Scottish referendum and Scottish dissatisfaction with the Westminster administration, to Labour policies, and on the theme of Islamic radicalization. In order to counter specific criticisms, statis‑ tics or cahiers de doléances without admitting to the details, a vague cover term was needed to dismiss them without having to engage in any way with the arguments, not so much rebuttal as rebranding. Thus we find the lexical item grievance used as a negative label to characterize a position: coalition of grievance; narrative of grievance; a grievance agenda; a culture of fear and grievance; the false narrative of grievance; the art of gripe and grievance politics; the politics of grievance and sub- sidy; betrayal, blame, grudge and grievance narratives; a sense of grudge, envy gripe and grievance.A number of metaphors are employed to accuse opponents of exploit‑ ing grievance: fuelling the sense of grievance and division; peddling a grievance agenda; sow grievance; whip up a grievance; stoke division and grievance; feed into a grievance narrative; pouring petrol on your grievance. Choices too are to be made between a coalition with a conscience or a government with a grievance, and promises are made to relieve a condition of permanent grievance; deliver on grievance; move away from grievance; disenchantment and blame; end the poli‑ tics of grievance and blame.Those accused of the politics of grievance are variously President Putin, the Palestinians, Muslim extremists, the Welsh, UKIP, the English but also the anti‑ English, but most frequently the Scots and the SNP.4
    4. An interesting feature in the context of both the Scottish referendum and the general elec- tions where the SNP had a significant role was the way PG Woodhouse’s comic remark that it is not difficult to distinguish between a Scotsman with a grievance and a ray of sunshine is quoted nine times in G and T 2014 and 2015 when commenting on attitudes in Scotland.
      A similar rebranding label which serves to summarize an opponent’s posi‑ tion as negative is the term grandstanding. It is used to characterize opponents or persons who do not have the writer’s approval, without necessarily disendorsing the content of their actions or pronouncements, often a difficult balancing act. It is an ad hominem strategy, often heard in political radio interviews, used in parlia‑ mentary discourse (over five hundred examples in recent Hansard archives) and taken up by the press, possibly an item from the lobby lexicon for non‑attributed ‘briefing against’, ascribing self‑promotional or self‑aggrandizing motives to the offender. It is noticeable that there is often a contrast to be made, a positive evalua‑ tion of the general gist of an opponent’s discourse but with the label of grandstand- ing to negatively evaluate the motive.
      1. There is little doubt that the First Minister is tapping into genuine public unease about Britain’s role in this crisis, unease that is shared across the political spectrum in Scotland, as the reaction from all the party leaders proved yesterday. What her “summit” will produce today, however, other than an opportunity for grandstanding and for seeking to prove the moral superiority of her government remains to be seen. (Telegraph 2015)
      2. […] not least through [China’s] cyberwarfare programme. But these matters are better addressed through friendship and co‑operation, not by posturing and political grandstanding (as with Speaker John Bercow in the Commons yesterday). President Xi should rest assured that he is a welcome guest in our country. (Telegraph 2015)
      3. If people can’t afford somewhere to live and work doesn’t pay, our economy is entirely dysfunctional. Tory grandstanding on the need for cuts constantly argues that the welfare bill is too high, while steadfastly refusing to admit, or even examine, why that might be. (Guardian 2015)So the journalists are not denying that the unease is both genuine and shared, that China’s cyberware policies are not really acceptable, or that the welfare bill is too high but what is the central thrust of the comment is a dislike of and a wish to dismiss the actions of Nicola Sturgeon, John Bercow, or the Tories. Often the evaluation is one attributed to a third party as we see in the following examples: in the first, Yannis Varoufakis is seen by undefined others as grandstanding, in the second it is attributed to Mr Patterson, and in the third it is Margaret Hodge’s crit‑ ics who charge her with it but the term itself is apparently chosen by the journalist since it has not been assigned quotation marks
      4. Greek isolation ranged against the other 18 countries of the eurozone, largely because of what was seen as know‑it‑all grandstanding by Varoufakis. (Guardian 2015)
      5. And while Labour values the credibility she has given to the party’s campaign, Patterson, who reportedly complained to Downing Street “in the strongest possible terms” about his cabinet colleague’s “grandstanding”.(Guardian 2015)
      6. some senior colleagues have been frustrated by her criticism of accountancy firm PwC, given its role in providing the party with free advice. But perhaps the most frequent charge of the Hodge critics is grandstanding – allowing her penchant for a soundbite to get in the way of useful questioning.(Guardian 2015)Many of the examples both in Hansard and in the press refer to the parliamentary select committees where politicians, civil servants and CEOs of energy companies or of multinational companies like Google have had to face serious interrogation. In the face of such lèse majesté it would seem that to be able to dismiss this as grandstanding is a useful downplaying strategy.
        1. ‘NHS, the envy of the world’: A zombie priming, refusing to dieAs has already been observed, many attempted social and political primings attract media attention, and can be contested and sometimes even ridiculed. One intriguing case is the priming that the UK’s National Health Service (the NHS) is the ‘envy of the world’. The data used in this study come from the SiBol corpora of English‑language newspapers from 1993 to 2013 and concordances made using Lexis Nexis’s newspa‑ per database.The NHS started in 1948. The government of the day was Labour but the leg‑ islation to set it up was also supported by the Conservative opposition and had in fact been drawn up under the wartime Coalition government. Health officials of both parties occasionally repeat the message that the NHS is the ‘envy of the world’. Firstly, Labour, when in power in 2004–5:
      7. Last Autumn, Sir George Alberti, the [Labour] Government’s “emergency care tsar,” went still further and declared that the NHS had become the “envy of the world.”to which the Conservative opposition replied:
      8. The shadow health secretary, Andrew Lansley, said: “John Reid [Labour Health Secretary] boasted that the NHS was the ‘envy of the world’, but he should be ashamed of Labour’s record… A recent report from the BowGroup revealed that if we had the average European standard of cancer care, 85 lives a day would be saved. “This shows Labour has no new answers to raising the quality of care in the NHS.” (Guardian 25/01/2005)
        to which the anti‑Labour Daily Telegraph added:
      9. So much for that hoary old myth that the National Health Service is “the envy of the world”. And so much for Tony Blair’s claim that the public services are safe in his hands. (Telegraph 03/01/2005)However, when they returned to office, the Conservative minister Jeremy Hunt made the same claim in a courageous – or foolhardy – letter to the anti‑Conservative Mirror newspaper:
      10. AS Health Secretary it’s my job to ensure patients receive the highest standards of care. NHS staff show the best of our national values and I see their dedication every week when I go out on the wards.The service remains the envy of the world more than six and a half decades after it was founded.Just a few months ago the independent Commonwealth Fund found that over the last four years the NHS has become the top‑ranked health care system across 11 of the richest countries in the world. (Mirror 28/09/2014)To which one disbelieving reader replied:
      11. How can the NHS be safe in the Tories’ hands when they are responsible for its demise and will, no doubt in time, privatise it? I’m hopeful that at my age I will not live to see the death of the NHS […] Mr Hunt, I am crying(Mirror 12/10/2015)As some of the evidence reported below suggests, many Labour politicians and supporters claim the sole credit for the creation and safe‑keeping of the NHS and often accuse the Conservatives of plotting to cripple (Mirror 02/04/2015), disman- tle (Mirror 28/04/2015) or sell it off (Mirror 10/05/2013).Of all the UK newspapers examined, the Labour‑supporting Mirror and its Scottish sister newspaper, the Daily Record, are alone in using the priming ‘NHS … envy of the world’ without doubt or qualification. In the period 2013–15, it occurs twenty times in the Mirror and in 15 of these it is represented as a fact, and in three of these it is found in editorials, therefore in the newspaper’s ‘own voice’.
      12. THE NHS is the jewel in Britain’s crown. It is […] the best healthcare system in the world and gives everyone in the UK standards of treatment and service that are the envy of other countries. (Mirror 12/09/2014)In the Daily Record it is found eleven times, ten times represented as a fact, for example, in the headline:
      13. OUR brilliant National Health Service is the envy of the world.(Daily Record 08/11/2014)
        praise for its supposedly exceptional staff is a common accompanying narrative:
      14. And that daily triumph is all down to the drive, commitment and passion of the doctors, nurses, porters and support staff, who all help to make ourNHS the envy of the world. (Daily Record 09/02/2013)Only once is there any expression of doubt, namely that it ‘used to be’ the world’s envy but may no longer be so.The ‘NHS world’s envy’ forced priming is part of two wider narratives, firstly that ‘despite its many problems it remains the envy of the world’ and ‘but the Conservatives (or in Scotland, the Scottish Nationalist Party) are threatening it’; in the following we find both:
      15. For almost 70 years the NHS has cared for the people of Britain based on need – not the cash in your wallet.But the health service that is the envy of the world is under threat as never before.Hospitals are on the critical list. GPs are leaving in droves.Our ageing population puts an ever greater strain on over‑stretched resources.Savaged by Tory cuts and reeling from disasters like the Mid Staffordshire Hospital scandal, the NHS is itself in intensive care. (Mirror 28/09/2014)The scandal referred to unfolded during the previous Labour government, some‑ thing the Mirror does not dwell on.In contrast, in the same three‑year period, the other popular English tabloid, the right‑leaning Sun, uses the expression eleven times but only fully endorses the sentiment once: ‘Our wonderful National Health Service turned 65 last Friday, and is still the envy of the world’ (Sun 09/07/2013). The other mentions express various types of disendorsement:
      16. So is this what our great National Health Service has become? […] The envy of the world is becoming a laughing stock. (Sun 03/05/2013)
      17. Let’s stop pretending it’s the envy of the world and fix it. (Sun 02/08/2013)During the same recent period, contributors to the right‑leaning Daily Telegraph also dismissed the ‘NHS, envy of the world’ priming as outdated and/or naïve. The following is a concordance of the four occurrences of NHS and envy in a span of 7/L, 7/R in 2013:
        1. The NHS is the envy of the world? Then the world must be bonkers.
        2. We are always being told that the NHS is “the envy of the world”, a claim that hasn’t been true for years. But it has come to something when our pets are likely to get better, and certainly more personal, treatment in the evenings
        3. Care has gone missing from parts of our NHS, yet to say so is considered her‑ esy. A culture that tolerates no narrative other than the NHS being the “envy of the world” has allowed those who complain to be cast as ungrateful and vexatious. Whistleblowers are treated so badly that too many look the other way rather than risk career
        4. SIR – I was told that the NHS is the envy of the world. God help the rest of the
        world. [Name and address supplied].Endorsement of the ‘NHS world’s envy’ seems, then, to be the preserve of the two left‑leaning tabloids, while right‑leaning papers mention it to disendorse and dis‑ tance themselves from it.In the second and third lines of the above concordance, we find the sugges‑ tion that although gone, there was a previous era in which the NHS truly was the world’s envy. We decided to investigate the claim by reaching back into earlier UK newspaper data. Data from the 2005 SiBol corpus, when the UK was governed by the Labour Party, can be seen in Examples (19) to (21). Of the seven occurrences of NHS and envy (of the world), not one endorses the proposition, although again there is indication of a belief that it once was the case (our italics):
      18. Bower chronicles Brown’s incredulity when he heard that the NHS was no longer the envy of the world. (Times 16/05/2005)The earliest SiBol corpus dates from 1993, when the UK was under the Conservatives, and it contains the single following sarcastic reference:
      19. TO ANY outsider, the capacity of the British for self‑delusion is amazing. ‘The NHS is the envy of the world’: really? Maybe to a yak farmer in Bhutan, but not to anyone I’ve ever spoken to in Western Europe, North America or Australia. (Guardian 03/09/1993)Going back even further to 1985 and 1986 (times of Conservative government), the years when newspaper texts first appear in the Lexis Nexis database, we find two other mentions of the priming, both as contestation:
      20. At one time our NHS was described, justly, as the envy of the world. The position is now reversed. Waiting time, in Walsall, to see an E.N.T.consultant, is 12 to 16 months. In Holland and West Germany 4 to 5 days.(Guardian, letter, 20/01/1985)
      21. The chairman of the BMA council, Dr John Marks, told the association’s annual representative meeting at Scarborough that the public was
        demanding explanations, having been ‘bamboozled and mesmerised’ by governments telling them that the NHS was the envy of the world.(Guardian 24/06/1986; also Times 24/06/1986)
        This ‘envy of the world’ priming appears to be highly contentious, but it is so politically useful that it refuses to die (or, perhaps, be allowed to pass on). Despite numerous reports which suggest that, among the OECD countries, the NHS deliv‑ ers very average health outcomes, commensurate with the OECD average invest‑ ment in it,5 the primed message of the world’s NHS‑envy is still being forced into the discourse by a section of Labour supporters as part of a wider narrative in which they heroically defend it against the nasty Tory party and business inter‑ ests. NHS‑envy priming is, however, also periodically resurrected by government health ministers of both political sides, the latest episode in November 2015.6The accusation of having let the NHS slip from being the object of the world’s envy is a forced priming in itself and it is a powerful tool of criticism for both parties when in opposition. However, care is required. During the 2015 election campaign the Labour leader urged his party to ‘weaponise’ the NHS, that is, to use it as a political weapon in the election campaign. The suggestion seems to have backfired. The government and parts of the press represented the utterance as political opportunism and cynicism:
      22. Andrew Percy, a Tory member of the Commons health select committee, said: A lot of people are disgusted by Labour’s weaponising of the NHS and their self‑righteous belief that they and they alone speak for the NHS and for NHS patients. They don’t. Their own record […] has been abominable.(Mail 25/04/2015)
        This is an example of another kind of forced priming in the armoury of political rhetoric, namely, to frequently repeat the utterance of a political opponent and represent it – and therefore the utterer – in a negative light. In fact, of the three main anti‑Labour newspapers, the Telegraph refers to Labour’s supposed ‘weap‑ onisation’ of the NHS seven times in 2015, the Daily Mail nine times, and the Sun 13 times.
    5. Health-Statistics-2015.pdf
    6. Several newspapers, including the Independent, the Telegraph and the Sun responded to a highly critical OECD assessment of NHS standards: ‘A Department of Health spokesman said […] “The OECD report shows there are many indicators where the NHS continues to be the envy of the world.”’ (Independent 05/11/2015)
      1. Transdiscoursive reactions: Resistant readings and reflexive commentingAs we can see from many of the above examples, forced priming consists of forms and formulae and associated evaluations. And we have also seen that not all such evaluations encounter compliant readings.Our investigations revealed a variety of patterns used by journalists to disend‑ orse and resist the attempts at forced primings of other parties, on the part of press commenters. The frequency of repetition of a message is sometimes commented on, whilst the message itself is labelled negatively, perhaps, for example, as slogans (our emphasis):
        1. “For hard‑working people… who play by the rules… our long‑term economic plan”. The Conservative slogans may be flat and narrow, but they are repeated and rammed home again and again. (Telegraph 2014)
        2. On Tuesday, at Treasury Questions in the Commons, George Osborne and his fellow MPs managed to mention the Conservative slogan “Long‑Term Economic Plan” a grand total of 17 times. On Thursday, during a speech in London, Ed Miliband succeeded in saying the Labour slogan “Cost‑of‑Living Crisis” an impressive 20 times. And yesterday, David Cameron used his visit to a JCB plant near Stoke to unveil theConservatives’ new slogan about the economy. During a 13‑minute speech, he said it a mere eight times. (Telegraph 2014)We should note the use of quotation marks which contain the formulae but not in the context of a coherent speech. Instead, they are random extracts, uncou‑ pled from a coherent discourse and clearly separated from the journalists’ own language by the quotation marks, in much the same way as journalists use scare quotes to show that the language is not of the kind they would wish to accept responsibility for.
        3. George Osborne has a “long term economic plan”, he told us yesterday. It must sit next to his “credible fiscal plan”, which is chock‑full of “difficult decisions”. Both were part of this week’s budget for “building a resilient economy” and not “squandering the gains”. And doing that requires ministers to “hold our nerve”. …Every couple of minutes broughtforth another word from the No‑Turning‑Back dictionary: “hard”, “determined”, “secure”. And, over and over, “plan”. (Guardian 2014)
        4. This is, by no means, a phenomenon unique to the British political scene. Whenever I am in Greece, listening to Prime Minister Samaras’s speeches, I often have a sense that they are direct translations of Cameron’s equivalent scripts. “Difficult decisions”, “the mess we inherited”, “we are on the right track, but need more time”, “hard‑working people” all feature regularly… One could dismiss it as cynical political positioning, just spin – and
          perhaps this is how it begins. However, when a particular narrative is repeated often enough, two things happen. First, it becomes dominant, and alternative versions of the truth are suppressed. (Guardian 2014)We note the quotation marks and scare quotes but also the items denoting forcing (chock full, every couple of minutes, another, over and over, script, feature regularly, repeated often enough). In the last example we see, alongside the use of distancing quotes, another favoured disendorsment label: spin. The repetitions of key for‑ mulaic phrases having revealed the organized intention or orchestration, a dele‑ gitimizing characterization is used to frame the item; the lines and phrases are repeated in the press (from both sides of the political spectrum) but the reiteration or forcing is rejected as strategic, and in Example (28) heavily criticized by the journalist.A reporting label which mocks by means of a metaphor (metaphors are, of course, very generally evaluative) asserts that the tactical repetition is not achiev‑ ing the desired effect. One label which performs this task uses the term mantra:
        5. The drastic cuts to the strength of the regular Army, combined withthe Coalition’s mantra that there would be no “boots on the ground”.(Telegraph 2014)
        6. Mr Miliband has now overturned this for a new mantra: what’s best is what sounds good. (Telegraph 2014)
        7. Ms Sturgeon predicted that Labour will fall back on the same “desperate mantra” that voters must support them to expel the Tories next May.(Telegraph 2014)
        8. The idea that immigrants are the cause of the UK’s ills is an old rightwing mantra. It’s simply the politics of fear. (Guardian 2014)
        9. He will repeat Margaret Thatcher’s mantra that there is no such thing as government money – only taxpayers’ money. (Guardian 2014)Again the term is accompanied by references to forcing (same desperate, fall back onold, repeat). A mantra is a word or formula deemed to have peculiar powers to achieve desired effects by means of much repetition. And if we have a mantra, can the composer be other than a guru?
        10. Axelrod will also be crucial in dealing with a Tory “fear and smear” campaign that the party is expecting from the camp of David Cameron and his Australian election guru, Lynton Crosby (Guardian 2014)In these excerpts, the phrasing is repeated by the journalist but it is re‑framed as having been consciously manufactured, with the implication that it has lost its presumably intended original priming effect in much the same way as an
          original or poetic figure of speech can become a cliché. The reader is primed with an evaluation – generally negative – by means of a reporting signal. And by foregrounding an awareness of the strategic calculation behind them, the press reports the formulae while at the same time signalling its own immunity from the perlocutionary intent.
      2. Uptake of a forced priming with reversal of evaluation
      Hoey’s lexical priming theory maintains that whenever a speaker encounters a word, he or she makes a mental note of the meanings with which it is associated, what style it tends to occur in and, in terms of evaluation and pragmatics, whether it is used to praise or blame.One development in the presentation of policy to the public in modern poli‑ tics has been the proliferation of increasingly promotional communications in an attempt to prime the electorate with a positive view of plans, events, ideas, policies, often on official, even government websites. An interesting example is the lexical item flagship. A search on the website gives 645 occurrences from 2010 to the present. Flagship programmes, plans, schemes etc. are presented in a relent‑ lessly confident way by the government with positive evaluations in the surround‑ ing co‑text:
      1. The Civil Service Fast Stream team runs the Government’s flagship Fast Stream graduate programme. We recruit the brightest and best graduates into the Civil Service and give them the skills, knowledge and experience they need to become effective and inspiring leaders.
      2. Academy has ‘new sense of purpose’ after receiving new building throughgovernment’s flagship £4.4 billion rebuilding programme.
      3. On 5 December DWP announced plans for the next stage of implementing Universal Credit. These steps continue our progressive approach to test, learn, and implement as we deliver this flagship programme, and we will apply that same approach to the development of the Local Support Services Framework.
      4. Help to Buy, the government’s flagship housing scheme, has helped almost 100,000 people buy a new home since it was introduced.The term is used metaphorically, to indicate the best or most important prod‑ uct, idea, building, etc. that an organization owns or produces. The reader is encouraged by the co‑text to think of these flagship schemes as inherently successful. This metaphorical use of the term has been taken up by political discourse, very probably borrowed from business discourse as part of a promo‑ tional strategy, chosen to emphasise the importance given and the faith placed
        in the policy.7 In our G and T corpus we find it is used mostly to refer to busi‑ ness and commercial entities, (this first usage is corroborated by dictionary examples and definitions)8 or the arts, but also to political policies.The most frequent R1 collocate for flagship in G and T 2014 is store while the second is policy, we also get programme and scheme and a range of policy areas, such as education, housing and welfare reform. While the business, commercial or artistic initiatives are reported with the same positive evaluations of their pro‑ moter (possibly following press release material) the press representations of flag- ship in regard to government policy on the other hand, show how it is used in a very different pragmatic environment. A flagship policy is usually framed as con‑ troversial and problematic and negative evaluations are frequently found in the co‑text. Indeed the main point is that of making a negative evaluation, presenting an account of failure or difficulty and controversy:
      5. David Cameron’s flagship policy is not the panacea for the deep problems in our education system. (Guardian 2013)
      6. The figures represented an improvement on results last November but the slow rates of progress will lead to fresh criticism of the flagshipgovernment policy. (Telegraph 2013)
      7. The politics of aspiration have been too quickly swept away in recent years with the flagship inheritance tax pledge abandoned almost as an afterthought. Child benefit was ditched on breakfast television, to bereplaced by another complicated means‑tested handout. (Guardian 2015)
      8. This was a flagship policy designed to reconnect the public and the police. Yet after spending £75m, nearly 90% of Britons have no idea who their elected police and crime commissioner even is. November’s bungled poll failed both candidates and voters. (Guardian 2013)In these examples, the journalists criticize the scheme or policy directly. In other examples, the evaluations come from others and are contained in reports, com‑ ments or vague attributions:
    7. We see from Google n-grams that the term flagship has been on the rise since the late 1980’s reaching 0.000140% in 2010. Our G and T corpus shows a much higher occurrence (0.00123%).
    8. This machine is the flagship in our new range of computers.The company’s flagship store is in New York. (Cambridge online dictionary)The google dictionary definition includes the political realm: “this bill is the flagship of the government’s legislative programme”
      1. This included a flagship policy forcing nurses to spend a year washing and feeding patients as a health care assistant before qualifying, to improve compassion on the wards. The RCN has described the plan as “stupid” and “plucked out of thin air”. (Telegraph 2014)
      2. The OECD also joined the growing chorus of voices to raise doubts about Mr Osborne’s flagship Help to Buy subsidised mortgage scheme, warning that five‑year interest‑free loans for 20pc of a home’s value and state guarantees on £130bn of mortgage debt may simply create another bubble.(Telegraph 2014)
      3. The announcement of the scheme comes amid growing anger at the Conservatives’ failure to honour their flagship manifesto pledge to introduce tax breaks for married couples. (Telegraph 2014)As regards this term at least, the administration’s forced priming via its pro‑ nouncements and announcements meets with mixed success in the wider media discourse. The phrase is indeed being disseminated by the press, uptake has taken place, but the line taken (the evaluation of success or failure) is diametrically opposite to the one the composers hoped for.
  4. Conclusions
    The concept of priming is clearly understood by the press and the term itself is indeed referenced directly,
    1. Ahead of a speech in Cardiff, Osborne popped up on Radio 4’s Today programme to prime us all, but only succeeded in making listeners wonder why he kept stressing that the Bank of England is “independent” – a dozen times, I’d say. Has he quietly nationalized the central bank as Clement Attlee once did in a very different world and Venezuela’s faltering Corbynistas did this week? (Guardian 2016)People often recognize and can repeat advertising slogans after being exposed to a campaign and the same is true of political slogans. In those cases the forced prim‑ ing is clear and the fact that such campaigns continue to be paid for is a signal that they are deemed to be successful in their effects as persuaders.9 In the everyday
      1. Sam Delaney in his account in his book Mad Med and Bad Men: what happened when British Politics met Advertising suggests that it is the discipline in everyday communication that advertising brings to press and communication strategies, and it is this which tends to prove successful. He also claims that it is above all negative campaigning which works.
        discourse of political communication it is not always so easily discerned but cor‑ pus analytical techniques can point us to the attempts.In terms of the White House’s attempts to set the linguistic agenda, the evi‑ dence of their efficacy was inconclusive.In the UK press, we saw a mixture of reactions to ‘message discipline’ on the part of the media. These included a degree of uptake and re‑broadcast, if the media outlet agreed with the priming evaluation (as with the Mirror on the NHS in Section 3.4), but we also noted many instances of distancing from the forced priming, occasionally accompanied with sarcasm and criticism. In the case of flag- ship we also saw that the intended positive evaluation tended to be rejected, ridi‑ culed and turned back against the would‑be primers.In the case of the television interview examples, we can see how prefabricated phrases are inserted into interactive speech events, suggesting that the line to take and the phrases to use have been part of a carefully prepared language strategy which, after a creative process leading to their formulation, can be recycled for many occasions and reiterated by many voices. These items include self‑praise or blaming the other with a build‑up of evaluative cohesion.Though we have not used such data in the present study it is clear that further amplification can take place through social media; the concept of going viral on Facebook, twitterstorms, and the hashtag10 affordances of technology means that phrases can rebound and resound with ever increasing frequencies, in a cycle of what we earlier called transdiscoursive intertextuality. Even without the amplifi‑ cation of social media it is clear from many of our examples that the reporting of forced primings in political discourse will now often include comment on the reit‑ erations, as if to assert that the media is not to be taken in by the strategy. The press, though, by using the phrases, have already shown themselves to have been primed. Press and public may be more aware of the strategy, they may resist the evaluations, but they do take up the phrases, even if only as part of their receptive repertoire, recognizing the colligational, collocational, textual and pragmatic associations that accompany the phrases. When they do not do so it may be noted with disappoint‑ ment as we see from the following item found after the Brexit referendum:
    2. A senior campaign source said: “Downing Street told us: ‘We won with a risk message in the Scottish referendum in 2014 and 2015, and we could do the same in 2016.’ [] Those messages are fine if they are going to be echoed every day in the rightwing press, so creating an echo chamber that the broadcasters have to follow. But the press was never on our side.”(Guardian 25 06 2016)
      1. George Osborne used the hashtag #hardworkingpeople to tweet his justifications for his policies.
        In this chapter we have inferred, from the exceptional frequency of the occur‑ rences of specific phrases with associated evaluations of self‑praise or other‑blame, processes of forced priming, that is, the deliberate flooding of pre‑composed mes‑ sages into the sphere of public political discourse. We are hardly the first to notice this forced flooding; it is part of the campaign manager’s job and indeed the media itself refers to politicians’ ‘message discipline’ and ‘singing from the same hymn‑ sheet’. Here we have reinterpreted the phenomenon through the lens of Hoey’s lexical priming theory.What is novel about the work outlined above is the way in which, using cor‑ pus techniques, we were able to follow up transdiscoursively on the forced priming attempts, that is, to garner evidence on their effect and on their success or lack of it, at least in terms of how the media reacts, of how the message originating in the source political discourse fares in one of the target discourses, the mainstream media.One final question remains. How much does media reaction matter? As was pointed out in the Introduction, the real beneficiaries of these forcing attempts are not the gentlepersons of the press, but the members of the voting public. How do they respond to forced priming? In the use of slogans, the important thing is that they hear them over and over. The results of the EU referendum suggest they do have some effect. The Vote Leave campaign slogan was “Take back control.”.
    3. One of the anti‑EU Cabinet ministers, House of Commons Leader Chris Grayling, said: “After we Vote Leave the public need to see that there is immediate action to take back control from the EU.”(The Sun 15th June 2016)
    4. Nigel Farage, the Ukip leader and Leave.EU supporter, did something similar during the EU referendum campaign, repeating again and again the mantra “take back control” (Guardian 01 July 2016)
    5. Michael Gove “But if we vote to leave, we take back control…We can take back control of our borders.It’s time to take back control.”
    (Guardian 19 04 2016)(70) In a rallying cry ahead of Thursday’s referendum, [Boris Johnson] said the electorate had a once‑in‑lifetime opportunity to “take back control of this great country’s destiny”. (Express June 20 2016)These transdiscoursive and intertextual reiterations seem to have served their purpose and were deemed successful in tapping into emotions connected to pow‑ erlessness in that part of the voting public who have little control over their lives and little to lose. The phrase exemplifies the features of forced priming in being vague and at the same time positively evaluative. The Remain campaign had no similar competing phrase and their discourse of risk as we saw did not result in take up and re‑amplification.
    There is some evidence, in the case of the ‘NHS world’s envy’ narrative that readers are protective of their previously acquired primings, including and espe‑ cially, evaluative primings. Indeed, in general, we probably do not even notice the forcing of priming and its evaluation when it concords with our own personal worldview. Of course budget reductions are austerity and cuts and something wicked for the ardent socialist, and of course the same policies relabeled as fiscal responsibility and savings are simply prudence for the well‑heeled conservative. But when primings are forced on us that are novel, or rub against the grain of our worldview, then we sit up and notice the language.This is entirely analogous to our reactions to other forms of priming. A good example is provided by collocations and colligations. Those we are already used to pass us by unnoticed, but the unusual, whether accidental or intentional, bring us up short. For example, deliberately generally appears in negatively evaluated environments and so we note the novel and comic effect when a Wodehouse char‑ acter confesses ‘my astonishment that anyonecould deliberately love this girl’ (our emphasis).11Finally, how has an individual’s worldview itself come to be as it is? Is it itself simply the sum of the previous primings we have experienced and internalized? Any answer is beyond the scope of the present research but such a process seems too deterministic a vision of the way mind and primings interact. A synergic use of collaborative methods involving both corpus analysis and other forms of research such as focus groups or surveys can be envisaged. Priming is a form of nurture, but nature, that is, natural predispositions and differences in receptiveness surely also have their role to play.
    ReferencesBaker, P. 2005. Using Corpora in Discourse Analysis. London: Continuum.Duguid, A. 2009. Insistent voices, government messages. In Morley & Bayley (eds), 234–260. Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.doi: 10.4324/9780203327630Morley, J. & Bayley, P. (eds). 2009. Corpus-Assisted Discourse Studies on the Iraq Conflict: Word- ing the War. London: Routledge.Partington, A. 2003. The Linguistics of Political Argument. London: Routledge.doi: 10.4324/9780203218259Partington, A. & Taylor, C. 2010. Persuasion in Politics. Milano: LED.Partington, A., Duguid, A. & Taylor, C. 2013. Patterns and Meanings in Discourse [Studies in Corpus Linguistics 55]. Amsterdam: John Benjamins. doi: 10.1075/scl.55
    1. P.G. Wodehouse, The Code of the Woosters.

    Popper, K. 2002 [1934/1959]. The Logic of Scientific Discovery. London: Routledge (Originally published as Logik der Forschung, Vienna 1934, first English translation Hutchinson and Co in 1959).Riccio, G. 2009. White House press briefings as a message to the world. In Morley & Bayley (eds), 108–140.Short, C. 2004. An Honourable Deception? New Labour, Iraq, and the Misuse of Power. London: The Free Press.Can lexical priming be detected in conversation turn‑taking strategies?
    Michael Pace‑SiggeUniversity of Eastern Finland
    Turn‑taking strategies in spoken communication have been widely researched and discussed in recent literature (see, e.g. McCarthy, 1998; Myers 2009; Archer et al. 2012). Moving on from the non prosodic, non‑lexical pointers (cf. Yngve 1970; Duncan Jr. 1972), corpus‑based research has focussed on lexical items (see McCarthy 1998; Tao 2003; Myers 2009; Evison 2012). Following the tenets of psychological priming that form the basis of Hoey’s lexical primingtheory (2005), some kind of trigger item should be in evidence, showing a listener that a turn is given up. Consequently, recognisable turn‑final and turn‑initial lexical items – as well as evidence of speaker alignment – should be in evidence. This chapter will describe (a) some salient signals used that become apparent when monologues are directly compared with dialogues; and (b) highly frequent (sets of) words found employed in conversational exchanges. Consequently, an argument will be made that language users appear to be primed in their turn‑ taking word choices to follow a structured, recognisable pattern, thus facilitating fluency in their conversation.
    1. Introduction
      A widely discussed topic in various branches of linguistics is the issue of turn‑ taking. By this, we refer to the transaction in a conversation where the first speaker indicates that they have finished their part of the conversation and are opening the floor for another speaker. Conversely, it means for a conversation partner to realise they can start speaking without interrupting or appearing to be impolite. There are many prosodic, non‑lexical indicators used: the phonetic pointers employed by English speakers (cf. Yngve 1970; Duncan Jr. 1972; Knowles 1987; Rod Gardner 2001). Furthermore, the issue has been addressed with reference to pragmatics (see, amongst others, McCarthy 1998; Myers, 2009; Archer et al. 2012). While anecdotal evidence, careful listening in fact, might hint at what constitute markers within conversation, a more empirical approach can be found where corpus‑based evidence is used as a point of reference, as in Tao (2003), Carter & McCarthy (2006), and O’Keeffe et al. (2007), and others.
      doi 10.1075/scl.79.04pac© 2017 John Benjamins Publishing Company
      This chapter will look in how far the concept of textual colligation (cf. Hoey 2005: 13) can be applied to the production of casually spoken material. It must be noted that textual colligation has so far only been demonstrated for written texts, though it is predicted to apply to turn‑taking (see McCarthy 2010 or Evison 2012, for details). Hoey links the subconscious process of textual colligation to the phenomenon known as Lexical Priming – where speakers and listeners are primed through repeat exposure (cf. Hoey 2005; Pace‑Sigge 2013) of a word or set of words: primes signal what can be expected next.When looking at evidence of signals of turn‑endings or turn‑starts, however, the nature of spoken language presents a number of clear difficulties (cf. Halliday 2004). The issue becomes difficult when one tries to find clear lexical signals of either turn‑givers or turn‑takers in a transcript. While there are a large number of turns between speakers in any conversation, there is no direct correlation between a clearly defined set of words and the length of turns. This is partly due to the mental processing power of any speaker and the online nature of conversations (cf. Cheng 2012: 13f.) and partly due to the nature of free‑flowing, casual dialogue between (fairly) equal parties, which follows no fixed norm (unlike formal dia‑ logues or conversations between unequal parties – for example doctor‑patient, teacher‑student).Yet, following the tenets of psychological priming that form the basis of Hoey’s lexical priming theory, there should be a trigger item that shows a lis‑ tener that a turn is being given up. Likewise, the first speaker should receive a notification whether the listener is prepared to either forgo or take the turn. When we look at the lexis employed, informal speech appears, on first sight, too anarchic to employ a formulaic, standardised protocol. In order to facili‑ tate smooth and fluent conversation, turn‑taking must, however, follow a struc‑ tured recognisable pattern.
    2. Theoretical background
      The literature on turn‑taking appears to fall into either of two categories: The first looks at the pragmatic aspects of language, with reference to the seminal works by Yngve; Sacks, Schlegoff & Jefferson; Duncan Jr in 1972, 1974 and 1981, or Tannen (1989). This pragmatic approach also includes, to a certain degree, the corpus‑ based research by McCarthy (1998), Carter (2004), Carter & McCarthy (2006). The second category looks at way speech participants are primed to indicate that they are giving up a turn as speakers, to identify such signals as auditors and to indicate that they want to take or not take a turn. These forms of alignment in speech, where language‑producing agents (both human and humanoid) look at
      fixed patterns, are an area which is mainly researched into by those interested in artificial intelligence. That there is a link between how humans are primed and how a “language comprehender” (i.e. a machine) is taught to disambiguate utter‑ ances has been shown by Pace‑Sigge (2013). This concept has been discussed, with reference to Natural Language Processing (NLP), by Thórisson (2002), Mehler et al. (2010) and Stent (2011) amongst others.
      1. The pragmatics of turn takingStarkey Duncan Jr’s research (1972) describes a list of six “turn‑yielding signals”:
        1. Intonation – the use of pitch‑pattern
        2. Paralanguage – drawl on the final or stressed syllable of a terminal clause
        3. Body motion – in particular hand gestures
        4. Sociocentric sequences – “stereotyped expressions, typically following a sub‑ stantive statement. Examples are “but uh”, “or something” or “you know”.
        5. Paralanguage – a drop in pitch / loudness – often “in conjunction with sociocentric sequences”
        6. Syntax – the completion of a grammatical clause. (Starkey 1972: 286f.)
        All of these are based on the carefully recorded conversations (“coordinated action of two individuals”) of US English speakers. It must be noted that the lexical ele‑ ment is recorded only as the fourth most prominent of six signals (see also Knowles 1987). This already highlights that words or sets of words can be used to investi‑ gate turn‑taking action but they bound to give only a partial insight into the issue. Tannen (1989), while not using corpora, looks at empirical evidence and refers to phrases that are found to be repeated by speaker one as “shadowing” –something that can, otherwise, be described as “alignment”.
      2. Hoey’s lexical primingHoey, explaining his 2005 theory in Hoey (2014), starts with what he perceives as an error in a lot of linguistic research: “… much of the [20th] century discoveries about vocabulary were presented in essentially granular, list‑like fashion”. Yet, in neo‑Firthian tradition, no word is without companion nor is the position a word is characteristically random:As a word is acquired through encounters with in speech and writing, it becomes cumulatively loaded with the contexts and co‑texts in which it is encountered, and our knowledge of it includes the fact that it co‑occurs with certain other words in certain kinds of context. The same applies to word sequences built out of these words; these too become loaded… (Hoey 2005: 8; my highlights)
        While there is no evidence that this “knowledge” is conscious, the speaker will have acquired an idea how word‑usage is patterned according to the text type employed. Consequently, “every lexical item (or combination of lexical items) is capable of being primed (positively or negatively) to occur at the beginning or end of an independently recognised “chunk” of text” (Hoey 2005: 129). This is explained as follows:…there is every reason to suppose that similar claims can be made about the be‑ ginning and end of speech turns, conversations and the like. Michael McCarthy (personal communication) … notes that the is primed negatively to occur at the beginning of speech turns. Conversely, in 51 examples of I know drawn from a small corpus of casual conversation, 26 [examples] are either the first words of a turn or within one or two words of the beginning of a turn (e. g. Yeah I know, no no I know), suggesting that I know is typically primed positively for turn begin‑ nings; none are conversation‑initial. (Hoey 2005: 150f.)These examples focus on turn‑initial usage. While I shall not look at conversation starters here, both turn‑final and turn‑initial wording will be investigated.
    3. Data and corpus analysis

    Using an approach linked to lexical priming, there are three issues that are of relevance when looking at turn‑taking and the occurrence of words that act as sig‑ nals for a particular follow‑up (either another word or a change of speaker). These signals would be textual colligation primes.
    1. Are speakers primed to associate such signals with turn‑relinquishment and often replicate this association in their own verbal behaviour? In other words, is there evidence that they indicate that they are about to give up a turn?
    2. Are there turn‑initial words or phrases that users seem to be primed to employ?
    3. Is there an alignment between the wording and word‑constructions between the initial speaker’s utterance and the responses by other speakers?Two main strands of inquiry are being followed in order to answer the research question:
      1. re‑evaluation of the literature data
      2. investigation of spoken conversation corpora material.
      1. Previous investigations in the light of the lexical priming theoryIn this section I want to focus on McCarthy (2002, 2003, 2010), Tao (2003) and Evison (2012); the latter two make clear back‑reference to McCarthy’s research,
        and Evison bases her investigation on the work of her two predecessors. As we have seen above, McCarthy and Hoey seem to have discussed the issue in the light of his LP theory. In fact, Evison does mention Hoey (2005) but does not pursue this further. All three focus on turn‑initial items, looking at American English (Tao), American and British English (McCarthy), spoken fluency (McCarthy), and British academic discourse (Evison). All of them agree that the term right appears in over half of its uses as turn‑initial; the backchannelling items (cf. Yngve: 1970) oh and mhm also occur to a high degree as turn‑initial. Both Tao and Evison high‑ light that the is only employed in circa 2–3 per cent of all of its uses as turn‑initial. A re‑interpretation of the data would class mhm, oh, right, and yeah as being typi‑ cally primed positively for turn beginnings; by contrast, the is “primed negatively to occur at the beginning of speech turns” (Hoey 2005: 151).Tao (2003) notes that there appears to be an order of precedence (when looking at their frequency of occurrence) in his data. This is given as OH> YEAH>AND/SO. Indeed, he describes these forms as reflecting functional categories in Table 1:Table 1. Functional categories of some common turn initiators (based on Tao 2003: 196)
        Tying AssessingOH, WELL, BUT, AND YEAH, NO, RIGHTExplainingSOAcknowledgingMHM, UH‑HUH, OKAY
        Evison (2012) also shows that (in academic spoken material) and is more frequent at the beginning of a turn than but; and is also more frequent than so or because. Furthermore, she shows that the most frequent items tend to cluster, forming big‑ rams like yeah and; mhm and, right and or yeah but. These bigrams, too, could be interpreted as positive usage primings.McCarthy, comparing British and US‑American data, lists the most frequent (evaluative) responses. Below, the four most frequent reactive responses are com‑ pared in Table 2:Table 2. Occurrence of relevant tokens as single‑word responses (based on McCarthy 2002: 59f.)
        ITEM: British data% of totalITEM: American% of totalRIGHT77WOW98GOSH76GOSH81TRUE70ABSOLUTELY67WOW69EXACTLY60
        Furthermore, right occurs only half as often (38%) in the US data, while we can see that wow is used substantially less frequently amongst British speakers. Interestingly, it is the item quite which is the least likely token to be found as a response in either set. These findings, in the light of the lexical priming theory, can be seen as the specific, strong, positive primings of two different speech commu‑ nities. While the words are close in content, the actual lexical choice is specific: it can be claimed that British speakers are more strongly primed to respond with the word right than US speakers. While gosh is the second‑most frequent response, it is almost as likely to be heard as right from a British speaker. This is not the case, however, for American speakers.
      2. Corpora and methodThe material used for this investigation will be exclusively British English spo‑ ken data. Used are the section of the SCO Corpus (cf. Pace‑Sigge 2013) which has clearly identifiable turns, the Lancaster SWAT1corpus (2003) and the 2009 “Linguistic Innovators Corpus” (LIC)2 of young and older speakers in both Hackney and Havering (i.e. North London and a part of ‘Greater London’). This third sub‑corpus presents the largest part of the data and the largest number of turns identified (see Table 3).These corpora have been chosen because turns are clearly identifiable: they are preceded by a speaker x: marker.Table 3. Corpora used

        This means that the three sub‑corpora which are combined to make the Turn‑Taking Corpus (TTC) contain a total of 212 files and have 1,643,131 tokens altogether.One step undertaken in order to identify words that are specific to conversations and turn‑taking was to run a keyword analysis (using Wordsmith 6, Scott 2015)
        1. The Lancaster Speech, Writing and Thought Presentation Spoken Corpus (SWAT), comprising data taken from the British National Corpus and from archives held at the Centre for North West Regional Studies (CNWRS), Lancaster University, UK. Details at fass/projects/stwp/handbook.htm
        2. Linguistic Innovators Corpus (LIC). Details at: galsig/costas.pdf (Gabrielatos, C., Torgersen, E., Hoffmann, S. & Fox, S. 2010) – details Material kindly provided by Paul Kerswill.
          between TTC and a corpus of monologues: single‑speaker public speeches (SSM). The reasoning behind this was that the SSM is not only more formal but has a com‑ plete absence of lexical turn‑taking markers: technically, they are monologues. The corpora used for this investigation are described in detail in Pace‑Sigge (2015).The initial method to look for turn‑specific items was different from the approaches described in 3.1. First of all, a keyword analysis was employed to identify all those words which are specific to casual spoken conversation, in contrast to those found in single‑speaker speeches. Items that are idiosyncratic (like personal names, places or actions) were filtered out. The remainder tended to be discourse and speech markers. Secondly, looking at turns in concordance lines, it was investigated whether there are salient words or structures which appear to be either signals to pass the turn to the next speaker or align the two speakers or are positively primed to be turn-initial.
    4. Comparing monologues with dialogues
      1. KeywordsAn initial step is to look at keywords that distinguish spoken dialogue (conversa‑ tions) from monologues (speeches). Apart from the raw data, the findings pre‑ sented in 3.1 were used to assist the selection process. As a result, proper nouns without determiners or other premodifications (including “father”, “sister”), verbs that cannot be clause initial (for example: “used”), and place names have been ignored. As many spoken items are not dictionary items as such, different spellings for similar utterances (yeah yeh; erm / um) have been counted together.Table 4. Twelve most frequent conversational spoken English words in TTC
        Key wordN%I528433.16YOU428332.56YEAH / YEA / YEH278351.85LIKE271281.62IT250571.50WAS228491.37THEY168851.01SO151480.91ER / ERM / UM150580.89MY141030.84JUST130330.78DO125010.75
        Table 4 shows a clear preference for referring to either of the two speakers (I, you, my) or outside parties or entities (it, they). There are hesitation markers (er / erm), and discourse particles (yeah, like). This, however, would only provide a tentative link to signals that indicate turn‑taking as all of these items also appear in the single‑speaker presentations, albeit with a significantly lower frequency. A further step, retaining the key‑word analysis approach, would be to look at those items which are highly frequent yet occur extremely infrequently in the speeches data.Table 5. Top 12 words proportionally most frequent in TTC3
        Key wordN: TTCN: SSMTTC/SSMMM1082241967.6YEAH / YEA/ YEH27835211015.9ER / ERM / UM1505831347.0ALRIGHT18115258.7AH15596185.6PAUSE19251498.2OH85436889.7SHE’S22252369.1LAUGH14852346.1HE’S39138234.1DON’T908621230.6DIDN’T37389727.7
        Table 5 compares those items which occur with a very high frequency in the casual conversation corpus (TTC) with their occurrence figures in monologues (SSM). The figures given are strictly proportional. All are highly key, therefore far more frequent in TTC than in SSM as chance would allow. The necessary caveat here is that the single‑speaker speeches may have been written or transcribed in a way that leave out pauses, contractions and discourse particles – thus making the occurrence of such items appear much lower.4 It must be noted, furthermore, that while “laugh” in TCC refers to the non‑verbal action, in SSM it occurs only as part of the spoken text, as in this example “something that makes us laugh, and often
        1. Given the difference in size of the corpora, SSM figures have been normalised: N times 1.4
        2. Another issue is the interpretation of non-verbal sounds. The TTC has hesitations tran- scribed as “um”, “er” and “erm”. Yet “erm” in the SSM data refers to the European Exchange Mechanism and therefore does not equate with the “erm” found in TTC at all.

        those things…”. The table also shows a large number of discourse particles that are characteristic of spoken English (cf. Knowles 1987; Cheng 2012): contractions (don’t) agreement (yeah, alright), hesitation markers (er / erm / um), backchannel‑ ling (mm). There are finally two items which are usually classed as interjections: oh and ah – these are meant to draw attention to what the listener wants to say. Hesitation markers and backchannels are not dictionary words and are therefore subject to variation according to the system the individual transcriber adheres to. It can also be seen that there is a clear preference for using personal pronouns in conversations compared to monologues. Amongst the 40 items seen as “overused” (key), eleven are personal pronouns. These are frequently followed by a verb (she’s, you’re). Less frequent, though, are the possessive personal pronouns my, me, his, her. Amongst the 15 most overused items, four are negations (didn’t, don’t, can’t, wouldn’t).5Looking at Hoey’s (2005) theory with reference to domain and genre specific priming, we can detect strong evidence that the textual collocations for the TTC key words are positively primed for spoken conversations, yet negatively primed (and therefore underused or not at all occurring) for monologues.We have seen that, apart from the contractions, the various studies discussed in 3.1 have shown such items to be most frequently turn-initial. To test this, a selection of these items needs to be looked at to see to what degree they are actu‑ ally employed in such a position within an utterance. To do so, the investigation has to move from a keyword analysis to a concordance analysis.
      2. Keywords in positional contextIn 4.1 we identified the words that are frequently used in conversations, yet are very rare in monologue‑style speech. To then see whether any of these words have a marked tendency to occur either at the end of a turn (TF), or at the start of a new turn (TI) the words have to be investigated within the concordance lines. To do this, the usage patterns at the end and at the start of each of the identified turns is investigated.The TTC corpus provides a platform for turn‑taking investigations as concor‑ dance lines can be investigated following the node Speaker1TF : Speaker2TI. This assists in showing the turn‑final and turn‑initial words most frequently occurring in this conversation corpus. Once items (usually proper nouns) specific to indi‑ vidual conversations are taken out, a picture of the salient patterns of use appears.
    5. See Pace-Sigge (2013), in particular Chapters 5, 6 and 7, for more detailed analysis with regards to personal pronoun and discourse particle use; Pace-Sigge (2015), Chapters 5 and 6, with regards to differences between monologue and conversation texts.
      Table 6. Most frequent turn-final and turn-initial items in TTC
      Table 6 gives a first glimpse at what to expect. In the turn‑final column there are words that are expected to be clause‑final and not clause‑initial: for exam‑ ple know, them, him.6 There are also items that are, or may be connected with the use of, tag questions: innit, it. On the other hand, turn‑initial utterances strongly indicate backchannelling: mm, mhm, mmm, oh, ah. Some transcripts are detailed enough to highlight that there is an overlap using these items (mean‑ ing “that a response has been squeezed in edgeways” – cf. Yngve 1996: 299f.): Furthermore, there is backchannelling indicating agreement by Speaker 2 (S2) with what Speaker 1 (S1) has said: yeah, yeh, yea, yes, right. There can also be surprise or disagreement: oh, no, but, nah.There is clear evidence that speakers both finish and start their turn with hes‑ itation‑markers (er, erm) though they are notably more frequently used by S2 (i.e.: turn‑initial).Table 6 merely represents occurrence frequencies: this means that, while yeahis found ranked amongst the most frequent TF1 (turn‑final) and TI1 (turn‑initial)
    6. There are also definite equivalents to the clause-final-only terms: they and he.
      co‑occurrence. This does not mean, however, that a turn‑ending is directly fol‑ lowed by a turn starting with yeah. Looking at the occurrence patterns for the last word of a turn on the one side and the first word of a turn on the other, two important features need to be taken into account: (1) the frequency of occurrence (which almost always differs) and, more importantly perhaps, what patterns are revealed when looking at the actual concordance lines.
      1. Preferred and dispreferred items for speakers and respondents in conversations
        1. Turn‑initial itemsThis section shows those words which are most likely and those least likely (pre‑ ferred and dispreferred) to be found when a speaker takes a turn.Table 7. Most and least likely turn‑initial items in TTC7
          TI1 item preferredN% usageTI1 item dispreferredN% usageMM9,52368.8THEM90.4OH4,70255.8OF410.6YEAH,YEH,YEA16,57145.1BE170.8NO, NAH5,09343.4OUT120.8YES91742.5WILL251.0WELL96528.5TO1211.4WHAT2,37428.0THINK431.6RIGHT92727.8UP261.6ER, ERM, UM175529.7GOT671.8HOW54727.3HAD312.2WHERE47426.0HER342.4SO2,31525.6SAY362.5THAT’S100320.4GET602.5ALRIGHT31020.3THINK702.7I’M40417.8FROM443.0

    7. The percentage values here are based on the proportional usage as last item (TF1) of a spoken turn or as turn-initial (TI1) compared to the total of L4 to R5 collocates.
      To calculate the relative percentage of usage of an item in turn‑initial position (TI1), two options are available: either by comparing the target items’ frequency of occurrence with the number of tokens for the word in the TTC, or by looking at the total number amongst the collocates. Neither can be fully mathemati‑ cally accurate. For example, repetition of discourse markers (yeah yeah yeah) can skew the word‑count. As this investigation shows, there is also a text‑type specific bias in the figures. Thus, we some items are typical of spoken discourse. Mm (mhm) is an item that occurs almost exclusively at the beginning or end of a spoken turn. The reason for this is that it is the only utterance in a turn. This is in stark contrast to none‑discourse specific items – like the highly frequent the which can be found in any position of an utterance.Table 7 shows that (within the TTC at least) there are some items that speakers prefer and some which they disprefer in turn‑initial position. Amongst the turn‑ initial words, there is an overwhelming preference to use backchannelling (mm or right); this also affirms claims made in pre‑corpus studies. Furthermore, the use of discourse markers mirrors the findings by McCarthy, Tao and Evison (see above). Table 6 also demonstrates a quirk of spoken transcriptions – hence similar utterances are notated as oh rightah right or as a single unit, alright.8 These positive responses are juxtaposed with the negatives no or nah. There is also a tendency to ask questions (what, how, where). Furthermore, speakers use forms particular items to show hesitation or to buy themselves time – oh, well, er / erm or so. Lastly, turn‑initial words indicate that explanations or evaluationsare being given: that’s and I’m.At the same time, we find clear evidence of words which are strongly dispre‑ ferred in turn‑initial position. Hoey and McCarthy have pointed out that the is unlikely to be a turn opener (cf. Hoey 2005: 151). The table above indicates that the is the most frequent item in the data not to be found in turn‑initial position. However, prepositions (ofout, to, up, from, on), possessive pronouns (them, her, my) and infinitive forms (be, get, think, go) are all, proportionally, even less likely to start an utterance. Most interesting here is that that and don’t are somewhat dispreferred yet that’s and no are often heard at the start of a turn.
    8. Michael Hoey (personal communication) points out that “alright” is definitely different from the other two in my usage and, I suspect, in that of most speakers. While I cannot dis- agree with that, one would have to be able to compare transcripts from a variety of sources to confirm whether these are or are not the same kind of utterance. As it stands, we will have to assume that (given that these lexical items appear used in about the same way in the different corpora) that different transcribers recorded the same form in in different orthographic ways.
      Table 8. Most likely second items in a turn (TTC)
      TI2 itempreferredN% usageRIGHT128338.5BUT173928.8‘COS65328.2DID103125.2SHE86825.2HE109423.4IF40622.9WE78522.7DON’T101022.2YOU474522.0I425221.2AND307821.6IT’S135821.6THEY132021.5WOULD32721.1THAT’S101520.6
      In the context of items found to start a turn, the second item in a turn (TI2 ‑ see Table 8) needs to be taken into consideration as they often combine with TI1 items. This is highly notable with the most frequently TI2‑occurring right which is usually part of oh right or yeah right (see also Evinson 2012). The second‑ranked but is interesting as it clearly demonstrates the use of a politeness strategy by the speaker. Rather than directly offering an opposing view, speakers say mm butyeah but or, a lot less frequently, no but. A similar pattern emerges with ‘cos in TI2 posi‑ tion. In fact, the formula yeah (agreeing) but (qualifying) appears in 1/5 of all TI2 but occurrences. A good example is the following, talking about a baby sister:S1: …even though she is three S2: she’s three – ahhhS1: Yeh. but she’s spoilt and …Did, though, is part of an open‑ended question (where / how / when did) but also appears after a hesitation (mm / er did…). Personal pronouns tend to come after connectors (and he; but she) while the conditional if also comes after connectors like but, discourse makers, or hesitations (yeah if; mm if, erm if). That’s appears slightly more frequently in TI2 than in TI1. In TI2 it is mostly used as the fixed phrases no, that’s ok or oh, that’s right. Most frequently, however, it supports an
      affirmative view: yeah, that’s it / right. Similarly, TI2 and is usually found as yeah and. These patterns fit easily with what Table 7 indicates: that respondents in a conversation seek, overwhelmingly, agreement.Table 9. Most likely turn‑initial bigrams (TTC)
      R1 – R2 bi‑gramN% use of R2 occurrenceOH RIGHT65451.0MM MM21535.8I KNOW18332.3MM/MMM/MHM SO47321.9YEAH/YEH/YEA SO47021.5YEAH/YEH/YEA ‘COS13420.5YEAH/YEH/YEA BUT39622.8MM/MMM/MHM BUT1398.0YEAH/YEH/YEA AND56818.5MM/MMM/MHM AND41813.6
      Table 9 shows a strong preference for a speaker to start a turn with items that appear associated with backchannelling. Furthermore, the choice of words and constructions are fairly restricted. It appears to mirror findings described by Evison (2012). There can be agreement (oh right; I know); clarification, where mm or yeah is followed by so as in this example: “mm so that they have been in trouble”. Alternatively, the respondents continue on from what has just been said – yeah cos or yeah and as in “yeah, cos no one’s got English accents now” or “yeah and I used to get into an argument”. It must be noted, however, that doubt (or disagreement) is more likely to be expressed through yeah but rather than by using the hesitation marker mm first and then using but.9 An example for this would be the following:S1: She used to teach you different subjects?S2: yeah but it’s not – that’s what they’re supposed to doIndeed, as can be seen below, there appears to be quite an overlap between turn‑starting items and items following a backchannel. Backchannelling here is following the observations and definitions of Yngve (1970) and Sacks et al. (1974).
    9. See Gardner (2001), in particular Chapters 3 and 6, for a detailed analysis of the various forms of mm as a response token.
      Table 10. Most likely R1 after back‑channelling
      Backchannel: R1 bi‑gramN% use of R2 occurrence[MM/MMM/MHM] BUT43124.5[MM/MMM/MHM] AND69622.6[MM/MMM/MHM] SO39218.2[MM] MM10216.9[YEAH/YEH/YEA] AND36211.8[YEAH/YEH/YEA] BUT20411.7[YEAH/YEH/YEA] ‘COS7411.3
      It is remarkable how similar Tables 9 and 10 are. While there can be no absolute certainty here, as the transcriptions might not fully reflect the dynamics of the exchange, there still seems to be an indication that, at times, backchannelling is a signal to Speaker 1 (S1) by Speaker 2 (S2) that it is ok for Speaker 1 to carry on talking. Consequently, S1 then continues, starting with a word like so – in fact using the backchannel as a request to reword as in the example below:S1: …she is doing a more relaxing job S2: mmS1: so she is a technician for erm …There can also be positive acknowledgement: 364 occurrences of oh right are back‑ channeling items by Speaker 2 while Speaker 1 continues to speak. Yet twice as many occurrences of oh right seem to start a completely new turn. This appears to be similar to I know, which, mainly, starts a new turn. At the same time, there are 50 occurrences where I know is simply slotted into the running conversation. This, again, fits well with Tao’s and Evison’s findings.The longer clusters investigated, appear to only show source‑specific for‑ mulaic phrases. On the one hand, we can find a lot of leading questions in the ‘Linguisitic Innovators’ corpus (“What do you mean by”, “What do you think of ”). On the other hand, such formulaic phrases stem from speech‑community specific strong use (‘overuse’) like the phrase “you know what I mean” by Liverpool speak‑ ers (see Pace‑Sigge 2013). Such long phrases are, however, of no relevance to the current discussion.
      1. Turn‑final itemsWhile turn‑initial items are relatively frequent and while the findings above reflect fully what has been taught in discourse analysis, it is a lot harder to identify how a speaker uses lexical means to end a turn. As the data below shows, the variety of
        words employed to end a term appears to be larger than the variety employed to be turn‑initial. It appears, in fact, easier to highlight those words that are least likely to be found at the end of a turn rather than those completing a turn.
        Table 11. Most and least turn‑final items in TTC
        TF1 item preferredN% usageTF1 item dispreferredN% usageNOW45530.7YOU’RE80.6THEM52223.3I1460.7THERE71621.5I’M150.7THEN45316.5WHERE120.7KNOW76816.3HOW170.8YEAH,YEH,YEA5,94716.2HE’S150.8REALLY47116.0A820.9THAT1,43215.7MY450.9UP26015.7AT180.9OUT24215.7AS211.1WELL52415.5THAT’S611.2IT1,60114.9WILL291.2PEOPLE21912.6THEY’RE181.2THINK32312.2IF231.3MM1,62611.7WHAT1241.5


        Turn‑final items are fairly heterogeneous. The most frequently occurring term, yeah, is only prominent because it is a single‑item utterance: both turn‑initial and turn‑final at one and the same time. In fact, yeah and mm likewise acts as a single‑ word acknowledgments. Otherwise, yeah acts as brief confirmation: I did, yeah; I thinks so, yeah.Really appears to have two functions. It can either be a short form of indicating an opposing opinion – then we find not really as a brief response as the most likely usage. Alternatively, speakers do not feel there is much to say (or they are being self‑deprecating) and thus want to end their turn: just a hobby really; there is no point really; I don’t know anyone really. It must be noted how turn‑final really almost always carries a negative semantic association. This is similar to think, which appears almost exclusively as the turn‑final I think as in three or four times I think. Likewise,that which occurs mostly in forms like and that; and all that; like that; stuff like that.
        McCarthy describes this as “logical that these vague category markers should cause turn‑change because they need to be reaffirmed by the listener, who is effectively saying ‘yes, I understand what you mean by and stuff ’. This is an important part of the continuous negotiation of meaning that has to take place in conversation”.10 There is a clear message of finality to all such phrases, which explains why a speaker would use these to give up a turn. These findings provide a clear link to what has been investigated earlier (see Section 2). Those claims were made on the strength of rather limited data – yet the truth of what has been said then is here supported on the basis of larger, more recent material.Now stands out amongst the turn‑final items. It is not a very frequent word in the whole corpus (0.24 per cent, yet it is in almost every file). Within the total of collocates, it is found in around 1/3 of the time in turn‑final position. Amongst the total tokens of now in the TTC, it appears in turn‑final position in 11.5 per cent of cases.11 As such, ending an utterance in now indicates strong finality by a speaker: “sixty‑nine is quite young now”; “she’s been living in that house now”; “I’ve got anew job now”. It must be noted that this specific usage pattern of now has not beenfound mentioned in any of the earlier literature on the subject.Furthermore, a lot of turns appear to end on them, there and then: So we have utterances like “yeah, most of them”, “my house is full of them”; “it’s just there”, “I have relatives there”; “you lived in Liverpool then”, “go on then”. While it can be said that there and then are alike (pointers to space and time), them is quite different. However, they share a common quality: they are all clause (and, in par‑ ticular, utterance) final pointers, back‑referencing to something that must be clear to the listener. TF1 well also belongs to this group: Every single TF1 (turn‑final item) well is preceded by TF2 (second item of a new turn) as – thus being the pre‑ ferred spoken form to add something: … as well. Again, there this references back to something established earlier, as in “…so your mum is Bangladeshi as well”. These turn‑final words mostly complete factual statements or statements turned into a question. Pragmatically it makes therefore sense that these should end an utterance.Interesting is the use of people at the end of a turn. People is, like “some‑ body”, an inherently vague item. So we have, for example, “… black people”, “… Columbian people” and such utterances are met by the listener’s backchanneling (mm). Alternatively, the item people seems to be a trigger, where another speaker adds to or completes an utterance:
    10. Michael McCarthy, personal communication.
    11. Now numbers 3950 in total, in 204 out of 211 files of the corpus. Amongst the L4 to L5 collocates, NOW occurs 1450 times.
      S1: Just people hitting people S2: Yeah for no reasonorS1: I’d only go out with white people S2: Ahh, only white peopleFinally, another turn‑final item is it. Around one‑third of these occur within tag questions (is it? = 409 occurrences; isn’t it? = 141 occurrences) where turn‑finality is expected.12 There are also a number of instances of also that’s it (220 occur‑ rences; that’s about it = 69 occurrences), which indicates the intention to com‑ plete a turn. It carries therefore in its semantic association of finality for both the speaker (producer) and listener (recipient). Apart from that, the term it coheres with another topic within the lexical chain of the speaker’s utterance: it refers back to something earlier mentioned, as in the following:Everyone claims EMA. Everyone can get itS1: So you do the ironing? S2: No, I can’t deal with itYou’ve got loads of money so I just take some of itMoving to the dispreferred words (right side of Table 11), it becomes fairly clear that these are all items that are not usually found in clause‑final position – unless a speaker is interrupted. This also explains the extremely low frequency of these words. The exception is what – which is recorded to be in turn‑final position as part of a fixed phrase like no matter what, guess what, I’ve forgotten what. Yet what is also turn‑initial, being a single‑word interjection to clarify, as inS1: …you can change the speed of the tape S2: WhatS1; you can change the speed of the tape.Prime verbs are not usually found in clause‑final position (I am, you are, they are, he is – usually uttered, as can be expected in speech, in their contracted form), nor are pronouns like I, my, we, he. The exception to this rule would be the use of tags as in “I’m hungry, I am”, “Pass me that pen, will you” (cf. Carter & McCarthy 2006). Furthermore, indicators of open questions (how, where) or post‑positioned modals (would) are not found in clause‑final position unless, that is, they occur in set phrases as in yeah that’s how; I can’t see how; you know how; don’t ask me where; I don’t know where. All of these are back‑referencing.There are also prominent patterns of usage when one looks at the TF2 items found at the end of a speaker’s turn.
    12. It should be noted that positive tag-questions are notably more frequent here.
      Table 12. The 12 most likely second turn‑final items (TTC)
      TF2 item preferredN% usageAS51826.3OR43219.3FOR41317.4AT32816.7THE212516.6MY81315.5FROM22215.3A134114.3OF89814.2BE20112.2ABOUT37112.8LIKE103310.8
      Table 12 indicates that, speakers in TF2 (that is, close to the end of a turn) tend to use determiners (a, the, my) or prepositions (for, at, from). Furthermore, it is interesting to note the be-ADJ construction, most prominently be honest. 23 out of these are the phrase to be honest.13 Fixed phrases like these reflect the use of a colligational pattern that is associated with turn‑endings. Therefore, phrases like it’ll be alright; couldn’t be bothered; needs to be done, as well as the be-ADJ form that’ll be good occur.Similarly, TF2 of‑constructions, where of is used as the penultimate item in a turn are frequent turn‑ending devices: you know the bit of X-road, you know outside of college, think about different parts of England (place); I only know one of them, about twenty of them (persons). There is also back‑referral with it: no but I’ve heard of it, I left there because of it.Looking at Table 13, we find fixed phrases that either reinforce something pre‑ viously said indirectly by adding a compatible word or phrase with what was just said, most prominently the bigram as well. Or they are indicators that the speaker is resorting to vagueness markers, letting the turn peter out: (or) something like that, that kind of thing, that sort of thing, and that or the more final that’s about it. This is fully in line with the longest turn‑final clusters found14:
    13. A typical phrase to be heard in Liverpool, cf. Pace-Sigge (2013).
    14. It also mirrors findings in Pace-Sigge (2013).
      Table 13. The 11 most likely turn‑final bigrams (TTC)
      TF2‑TF1 ITEM:N% use of L2 occurrenceAS WELL44886.5ABOUT IT25669.0DO YOU27465.9OR SOMETHING25659.3OR ANYTHING11225.9LIKE THAT60058.1IS IT20655.5AT ALL17051.8THAT’S IT22043.1YOU KNOW55040.6AND THAT34128.3
      Table 14. Longest turn‑final clusters
      According to McCarthy,15 the phrases found in Table 14 “all appeal to shared knowledge, so it’s natural that the listener should respond at this point”. Apart from tag‑questions with it, there is also the form that appears to challenge the listener: do you as in “you don’t kick off in stranger’s pubs, do you” or “you get on with him, do you?”
      1. Alignment between speaker’s turns
      While there has been computational modelling to approach this matter – notably by Howes et al. (2010), Mehler et al. (2010), and Gómez González
    15. Michael McCarthy, personal communication.

    (2011) – the issue of alignment between two speakers is probably the hardest feature of turn‑taking to detect using corpus linguistic methods. One would have to find words or sets of words that appear both in the utterance‑final con‑ cordance lines and are then followed in the utterance‑initial concordance lines. Yet a multitude of subjects that an individual might talk about would make repeated patterns almost invisible. While it might be difficult to detect patterns, there is, however, still the option to go over hundreds of random concordance lines and gather evidence of alignment: where the respondent repeats chunks of what has just been heard. Below, therefore, exemplary evidence is presented only. It is, however, a further salient feature of turn‑taking in discourse. This is what Trofimovich (2005) refers to as auditory priming. He found that some‑ times primes can last for a very short period – seconds only. There seems to be some evidence in my sample that some words are more likely than others to be mirrored by a respondent.There are, examples where longer chunks (and grammatical constructions) are repeated, seemingly without conscious effort to do so. These become apparent when going over the concordance lines and finding instances where what has been said in turn‑final position (by speaker 1) are found repeated in the following turn‑ initial position (by speaker 2):
    1. S1: Is that every Saturday? S2: Not every Saturday
    2. S1: mm .. you like playing football?S2: yeah I like playing football I play I play a lot …
    3. S1: you can’t play football or anything? S2: I play football yeah
    4. S1: ain’t filled it out yetS2: I filled mine out already
    5. S1: mm .. have all your boyfriends been from this area? S2: all Brentwood the area of Brentwood
    6. S1: is that a better area?S2: not better area … mm nah …
    7. S1: …I don’t even knowS2: cos we’re coldS1: I know we’re cold girls
    8. S1: No, I’m not, I’m from Lancashire S2: But I haven’t got an accentS1: But I haven’t got an accent you see …
      S2: Where you are from hasn’t got a broad accentS3: Lancashire has really broad accent
    9. S1: … way they dress or the way they speak? S2: way they speakS3: speak and dress innit some of some of them S2: speak and dress yeah
    10. S1: You horrible (..) git
    S2: So am IS1: love you really though S2: love you too – loveS1: Bye bye darlingS2: Cheerio my sweet
    Repetition and re‑lexicalisation has been described and analysed in Carter and McCarthy (1988: 185ff.). The examples given above add to the considerable litera‑ ture already present.Example (1) appears almost over‑precise in repeating the bigram from the question. (2) repeats both most of a longer chunk and keeps the colligational pat‑ tern. It has to be noted that the – ing form colligation is kept, though in another conversation (3) we find the infinitive form. (4) keeps the colligational pattern, which underlines the positive response to S1’s ain’t. Here the bigram is the phrasal verb fill out. Examples (5) and (6) appear to indicate clear alignment – S2 seems to be not usually employing the term ‘area’ but brings this in in his/her answer.The final four examples seem to be even better examples of alignment, as the same wording seems to be bouncing back and fourth, at times between three par‑ ties. Hence we find, in (7) how Speaker 1 appears to combine her “I don’t even know” utterance with the second speakers contribution to create “I know we are cold girls”. In (8) S2 repeats the previous utterance verbatim and then proceeds to talk at some length. This then is finds strong linkage with what S3 says, when she refers to Lancashire and ends her utterance with almost the same words as the previous speaker. Example (10) can only be understood with recourse to some background: these are two middle‑aged colleagues, having a banter. Yet the use of the antonym love to git, the repeated use of the address love and the closely related words in that field (darling, sweet) indicate extremely close alignment – though this has to be seen as a more conscious act than the Examples 1–9.Tannen (1989: 97), referring to Jakobson, summarises fittingly why such echo‑ ing is found:Utterances do not occur in isolation. They echo each other in a “tenacious array of cohesive grammatical forms and semantic values,” and intertwine in a “network of multifarious compelling affinities.” One cannot understand the full meaning of
    any conversational utterance without considering its relation to other utterances (…) in prior text.
  5. Discussion and conclusion

If we are to see the evidence presented above as a result of unconscious prefer‑ ence or non‑preference to employ such items within different part of a spoken discourse, if “every word is primed to occur in, or avoid, certain positions within the discourse”, its textual colligations (Hoey 2005: 13) then a clear case can be made. In line with the material presented in Section 2, a lot of turn‑initial items are backchannelling sounds made by respondents, and these tend to show agree‑ ment. These would then followed by personal pronouns. This confirms the conclu‑ sions drawn by Yvnge, Sacks et al. with data collected in a pre‑corpus era. On the other hand, prepositions (ofout, to, up, from, on), possessive pronouns (them, her, my) and infinitive verbs (be, get, think, go) are dispreferred in turn‑initial posi‑ tion. Furthermore, backchanneling can be misunderstood by first speaker as turn‑ openers, hence the respondent is expected to carry on talking after uttering what would otherwise be a single word acknowledgement.

Similarly, there are lexical signals at the end of a turn as the investigated cor‑ pus shows: thus, verb phrases, pronouns and wh‑words are clearly dispreferred while words like yeah, that or it are found to a high degree at the end of an utter‑ ance. This investigation has also uncovered that the items now and people appear to have a strong preference to be turn‑final. Consequently, speakers seem to run out of things to say and simply point towards vague entities, ending a turn with long prefabricated phrases like and things like /stuff like that etc.

Finally, speaker alignment is here presented by some initial impressions. With the pragmatic aim of co‑operation between the speakers, the idea of textual colligation in tandem with pragmatic aims within a conversation, namely, speaker alignment pre‑ sented some initial impressions. Though alignment is difficult to demonstrate, and is very clearly not a constant found in every single exchange, concordances demon‑ strably present evidence of some use of alignment where sets of words (and, often, colligation structures) are repeated by one or more respondent in the TTC. These can be seen as auditory priming as described by Trofimovich (2005).

Further research would have to take into account the full range of commu‑ nication signals – (the ones given by Starkey Duncan Jr, 1972) – and investigate not only corpus but also audio and video data to give a rounded account of turn‑ taking primings. Also, as Howes et al. (2010) or Stent (2011) have shown, there are more sophisticated systems out there to track alignment between speakers. This should be made greater use of for a more in‑depth investigation, too.


Archer, D., Aijmer, K. & Wichmann, A. 2012. Pragmatics. And Advanced Resource Book. London: Routledge.

Carter, R. 2004. Grammar and spoken English. Applying English grammar: Functional and cor- pus approaches, pp. 25–39.

Carter, R. & McCarthy, M. 1988. Vocabulary and Language Teaching. London: Longman. Carter, R. & McCarthy, M. 2006. Cambridge Grammar of English. Cambridge: CUP. Cheng, W. 2012. Exploring Corpus Linguistics. Language in Action. London: Routledge.

Duncan Jr., S. 1972. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23(2): 283–292. doi: 10.1037/h0033031

Duncan Jr., S. 1974. On the structure of speaker‑auditor interaction during speaking turns.

Language in Society 3(2): 161–180. doi: 10.1017/S0047404500004322

Duncan Jr., S. 1981. Conversational strategies. Annals of the New York Academy of Sciences 364: 144–151. Issue The Clever Hans Phenomenon: Communication with Horses, Whales, Apes, and People. <‑1/issuetoc> (5 October 2015). doi: 10.1111/j.1749-6632.1981.tb34468.x

Evison, J. 2012. A corpus linguistic analysis of turn‑openings in spoken academic discourse: Understanding discursive specialisation. English Profile Journal 3(4). <journals.cambridge. org/abstract_S2041536212000049> (5 October 2015).

Gabrielatos, C., Torgersen, E., Hoffmann, S. & Fox, S. 2010. A corpus‑based sociolinguistic study of indefinite article forms in London English. Journal of English Linguistics 38(4): 297–334. doi: 10.1177/0075424209352729

Gardner, R. 2001. When Listeners Talk: Response Tokens and Listener Stance [Pragmatics & Beyond New Series 92]. Amsterdam: John Benjamins. doi: 10.1075/pbns.92

Gómez González, M. 2011. Lexical cohesion in multiparty conversations. Language Sciences 33: 167–179. doi: 10.1016/j.langsci.2010.07.005

Halliday, M.A.K. 2004. The spoken language corpus: A foundation for grammatical theory. In

Advances in Corpus Linguistics, K. Aijmer & B. Altenberg (eds). Amsterdam: Rodopi.

Howes, C., Healy, P. & Purver, M. 2010. Tracking lexical and syntactic alignment in conver‑ sation. In Proceedings of the Twenty-fifth Annual Conference of the Cognitive Science Society (CogSci 2010), Portland, Oregon.  paper0484.pdf (9 October 2015).

Hoey, M. 2005. Lexical Priming. A New Theory of Words and Language. London: Routledge. Hoey, M. 2014. Words and their neighbours. In The Oxford Handbook of the Word, John R.

Taylor (ed.). Oxford: OUP. doi: 10.1093/oxfordhb/9780199641604.013.39

Knowles, G. 1987. Patterns of Spoken English. And Introduction to English Phonetics. London: Longman.

McCarthy, M. 1998. Spoken Language and Applied Linguistics. Cambridge: CUP.

McCarthy, M. 2002b. Good listenership made plain. British and American non‑minimal response tokens in everyday conversation. In Using Corpora to Explore Linguistic Variation [Studies in Corpus Linguistics 9], R. Reppen, S.M. Fitzmaurice, & D. Biber (eds), 49–71. Amsterdam: John Benjamins. doi: 10.1075/scl.9.05mcc

McCarthy, M. 2003. Talking back: “Small” interactional response tokens in everyday conversa‑ tion. Research on Language and Social Interaction 36(1): 33–63.

doi: 10.1207/S15327973RLSI3601_3

McCarthy, M. 2010. Spoken fluency. English Profile Journal 1(1): e4.

doi: 10.1017/S2041536210000012

Mehler, A., Lücking, A. & Weiß, P. 2010. A network model of interpersonal alignment in dialog.

Entropy 2010(12): 1440–1483. doi: 10.3390/e12061440

Myers, G. 2009. Structures of conversation. In English Language. Description, Variation and Context, J. Culpeper, F. Katamba, P. Kerswill, R. Wodak, & T. McEnery (eds). Houndmills: Palgrave Macmillan.

O’Keeffe, A., McCarthy, M. & Carter, R. 2007. From Corpus to Classroom. Language Use and Language Teaching. Cambridge: CUP. doi: 10.1017/CBO9780511497650

Pace‑Sigge, M. 2013. Lexical Priming in Spoken English. Houndmills: Palgrave Macmillan. Pace‑Sigge, M. 2015. The Function and Use of TO and OF in Multi-Word-Units. Houndmills:

Palgrave Macmillan. doi: 10.1057/9781137470317

Sacks, H., Schegloff, E.A. & Jefferson, G. 1974. A simplest systematics for the organisation of turn‑taking for conversation. Language 50: 696–735. doi: 10.1353/lan.1974.0010

Stent, A. 2011. Shared experiences, shared representations, and the implications for applied natural language processing. In Proceedings of the Twenty‑Fourth International Florida Artificial Intelligence Research Society Conference, 210–215. publication/221439080_Shared_Experiences_Shared_Representations_and_the_Implica‑ tions_for_Applied_Natural_Language_Processing (9 October 2015).

Tannen, D. 1989. Talking Voices: Repetition, Dialogue, and Imagery in Conversational Discourse.

Cambridge: CUP.

Tao, H. 2003. Turn initiators in spoken English: A corpus‑based approach to interaction and grammar. Language and Computers 46: 187–207. Special issue Corpus Analysis: Language Structure and Language Use, P. Leistyna & C.F. Meyer (eds).

Thórisson, K.R. 2002. Natural turn‑taking needs no manual: Computational theory and model, from perception to action. In Multimodality in Language and Speech Systems, B. Granström,

D. House, & I. Karlsson (eds), 173–207. Dordrecht: Kluwer. doi: 10.1007/978-94-017-2367-1_8

Trofimovich, P. 2005. Spoken‑word processing in native and second languages: An investigation of auditory word priming. Applied Psycholinguistic 26: 479–264.

doi: 10.1017/S0142716405050265

Yngve, V. 1970. On getting a word in edgewise. In Papers from the 6th Regional Meeting of the Chicago Linguistics Society, April 16–18, Robert I. Binnick (ed.). Chicago IL: CLS.

Yngve, V. 1996. From Grammar to Science. New Foundations for General Linguistics. Amsterdam: John Benjamins. doi: 10.1075/z.80

part ii

Similes, synonymy and metaphors

Lexical priming and the selection and sequencing of synonyms

Linda Bawcom

Detailed corpus‑based research has identified factors that describe various reasons for the preference of one synonymous lexical item over another (or others). This paper continues along these descriptive lines while presenting the versatility of synonyms and their functions. In addition to statistical results, we also investigate the psychological reasons for our choices by exploring what is referred to in psycholinguistic priming tasks as the frequency effect. We will find that this psychological, subliminal effect can importantly add explanation to description for corpus‑based studies, which finely dovetails with Hoey’s theory of Lexical Priming.

  1. Introduction
    Using large general corpora, researchers have performed numerous studies which provide statistical evidence of how we use, differentiate among, and thus choose a possible synonym.1 The studies carried out using a corpus‑based methodology typically point to collocation, semantic preference, and semantic prosody as expla‑ nations for the preference of a particular synonym compared to another (or oth‑ ers) with the same intuitive meaning (see for example: Edmonds 1999; Sinclair 1991; Stubbs 1996; Partington 1998; Hoey 2005; and Barnbrook et al. 2013).This study, on the other hand, was carried out using a small, topic‑bound, specialized corpus. As Hoey points out referring to his Guardian corpus, “…cer‑ tain kinds of feature only become apparent when one looks at more specialized data” (Hoey 2005: 13). This is precisely the case with this research where entire texts could be analyzed and compared. While the study presented corroborates prior studies treating synonyms, it reveals an interesting factor referred to as the frequency effect, which may also subliminally affect our selection of an appropri‑ ate synonym. This psychological singularity is only one of numerous associations that a word or phrase can be linked to in our mind.
    1. As used in this article, synonym means that when focusing on their similarities, one lexical item could intuitively be substituted for another in the same kind of discourse without changing the meaning.
      doi 10.1075/scl.79.05baw© 2017 John Benjamins Publishing Company
      Why associations such as collocation, colligation, and semantic preference should occur at all was the impetus for Hoey’s research which led to his lexical priming theory. He argues that these associations must be linked to the psycho‑ logical phenomenon of priming. By including discussions of results based on priming tasks in the field of psycholinguistics, which complement Hoey’s theory, additional insights can be gained into the psychological process concerning word choice. This importantly enables us to add a psychological explanation to descrip‑ tion. Following is a brief introduction to semantic priming and Hoey’s theory of lexical priming.
      1. Semantic primingThe notion of priming comes from semantic priming tasks carried out in the field of psycholinguistics. The question that researchers seek to answer is how language patterns are stored and retrieved. Simply put, a basic semantic priming task will present a word to participants such as fish (called the prime) and then a string of letters or another word such as chips (called the target). The time taken to do whatever the lexical decision task may be, such as recognize whether the target is a word or non‑word is measured, with the results generally being in milliseconds. Over time, controlled experiments with lexical decision tasks have become more sophisticated, complex, and varied. They have moved from two words (one word prime and one word target) to using primes and targets with short phrases (Arnon & Snider 2010), short sentences (Bod 2000, 2001), and lexical bundles (Tremblay et al. 2011; and Conklin & Schmitt 2012). What has been found in studies such as those cited is that items that are associated in our mind (such as fish and chips) or target words that are statistically ranked as more frequent than others are recognized faster than other more random lexical items or strings of letters. This discovery has come to be referred to as the frequency effect. The task for those in that field of research has been to account for it. A more in depth discussion of the frequency effect will bepresented in Part 3 of this paper.
      2. Lexical primingHoey’s theory of lexical priming takes a significant step toward bridging a gap between explanation and corpus‑based descriptions. While his theory offers sup‑ port for the frequency effect, as we shall see, it can be applied to account for a more complete and sophisticated explanation as to why we retrieve words and phrases that are associated.Contrary to semantic priming, what is of interest to Hoey is not the target, but rather the prime; why the prime would make the associated target more read‑ ily accessible. Basing his investigation primarily on collocation and semantic
        association,2 Hoey demonstrates through rigorous statistical evidence that our repeated exposure to a word in various contexts psychologically primes each word for each individual to associate that item with its:
        • collocates
        • semantic associations
        • pragmatic associations
        • colligations
        • textual collocations
        • textual semantic associations
        • textual colligations
        With respect to co‑hyponyms, synonyms, and polysemy Hoey states:Co‑hyponyms and synonyms differ with respect to their collocations, semantic associations and colligations. When a word is polysemous, the collocations, se‑ mantic associations and colligations of one sense of the word differ from those of its other senses (p. 13)Based on his research, Hoey’s claims provide a comprehensive theory as to why lexical items collocate and why they occur in particular patterns.In order to see how semantic priming and lexical priming can be applied to explain research results in this study, we examine synonyms from two different but complementary perspectives. The first is both qualitative and quantitative focusing on identifying a number of the ways synonyms serve us and their similarities and dif‑ ferences. The second is purely quantitative in nature as we go through the process of investigating the frequency effect by exploring synonyms used in the same text in the genre of newspaper articles dealing with the same topic. The scope of this paper does not allow for presenting the finer details of the research, but with these two corpus‑ based studies, we can review the type of description that a corpus‑based methodology offers, gain a few new insights into synonyms, and then see how semantic priming’s frequency effect and Hoey’s lexical priming complement and support each other.
    2. The functions of synonyms
      Synonyms serve useful purposes. We may turn to them when we need to:– accommodate non‑native speakers or speakers of English from countries other than our own (cross‑varietal synonymy);
  2. Semantic association exists “when a word or word sequence is associated in the mind of a language user with a semantic set or class, some members of which are also collocates for that user”(Hoey 2005: 24)
    • avoid taboo or offensive language;
    • be politically correct;
    • use key words to browse the Internet; or
    • write a cohesive piece of work.3
    What follows are four additional functions of synonyms. The majority of the texts used in the examples are taken from a small corpus of newspaper articles report‑ ing on the tsunami that struck Indonesia in December of 2004.
    1. Collocation and colligationBoth collocation and colligation are concerned with the company an item keeps; our knowledge of which influences our word choices. Collocation is doubtless the most frequently cited reason for our choice of a particular lexical item. In the examples which follow, we can appreciate both the similarities and divergences of synonyms.high and tall
      1. It triggered a tidal wave that reached an estimated 20 to 30 feet high when it made landfall.
      2. A slightly more violent earthquake struck Alaska 40 years ago, creating a tsu‑ nami up to 20 feet tallThere are 54 instances of high but only of tall. Of the 54 instances of high, 14 are the fixed phrases: high-techhigh schoolhigh ranking, and high-speed data, but 7 do co‑occur, like tall, with the height of a waves. Therefore, if measuring the exact height of waves, both tall and high are available but are not, unsurprisingly, inter‑ changeable in fixed phrases.hit and struck
      3. 5,000 people were killed after an earthquake hit off the coast of the island.
      4. one capable of massive damage – struck off the coast of the Indonesian islandCoincidently, there are 56 instances of the verb hit and 56 of the verb struck. While there are numerous instances of both words collocating with tsunamiwave/squake and earthquake, they distribute themselves quite differently. We find that 44 (78%) of the collocates for the verb struck are within a span of four, while the span for hit with these co‑occurences is more often wider (up to a short paragraph).
  3. Bawcom (2010) identifies 27 functions of synonyms.
    In addition, 13 (24%) of the instances for the verb hit are ellipses. There are no instances of this feature with struck. We also find that, though only a few examples, hit is used in the passive voice 5 times but struck only 2.countries and nations
    1. tsunami that struck coastal areas in India, Sri Lanka and other South Asiancountries
    2. provide tsunami forecasts to other Asian nations starting in March
    3. but at least 11 countries were hit
    4. immense waves or tsunamis crashed into several countriesOut of 28 instances of nations in the tsunami corpus, a number modifies only one, whereas out of 64 instances of countries, a number modifies 17 (26.5%). The word countries, therefore, appears to have a semantic preference for numbers.said and toldFollowing is an example of a synonym necessarily having to be used due to colliga‑ tion. Texts 9 and 10 are identical with the exception of the use of said and told, the latter necessitating an object.
    5. “Hundreds of thousands of livelihoods have gone”, he said.(Virginia Pilot‑December 28th, 2004).
    6. “Hundreds of thousands of livelihoods have gone”, he told reporters. (Washington Post, December 28th, 2004)From the foregoing examples, we notice that the synonyms share collocations. This would appear to somewhat contradict Hoey’s theory, which posits that synonyms differ with respect to “their collocations, semantic associations and colligations”. However, in his investigation of the two fixed expressions around the world and round the world Hoey draws the conclusion that:It would seem as if the synonymous word sequences we have been considering are primed similarly but distribute themselves differently across the lexical, se‑ mantic, and grammatical terrain. Thus both expressions collocate with halfway and markets, but one of them is far more strongly primed than the other for such collocates … The shared meanings means that there is overlap in the primings, but ultimately it is the difference in (the weighing of) the primings that justifies the existence of the alternatives. (Hoey 2005: 79)Thus, we understand that ‘differ’ does not mean across the board; there is room for overlap. It is a matter of how much that overlap occurs (and in what context) that is relevant to word choice and the manner in which lexical items are stored and retrieved.
      1. Avoiding repetitionWhat generally leads us to look for a synonym is an effort to avoid repetition. It seems a commonly held position, certainly in style manuals (see for example: Faigley 2005; and Fowler et al. 2007), that repetition of the same word or phrase within a short span may cause the reader to feel the writer has not put enough thought or effort into her or his work. Avoiding this pitfall, as we shall see in Sec‑ tion 3, can undoubtedly be applied to journalists, who must also conform to in‑ house style manuals.Keeble (1998), in his textbook, The Newspapers Handbook (2nd edn.), cau‑ tions students “not to repeat a word in the same sentence or any striking words close together, unless a specific effect is intended” (p. 89). Keeble does not focus on variation at length in his textbook. However, in his analysis of an excerpt of an article taken from the February 26th, 1993 issue of the Matlock Mercury, he does mention the usefulness of variation for avoiding repetition and for dramatic effect.
        Morris Dancer Dies in Road CrashA WINSTER Morris dancer was killed in a car smash last Monday – the second from the group to be involved in a tragic road accident in four months. (original emphasis)Para. 1: focus on the human tragedy. The reference to the previous accident pro‑ vides the angle. The use of the dash highlighting this point effectively. To avoid repetition, the reporter says ‘“car smash”’, and then ‘“tragic road accident”’ There is journalistic license here. ‘“Tragic accident”’ might refer to a second death. As becomes clear later, the accident victim lost a leg, not his life. But ‘“tragic acci‑ dent”’ is not inaccurate and adds to the drama of the intro. (p. 135)In the next two examples from the tsunami corpus – and throughout those news‑ papers articles, journalists heed advice from such style manuals (or perhaps their in‑house style manual). Note how archipelago has been used to avoid the repeti‑ tion of Islands in text 11 and in text 12 underwater so as not to repeat undersea.
    7. Authorities in Tamil Nadu put the death toll in the state at 1,705. India’s private ND TV television channel reported that 3,000 people had died in the remote Andaman and Nicobar Islands, an Indian territory between Sumatra and Burma. (paragraph 2)They expressed particular concern about the fate of thousands of people in theAndaman and Nicobar archipelago. (paragraph 3).
    8. Depending on a location’s distance from the undersea quake or landslide, warn‑ ing times may be short. (paragraph 2)
      Major earthquakes are suspected of causing underwater quakes and slides, which may contribute significantly to tsunami generation. (paragraph 6)Lastly, there is the following example from a popular American talk show.
    9. McCarthy: He [Jim Carrey] has a daughter who I love, and I have Evan, who he adores. (Dimich & Goodside 2008)
    Though anecdotal in nature and the relationship more collocational than syn‑ onymous (the first collocation for adore is love in the Corpus of Contemporary American English (COCA) (Davis 2008)), what attracts our attention is how seamlessly McCarthy avoids repetition in this perfectly created parallel struc‑ ture. Our ability to do tasks such as this is remarkable taking into account that word choice judgments are made in approximately 200 to 400 milliseconds. The length of time depends on how fast we are talking; between 150–300 words per minute (Conklin & Schmitt 2008: 72). Regarding the swiftness of accessing the parallel structure, Branigan et al. (1995) in their syntactic priming read‑ ing investigation found that “processing a particular syntactic structure within a sentence affects the process of the same (or related) syntactic structure within a subsequently presented sentence” (Branigan et al. 1995: 492). From a lexical priming perspective, the rapidity of perfect retrieval makes sense if we take into account what this theory postulates, which is that each word is stored along with all its collocational, colligational, semantic and textual associations.
    1. Se​quencing of synonyms: Use of the most frequent synonym first
      3.1 IntroductionOne of the purposes of presenting the function of using synonyms to avoid rep‑ etition was to lay the groundwork for the quantitative part of this paper, which follows. It is because of journalists desire to avoid repetition that it was possible to investigate synonyms, which led to being able to provide more evidence of the frequency effect.
      3.2 The frequency effect and spreading activationIn an exhaustive review of literature at the time, Murray and Forster (2004) state that “frequency effects have been found in just about every task that could reasonably be classified as a lexical processing task” (p. 721). The ubiquity of this effect has led researchers in psycholinguistics to acknowledge that there must be a link between this phenomenon and features that trigger association. By way of explanation, both
      Miller and Charles (1991) and Murray and Foster (2004) assert that continued encounters with a lexical item strengthen the associations that activate it. Thus, we can assume that those encounters forge elements attached to the lexical item in our neural network that are common to the contexts in which they are found.The importance of context is underlined by Adelman et. al (2006) and by Hoey (2005). Based on their experiments, Adelman et al. argue that what is important is not exactly the frequency of the word, but rather “the number of contexts the word has been seen in before” (p. 223). This belief coincides with semantic prim‑ ing reading tasks carried out by Reali and Christiansen (2007) who write, “read‑ ers’ expectations are influenced by exposure to sequences of words (or classes of words) that have been repeatedly used in similar contexts” (p. 19). The notions of association and context based on results of semantic priming tasks neatly dovetail with lexical priming. On the subject of the pervasiveness of collocation and how to account for it, Hoey writes:As a word is acquired through encounters with it in speech and writing, it be‑ comes cumulatively loaded with the contexts and co‑texts in which it is encoun‑ tered, and our knowledge of it includes the fact that it co‑occurs with certain other words in certain kinds of context. (p. 8)Repeated exposure to lexical items in their particular contexts, therefore, facili‑ tates recalling them and the words related to them quickly.One explanation for the frequency effect and Hoey’s lexical priming research is that of spreading activation (Collins & Loftus 1975). This model posits that we are able to recall related information due to an interrelated network of spreading activation so that concepts that are closely related are activated swiftly. Collins and Loftus postulate:The conceptual (semantic) network is organized along the lines of semantic simi‑ larity. The more properties two concepts have in common, the more links there are between the two nodes via these properties and the more closely related are the concepts. (p. 411)Kess (1992) further explains that:In a spreading activation model of the lexicon, the activation of a single word then spreads over its network of associated words, being strongest with the most closely associated words and weakest as the strength of the relationship decreases with semantic distance (…). The Collins and Loftus model thus explains how semantic priming operates by spreading activation of nearby related concepts in the semantic memory. (p. 223)It is worth reiterating here that this neural network of spreading activation is per‑ sonal in nature. The strength of the associations depends on the frequency with
      which each individual has encountered a lexical item in particular contexts. Of course it depends on one’s background, but it seems apparent that a great deal of us have come across and therefore used the same words in the same kind of context. Otherwise phenomena such as collocations, semantic associations, and therefore the frequency effect would be difficult indeed to quantify or explain.
      3.3 The tsunami corpusThe tsunami corpus (TC) comprises 69 newspaper articles gleaned from vari‑ ous regions throughout the United States dated December 27th and December 28th, 2004. The corpus consists of 47,564 tokens (running words) and 5,366 types (the number of different words). All texts discuss the Indonesian tsunami and the human and material impact it had in the region. Although a few texts present more technical information than others regarding what triggers a tsunami, there are no singular or what might be considered atypical accounts.
      3.4 Other corpora and software usedTwo general corpora were used, the British National Corpus (BNC) (2001) and the Corpus of Contemporary American English (COCA) (Davis 2008) in order to determine if the words selected in this study could be considered synonyms.Wcopyfind 2.6 (Bloomfield 2004), a free online plagiarism tool, was used initially as a means of weeding out articles that contained too high a percent‑ age of text reuse. This program is able to compare and detect articles where information taken from wire services, such as the Associated Press or Reuters, was duplicated among the articles. This was a necessary step to insure that fre‑ quency statistics would not be skewed. In addition, as texts that have the same wording are presented side by side, this program proved useful for discovering synonyms where word changes were made in texts that may have used the same wire services.
      3.5 CategorizingAfter generating a wordlist in Scott’s (2006)Wordsmith and eliminating closed sets and function words, the remaining 3,946 tokens were categorized. Initially, with the exception of verbs, they were categorized by question words such as: who, what, when, where, how, and how much, simply as a means of shortening what would have been very long lists had parts of speech been used. Afterwards, related words were grouped and then a topic assigned to them. This resulted in 64 topics.
      Table 1. Sample of categorization by topic
      death/injuryreactions to tsunamirelationshipsbodiesawfulboyfriendbodybadbrotherburialscataclysmchildburiedcatastrophicchildrencasualtiesdisbeliefdaughtercondolenceseeriefamilycorpsehorrendoushusband
      The majority of word groups were not difficult to create categories for and subse‑ quent words, in general, fell naturally into them since treating the same topics con‑ cerning the disaster (see Table 1). However, polysemous items such as cover and relations were problematic. No attempt was made in most instances to investigate their context at that time as it appeared that the number of candidate synonyms was going to be plentiful.
      3.6 The selection of candidate synonymsAfter categorizing the lexemes, the candidate synonyms were intuitively selected and color‑coded within each topic. Since colors cannot be presented here, they have been bulleted. This particular category is one of the shortest lists, but com‑ pared to others, returned the highest ratio of possibilities (see Table 2).
      Table 2. Coast category of candidate synonyms
      • seafront
      • seashore
      • shore
      • shoreline
      • coastline
      • shores
      • coastlines
      • coasts

      3.7 Findings from the tsunami corpusAs the list had now become more manageable, candidate pairs and groups were subsequently arranged by part of speech. Each pair or group of words was then compared as to their collocations and their semantic preferences. A sample is pre‑ sented in Table 3.
      Table 3. Sample list of candidate synonyms by part of speech
      After investigating all the candidate pairs and groups, the final selection was made. The overall results are presented in Table 4.
      Table 4. Results from final list of candidate synonyms

      Part of SpeechNumber of Words
      Total Items
      Table 5 presents the final results of the research with respect to the use of the most frequent synonym first in the same article. The first column is the pair or group of synonyms in order of frequency that appear in an article describing the same event or circumstance. The second column is the number of texts in which they appear and the third is the number of times the most frequent synonym is used first.
      Table 5. Results of the frequency of synonyms used in the same text
      Synonyms*includes morphemes
      Pairs/groups in texts
      Most Frequent Used First

      awful/terrible100%deadly/lethal11100%devastated/stricken33100%giant/massive300%giant/massive/immense/ enormous2150%high/tall11100%huge/colossal/massive100%huge/giant/enormous11100%huge/giant11100%huge/immense11100%huge/massive4250%large/big22100%largest/biggest11100%lucky/fortunate11100%massive/enormous11100%massive/gargantuan100%massive/immense11100%overpowering/ overwhelming11100%powerful/mighty100%poignant/heartbreaking11100%safe/unharmed11100%Total adjectives302067%nouns

      Table 5. (Continued)
      Synonyms*includes morphemes
      Pairs/groups in texts
      Most Frequent Used First
      Percentageseafront/oceanfront100%streets/roads100%tourists/travelers100%tsunami/tidal wave*302583%Total nouns684262%verbs

      (BE) killed/perished100%could/might8563%cried/keened11100%died/(BE) killed16650%died/(BE) killed/perished100%flooding/inundating11100%found/discovered11100%got/received3133%happen/occur2150%happened/occurred9444%hit/struck12650%said/reported99100%said/told151493%said/told/reported7686%sent/dispatched100%started/began22100%(had) started/begun100%triggered/caused6467%warned/ cautioned11100%weeping/sobbing11100%Total verbs986667%Adverbs

      almost/nearly11100%probably/likely (adj.)100%Total adverbs2150%Prepositions

      Table 5. (Continued)Synonyms*includes morphemes
      Pairs/groups in texts
      Most Frequent Used First
      Percentagenear/close to (+place)22100%under/beneath400%under/underneath/below11100%Total prepositions8450%Totals20613365%
      The total shows that overall in this corpus 65% of the time when a choice was avail‑ able between or among synonyms, the most frequent was used first in describing the same event or circumstances in the same newspaper article. In the following section, we test these statistics to see if the results are random chance.3.8 Probability measurementConcerning probability measurements, in Trust the Text, Sinclair states:At present the only available measure of significance is to compare the frequency of a linguistic event against the likelihood that it has come about by chance (Clear 1993). Since language is well known to be highly organized, and each new corpus study reveals new patterns of organization, a relationship to chance is not likely to be very revealing. (Sinclair [1997] 2004, p. 29)Sinclair makes a valid point here, discussing word choice mainly within the context of collocation, which is certainly relevant in this study. Nevertheless, what we are looking at here regarding frequency is not so much a pattern of collocation but a pattern of synonymous word choice. Therefore, in this particular case, dis‑ covering whether this synonymous pattern of frequency is statistically significant is of interest as related to the frequency effect.
      3.8.1 One‑tailed binomial testIn order to ascertain the probability of whether or not the percentages returned are the result of random chance, a one-tailed binomial test was performed.4In order to ascertain the probability of whether or not the percentages returned are the result of random chance, a one-tailed binomial test was performed[iv]. By way of a brief explanation, a two-tailed binomial test will test for the probabil‑ ity of a relationship going in two directions (significantly more than a particular
    2. I a​m indebted to Stefan Th. Gries at the University of California, Santa Barbara for sug- gesting and performing this binomial statistical analysis.

    hypothesized mean or significantly less than that hypothesized mean). As we are not interested in this study as to whether or not a result is significantly less than a hypothesized mean frequency, then we use the one-tailed binomial test, which only tests for one direction.The data presented here meet the requirements of a binomial distribution test:
    • The experiment consists of repeated trials.
    • Each trial can result in just two possible outcomes. We call one of these out‑ comes a success and the other a failure.
    • The probability of success, denoted by P, is the same on every trial.
    • The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.

    3.8.2 Results of one‑tailed binomial distribution testAdjectives: There are 30 adjective tokens and for 20 of them the most frequent is first. According to a one‑tailed binomial test, that is significantly more often than the chance frequency of 30:2=15 would lead you to expect: 0.04936Nouns: There are 68 noun tokens, and for 42 of them the most frequent is first. According to a one‑tailed binomial test, that is not significantly more often than the chance frequency of 68:2=34 would lead you to expect: 0.98077Verbs: There are 98 verb tokens, and for 66 of them the most frequent is first. According to a one‑tailed binomial test, that is very significantly more often than the chance frequency of 98:2=49 would lead you to expect: 0.00349Adverbs: There are 2 adverb tokens, and for 1 of them the most frequent is first. According to a one‑tailed binomial test, that is not significantly more often than the chance frequency of 4:2=2 would lead you to expect: 0.75Prepositions: There are 8 preposition tokens, and for 4 of them the most frequent is first. According to a one‑tailed binomial test, that is not significantly more often than the chance.frequency of 8:2=4 would lead you to expect: 0.63671Total:There are 206 pairs or groups, and for 133 of them the most frequent synonym is first. According to a one‑tailed binomial test that is not significantly more often than the chance frequency of 206:2=103 would lead you to expect: 1.751893The binomial distribution tests show that only in the case of the verbs and adjec‑ tives (though just), was the outcome not due to random chance. Nonetheless, we need to take into account that results of the adverbs and prepositions are based on comparatively few examples. And importantly, if it were expected that the most
    frequent synonym would be used first, then the result of the nouns would then be statistically significant with a value of 0.03405934 (sum(dbinom(42:68, 68, 0.5) (Personal communication Stephan T. Gries).
    3.8.3 Pragmatic associationIt is of some interest to note that the order in which the news is presented can affect sequencing. For example, the word bodies is more frequent than corpses, yet out of the six instances where they are used in the same text, bodies is never used first. Stubbs (2001), in his investigation of body and corpse, finds that while body is a neu‑ tral term, corpse occurs with more unpleasant terms. The examples he provides are:– Lenin’s body lay in state– a body was washed up on the beach– the corpse was barely recognizable– the corpse was found floating in the river (p. 37)In the tsunami corpus, bodies is also usually neutral, often simply preceded by a number, whereas we find corpses co‑occurring with bloated/ decomposing/ and stench. This would support Hoey’s (2005) claim that as well as semantic association, words are also primed for pragmatic association although the boundary between the two is not clear‑cut (p. 26). This pragmatic feature may be one explanation as to the reason the most frequent synonym was not used first and would be of interest to explore in future studies.Although we must remain cautious so as not overstate our case until further research can be carried out, these results are promising in that they coincide with results of semantic priming tasks while simultaneously lending support to Hoey’s lexical priming theory. We can hypothesize from the results that one reason for the frequency effect is that we store and often retrieve first the words we have come across most frequently in certain contexts. This result also sheds light on lexical priming because not only are the most frequent words often retrieved first, but they must be retrieved with their textual (contextual) collocations in order not to change the meaning of what the writers want to express.
  4. Conclusion

We began this study with a brief introduction to semantic priming and Hoey’s claims concerning lexical priming. We then looked at synonyms from two dif‑ ferent corpus‑based perspectives. In the section on functions, we saw how the

choice of a synonym can be dependent on a number of the same factors that influence our choice of any lexical item such as collocation, colligation, genre and semantic association. While exploring those factors, we applied the claims made by Hoey in order to add explanation to description. We also noted that based on experiments with lexical decision tasks carried out in psycholinguis‑ tics and research carried out by Hoey, it would appear that we store and retrieve a word along with its associations apropos the situation and context we have repeatedly encountered it in before.

In the quantitative section, the chief focus was upon the frequency effect. This is where replicated experiments in psycholinguistics have found compelling evi‑ dence of a strong association between retrieval time and semantic similarity or items that are ranked by frequency at the word, phrase, and sentence level. The data from the tsunami corpus provide further evidence that when we are dealing with the same topic, context, and genre, the strong tendency is that the most fre‑ quent synonym will be retrieved first. However, it must be noted that the absolute amount of evidence for this is quite small, and more research needs to be done. If the current pattern of results could be replicated, this would provide stronger evidence for the way in which we process language.

Just a year after the publication of Lexical Priming, Stubbs wrote,

[Corpus linguistics] has the empirical data and the hermeneutic methods to try out some new approaches to long‑standing problems, and should therefore try to move from descriptive to explanatory adequacy, and indeed to rethink what such explanation might look like. (2006: 34) (emphasis is the author’s)

Lexical priming is both supported by and adds support to experiments performed in the field of psycholinguistics. Research demonstrates that there are grounds for proposing Hoey’s promising and serviceable theory as one that can begin to fill the existing gap in corpus‑based studies between description and explanation.


Adelman, J. S. et al. 2006. Contextual diversity, not word frequency, determines word‑naming and lexical decision time. Psychological Science 17: 814–823.

Arnon, I. & Snider, N. More than words: Frequency effects for multi‑word phrases. Journal of Memory and Language 62: 67–82. doi: 10.1016/j.jml.2009.09.005

Barnbrook, G., Mason, O. & Krishnamurthy, R. 2013. Collocation: Applications and Implications.

Basingstoke: Palgrave Macmillian.

Bawcom, L. 2010. What’s in a Name? The Functions of Similonyms and Their Lexical Priming for Frequency. PhD dissertation, University of Liverpool.

Bod, R. 2000. The storage and computation of three word sentences. Paper presented at architec‑ tures and mechanisms of language processing conference, Leiden, The Netherlands.

Bod, R. 2001. Sentence memory: storage vs. computation of frequent sentences. Paper presented at CUNY 2001, University of Pennsylvania, Philadelphia, PA.

Branigan, H.P., Pickering, M.J., Liversedge, S.P, et al. 1995. Syntactic priming: Investigating the mental representation of language. Journal of Psycholinguistics 24(6): 489–506.

doi: 10.1007/BF02143163

Clear, J.H. 1993. From Firth principles – Computational tools for the study of collocation. Text and Technology. In Honour of John Sinclair, M. Baker, G. Francis, & E. Tognini‑Bonelli (eds), 271–92. Amsterdam: John Benjamins. doi: 10.1075/z.64.18cle

Collins, A.M. & Loftus, E.F. 1975. A spreading‑activation theory of semantic processing. Psycho- logical Review 82: 407–428. doi: 10.1037/0033-295X.82.6.407

Conklin, K. & Schmitt, N. 2008. Formulaic sequences: are they processed more quickly than non‑formulaic language by native and nonnative speakers? Applied Linguistics 29(1): 72–89. doi: 10.1093/applin/amm022

Conklin, K. & Schmitt, N. 2012. The processing of formulaic language. Annual Review of Applied Linguistics 32 45–61. doi: 10.1017/S0267190512000074

de Bot, K. 1992. A bilingual production model: Levelt’s “speaking” model adapted. Applied Lin- guistics 13: 1–25. doi: 10.1093/applin/13.1.1

Dimich, M. & Goodside L. (Dir). 2008. The Ellen Degeneres Show. New York NY: National Broadcasting Company.

Edmonds, Philip. 1999. Semantic representations of near‑synonyms. Unpublished thesis, Uni‑ versity of Toronto.

Faigley, L. (ed.) 2005. The Penguin Handbook. New York NY: Pearson Education.

Firth, J.R. [1951]1957. A synopsis of linguistic theory, 1930–1955. In Selected Papers of J.R. Firth 1952–59, F. Palmer (ed.), 168–205. Bloomington IN: Indiana University Press.

Fowler, H., Aaron, J. & Okoomian, J. (eds). 2007. The Little Brown Handbook. New York NY: Pearson Longman.

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge. Keeble, R. 1998. The Newspapers’ Handbook, 2nd edn. London: Routledge.

Kess, J.F. 1992. Psycholinguistics: Psychology, Linguistics, and the Study of Natural Language [Cur‑ rent Issues in Linguistic Theory 86]. Amsterdam: John Benjamins. doi: 10.1075/cilt.86

Miller, G.A. & Charles, W.G. 1991. Contextual correlates of of semantic similarity. Language and Cognitive Processes 6:1–28. doi: 10.1080/01690969108406936

Murray, W.S. & Forster, K.I. 2004. Serial mechanisms in lexical access: The Rank Hypothesis.

Psychological Review 111(3): 721–756. doi: 10.1037/0033-295X.111.3.721

Partington, A. 1998. Patterns and Meanings [Studies in Corpus Linguistics 2]. Amsterdam: John Benjamins. doi: 10.1075/scl.2

Reali, F. & Christiansen, M.H. 2007. Processing of relative clauses is made easier by frequency of occurrence. Journal of Memory and Language 57: 1–23. doi: 10.1016/j.jml.2006.08.014

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.

Sinclair, J. [1997]2004. Trust the Text: Language, Corpus, and Discourse. London: Routledge. Stubbs, M. 1996. Text and Corpus Analysis. Oxford: Blackwell.

Stubbs, M. 2001. Words and Phrases. Oxford: Blackwell.

Stubbs, M. 2006. Corpus analysis: The state of the art and three types of unanswered questions. In System and Corpus: Exploring Connections, G. Thompson & S. Hunston (eds), 15–36. London: Equinox.

Tremblay, A., Derwing, B.L., Libben, G. & Westbury, C. 2011. Processing advantages of lexical bundles: Evidence from self‑paced reading and sentence recall tasks. Language Learning 61: 569–613. doi: 10.1111/j.1467-9922.2010.00622.x


Bloomfield. L. 2004. Wcopyfind 2.6. Charlottesville VA: University of Virginia. www.plagiarism. (18 June 2006).

Davies, M. 2008. Corpus of Contemporary American English (COCA). www.americancorpus. org/

Scott, M. 2006. Wordsmith Tools, Version 4. Oxford: OUP.

The British National Corpus, Version 2 (BNC World). 2001. Distributed by Oxford University Computing Services on behalf of the BNC Consortium.

Lexical priming and metaphor – Evidence of nesting in metaphoric language

Katie J. Patterson

University of Eastern Finland, Universidad Austral de Chile

Metaphoricity is often regarded as a distinctive linguistic phenomenon, in opposition to literal, or non‑figurative language. Recent research from a corpus‑ linguistic perspective has begun to show, however, that such a dichotomist stance to metaphor does not bear scrutiny (Deignan 2005; Partington 2006; Philip 2011). Our ability to manipulate or bend the limits of linguistic conventions (semantically, lexically, grammatically) in order to cope with communicative demands is one area where this dichotomy does not hold up. The focus of this chapter is to explore a nesting (cf. Hoey 2005) pattern of grew that is specific to its use in metaphoric contexts, and compare this to its absence in non‑metaphoric contexts. The data are taken from a 49m‑word corpus of nineteenth century writings. The findings go some way to suggesting that as a metaphor, grew is qualitatively a different lexical item when compared to its non‑metaphoric use(s). It is proposed that Hoey’s (2005) Drinking Problem Hypothesis can account for these lexical differences, providing a psychological explanation for what drives

us as language users to identify metaphor. Crucially, adopting lexical priming as a means to exploring metaphor shifts the perspective of metaphoricity to the individual language user: the findings show that a metaphoric sense of an item appears to be dependent on the primings activated in a reader. It can thus be

argued, based upon the lexical priming approach, that metaphoricity is inherent in the language user rather than the language itself, and that its manifestation is often dependent on the individual’s interpretation of the language.

  1. Introduction
    Creativity in language such as metaphor, is often thought of as a free act of expres‑ sion, but while this may be true to some extent, the expressive effect of that choice of language is diminished if it does not retain meaning for the user. A creative exploitation is discussed by Hoey as “the result either of making new selections from a semantic set for which a particular word is primed or of overriding one or more of one’s primings” (2008: 16). When that initial exploitation (or metaphor) becomes conventionalised, it begins to adopt its own lexical patterns or primings.

    doi 10.1075/scl.79.06pat© 2017 John Benjamins Publishing Company
    This study focuses on the lexical characteristics of metaphor and the idea that metaphoric language must operate within a set of conventions which allow us to recognize it as such. The framework of the Lexical Priming theory (Hoey 2005) offers up this claim.The last decade has seen researchers follow a trend of more usage‑based approaches to metaphor study, drawing their methods and theories from the field of corpus linguistics (Deignan 2005; Koller 2006; Partington 2006; Deignan & Sem‑ ino 2010). The introduction of corpus linguistics has consequently created a shift away from the earlier dichotomist stance involved in metaphor theories to more usage‑driven issues, based on sociolinguistic and interpersonal contexts in which metaphors are used (Deignan, Littlemore & Semino 2013). Rather than isolated examples, corpus linguistics provides the resources to focus on repeated patterns and recurrent instances of metaphor, which are, by their nature, successful uses of the language. This focus on repetition of language use also addresses convention in metaphoric behaviour: when a metaphor is re‑used often enough, it comes to be an expected use of the language, with its own conventional behaviours and patterns. This stands in opposition to the highly original and creative exploitation of truly novel metaphors (see Svanlund 2007 for an account of conventionality).This research focuses on the use of a single item in both metaphoric and non‑ metaphoric contexts, in order to explore what it is that allows us to recognize when language is being used metaphorically, as opposed to non‑metaphorically. By focusing on meaning within a Neo‑Firthian framework, this research aims to re‑focus discussions of metaphor within the wider discourse field, taking into con‑ sideration context, pragmatic meaning, the individual’s mental lexicon, and subse‑ quently what role these factors play in interpreting metaphoric meaning. The first aim of this chapter then is to explore what conventional metaphoricity means and the ways in which metaphoricity is manifest in the language, as revealed through a corpus approach.The second more specific aim of the research is to explore the extent to which the theory of lexical priming can account for our ability to recognize conven‑ tional metaphoric instances of an item or phrase, in contrast to non‑metaphoric instances. The theory successfully accounts for the lexical characteristics and pat‑ terns of use associated with our use of language in both spoken and written lan‑ guage within particular domains, but little attention has been paid to figurative language. Research focusing on polysemy (Hoey 2005 & Tsiamita 2009) shows that two distinct senses of a word or item tend to avoid each other’s primings (as claimed in Hoey’s Drinking Problem Hypothesis 2005, explained in Section 2). Together with a corpus linguistic methodology, Hoey’s theory is adopted as a theoretical tool for analysing metaphoric language. In relation to a pervasive but less dichotomist phenomenon such as metaphor, lexical priming may provide an
    explanation for what drives us as language users to identify creativity in meta‑ phoric language use. If primings are shown to differentiate metaphoric uses of a word or phrase from non‑metaphoric or literal uses, then this would suggest that metaphor is not as unrestricted and ambiguous as we often claim. The introduc‑ tion of an extended theory involving our psychological associations with language could possibly offer an explanation for how we recognise conventional norms, metaphoric as well as literal.These two research aims are explored by means of a case study: a corpus‑ driven lexical analysis of a keyword1 (grew) identified in a 49‑million‑token cor‑ pus of nineteenth century writing, when compared against a general comparator, the British National Corpus. The findings are taken from a larger study aimed at comparing and contrasting the lexical behaviours of metaphoric and non‑meta‑ phoric instances of language within the constraints of a genre‑ and time‑specific corpus. Within the present chapter, the specific aim is to highlight an example of nesting associated with metaphoric instances of the word grew and avoided by non‑metaphoric instances of the same item. The findings suggest that the lexical behaviours (and subsequently the senses) of the two uses of grew, are distinct from each other. Thus they can be said to avoid each other’s primings. This consequently supports the idea that metaphoric senses have, to an extent, a fixed set of choices in terms of grammar and lexis Deignan 2005). Lexical Priming (Hoey 2005) reveals that metaphor, like polysemy, can be characterised by the presence of regular pat‑ terns which are avoided by literal senses. The implication is that metaphor can be seen operating on the same cline as polysemy, and that analysis of lexical primings can help to identify metaphor from a lexical perspective.
  2. Theoretical background
    1. Metaphor, creativity and corpus linguisticsCreativity, linguistically, is itself defined by Sampson as occurring when a prod‑ uct commonly falls “outside any class that could have been predicted on the basis of previous instances of the activity in question, and yet the innovation, once it exists, is recognized as in some way a valid or worthwhile example of that activity” (Sampson 2013: 4).2 In this sense then, part of a metaphor’s inherent quality is that
      1. As defined by Scott (2008) and calculated using his software Wordsmith 5.
      2. He gives the analogy of a creative painter differing from a technically accomplished one because he produces canvases that deviate in some way from the stylistic norms established by earlier artists. (Sampson 1979: 101–107).
        it overrides an expected use of the language. Carter (2004) claims that creative language “inheres in the degrees to which it departs or deviates from expected patterns of language and thus defamiliarises the reader” (Carter 2004: 58). It is this notion of deviance which often remains central to a lexical analysis of meta‑ phor (Philip 2011; Hanks 2013). Steen (2009) states that metaphors are accurately considered “a form of linguistic deviation at the semantic level, which are used to create foregrounding effects” (Steen 2009: 87).The exploitation or deviation from a linguistic norm, which is often con‑ sidered inherent in metaphoric language, cannot occur without a collectively accepted ‘normal’ or expected way of using language. Working in the field of philosophy of language, Wittgenstein claimed that the meaning of a word or phrase is nothing other than the set of informal rules governing the use of the expression in actual life (Wittgenstein [1922] 2014). Wittgenstein emphasised the idea that language itself can only be understood as a practice, and that mean‑ ing is developed through social situations and interaction. This co‑operation is what governs the expected conventions of usage. It also means, crucially, that meaning has the ability to subtly shift according to the subjective understanding of the language users and their circumstances of use. Philosophers of language working within this tradition claim that this openness and subjectivity is what reinforces socialisation amongst individuals. Speakers, as collective individuals, become members of a society and it is the creation of this community which monitors the collective uses of language (cf. Habermas 1990; Gadamer 2004). From this perspective, language (whether figurative or non‑figurative) is a social tool, and repetitive patterns of use are adopted to conform, or can be avoided to create novel and new expressions (cf. Gibbs 1994). Creativity is often defined as a breaking of particular linguistic norms and conventions and as a result is thought of as a largely free act of expression, but while this may be true to some extent, the expressive effect of that choice of language is diminished if it does not retain meaning for the user. Philip (2010) claims of creative language that there is a “requirement of expressing unique, unrepeatable meanings by means of a syntax and vocabulary which must retain a high level of rigidity so that the texts can be understood by the users of language” (Philip 2010: 151). In less conventional instances of metaphor, language is often granted a less conforming ‘level or rigidity’; either in terms of the grammatical or semantic relationships, but it must still retain enough linguistic conventionality (grammatically, lexi‑ cally, pragmatically) to be understood by the receiver. The focus of this research is on the conventions which govern both metaphoric and non‑metaphoric uses of language.J.R. Firth’s contextual theory of meaning argues that meaning is not situated within the isolation of an item itself, but inextricably tied to its place in both co‑text
        and context. Researchers then must look to exploring a wide range of lexical char‑ acteristics, involving grammar and lexis, but also more secondary or abstract aspects of meaning such as semantic and pragmatic association and an explora‑ tion of metaphoricity must take into account the variety and intricacy of meaning manifest in co‑text and context. Corpus linguistics offers this opportunity.Whilst metaphor has remained central to many cognitive, philosophical, literary, and linguistic theories of language, its role and consequently its inter‑ pretation in each of these spheres has shifted considerably in various directions. What remains tantamount in most theories is the well‑rooted acknowledgement that metaphor is creative in its design and use. Black’s (1993) influential account of metaphor and philosophy formed the basis for the Interactionist approach – the idea that metaphor actually creates insight or new meaning. The primary subject in a metaphor, he claims, is coloured by a set of “associated implications” normally predicated on the secondary subject (Black 1993: 28). Ricoeur (2003) claims that metaphor revives our perception of the world, through which we become aware of our creative capacity for seeing the world anew. Similarly in literature, metaphor is assigned to the “literary lexicon” (Carter 2004), with the notion of deviance remaining central to literary scholars working with metaphor within the formal‑ ist tradition (Nowottny 1965; Leech 1969; Short 1996). Leech (2008) stresses that these deviations from the accepted code in literature are unique and meaningful rather than “unmotivated aberrations”, describing them as a “semantic absurdity” (Leech 2008: 16).
    2. Lexical priming and the Drinking Problem HypothesisAs an approach to analysing metaphor, lexical priming may be able to account for the distinction between literal and metaphoric senses of a word or phrase from a psychological perspective. Specifically, an outcome of the theory entitled the Drinking Problem Hypothesis, offers up this potential. The hypothesis centres on the assumption that different word senses will avoid the patterns associated with the other sense(s) of that word of which we are primed for, in order to avoid ambi‑ guity. The name Drinking Problem Hypothesis comes from a scene in the 1980 film Airplane! outlined in Hoey (2005), in which the phrase ‘drinking problem’ is used humorously to refer to the difficulty a man has in getting liquid to his mouth. The play is on the connotations transferred from ‘drinking problem’ to the more practical issue of ‘a problem drinking’.These patterns, avoided by another sense of a word, take the form of collo‑ cations, colligations and semantic associations amongst others. The formulation of the claims of Lexical Priming regarding polysemy was based on polysemous nouns with two or more abstract senses each (i.e. consequence, reason, immunity).
      The hypothesis is further supported by a study of the polysemous senses by Tsia‑ mita (2009), in which it was found that different senses of each of the words drive (‘journey’ and ‘a private road’) and face (both abstract and concrete uses) have dif‑ ferent sets of primings.3 The implication is that metaphoric senses will also avoid the patterns (or primings) of the literal sense(s), since a metaphor and its liter‑ ary counterpart might reasonably be regarded as a special case of polysemy. This would suggest that metaphors are characterised and therefore identified to some extent, by their avoidance of literal use primings. Deignan (2005) reverses this idea in relation to metaphor:It is possible that when a metaphorical mapping first takes place, a linguistic ex‑ pression becomes ambiguous between literal and metaphorical. Eventually the regular association of the expression with its metaphorical meaning means that speakers start to avoid using it with a literal meaning. (Deignan 2005: 212).In Hoey’s own words creative exploitation is discussed as “the result either of making new selections from a semantic set for which a particular word is primed or of overriding one or more of one’s primings” (2008: 16). Thus we can talk of ‘overriding’ one’s primings in relation to metaphor use: accordingly, it is when a metaphoric sense becomes well used, or conventionalised that readers may start to be primed to associate certain collocations, colligations, semantic, pragmatic and textual associations with the metaphoric sense. These primings in turn will become strengthened the more established the metaphoric sense is, and thus more removed from the non‑metaphoric sense.The hypothesis can be approached in relation to metaphor by testing the three sets of the lexical priming claims. Lexical characteristics can be explored in relation to co‑textual, contextual and text‑linguistic features of both senses of a lexical item. In a study of keyword items in 19th Century fiction (Patter‑ son 2015, 2016), support was given for some of these lexical priming claims: the more conventionalized uses of an item as a metaphor displayed stronger asso‑ ciations or primings than novel or original metaphors associated with that item. Furthermore, metaphoric and non‑metaphoric instances of a given item were distinguishable by a range of linguistic features. Noun, adverb and personal pronoun collocates were shown to play crucial roles in the semantic and lexical distinctions of metaphoric and non‑metaphoric instances of kindle for instance (Patterson 2016). The findings showed that when kindle is used as a verb in a
  3. Hoey notes that lexical priming is a property of the person, not the word. When talking of words being primed to collocate, this is short hand for saying that most speakers are primed for the words to collocate.
    metaphoric context, it could often be identified by a presence of collocations or colligations that were absent amongst the non‑metaphoric uses of the verb. It is also the case that more abstract levels of meaning (pragmatic association) can help to distinguish metaphoric and non‑metaphoric senses. It can be surmised from such previous studies that metaphoric and non‑metaphoric instances of an item or phrase behave in the same way as less ambiguous cases of polysemy and that applying Lexical Priming to metaphor may provide the reader/listener with helpful signals to distinguish the senses.
    1. Lexical priming and nesting
    Previous findings (Patterson 2016, 2015) explored the manifestation of prim‑ ings associated with single items (kindle, flame), in terms of collocations, col‑ ligations, and semantic, textual and pragmatic associations. The studies showed that metaphoric instances of an item differed in these primings to their literal counterparts and this in turn suggests that metaphors can be identified lexically, based on the lexical priming theory. The research also drew attention to the presence of primings within co‑occurrence patterns. These are not fully deter‑ mined by the primings of the individual words and are examples of what Hoey terms nesting (2005: 8–11). By way of example, collocational nestings consist of multi‑layered patterns combining several lexical items (e.g. New York and Stock Exchange are collocates within the larger phrase New York Stock Exchange whilst York Stock is not, Frantzi & Ananiadou 1996) and colligational nesting is formed by multi‑layered patterns that connect lexical items and grammatical elements (e.g. mass destruction often modified by weapons of, Seretan 2011). In relation to metaphor, exploration of primings within nesting, may provide a clearer picture into metaphor use by focusing on larger chunks of language and correspond‑ ing sets of metaphoric uses. This approach stands in opposition to focusing on a dichotomist view of a single metaphoric and a single non‑metaphoric use of a given item.This analysis focuses on a single co‑textual linguistic feature associated with the metaphoric uses of the item grew (motivation behind the item grew is outlined in Section 3 below). In particular, grew is shown to collocate with more to form longer chunks of language which can be said to be evidence of nesting priming. The exploration of the lexical behaviour of more and grew and their nesting struc‑ tures will be shown to reveal semantic associations and pragmatic associations as well as more primary aspects of meaning (collocation, colligation) particular to the metaphoric uses of grew. The analysis is viewed as a snapshot of a larger investigation into the primings associated with grew in its metaphoric and non‑ metaphoric contexts.
    1. Methodology
      1. The corpusThe nineteenth century corpus consists of texts written by English authors between 1800 and 1899. In total, there are 416 texts with a running token size of 45,480,658. There are no more than two texts written by a single author, in order to gain as widely representative a collection as possible, eliminating any idiosyncrasy. Previ‑ ous work has been undertaken on figurative language in English nineteenth cen‑ tury writing in the areas of corpus linguistics/stylistics (Mahlberg 2010, 2012), literary metaphor (Kimmel 2008) and cognitive stylistics (Barbera 1993; Stockwell 2002; Boghian 2009), making it a rich source for comparative and supporting research. Furthermore, focusing on the nineteenth century period allows scope for diachronic analysis of changes in metaphoric behaviour in more contemporary corpora. The BNC (written‑fiction) will be used as a comparator corpus through‑ out the analyses, in order to determine any corpus specific traits or behaviour.More generally, the motivation behind choosing a time‑restricted corpus largely centres on the theory of lexical priming. According to Hoey (2005) the theory is context dependent (including genre, situation, community etc.), thus any conclusions drawn from the analysis are bound to the type of text represented in the corpus. Partington (1998: 107–108) also suggests that one of the distinguish‑ ing features of genres is the types of metaphors that are found in them, which means that results from a genre restricted corpus study cannot be generalized without qualifications. Thus by restricting the corpus to the nineteenth century, but accommodating as many genres and text types as possible, the findings can be said to be representative of the time period more generally.WordSmith Version 5 (Scott 2008) is used to extract data from the corpus. An initial Keyword search identified words of unusually high frequency in the nine‑ teenth century corpus in comparison with a more general and contemporary refer‑ ence corpus (the BNC). The Keyword function (Scott 2008) compares the ‘keyness’ of items in one corpus, compared to a larger reference corpus.4 Items with a sig‑ nificant ‘keyness’ appear more frequently than would be expected in one of the two corpora. The aim this program was used to identify words which occur significantly more frequently in one corpus than the other. Grew was one of the words which was identified as significantly more frequent in the nineteenth Century corpus. The larger research project demanded exploration of a range of word classes and grew was chosen as the most suitable lexical verb amongst the keywords. A Wordlist of the corpus identified 3812 instances and thus enough data to be able to explore a range of lexico‑grammatical features and their patternings and frequencies.
      3.2 The metaphor identification processThe analysis and comparison of the lexical characteristics of metaphoric and non‑ metaphoric instances requires, in the first place, a methodological decision involv‑ ing the classification of each instance as metaphoric or non‑metaphoric. In order to be able to analyse the two groups statistically, they must be divided in such a way that they become, in effect, separate corpora. This entails the division of con‑ cordance lines into two clear groups of metaphoric and non‑metaphoric instances. Whilst successful methods of identification exist for metaphor, such as MIP and MIPVU by the Pragglejazz group (Steen et al. 2010) and Cameron 2003, such pro‑ cedures aim at objective classification based on criteria such as dependency and salience. The focus of the research at hand is on the readership and interpreta‑ tions of metaphoricity. Moreover, the lexical priming theory places importance on the subjectivity of the individual’s interpretation and the importance that has on meaning.5 The decision of identifying metaphoricity was given over to nine individual readers or co‑raters. Each of the raters read a selection of the 3812 con‑ cordance lines; the particular selection process means that every line was read by at least three and up to six individuals. These co‑raters were not provided with any definition of metaphor and were not ask to give one themselves. Three partici‑ pants have a background in linguistics but the other six do not. They were asked, without the aid of dictionaries, to decide whether a given word (in this case grew) was being used metaphorically within the context provided. Whilst such a meth‑ odology has drawbacks in establishing any form of clear‑cut dichotomy between the language, it is important for the lexical priming theory that the metaphoricity is determined by readers on an individual basis, line by line, rather than decided upon categorically with the aid of reference and definitions. Also, the decision to use the term ‘non‑metaphoric’ rather than ‘literal’ is in order to reduce the domi‑ nance of a dichotomist stance between the two groups, and instead to see them as a set that displays metaphoric behaviours, and a set that does not. Concordance lines were all set to 120 characters in length. If not enough context was provided to permit a decision, the participants could check more co‑text by clicking on the concordance line to reveal more text.6 Participants were given three options for categorization. These were metaphoricnon-metaphoric and unsure. Where there was discrepancy between any number of individuals, the concordance was in any case placed in the unsure group, thus creating the assurance that all clearly identi‑ fied metaphors have unanimously been agreed upon by all individuals. One of the
    3. See Patterson 2016 for a full discussion on metaphor identification processes and the dif- ficulty of defining metaphoricity from a neo-Firthian perspective.
    4. A function of Wordsmith 5 (Scott 2008).

    other aims of the larger research project was to accrue a middle group of not so clearly identifiable instances of metaphoricity, such as ambiguous or ‘problematic’ instances or weak or heavily conventionalized metaphors, in order to explore pro‑ cesses of conventionalisation of metaphoricity, but in addition to this the process also helps to keep the two metaphoric and non‑metaphoric datasets as clear and prototypical as possible. The analysis will discuss more or less metaphoric mean‑ ing and more or less non‑metaphoric meaning, seeing these as “end‑points on a scale, rather than absolutes”, a stance similarly adopted by Lindquist and Levin (2008: 145). The two sets of concordances lines are then treated as individual cor‑ pora and fed into WordSmith5 (Scott 2008).The agreement rate for the individuals was surprisingly high. The first group of data consists of the clearly (and unanimously agreed upon) metaphoric uses of grew which total 2863 instances and comprises over three quarters (75.10%) of the total data. The second group comprises the non‑metaphoric uses of grew, which total 807 instances and make up 21.17% of the data. An example line from the respective groups or corpora are given below:
    1. “But this was only for a moment, for the anguish came back and GREW apace, and I fell to thinking dismally of the plight…”
    2. “…and round it and on the wet patch of the roof above GREW a garden of ferns and other clinging plants. The weeks moved on…”
    The remaining problematic, ambiguous, or not unanimously agreed upon as meta‑ phoric or non‑metaphoric instances (less than 4%) were left in the unsure group and will not be discussed here. The intention of the methodology was to retrieve lines of co‑text immediately surrounding grew in order to analyse in for lexical behaviours and patterns. The analysis will focus on meaning associated with the nesting of grew and more which is prominent in the metaphoric dataset only. Both of the (now called) corpora (metaphor and non‑metaphor) will be discussed.
  4. The study
    1. Grew more and moreMore was initially singled out in an analysis of the top ten collocates associated with each grew dataset, because of its status as a lexical word, in comparison to largely grammatical items. It is also specific to the metaphoric top‑ten collocate list and is ranked fifth, occurring on average 15 times per thousand words. This can be compared to an average occurrence of 1.84 per thousand words in the non‑ metaphoric dataset shown on the right‑hand side of Table 1:
      Table 1. Rank and frequency of more as collocate of grew in both datasets

      MetaphorNon‑ Metaphor

      CollocateFreq. PTW.LFreq.RFreq.
      CollocateFreq. PTW.LFreq.RFreq.1AND69.0879512361THE37.342972922THE58.6710726532AND23.261542133OF21.843283143OF12.5599994AS16.362242574UP10.9751685MORE15.0045396… 30MORE1.84621
      The item is also much more fixed in its association with metaphoric instances of grew: 89.8% of all instances occur on the right of grew. The majority of these occur in posi‑ tions R1 and R3: 198 (44.91%) and 116 (26.30%) instances respectively. Concordance examples of grew more are shown below in Concordance 1:

      Figure 1. Random selection of grew more occurrences in metaphoric dataset
      The majority of adjectives following the collocation are related to emotion or abstract characteristics, showing that most uses of grew more are used in relation to a change in temperament, state, or emotion. A large majority of the imagery
      associated with the adjectives on the right of the collocation are negative. This includes the items shown in the screenshot above: intolerable, irritating, languid and faint, loath, melancholy, miserable, nervous and pertinacious. Out of 250 adjectives following grew more, 197 (78.8%) can be described as negative in their pragmatic association, when viewed in context.71 of the instances of the collocation grew more form part of the larger nesting structure grew more and more (+ adj.), where grew fills both R1 and R3 positions simultaneously. In total 35.86% of R1 more collocates and 61.21% of R3 more col‑ locates form part of the larger cluster grew more and more, which in turn colligates with an adjective. This priming is determined by its absence in the non‑meta‑ phoric dataset. Examples of the metaphors are shown below in Figure 2:

      Figure 2. Random selection of grew more and more + adj. in metaphoric dataset
      The adjectives are varied in their references; however again, there seems to be a unified notion of negative semantic prosody. On the left side of the cluster there is a variety of subjects; the majority are human (Job, Mrs Hadwin, Freddy, Mr Heath- cliff, Tess, Jem, he and she). There are also abstract subjects (attention, atmosphere, burden, vigilance, husband’s affairs and burden amongst others) which are not always in the same clause, and a small number of concrete subjects (country, face and light). Looking to the right of the cluster, the large majority of the adjectives are clearly negative. This is reflected in the screenshot above (e.g. extravagant, for- midable, fidgety, fretful, harsh, irritating, nervous, oppressive and peremptory). In total 37 out of 98 (37.76%) of the adjectives following the colligation grew more and more are negative in their pragmatic association.
      More specifically, the sample above suggests that the colligational nesting grew more and more (+ adj.) is used in relation to a negative change in a character’s temperament or a situation (i.e. grew more and more nervous). The repetition more and more also suggests a slow development rather than a sudden one. This reflects a gradual, organic development associated with growth, semantically linked to the non‑metaphoric meaning of animal or plant growth.The analysis has thus begun to demonstrate preferences, semantically and pragmatically, associated with grew and more when used metaphorically. The role of all intensifiers used alongside the adjective collocates shall now be explored. This will determine if it is the collocation and the subsequent nesting that is particular to the metaphoric use, or a more general colligation (grew + intensifier + adj.)
    2. Grew less and lessTable 2 summarises intensifiers in each dataset acting as collocates:
      Table   2.     Intensifier   collocates in both datasets             

      Non‑MetaphorRCollocateFreq.Freq. PTWRCollocateFreq.Freq. PTW1MORE481151MORE293.842VERY1033.52VERY171.083LESS571.943MOST90.574MUCH240.82


      The second most frequent intensifier collocate in both sets is very, occurring 3.5 times per thousand in the metaphoric set and 1.08 in the non‑metaphoric set. This high‑ lights that more is unusually frequent in its metaphoric use in comparison both to the non‑metaphoric set and to other intensifiers. Less occurs in the metaphoric data 57 times as a collocate of grew. It is ranked 50th (according to WordSmith’s collocate list) and occurs 1.94 times per thousand words. 50 out of 57 instances (87.71%) occur on the right of grew: 21 of these (42.00%) occur in R1 position whilst 11 (22.00%) occur in R3. Whilst grew less is also possible in a non‑metaphoric context (i.e. at a slower rate), the collocation is most often found in the colligational structure grew less (+ adj.). Out of all instances of grew less, 86.33% of them appear in the meta‑ phoric data and 81.00% of these appear in the structure grew less (+ adj.). Grew is used in the sense of ‘becoming’ in many of these instances, whereby the focus of the clause or phrase is on the adjective rather than the verb, growing. This was shown to be a key difference in the metaphoric/non‑metaphoric instances of grew generally (Patterson 2016). Instances are shown below in Figure 3:

      Figure 3. Selection of instances of grew less in metaphoric dataset
      Often the items are related to abstract traits in reference to a character, their utter‑ ance or action (constrained, embarrassed, speculative, unpleasing). Many of the adjectives describe a concrete thing (dry in relation to a throatshaky in relation to a hand). Despite this, grew is still not often used in a physical sense, but rather as a form of development or transformation. Grammatically speaking, the majority of instances of grew less + adjective can be replaced with became.There are 8 instances of the cluster grew less and less, making up 14.04% of all instances of grew in this corpus. This can be compared to the 71 instances of grew more and more in the same corpus, making up 16.09% of all instances of more. Thus whilst less is less frequent than more, it is almost as likely to be found in the cluster less and less as more is likely to be found in the cluster more and more. This makes it more fixed in structure. Instances are shown below in Figure 3:

      Figure 4. All instances of grew less and less in metaphoric dataset
      Again, there appears to be no generalisation that can be made about what less and less is referring to in these examples. Grew less and less is used here both in reference to people (abstract and physical characteristics) and external concrete/ abstract entities. There is also less preference for the colligation grew less and less (+ adj.). Furthermore, unlike grew more and more, there appears to be no strong pragmatic associations attached to the cluster. This may be due to the small amount of data. Some instances refer to improvement, whilst others refer to a deterioration in condition or circumstance.
    3. ‘Grew’ + comparativeIt should follow that when grew is used alongside a comparative adjective or adverb (e.g. darker, smaller etc.), grew is similarly being used, metaphorically, in a trans‑ formative sense. Comparatives with a frequency higher than ten are shown below in Table 3. Columns 4 and 5 show their ranking in R1 and R3 position (taking into consideration clusters such as grew brighter and brighter):
      Table 3. Collocates and their R1/R3 positions acting as comparatives in metaphoric dataset


      Freq. PTW
      R1 freq.
      R3 freq.R1 & R3(x and x)FAINTER531.80312222LOUDER431.4632109STRONGER401.362586DARKER371.262888HEAVIER260.881523BRIGHTER230.781385PALER220.751385CALMER170.58161
      Instead of referring to mood and temperament, the adjectives refer more neutrally to external, environmental changes such as those relating to sound or light (e.g. grew fainter; grew louder; grew wider; grew thicker). Clusters with the above com‑ parative clusters (with a minimum frequency of 5) are shown below to determine further evidence of nesting, semantically associated with grew more and more in Table 4 below:Table 4. Frequent clusters involving comparatives in metaphoric dataset
      Of particular interest is the colligation adj.(er) + and + adj.(er). The most prevalent of these are louder and louder, fainter and fainter, darker and darker, stronger and stron- ger, thicker and thicker, and brighter and brighter. Grew fainter and fainter is the most frequent, making up the majority of all occurrences of fainter. With the exception of faint and dark, the comparatives depict an increase in intensity, which is similar to a
      physical, non‑metaphoric sense of growing outward or upward. Other, less frequent comparatives found in the colligation grew + adj.(er) + and + adj.(er) include angrier and angrier, bleaker and wilder, closer and heavier, feeble and fainter, colder and colder, denser and denser and stupider and clumsier and wider and wider. As with the col‑ ligation grew more and more (+ adj.), the colligation grew + adj.(er) + and + adj.(er) depicts a preference for comparatives to be used emphatically, signaling a slow or gradual growth or development, rather than an immediate change. There is a differ‑ ence, however, between the use of grew more and more + adj. and grew + adj.(er) and adj.(er), not simply in the structure but also in the semantic nature of the adjective being used in each structure. The majority of grew + adj.(er) and adj.(er) similarly depict something negative, often creating a sense of something impending of threat‑ ening, but the pragmatic association is much more prominent than for the struc‑ ture grew more and more + adj.. In total there is negativity associated with 137 out of 171 (80.12%) instances of grew + adj.(er) + adj.(er), compared to only 37.76% of instances in the structure grew more and more + adj. as was shown previously.This finding can be compared with uses of both colligations (more and more + adj. and adj.(er) + adj.(er)) more generally, without grew, to determine if this is a more general finding of the language, rather than specific to the datasets. A small search of roughly 4 million tokens (taken from 3 random texts from the main nineteenth century corpus) yielded 21 instances of more and more + adj. and 100+ instances of verb + adjective (er) + adjective (er). With regard to the first structure, 13/20 are clearly negative in their pragmatic association (adjectives include incensed, astonished, silent, fretful and anxious). Another two instances reveal a degree of neg‑ ativity when more context is provided. In summary, within the small sample (21 instances), three quarters of these display negative pragmatic association, which as a consequence, appears to be a salient feature of the structure more and more in general. In comparison, the adjectives in the second structure show no sign of char‑ acterised pragmatic association (some instances are negative, some are positive, and some are neutral). Similarly, they refer more often to external observations often related to speed (faster and faster), spatial description (nearer and nearer; lower and lower; hither and thither), or light (darker and slighter; blacker and thicker). There is also repetition of over and over and other degrees of intensity (harder and better; graver and steadier). With the exception of three instances, all show an increase in intensity, again similar to the non‑metaphoric meaning of growth. Within the sample there is a mixture of metaphoric and non‑metaphoric language. It can be concluded then that the colligation adj.(er) + adj.(er) is specifically negative in its pragmatic association when used alongside grew in a metaphoric sense. This find‑ ing alone confirms that metaphoric instances of the item grew differ in their lexical characteristics to both non‑metaphoric uses of the same item, and other more gen‑ eral uses of the same colligation – in this case adj.(er) + adj.(er).
  5. Conclusions
    1. Summary of findingsThe small snapshot of findings has illustrated that metaphoric uses of grew are signalled differently in the structures and groups of items in which they form a part of. These primings are characterised by their absence in the non‑metaphoric dataset. Findings from the analysis show that the given nesting structure grew more and more, it is specific to a metaphoric use of grew, signaling an abstract transformation of character. There was also shown to be negative association, often conveying a sense of despair, anger, or weakness in a character’s tem‑ perament. Interestingly this is not the case with grew less and less, another fre‑ quent metaphoric cluster. Of more interest is the colligation adj.(er) + adj.(er), which had a much stronger negative pragmatic association that grew more and more + adj. Moreover, the structure was shown to be specific to the verb grew – there was no pragmatic association found associated with the more general colligation adj.(er) + adj.(er) in the BNC. The adjectives displaying the highest degree of fixedness in R1 position also display a negative pragmatic association (grew pale, worse, tired, weary, hot). There is no such association shown in the non‑metaphoric adjectives, or indeed in any collocate analysis with the non‑ metaphoric dataset. Pragmatic association has been shown to play a crucial role in the above nesting structures. Interestingly Louw (1993) claims that metaphor is often enlisted “both to prepare us for the advent of a semantic prosody and to maintain its intensity once it has appeared” (Louw 1993: 172). The findings here do indeed show prevalence for pragmatic association amongst metaphoric instances of items in comparison to the non‑metaphoric uses. Thus it could be suggested that pragmatic association and metaphor form a creative relationship. More importantly to the larger investigation, the nesting and the subsequent manifestations of meaning (collocation, colligation, semantic association and pragmatic associations) are avoided by non‑metaphoric uses of grew, providing support to the Drinking Problem hypothesis (Hoey 2005), which claims that two senses of a word will avoid each other’s primings.
    2. Implications for future metaphor research

Together with the other lexical priming claims discussed in the author’s PhD thesis, it has been demonstrated that co‑textual, contextual and textual lin‑ guistic features display lexical patterns which are manifest in only metaphoric or only non‑metaphoric instances of grew. Through corpus methods, the find‑ ings suggest that the Drinking Problem hypothesis can indeed be extended to metaphoric language, to account for our ability to identify and recognize

metaphoric instances of a given item or phrase. This in turn supports the notion that metaphoric instances of words or phrases have (to an extent) a fixed set of choices in terms of grammar and lexis.

To date, corpus linguistics has pushed furthest the argument that the linguistic patterns found in metaphor more complex than other theories can account for and that the importance of social interaction needs to form part of an adequate expla‑ nation of the data. Together with corpus linguistics, the Lexical Priming theory permits a re‑focusing of metaphor, taking into consideration society’s role in the use of language, and language’s relationship with both society and the individual. Rather than taking a compartmentalised approach to metaphor, corpus linguistics and lexical priming address both the cognitive and social aspects to metaphor, as integral parts of both the theory and analysis of data.

Amongst other linguists, Sampson goes further in support of corpus linguis‑ tics, claiming that “it could be argued that corpus methodology should be driv‑ ing the theoretical notions of metaphor and lexicology more generally” (Sampson 2001: 194). The creative link found between metaphor and pragmatic association is a finding worthy of further exploration, and only through corpus linguistics can it be explored in the first place. Louw’s (1993: 172) claim that metaphor is often enlisted to help prepare a reader for semantic prosody (pragmatic association) may in fact turn out to be a more pervasive relationship, where semantic prosody helps provide an explanation for our ability to recognise metaphor.

Hoey (2008) states that more work needs to be done in relation to creativity and lexical priming. Metaphor by its very nature is creative. Whilst the Drinking Problem Hypothesis (2005) does not shed any light on how to identify or defini‑ tively classify metaphoric language (as no theory can fully, despite good attempts), it might facilitate a focus on the set of choices being made by a speaker/writer and the level of fixedness of metaphoric senses in relation to their non‑metaphoric counterparts. This might make possible a lexically driven explanation of our abil‑ ity to identify metaphorical meanings, based on our encounters with language.


Barbera, M. 1993. Metaphor in 19th‑Century medicine. Knowledge and Language 3: 143–154. Black, M. 1993. More about metaphor. In Metaphor and Thought, A. Ortony (ed.), 19–41.

Cambridge: CUP. doi: 10.1017/CBO9781139173865.004

Boghian, I. 2009. The metaphor of the body as a house in 19th Century English novels. Styles of Communication 1(1): 1–13.

Cameron, L. 2003. Metaphor in Educational Discourse. London: Continuum.

Carter, R. 2004. Language and Creativity: The Art of Common Talk. London: Routledge.

doi: 10.4324/9780203468401

Deignan, A. 2005. Metaphor and Corpus Linguistics [Converging Evidence in Language and Communication Research 6]. Amsterdam: John Benjamins. doi10.1075/celcr.6

Deignan, A. & Semino, E. 2010. Corpus techniques for metaphor. In Metaphor Analysis: Research Practice in Applied Linguistics, Social Sciences and the Humanities, L. Cameron & R. Maslen (eds), 161–179. London: Equinox.

Deignan, A., Littlemore, J. & Semino, E. (eds). 2013. Figurative Language, Genre and Register.

Cambridge: CUP.

Frantzi, K. & Ananiadou, S. 1996. Extracting nested collocations. In Proceedings of the 16th International Conference on computational Linguistics, COL‑ING 96, 41–46.

Gadamer, H. 2004. Truth and Method, trans. J. Weinsheimer & D.G. Marshall. London: Continuum.

Gibbs Jr., R.W. 1994. The Poetics of Mind. Cambridge: CUP.

Habermas, J. 1990. A review of Gadamer’s Truth and Method, trans. F.R. Dallmayr & Thomas McCarthy. In The Hermeneutic Tradition: From Ast to Ricoeur, G.L. Ormiston & A.D. Schrift (eds), 213–244. Albany: Suny Press.

Hanks, P. 2013. Lexical Analysis: Norms and Exploitations. Cambridge MA: The MIT Press.

doi: 10.7551/mitpress/9780262018579.001.0001

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

doi: 10.4324/9780203327630

Hoey, M. 2008. Lexical priming and literary creativity. In Text, Discourse and Corpora, M. Hoey,

M. Mahlberg, M. Stubbs & W. Teubert (eds), 7–30. London: Continuum.

Kimmel, M. 2008. Metaphors and soft‑ware assisted cognitive stylistics. In Directions in Empiri- cal Literary Studies, S. Zyngier, M. Borlotussi, A. Chesnovokova & J. Auracher (eds), 193–210. Amsterdam: John Benjamins.

Koller, V. 2006. Of critical importance: Using corpora to study metaphor in business media discourse. In Corpus-Based Approaches to Metaphor and Metonymy, A. Stefanowitsch &

S.T. Gries (eds), 229–257. Berlin: Mouton de Gruyter.

Leech, G. 1969. A Linguistic Guide to English Poetry. London: Longman.

Leech, G. 2008. Language in Literature: Style and Foregrounding. London: Pearson Longman. Lindquist, H. & Levin, M. 2008. Foot and mouth: The phrasal patterns of two frequent nouns.

In Phraseology: An Interdisciplinary Perspective, S. Granger & F. Meunier (eds), 143–158. Amsterdam: John Benjamins. doi: 10.1075/z.139.15lin

Louw, B. 1993. Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. Text and Technology. In Honour of John Sinclair, 157–176. Amsterdam: John Benjamins. doi: 10.1075/z.64.11lou

Mahlberg, M. 2010. Corpus linguistics and the study of nineteenth century fiction. Journal of Victorian Culture 15(2): 292–298.

Mahlberg, M. 2012. Corpus Stylistics and Dickens’s Fiction. London: Routledge. Nowottny, W. 1965. Language Poets Use. London: Continuum.

Partington, A. 1998. Patterns and Meanings: Using Corpora for Language Research and Teaching.

Amsterdam: John Benjamins.

Partington, A. 2006. Metaphors, motifs and similes across discourse types: Corpus‑assisted dis‑ course studies (CADS) at work. In Corpus-Based Approaches to Metaphor and Metonymy,

A. Stefanowitsch & S. Gries (eds), 267–304. Berlin: Mouton de Gruyter.

Patterson, K.J. 2015. The confinements of ‘metaphor’ ‑ Putting functionality and meaning before definition in the case of metaphor. Globe: A Journal of Language, Culture and Communica- tion 2: 1–22. doi: 10.5278/ojs.globe.v2i0.74

Patterson, K.J. 2016. The analysis of metaphor: To what extent can the theory of lexical priming help our understanding of metaphor usage and comprehension? Journal of Psycholinguistic Research 45(2): 237–258. doi: 10.1007/s10936-014-9343-1

Philip, G. 2010. Why prosodies aren’t always present: Insights into the idiom principle. In Pro- ceedings of the Corpus Linguistics Conference CL2009, M. Mahlberg, V. González‑Díaz, & C. Smith (eds). Liverpool: University of Liverpool.  CL2009/317FullPaper.rtf

Philip, G. 2011. Colouring Meaning: Collocation and Connotation in Figurative Language [Stud‑ ies in Corpus Linguistics 43]. Amsterdam: John Benjamins. doi: 10.1075/scl.43

Ricoeur, P. 2003. The Rule of Metaphor: The Creation of Meaning in Language, trans. R. Czerny.

London: Routledge.

Sampson, G. 1979. Liberty and Language. Oxford: Oxford University Press. Sampson, G. 2001. Empirical Linguistics. London and New York: Continuum.

Sampson, G. 2013. One man’s norm is another man’s metaphor. Review article on: Patrick Hanks,

Lexical Analysis: Norms and Exploitations. Cambridge MA: The MIT Press, pp. xv + 462.

Scott, M. 2008. WordSmith Tools, Version 5. Liverpool: Lexical Analysis Software. Seretan, V. 2011. Syntax-based Collocation Extraction. Dordrecht: Springer.

doi: 10.1007/978-94-007-0134-2

Short, M. 1996. Exploring the Language of Poems, Plays and Prose. London: Longman.

Steen, G. 2009. From linguistic form to conceptual structure in five steps: Analysing metaphor in poetry. In Cognitive Poetics: Goals, Gains, Gaps, G. Brône & J. Vandaele (eds), 197–226. Berlin: Mouton de Gruyter.

Steen, G., Dorst, A., Herrmann, B., Kaal, A., Krennmayr, T., Pasma, T. 2010. A Method for Lin- guistic Metaphor Identification: From MIP to MIPVU [Converging Evidence in Language and Communication Research 14]. Amsterdam: John Benjamins. doi: 10.1075/celcr.14

Stockwell, P. 2002. Cognitive Poetics: An Introduction. London: Routledge. Svanlund, J. 2007. Metaphor and convention. Cognitive Linguistics 18(1): 47–89.

doi: 10.1515/COG.2007.003

Tsiamita, F. 2009. Polysemy and lexical priming: The case of drive. In Exploring the Lexis-Gram- mar Interface [Studies in Corpus Linguistics 35], U. Romer & R. Schulze (eds), 247–264. Amsterdam: John Benjamins. doi: 10.1075/scl.35.16tsi

Wittgenstein, Ludwig. [1922]1981. Tractatus Logico-Philosophicus, trans. C.K. Ogden. London: Routledge.

Teaching near‑synonyms more effectively

A case study of “happy” words in Mandarin Chinese

Juan Shao

Xi’an Jiaotong University, China, University of Liverpool, UK

The purpose of this study is to explore ways to effectively explain how Chinese near‑synonyms are distinguished based on corpus exploration. Lexical priming provides a theoretical framework for the collocational and colligational analysis of Chinese synonyms. A group of Mandarin Chinese “happy” words 高兴 (gāo xìng), 快乐 (kuài lè) and 开心 (kāi xīn) are chosen for the case study. The result shows that the three Chinese synonyms could be distinguished based on corpus analysis, which may provide useful reference for teaching Chinese to speakers of other languages.

  1. Introduction
    The use of corpora in language teaching has been gaining increasing prominence in the last two decades. A great number of corpus‑related (corpus‑based, corpus‑ driven and corpus‑assisted) research studies have contributed to the advance‑ ment in language pedagogy, in particular Teaching English to Speakers of Other Languages (TESOL). The topics range from compiling corpus‑driven dictionar‑ ies for learners, designing supplementary teaching materials as well as textbooks and using corpora in the classroom to analysing learner language, conducting comparative study between first and target language and teaching ESP/EAP. Most of the research has concentrated on English (for example, Greenbaum & Nelson 1996; Altenberg & Granger 2001; Tsui 2004; Römer 2004; Yoo 2009; Liu & Jiang 2009) and some on European languages such as French and Spanish (O’Sullivan & Chambers 2006; Benavides 2015). Few studies, however, have been done on Chi‑ nese Mandarin even though the last decade has witnessed a boom in learning Chinese as a second/foreign language across the world (Choi 2011; Yang 2015).A number of research studies have been conducted on linguistic behaviours of lexis, phraseology, pattern grammar, n‑grams (Sinclair 2004; Stubbs 2007; Granger & Meunier 2008), and findings have been applied in English language peda‑ gogy. Chinese linguistics remains a field less explored. Xiao and McEnery (2010)
    doi 10.1075/scl.79.07.sha© 2017 John Benjamins Publishing Company
    conducted a contrastive study between English and Mandarin Chinese focus‑ ing on tense and aspect, which not only enriches linguistic descriptions but also provides a potential reference point for Chinese teaching and learning. How‑ ever, much work needs to be done in the area of teaching Chinese to speakers of other languages including corpus‑based linguistic descriptions and its pedagogic applications.This study focuses on one important and difficult aspect in teaching which has scarcely been explored, namely distinction in near‑synonyms. Despite its impor‑ tance and intricacy, synonymy has not garnered the scholarly attention it deserves until quite recently (Divjak 2006; Edmonds & Hirst 2002; Taylor 2002). Liu and Espino have pointed outBecause of their subtle nuances and variations in meaning and usage, synonyms offer an array of possible word choices to allow us to convey meanings more pre‑ cisely and effectively for the right audience and context. (2012: 199)Therefore how to choose the most appropriate one from a list of synonyms for the right audience and context constantly frustrates language learners and also poses a problem for teachers. The focus of this study is not on the pedagogic method to teach synonyms, but rather on exploration of ways to effectively explain how Chi‑ nese near‑synonyms are distinguished based on corpus analysis. The findings will facilitate teachers to teach synonyms more effectively and also help the students to make better decisions in choosing the most appropriate word or phrase for the right audience and context.
  2. Background of the study
    1. Use of corpora in second/foreign language teachingW. Nelson Francis (1982) defines a corpus as ‘‘a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis’’ (p. 7). Using corpora in language teaching gives L2 learners access to authentic language and discover language patterns (Johns 1991). Data‑driven learning (DDL), developed by Tim Johns for use with inter‑ national students at the University of Birmingham, is an approach based on the theory that students act as ‘‘language detectives’’ (Johns 1997: 101), discovering facts about the language they are learning for themselves from authentic examples. Therefore, corpora or concordances can be used as a “language‑learning activity” (Gavioli 1997) or language learning tool with a vast amount of information from which learners can ‘‘conduct inductive explorations of grammatical constructions’’ (Meyer 2004). In other words, learners can gain an ‘‘ability to see patterning in
      the target language and to form generalizations’’ (Johns 1991) in the real language usage. A number of studies have reported the success in using concordances in teaching vocabulary (for example, Yýlmaz & Soruç 2015) and grammar (Uysal et al. 2013; Benavides 2015) in EFL classrooms.However, there are some concerns on the use of corpora in language teach‑ ing. One of the issues is that ‘‘language in corpus is de‑contextualised and must be re‑contextualised in a pedagogic setting to make it real for learners’’ (Hun‑ ston 2002; Widdowson 2000; Cook 1998). In addition it has been pointed out that DDL is most suitable for very advanced learners who are filling in gaps in their knowledge rather than laying down the foundations (Hunston 2002). For example, Ming and Lee (2013) report that compared with a conventional approach, such as Grammar Translation, using concordances in Taiwan’s EFL grammar classrooms could be ‘‘time consuming’’ and ‘‘technically challenging’’. The problems are… partly because choosing enough suitable sentences from a long list of concord‑ ance lines was in itself a laborious task, and partly because, it was rather difficult/ challenging to manipulate the advanced queries from which to extract the lines containing the precise grammar patterns teachers wanted to focus on, not least because they were unfamiliar with corpora. (Ming & Lee 2013: 271)Therefore, although advantageous, use of concordances in EFL classrooms still poses challenges, especially for teachers who themselves are not familiar with cor‑ pus analysis methods and also for learners who are not at an advanced level.
    2. The expansion and problems of Mandarin Chinese teachingThe last decade has witnessed a boom in teaching Chinese Mandarin all over the world. According to the Confederation of British Industry, Mandarin Chinese is one of the most sought after languages by British businesses (Moore 2012). In the USA, the number of learners of Chinese as a non‑native language has increased by over 18% since 2006 (Furman, Goldberg & Lusin 2010). By 2014, there have been over 480 Confucius Institutes in dozens of countries across six continents. The Chinese Ministry of Education estimates that 100 million people overseas may be learning Chinese by 2010 (Wikipedia 2016).However the problems in teaching Chinese Mandarin have been noticed by many linguists and language teaching practitioners. Firstly, compared with English, corpus studies in Chinese linguistics are still at the preliminary stage. No corpus‑based Chinese dictionaries are available, and few and relatively small‑scale corpus‑based teaching materials are compiled by individual teach‑ ers. When confused about the features of some lexical and grammatical usage teachers do not have reliable resources to resort to. Liu (2005) also points out
      that most Chinese teachers still adopt traditional teaching methods, present‑ ing prescriptive language descriptions and providing intuition‑based made‑up examples. Secondly, although vast in number, Mandarin learners are mostly beginners while intermediate and advanced leaners are relatively few. For exam‑ ple in UK 17% state secondary schools alongside 45% of independent schools are offering Chinese (CfBT Languages Trends 2013/14). And a couple of universities start to offer Chinese courses for undergraduate and postgraduate degrees, such as University of Cambridge, University of Dundee, University of Edinburgh and University of Liverpool. However, Chinese teaching in schools and universities has only started in recent years, most of learners are taking introductory courses in Chinese and there are relatively few intermediate and advanced leaners. Therefore use of corpora seems unrealistic in most of the classroom teaching; however corpora may provide useful resources for teachers to conduct inductive explorations of lexical and grammatical behaviours. As McEnery and Xiao have pointed out:The use of corpora in language teaching and learning has been more indirect than direct. This is perhaps because the direct use of corpora in language pedagogy is restricted by a number of factors including, for example, the level and experience of learners, time constraints, curricular requirements, knowledge and skills re‑ quired of teachers for corpus analysis and pedagogical mediation, and the access to resources such as computers, and appropriate software tools and corpora, or a combination of these. (2011: 2–3)I would therefore suggest that Chinese teachers learn how to use corpora to explore language patterns in Mandarin first and that corpus‑based dictionaries for learn‑ ers and other teaching materials be compiled based on corpus studies in Chinese. This study is one of the first attempts to these purposes.
    3. Corpus approaches to synonyms and lexical primingThe discrimination of near synonyms has been a very challenging issue for lin‑ guists, lexicographers, dictionary‑makers and language teachers in both L1 and L2 teaching (Edmonds & Hirst 2002; Divjak 2006; Lee & Liu 2009). Neither dic‑ tionaries nor thesauruses could provide satisfactory explanations in distinguish‑ ing near synonyms, as they only offer recurrent references in a circle, which often leads to frustration when consulted. It seems that traditional language descrip‑ tion based on introspection of make‑up examples does not distinguish synonyms effectively. With the development of corpus linguistics, the accessibility of large amounts of naturally occurring data has facilitated various empirical analyses in the exploration of near‑synonyms distinction.
      A number of corpus studies on English synonyms have been conducted in recent years, for example, Gries (2001) quantifies the similarity between English adjectives ending in -ic or -ical (for example economic and economical) on the basis of the overlap between their collocations. Other sets of synonyms that have attracted attention include strong and powerful (Church et al. 1991), absolutely, completely and entirely (Partington 1998), big, large and great (Biber et al. 1998), principal, primary, chief, main and major (Liu 2010), and actually, genuinely, really, and truly (Liu & Espino 2012).On the contrary, Chinese synonyms are less explored except that Xiao and McEnery (2006) looked at near synonyms from a cross‑linguistic perspective. The study analysed and compared collocation and semantic prosody of English synonyms and their Chinese equivalents, but did not look at their colligational behaviours.The theory of Lexical Priming (LP) was proposed by Michael Hoey in 2005. Based on corpus analysis, LP gives explanations to the existence of important terms and concepts in corpus linguistics including collocation, colligation, seman‑ tic association and etc. from a psychological perspective (Hoey 2005). Based on psychological experimental developments and the corpus linguistic analysis of large amount of naturally occurring data, Lexical priming (Hoey 2005) argues that people are mentally primed with words through encounters in speech and writing and they become cumulatively loaded with the contexts and co‑texts in the pro‑ cess of encountering. Hoey made an analogy between the mental concordance and the computer concordance and pointed out… the computer corpus cannot tell us what primings are present for any language user, but it can indicate the kind of data a language user might encounter in the course of being primed. It can suggest the ways in which priming might occur and the kind of feature for which words or word sequence might be primed.(2005: 14)
      Lexical priming has made a number of claims, to be specific, every word is primed with… the word or words that characteristically accompany it (its collocations), the grammatical patterns with which it is associated (its colligations), the meanings with which it is associated (its semantic associations), and the pragmatics with which it is associated (its pragmatic associations). (Hoey & Shao 2015: 19)Based on analysis of the English synonymous pair result and consequence, Hoey (2005) has demonstrated that synonyms are similar in terms of sharing features in collocation, semantic association and colligation, but differ in proportional distributions. Lexical priming has universal application and its applicability to
      Chinese has been demonstrated in previous studies (Shao 2014; Hoey & Shao 2015). Lexical priming seems to provide a reasonable theoretical and practical framework for distinguishing Chinese synonyms. The current study will make use of the framework of lexical priming to explore collocational and colligational behaviours of Chinese synonyms.In addition, language transfer from L1 may also influence how learners use the target language. Hoey explains:the learning of a second language (L2) is necessarily a very different experience from learning a first one (L1) for a whole raft of reasons, […]. In the first place, when the vocabulary of the first language is primed, it is being primed for the first time. When the second language is learnt, however, the primings are necessarily superimposed on the primings of the first language. (2005: 183)Therefore, the second aim of the study is to look at how English and Chinese ‘happy’ words are similar and different in terms of collocation, semantic associa‑ tion and colligation. The comparison between the synonyms in both languages may help to explain why learners make certain mistakes in the target language and thus we could suggest ways of improving the teaching of Chinese near‑synonyms.
  3. Setting up the study
    This study focuses on distinguishing Chinese near‑synonyms within the frame‑ work of lexical priming, to be specific, similarities and differences of Chinese near‑synonyms in terms of collocation, colligation and semantic association will be explored based on corpus analysis. Three Chinese ‘happy’ words 高兴 (gāo xìng), 快乐 (kuài lè) and 开心 (kāi xīn) are chosen for the current study for the following reasons.Firstly, these words are listed as required vocabulary in Mandarin test syl‑ lables. The Chinese Proficiency Test (HSK) is an international standardized exam that tests and rates Chinese language proficiency. It assesses non‑native Chi‑ nese speakers’ abilities in using the Chinese language in their daily, academic and professional lives. It consists of six levels, from level I (beginner) to level VI (advanced). The three ‘‘happy’’ words are included in the syllables in Level I, II and IV respectively.Secondly these ‘‘happy’’ words have frequently been used mistakenly by Man‑ darin learners. Table 1 provides some examples of mistakes made by first‑year university students who have studied Chinese for six months in UK.
    Table 1. Examples of mistakes from students’ writing/speaking
    English expressions
    Student’s mistakes
    Chinese expressionsI’m happy/ glad.* 我 是 高兴 /快乐 /开心。 Wŏ shì gāo xìng / kuài lè / kāi xīn.我 很 高兴 /快乐 /开心。Wŏ hĕn gāo xìng / kuài lè / kāi xīn.happy life* 高兴 的生活gāo xìng de sēng huó (?) 快乐 的生活kuài lè de sēng huó (?) 开心 的生活kāi xīn de sēng huó幸福 的生活xìng fú de sēng huóGlad to see you.* 我 是 高兴 见 到你。Wŏ shì gāo xìng jiàn dào nĭ.我 很 高兴 见 到你。Wŏ hĕn gāo xìng jiàn ào nĭ.Happy new year新年快乐 (xīn nián kuài lè)(?) 新年高兴 (xīn nián gāo xìng) (?) 新年开心 (xīn nián kāi xīn)新年快乐 (xīn nián kuài lè)
    As shown in Table 1, students tend to use 是 (shì)(BE) in the sentence when expressing 我很高兴 (Wŏ hĕn gāo xìng) in Chinese, which may be affected by its English translation I’m happy. Besides, ‘happy life’ is often translated into *高兴的生活 (gāo xìng de sēng huó), which is not a correct collocation in Chinese. In expressing ‘‘happy new year’’, although most of the students are correct by using新年快乐 (xīn nián kuài lè), they often ask whether they could say 新年高兴 (xīn nián gāo xìng) or 新年开心 (xīn nián kāi xīn). All these issues are somehow related to collocational and colligational primings of these synonymous items and their English translations.Finally neither dictionaries nor teaching materials provide useful informa‑ tion in distinguishing these synonyms. For most language learners, when having difficulty in understanding or expressing themselves in a second/foreign lan‑ guage, the first resource they tend to consult seems to be dictionaries. How‑ ever, dictionaries may not always be helpful, especially in choosing the most appropriate one for the particular co‑text and context. Take these ‘happy’ words for example. The entries for 高兴 (gāo xìng), 快乐 (kuài lè), 开心 (kāi xīn) (see Table 2) provided in a modern Chinese‑English dictionary, first published in 2001 by Foreign Language Teaching and Research Press in China, could be very confusing for the learners, even though the preface states that the edition is not only for Chinese students who are learning English but also for those who are learning Chinese.
    Table 2. Entries of the three synonyms in Modern Chinese‑English Dictionary
    高兴 (gāo xìng)看到孩子们有进步,心里很高兴。He was very pleased to see that the kids had made progress.他就是高兴看电影,看戏不感兴趣。He’s fond of seeing films, and not at all interested in watching plays.快乐(kuài lè)happy; joyful; cheerful快乐的微笑 a happy smile节日过得很快乐。The festival was spent joyfully.开心(kāi xīn)happy; joyous; elated他们很开心。They are happy.(Modern Chinese‑English dictionary, first published in 2001 by Foreign Language Teachingand Research Press in China)
    1. glad; happy; cheerful
    2. be willing to; be happy to

    In the Dictionary, several English words are provided as the explanation to each word without differentiation. For example, English explanations including glad, happy, and cheerful are provided for 高兴 (gāo xìng); happy, joyful and cheerful for快乐 (kuài lè) and happy, joyous and elated for 开心 (kāi xīn) and nothing is men‑ tioned about how to distinguish these synonyms. In addition, some examples pro‑ vided seem to be unrelated to the translations provided. Take 高兴 (gāo xìng) for example, even though the second sense provided is be willing to and be happy to, the example offered is ‘‘He’s fond of seeing films, and not at all interested in watching plays’’, which may cause further confusion to the users.
  4. Purpose and Methodology of the study
    The purpose of the study is to explore ways to effectively explain how Chinese synonyms are distinguished based on corpus exploration. As Hoey points out ‘‘synonyms differ in respect of the way they are primed for collocation, semantic associations and colligations and the differences in these primings represent dif‑ ferences in the uses to which we put our synonyms’’ (2005: 79). The first aim of the study is to explore the behaviour of Mandarin Chinese ‘‘happy’’ words: 高兴 (gāo xìng), 快乐 (kuài lè) and 开心 (kāi xīn). A detailed analysis is conducted to inves‑ tigate the similarities and differences of the words in terms of collocation, seman‑ tic association and colligation. In addition the differences of Chinese and English
    translations in terms of collocation, semantic association and colligation may be the reason why learners make certain mistakes in using the synonyms. Therefore my research questions are: (1) How are Chinese words meaning ‘‘happy’’ primed in terms of collocation, colligation and semantic association? (2) Is there a poten‑ tial link between the way English speakers are primed with respect to words mean‑ ing ‘‘happy’’ in English and the way they use similar words in Mandarin Chinese? To tackle these questions, the Lancaster Corpus of Mandarin Chinese (LCMC)was analysed with CQPweb (Hardie 2012) and FLOB was analysed with the Sketch Engine (Kilgarriff 2008) due to the accessibility of the corpus. LCMC is a one‑mil‑ lion‑word balanced corpus of written Mandarin Chinese, designed as a Chinese match of the Freiburg‑LOB Corpus of British English (FLOB). The corpus contains five hundred 2,000‑word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words (McEnery & Xiao 2004). CQPweb is a web‑based corpus analysis system developed by Andrew Hardie at Lancaster University. Rather than being bound to a particular dataset such as the BNC, it is compatible with any corpus; therefore its flexibility provides possibility to analyse Chinese corpus (Hardie 2012). As for FLOB, I use the Sketch Engine. As the Sketch Engine and CQPweb make use of the same statis‑ tical measurement, the comparability of Chinese and English analysis is ensured.
  5. Results and discussion
    1. Chinese grammatical termsThis section addresses the two research questions. Before presenting the result, a brief introduction to some grammatical terms in Chinese seems necessary. Affixes are grammatical morphemes that are added to other morphemes to form new words. Affixes in English may be derivational (for example -ness and pre-), or inflectional (for example plural -s and past tense -ed). Like English, Chinese have some characters which could be added at the end of a word (i.e. suffixes) to form a derivative. For example, 地 (de): ‑ly快 (kuài): quick – 快地 (kuài de): quickly高兴 (gāoxìng): happy – 高兴地 (gāoxìng de): happilyUnlike English, Chinese does not change verb forms to show sense and aspect, but rather using ‘‘particles’’, also known as ‘‘function word’’ (Li and Cheng 2008), which may ‘‘have a number of different functions depending on their placement in a sen‑ tence’’ (Wikipedia). Some general roles played by of particles in Chinese include ‘‘indicating possession, a continuous action, completion, addition of emotion, soft‑ ening of a command, and so forth’’ (Wikipedia). Look at the following example:
      他 走 了。Tā zŏu le. He walk par. He (has) left.As my focus is on distinction of synonyms, the specific function of each particle in the examples provided will not be distinguished.Question 1: How are the Chinese ‘‘happy’’ words primed in terms of colloca‑ tion, semantic association and colligation?
    2. Collocation and semantic associationThe first part of analysis concerns collocation. All the collocates of the three syn‑ onyms in LCMC were elicited by CQPweb, in which significance of collocation strength is scored by log‑likelihood. Note that the higher the score, the more evidence we have that the association of the word in query and its collocate is not due to chance (Hardie 2012).There is a long collocation list of 高兴 (gāo xìng), including 地 (de, suffix), 很 (hĕn, very), 心里 (xīn lĭ, in the heart), 非常 (fēi cháng, very), 十分 (shí fēn, very),得 (de, PAR), 我 (wŏ, I), 太 (tài, too), 听 (tīng, listen), 不 (bù, not), 了 (le, PAR),1 说 (shuō, speak/talk), 他 (tā, he), 她 (tā, she), 就 (jiù, PAR) and 着 (zhe, PAR), except the punctuation marks, which will not be discussed in this paper. Table 3 lists all the collocates on the basis of log‑likelihood score.地 (de) appears on the top of collocation list of 高兴 (gāo xìng) and it is the most frequent R1 collocate (the first collocate on the right side of the word in query). Out of 31 instances, 地 (de) is used before 高兴 (gāo xìng) only once (see Example 1) in L1 position, in which 高兴(gāo xìng) is used as adjective and 地 (de) is a suffix used after 意外 (yì wài, surprising) and changes it into an adverb 意外地 (yì wài de, surprisingly). In all the other 30 instances, 地 (de) occurs in R1 position of 高兴 (gāo xìng) to form an adverb and modify a verb in the sentence (Examples 2 and 3). Therefore, the analysis shows that 高兴 (gāo xìng) is mostly used as adverbs together with the suffix 地 (de); dictionaries however do not show this usage at all.
      1. 二 人 都 感到 意外 地 高兴。 Liăng rén dōu găn dào yì wài de gāo xìng. Two person both feel surprising par happy. Both of them feel surprisingly happy.2
        1. PAR is the abbreviation for particle.
        2. The Chinese is given first in character form, then in Pinyin, followed by a word-for-word translation and then a free translation.

        Table 3. Collocation list of 高兴 (gāo xìng) in LCMC

        collocatesTotal no. in whole corpusExpected collocate frequencyObserved collocate‑ frequency
        In no. of texts
        Log‑likelihood高兴(gāo xìng)地 (de, SUF)3,4712.557312899.167很 (hĕn, very)1,4671.081191673.764心里 (xīn lĭ, in the heart)2020.1499856.641非常 (fēi cháng, very)2190.1618647.153十分 (shí fēn, very)3210.2377734.107得 (de, PAR)1,6101.186111029.564我 (wŏ, I)5,5764.107171222.767太 (tài, too)3800.285319.477听 (tīng, listen)5210.3845516.508不 (bù, not)5,6874.189141414.311了(le, PAR)12,7879.41920149.121说 (shuō, speak/ talk)3,7542.765886.572他 (tā, he)5,8974.3441095.413她 (tā, she)2,8252.081542.943就 (jiù, PAR)3,4762.561551.823着 (zhe, PAR)3,5022.58551.787是 (shì, BE)11,6008.54577−0.301的 (de, SUF)51,13937.671614−16.613
      2. 林 小姐 高兴 地 订 下 了 16日 从 柳 州Lín xiăo jiĕ gāo xìng de dìng xià le shí liù rì cóngLin Miss happy par book par 16 day from Liu zhou转 郑 州 的 车票。liŭ zhōu zhuăn zhèng zhōu de chē piào. transfer Zheng zhou par ticket.Miss Lin happily booked the ticket to transfer from Liu zhou to Zheng zhou on 16th.
      3. 这种 方法, 小孩子 会 高兴 地 接受。 Zhè zhǒng fāng fă, xiăo hái zi huì gāo xìng de jiē shòu. This approach, little kid will happy par accept. The kids will happily accept this approach.
        很 (hĕn, very) ranks the second in the collocation list and it is the most frequent L1 collocate. It appears in L1 position in 15 out of 19 instances (some examples shown in Table 4) and in L2 position in the other 4, in which 不 (bù, not) and 是 (shì, BE) are in L1 with two occurrences respectively.
        Table 4. Concordance lines of 很 (hĕn) with 高兴 (gāo xìng) in LCMC
        小平 同志 听 了 很高兴,不断 点头 , 露出 满意 的 笑容 。现在 你 可以 钻进 来 了 。她 很高兴地 喊 道 。丈夫 写 了 一 张 戒烟 保证书 , 妻子 很高兴。夫人 , 我 很高兴接受 您 的 委托 。相见 之后 , 彼此 都 很高兴,一面 喝酒 一面 谈话
        Note that 是 (shì, BE) and 的 (de, SUF) appear at the bottom of the collocation list of 高兴 (gāo xìng) with negative log‑likelihood score, which indicates that the two collocates, in particular的 (de, SUF), very rarely co‑occur with 高兴 (gāo xìng).As for 快乐 (kuài lè), collocates include 祝 (zhù, bless), 生日 (shēng rì, birth‑ day), 感觉 (găn jué, feel), 你 (nĭ, you), 不 (bù, not), 人 (rén, person), 的 (de, suffix),是 (shì, BE) (Table 5).
        Table 5. Collocation list of 快乐 (kuài lè) in LCMC

        collocatesTotal no. in whole corpusExpected collocate frequencyObserved collocatefre‑ quency
        In no. of texts
        Log‑likelihood快乐(kuài lè)祝 (zhù, bless)180.0033336.453生日 (shēng rì, birthday)300.0053333.13感觉 (găn jué, feel)1660.0263122.698你 (nĭ, you)2,9000.4524410.439不 (bù, not)5,6870.886435.899人 (rén, person)4,0270.627334.684的 (de, suffix)51,1397.96312101.88是 (shì, BE)11,6001.806330.666
        Although only in three instances, the log‑likelihood scores for 祝 (zhù, bless) and 生日 (shēng rì, birthday) are 36.453 and 33.13 respectively, which are sig‑ nificant. Note 你 (nĭ, you) also appear in these instances; however due to its high frequency of 你 (nĭ, you) in the corpus the log‑likelihood score is relatively lower
        than those of 祝 (zhù, bless) and 生日 (shēng rì, birthday), therefore less significant than the other two collocates (Examples in Table 6).Table 6. Concordance lines of 祝 (zhù, bless) and 生日 (shēng rì, birthday) with 快乐 (kuài lè) in LCMC
        64个小伙伴在 ‘‘祝你 生日快乐’’ 的旋律中, 争先恐后地把学习《铃儿响叮当》或者《祝你生日快乐》的乐曲就会机械地一遍遍地响,覆着我的, 低声唱着《祝你生日快乐》。高高大大的白果树的枝叶,In addition, although low in the log‑likelihood score, 快乐 (kuài lè) is positively primed with 的 (de) and 是 (shì), which suggests that even though not frequent,的 (de) and 是 (shì) may co‑occur with 快乐 (kuài lè) and here are some examples.
      4. 幸福 就 是 叫人 快乐 的 感觉 Xìng fú jiù shì jiào rén kuài lè de găn jué. Happiness is make people happy par feeling. Happiness is a feeling which cheers people up.
      5. 节目 形式 的 风格, 是 可以 有 多种 不同Jié mù xíng shì de fēng gé, shì kĕ yǐ yŏu gè zhŏng bù tóng Program form par style, is may have various different的 表现 方式, 同样 是 轻松 快乐。 de biăo xiàn fāng shì, tóng yàng shì qīng sōng kuài lè. par representation way, same be relaxing joyful.The presentation styles of the program can be various, and also relaxing and joyful.In Example 4, 快乐 (kuài lè) is followed by 的 (de) to form an adjective and modify感觉 (găn jué, feeling) and in Example 5 the word is used after是 (shì) to function as complement in the sentence.Lastly, 开心 (kāi xīn) yields fewest collocates, namely 地 (de) and 的 (de) (Table 7). Based on the log‑likelihood score, 地 (de, suffix) is a much more fre‑ quent collocate than 的 (de), which indicates that 开心 (kāi xīn) can be used more frequently as an adverb (Example 6) than as an adjective (Example 7).Table 7. Collocation list of 开心 (kāi xīn) in LCMCTotal no. Expectedin whole collocate Observed collocateIn no. ofword collocatescorpusfrequencyfrequencytexts Log‑likelihood
        开心(kāi xīn)地(de, suffix)3,4710.18753 23.68
        的(de, suffix)51,1392.75633 0.022
      6. 一 句 话 说 得 大家 都 开心 地 笑 了 起来。 Yí jiù huà shuō de dà jiā dōu kāi xīn de xiào le qǐ lái. One CL word say par people all happy par laugh up.One word made everybody laugh happily.
      7. 你 有 什么 不 开心 的 事情 ? Nǐ yŏu shén me bù kāi xīn de shì qing? You have what not happy par thing? Do you have anything unhappy with?
        Then the analysis moves onto semantic association, which may be interchange‑ able with the term semantic preference (Hoey 2005). Stubbs defines semantic pref‑ erence as ‘‘the relation, not between individual words, but between a lemma or word‑form and a set of semantically related words’’ (2001: 65).By looking at the collocates of 高兴 (gāo xìng), it is not difficult to identify its semantic sets. Firstly, 高兴 (gāo xìng) co‑occurs with intensifiers such as 很 (hĕn, very), 非常 (fēi cháng, very), 十分 (shí fēn, very), and 太 (tài, too). Secondly, verbs denoting sensory experiences appear in another semantic set, including 听 (tīng, listen) and 说 (shuō, speak/talk). Last, personal pronouns such as 我 (wŏ, I), 他 (tā, he) and 她 (tā, she) form a third semantic group.With 快乐 (kuài lè), a restricted domain can be identified from its collo‑ cates, that is birthday celebration. 祝 (zhù, wish) and 生日 (sheng rì, birthday) are included. Of interest is the situation of collocate 你 (nĭ, you). One may argue that it cannot be categorised into the current semantic group, however, the examina‑ tion of the concordances show 3 out of 4 hits are related to the topic as it is used in the structure 祝你生日快乐 (zhù nĭ shēng rì kuài lè, wish you a happy birthday). No semantic sets of 开心 (kāi xīn) could be categorised due to its small numbers of collocates.To sum up, these three ‘happy’ words share some collocates but with different collocation strength. 高兴 (gāo xìng) is positively primed with 地 (de) and 得 (de), but negatively primed with 是 (shì) and 的 (de); 快乐 (kuài lè) is positively primed with 的 (de) and 是 (shì); 开心 (kāi xīn) is positively primed with 地 (de) and的 (de). This will be mentioned again when we look at the colligational features of these synonyms. As for semantic association, we do not find much overlaps among these synonyms. It may be due to the corpus size and numbers of collocates retrieved in the data, which is worth further exploration with larger corpus.
    3. ColligationColligation refers to “the grammatical position and function a word tends to prefer in or avoid” (Hoey 2005: 13). In addition, Hoey (2005) emphasises that colligation
      Table 8. Grammatical positions of the three ‘happy’ words in clausesPart of objectPart of Part of adjunct Part ofWordspredicate (to modify verbs) complement modifier head of objectother Total
      高兴(gāo xìng)82(66.7%)32(26%)2(1.6%)7(5.7%)//123快乐(kuài lè)8(30.8%)1(3.8%)3(11.5%)5(19.2%)6(23.1%)3(11.5%)26开心(kāi xīn)3(33.3%)2(22.2%)2(22.2%)2(22.2%)//9
      includes ‘‘the avoidance of certain grammatical patterns and functions’’. Firstly col‑ ligational behaviours of the three ‘‘happy’’ words concerns the co‑occurrence of the suffixes 地 (de, adverb suffix) and 的 (de, adjective suffix). As presented before, the top collocate 地 (de, adverb suffix) in the collocation list of 高兴 (gāo xìng) shows that 高兴 (gāo xìng) are frequently used as an adverb to modify another verb. The negative collocation of 的 (de, adjective suffix) and 高兴 (gāo xìng) suggests that the word do not function as adjectives. As for 快乐 (kuài lè), its positive collocation with的 (de, adjective suffix) shows that it can be used as an adjective. Both 地 (de, adverb suffix) and 的 (de, adjective suffix) co‑occur with 开心 (kāi xīn), which indicates that though more frequent as adverb 开心 (kāi xīn) can function as adjective as well.The second part of colligational analysis concerns the grammatical positions of the three words in clauses. All the instances of three words were analysed to see whether they occurred as part of the Subject, as part of the Object, as part of the Complement or as a part of a prepositional phrase functioning as Adjunct.The following findings deserve attention: Firstly, all the three words appear in the grammatical positions including Part of predicate, Part of adjunct, Part of complement and Part of object; the proportions of these grammatical positions, however, vary from word to word (Table 8).Secondly, the three words are positively primed to function as predicates in a clause with highest proportions, of 66.7%, 30.8% and 33.3% respectively (see Table 8). Note 高兴 (gāo xìng) is used as predicate twice frequently as the other two words and an example is shown in 8. Notice that the lexical category of 高兴 (gāo xìng) might be argued, but it seems to be commonly accepted that it functions as predicate in this Chinese clause.
      1. 张 将军 非常 高兴。 Zhāng jiāng jūn fēi cháng gāo xìng. Suranme general very happy. General Zhang is very happy.
        Thirdly there is a positive colligation between 高兴 (gāo xìng) and function of Adjunct, with a proportion of 26% (Example 9). However, 高兴 (gāo xìng) is nega‑ tively primed with other functions including Object (5.7%), Complement (1.6%) and others (none).
      2. 他 高兴 地 笑 起来。 Tā gāo xìng de xiào qĭ lái. He happy par smile par. He smiles happily.
      Fourthly, 快乐 (kuài lè) and 开心 (kāi xīn) are positively primed to function as Complement, with a proportion of 11.5% and 22.2% respectively (Example 10) while 高兴 (gāo xìng) is negative primed, with a proportion of only 1.6%.
      (10) 那个时期 的我,真是非常忧郁而不Nà gèshí qī dewŏ,zhēn shìfēi chángyōu yùérbúThat快乐time par的。I,really isverydepressedand.notkuài lè de. happy parI, at that time, was very depressed and unhappy.Next, there is positive colligation between 快乐 (kuài lè) and function of Object, which could be further classified into those which function as the head of Object (Example 11) and modifiers (Example 12). However, neither 高兴 (gāo xìng) nor 开心 (kāi xīn) is primed with function of Object.
      1. 画画 还 能 给 孩子 带来 快乐。 Huà huà hái néng gĕi hái zi dài lái kuài lè. Painting also can give kids bring happy. Painting can also bring kids happiness.
      2. 我们 说 些 快乐 的 事。 Wŏ men shuō xiē kuài lè de shì. We talk some happy thing. Let’s talk about something happy.
      To answer the first research question, it has been shown that these three syn‑ onyms share collocations and colligations, but differ in strength represented in terms of proportions.Question 2: Is there a potential link between the way English speakers are primed with respect to words meaning ‘‘happy’’ in English and the way they use similar words in Mandarin Chinese?To address the second question, three English synonymous words happy, glad and joyful were analysed. As Xiao and McEnery (2006) has pointed out that
      English and Chinese have different range of synonyms; therefore these words were chosen not based on one‑to‑one equivalent translation, but rather free translation. Table 9 shows the raw frequency, standardised frequency and disper‑ sion of the three words in FLOB.
      Table 9. Raw frequency, standardised frequency and dispersion of the three English words in FLOB
      WordsRaw FrequencyStandardised Frequency(per million)Dispersion(out of 500 texts)happy15819.5750glad496.0741joyful50.624
      The analysis of the English data will make use of the same structure: colloca‑ tion, semantic association and colligation. Note the focus of analysis here is not on the different features between English synonyms, but rather the comparison across the two languages.First all the collocates of the three words were elicited and listed, except the punctuation mark (shown in Tables 10, 11 and 12). One point needs to mention is that from the collocation list, we can see that happy can be used to modify nouns such as family, life, marriage and home. However, as mentioned before in LCMC高兴 (gāo xìng) can be used to modify noun like 事 (shì, thing), but never modify words 家庭 (jiā tíng, family), 生活 (sheng huó, life), 婚姻 (hūn yīn, marriage) or家 (jiā, home).
      Table 10. Collocation list of happy in FLOB

      Table 10 (Continued)

      Based on the collocation list, semantic associations of each word were categorised and the result was compared with that of Chinese data. The comparative analysis shows that:Firstly, happy has a semantic set of intensifiers including so and very. And 高兴 (gāo xìng) co‑occurs with intensifiers such as 很 (hĕn, very), 非常 (fēi cháng, very), 十分 (shí fēn, very), and 太 (tài, too). Secondly, happy and glad are positively primed with ‘‘BE’’ (in forms of was, is, be, are …) while 高兴 (gāo xìng) is nega‑ tively primed with 是 (shì, BE).Thirdly, happy only occurs with the verb see, and glad is primed with the verbs see and note; 高兴 (gāo xìng) has a semantic association with the verbs denoting sensory experiences, including 听 (tīng, listen) and 说 (shuō, speak/ talk). And finally happy and glad co‑occur with personal pronouns including he, you, she, I and we. And 高兴 (gāo xìng) co‑occurs with personal pronouns such as 我 (wŏ, I), 他 (tā, he) and 她 (tā, she).Then the comparative analysis moved on to colligation. Again all the instances of three words were analysed to see whether they occurred as part of the Subject, as part of the Object, as part of the Complement or as a part of a prepositional phrase functioning as adjunct.The result shows that the three words are dominantly primed to function as part of Complement with a proportion of 71.5%, 96% and 40% respectively
      Table 11. Collocation list of glad in FLOB
      Table 12. Collocation list of joyful in FLOBNo. Collocates Frequency Log‑likelihood No. Collocates Frequency Log‑likelihood1 and 3 15.792 2 the 3 11.257(Table 13). Note that 高兴 (gāo xìng), 快乐 (kuài lè) and 开心 (kāi xīn) are primed to function as predicate. But compare the following:Table 13. Grammatical positions of happy, glad and joyful in clauses
      WordsPart of subjectPart of complementPart of objectPart of adjunctotherTotalhappy8 (5.1%)113 (71.5%)21 (13.3%)11 (7%)5 (3.2%)158glad/47 (96%)1 (2%)1 (2%)/49joyful/2 (40%)1 (20%)2 (40%)/5
      1. He was extremely happy in all his scientific work and gave great satisfaction to his colleagues by the cheerful way he helped them. (complement in English)
      2. 今天 开 这个 大会 我们 非常 高兴。 Jīntiān kāi zhègè dàhuì wŏmén fēicháng gāoxìng. today hold this meeting we very happy.
      we are very happy to hold this meeting. (predicate in Chinese)In Examples 13 and 14, happy and 高兴 (gāo xìng) are in different grammatical positions in two languages.To sum up, there are similarities and differences between Chinese and English synonyms in terms of their collocational and colligational behaviours. The simi‑ larities may be the reason why they can be considered as equivalent translations and the differences may be the cause of difficulties in the language learning and mistakes in learners’ performances.
  6. Conclusion
    This study set out to explore an effective way to distinguish Mandarin Chinese synonyms via a case analysis of ‘‘happy’’ words. The three Chinese synonymous words 高兴 (gāo xìng), 快乐 (kuài lè) and 开心 (kāi xīn) share similarities in terms of collocation and colligation, but with different strength demonstrated in differ‑ ent proportions. However, they are divergent in terms of semantic association (at least in the LCMC). In addition there might be a link between the way English speakers are primed with respect to words meaning ‘‘happy’’ in English and the way they use similar words in Mandarin Chinese. And the differences between English and Chinese may lead to students’ mistakes in using the words (for exam‑ ple, with 是 (shì, BE) and ‘BE’).
  7. Limitations and future research

The results of analysis of the English data were compared with those for Chinese and the difference between the groups offer a potential explanation of the difficulty in using near‑synonyms in the target language. One limitation of the study is that without detailed analysis of the learner’s performance (for example in speaking and writing) it is impossible to find out whether there is a direct link between the learner’s primings in the first and target languages. Research on interlanguage may provide more reliable evidence of priming transfer (Dean Mellow & Cumming 1994; Yip 1995; Han 2014).

Despite the limitations, this study provides some indications of how corpus‑ based study on Chinese near‑synonyms could be conducted and may provide

insights into better ways of teaching Chinese as a second/foreign language. Further research needs to be conducted into other near‑synonymous words and phrases as well as into the interlanguage of English‑speaking Chinese language learners.


This research is supported by Humanities and Social Sciences Youth Foundation of Ministry of Education of China, under Grant No. 15YJC740065


Altenberg, B. & Granger, S. 2001. The grammatical and lexical pattern of make in native and non‑native student writing. Applied Linguistics 22(2): 173–195. doi: 10.1093/applin/22.2.173 Benavides, C. 2015. Using a corpus in a 300‑level Spanish grammar course. Foreign Language

Annals 48(2): 218–235. doi: 10.1111/flan.12136

Biber, D., Conrad, S. & Reppen, R. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: CUP. doi: 10.1017/CBO9780511804489

Choi Y.‑M. 2011. Global boom in Chinese language learning. Maeil Business Newspaper, Febru‑ ary 10, 2011.

Church, K.W., Gale, W., Hanks, P. & Hindle, D. 1991. Using statistics in lexical analysis. In Lexi- cal Acquisition: Exploiting On-line Resources to Build a Lexicon, U. Zernik (ed.), 115–164. Hillsdale NJ: Lawrence Erlbaum Associates.

Cook G. 1998. The uses of reality: A reply to Ronald Cater. ELTJ 52: 57–64. doi: 10.1093/elt/52.1.57

Dean Mellow, J. & Cumming, A. 1994. Concord in Interlanguage: Efficiency or Priming? Applied Linguist 15(4): 442–473.

Divjak, D. 2006. Ways of intending: Delineating and structuring near synonyms. In Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, S.T. Gries & A. Ste‑ fanowitsch (eds), 19–56. Berlin: Mouton de Gruyter.

Edmonds, P. & Hirst, G. 2002. Near synonyms and lexical choice. Computational Linguistics

28(2): 105–144. doi: 10.1162/089120102760173625

Francis, W.N. 1982. Problems of assembling and computerizing large corpora. In Computer Cor- pora in English Language Research, S. Johansson (ed.), 7–24. Bergen: Norwegian Comput‑ ing Centre for the Humanities.

Furman, N., Goldberg, D., & Lusin, N. 2010. Enrolments in languages other than English in United States institutions of higher education. Modern Language Association of America.


Gavioli, L. 1997. Exploring texts through the concordancer: Guiding the learner. In Teaching and Language Corpora, A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (eds), 83–99. London: Longman.

Granger, S. & Meunier, F. (eds). 2008. Phraseology: An Interdisciplinary Perspective. Amsterdam: John Benjamins. doi: 10.1075/z.139

Greenbaum, S. & Nelson, G. 1996. The International Corpus of English (ICE) Project. World Englishes 1(I): 3– I5. doi: 10.1111/j.1467-971X.1996.tb00088.x

Gries, S.T. 2001. A corpus‑linguistic analysis of ‑ic and ‑ical adjectives. ICAME Journal 25: 65–108.

Han, Z. T. E. 2014. Interlanguage. Amsterdam: John Benjamins Publishing Company.

Hardie, A. 2012. CQPweb ��� Combining power, flexibility and usability in a corpus analysis tool.

International Journal of Corpus Linguistics 17(3): 380–409. doi: 10.1075/ijcl.17.3.04har

Hardie, A. 2012. CQPweb. 

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

doi: 10.4324/9780203327630

Hoey, M. & Shao, J. 2015. Lexical priming: The odd case of a psycholinguistic theory that gener‑ ates corpus‑linguistic hypotheses for both English and Chinese. In Corpus Linguistics in Chinese Contexts, B. Zou, M. Hoey, & S. Smith (eds). London: Palgrave Macmillan.

doi: 10.1057/9781137440037

Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP. doi: 10.1017/CBO9781139524773

Johns T. 1991. From print out to handout: Grammar and vocabulary teaching in the context of data‑driven learning. CALL Austria 10, 14–34.

Johns, T. 1997. Contexts: The background, development and trailing of a concordance‑based CALL program. In Wichmann et al. (eds), 100–115.

Kilgarriff, A. 2008. The Sketch Engine. 

Lee, C. Y. & Liu, J. S. 2009. Effect of collocation information on learning lexical semantics for near synonymy distinction. Computational Linguistics and Chinese Language Processing 14(2): 205–220.

Li, D. & Cheng, M. 2008. A Practical Chinese Grammar for Foreigners, revised edn. Beijing: Lan‑ guage and Culture University Press.

Liu, D. 2005. Chinese language teaching: Present and future. Modern Chinese 3: 25–33.

Liu, D. 2010. Is it a chiefmainmajorprimary, or principal concern? A corpus‑based behav‑ ioral profile study of the near‑synonyms. International Journal of Corpus Linguistics 15(1): 56–87. doi: 10.1075/ijcl.15.1.03liu

Liu, D & Espino, M. 2012. Actuallygenuinelyreally, and truly: A corpus‑based behavioral pro‑ file study of near‑synonymous adverbs. International Journal of Corpus Linguistics 17(2): 198–228. doi: 10.1075/ijcl.17.2.03liu

Liu, D., & Jiang, P. 2009. Using a corpus‑based lexico‑grammatical approach to grammar instruction in EFL and ESL contexts. Modern Language Journal 93: 61–78.

doi: 10.1111/j.1540-4781.2009.00828.x

McEnery, T. & Xiao, R. 2011. What corpora can offer in language teaching and learning. In Handbook of Research in Second Language Teaching and Learning, Vol. 2, E. Hinkel (ed.). London: Routledge.

Meyer, C.F. 2004. English Corpus Linguistics: An Introduction. Cambridge: CUP.

Ming, H.L. & Lee, J.Y. 2015. Data‑driven learning: changing the teaching of grammar in EFL classes. ELT Journal 69(3).

Moore, M. 2012. The rise and rise of Mandarin Chinese but how many will end up speaking it?


O’Sullivan, Í. & Chambers, A. 2006. Learners’ writing skills in French: Corpus consultation and learner evaluation. Journal of Second Language Writing 15(1): 49–68.

doi: 10.1016/j.jslw.2006.01.002

Partington, A. 1998. Patterns and Meanings: Using Corpora for English Language Research and Teaching [Studies in Corpus Linguistics 2]. Amsterdam: John Benjamins. doi: 10.1075/scl.2

Römer, U. 2004. A corpus‑driven approach to modal auxiliaries and their didactics. In How to Use Corpora in Language Teaching [Studies in Corpus Linguistics 12], J.M. Sinclair (ed.), 185–199. Amsterdam: John Benjamins. doi: 10.1075/scl.12.14rom

Sinclair, J. 2004. Trust the Text. Language, Corpus and Discourse. London: Routledge

Shao, J. 2014. Near synonymy and lexical priming. Paper given at 6th International Conference on Corpus Linguistics, Universidad de Las Palmas de Gran Canaria, May 22–24.

Stubbs, M. 2001. Words and Phrases. Oxford: Blackwell.

Stubbs, M. 2007. Quantitative data on multiword sequences in English: The case of the word world. In Text, Discourse and Corpora, M. Hoey, M. Mahlberg, M. Stubbs, & W. Teubert (eds), 163–190. London: Continuum.

Taylor, J.R. 2002. Near synonyms as co‑extensive categories: ‘high’ and ‘tall’ revisited. Language Sciences 25(3): 263–284. doi: 10.1016/S0388-0001(02)00018-9

Tsui, A. 2004. What teachers have always wanted to know and how corpora can help. In How to Use Corpora in Language Teaching [Studies in Corpus Linguistics 12], J.M. Sinclair (ed.), 39–61. Amsterdam: John Benjamins. doi: 10.1075/scl.12.06tsu

Uysal, H., Bulut, T. & Hosein, Y. Al. 2013. Using concordances as supplementary materials in teaching grammar. Studies about Languages 22: 113–118.

Widdowson, H.G. 2000. On the limitations of linguistics applied. Applied Linguistics 21: 3–25. doi: 10.1093/applin/21.1.3

Wikipedia. 

Xiao, Z. & McEnery, A. 2006. Collocation, semantic prosody and near synonymy: A cross‑lin‑ guistic perspective. Applied Linguistics 27(1): 103–129. doi: 10.1093/applin/ami045

Xiao, R. & McEnery, T. 2010. Corpus-based Contrastive Studies of English and Chinese. London: Routledge.

Yang, R. 2015. China’s soft power projection in higher education. International Higher Educa- tion 46.

Yip, V. 1995. Interlanguage and Learnability: From Chinese to English. Amsterdam: John Ben‑ jamins Publishing Company.

Yoo, I.W.‑H. 2009. The English definite article: What ESL/EFL grammars say and what corpus findings show. Journal of English for Academic Purposes 8: 267–278.

doi: 10.1016/j.jeap.2009.07.004

Yýlmaz, E. & Soruç, A. 2015. The use of concordance for teaching vocabulary: A data‑driven learning approach. The Proceedings of 6th World Conference on Educational Sciences. Proce- dia – Social and Behavioral Sciences 191: 1–2882.

part iii

Collocations, associations and priming

Lexical priming and register variation

Tony Berber Sardinha

São Paulo Catholic University

Lexical priming predicts that repeated encounters with lexical patterns will prime users for register awareness (Hoey 2013: 3344). To verify this prediction, this chapter reports on a study that determined the dimensions of collocation in American English, which are the parameters underlying the use of collocations in spoken and written text. The method was inspired by the multidimensional framework for register variation analysis introduced by Biber in the 1980s. The corpus used was the 450‑million‑word Corpus of Contemporary American

English (COCA, 1990–2012 version). The most characteristic collocations of each register in COCA (spoken [American radio and television programs], magazine, newspaper, academic, and fiction) were computed using the logDice coefficient (Rychly 2008). These were then entered in a factor analysis, which yielded the statistical groupings of collocation across the registers. Nine dimensions were identified and are described in this chapter. The relationship between collocation and register was tested statistically through the dimensions, and the results suggested that register could predict the collocations (via the dimensions) between 39% and 67% of the time, which seems to lend support to the hypothesis that users are primed for register, as far as AmE collocations are concerned.

  1. Introduction
    One of the key elements of lexical priming theory is the relationship among word combinations, priming, and textual variety:An important feature of lexical priming theory is that, at the same time as the listener or reader comes to recognize through repeated encounters with a word, syllable, or word combination the particular semantic, pragmatic, grammatical, and textual/discoursal contexts associated with it, s/he will also subconsciously identify the genre, style, or social situation it is characteristically used in.(Hoey 2013: 3344)This theory predicts that individuals are able to store in their minds information about the textual varieties (registers, genres, etc.) in which particular colloca‑ tions are most typical. In other words, if presented with a particular collocation, individuals should be able to identify the register for which it is characteristically
    doi 10.1075/scl.79.08ber© 2017 John Benjamins Publishing Company
    primed. This hypothesis actually presupposes a regular association between col‑ location and register, meaning that register differences should be marked by dif‑ ferences in collocational use; in other words, different registers should have largely distinct groupings of collocations, as Hoey (2005: 10) illustrated:An example of contextual limitation is the collocation of recent and research, which is largely limited to academic writing and news reports of research. Re‑expressed in terms of priming, research is primed in the minds of academic language users to occur with recent in such contexts and no others. The words are not primed to occur in recipes, legal documentation or casual conversation, for example. In short, collocational priming is sensitive to the contexts (textual, generic, social) in which the lexical item is encountered, and it is part of our knowledge of a lexical item that it is used in certain combinations in certain kinds of text.Therefore, the major goal of the study reported in this chapter is to determine whether such register‑characteristic collocations exist and, if so, how extensive and predictable they are. Previous research has shown that particular registers do have characteristic sets of words or terms. For instance, through the keyword pro‑ cedure in WordSmith Tools, McEnery, Xiao, and Tono (2006: 308–317) identified the most typical words from a sample of English conversation, and Menon and Mukundan (2012) extracted the salient terms in science textbooks. However, no comprehensive studies of collocation have been conducted at the level of regis‑ ter to show what the preferred collocations for particular registers are and how these sets of collocations vary across registers. Previous research on collocation ‘has usually ignored register differences’ (Biber 2010: 245), and when a contras‑ tive perspective is adopted, previous studies have largely focused on comparing varieties such as native and non‑native speakers (e.g., Hunston 2002: 206–212). Notable exceptions include Pace‑Sigge (2015), who looked at naturally occurring conversation, prepared speech, and written literature, and the seminal work by Sinclair, Jones, and Daley (1970/2004), who compared collocations occurring in a science magazine (New Scientist) and in conversation and found that the colloca‑ tions could discriminate between the two registers. The authors concluded that:from a linguistic point of view it is interesting to find that ‘strength of collocation’ provides a useful discriminant between different types of English and it would be interesting to see if the results were so encouraging for two texts which differ very little. (p. 133)Unlike collocation, both fixed and semi‑fixed word combinations have been inves‑ tigated from a register perspective. For instance, Biber and Conrad (1999) identi‑ fied lexical bundles (fixed word sequences) in conversation and academic prose in English; Berber Sardinha, Ferreira, and Teixeira (2014) described the lexical bundles across 48 different registers in Brazilian Portuguese; and Gray and Biber
    (2013) compared the lexical frames (both fixed and discontinuous sequences) in English academic prose and conversation. This chapter attempts to fill the gap in collocation studies by reporting the results of a multidimensional study on collocations from a register perspective using the 450‑million‑word Corpus of Contemporary American English (COCA, 1990–2012 full text, downloadable ver‑ sion). After describing the dimensions of collocation variation, this chapter will present the results of a discriminant function analysis used to determine whether it is possible to predict the register category from collocations using the dimen‑ sions as predictors.
  2. Method
    The method for this investigation was inspired by the multidimensional (MD) analysis of register variation, introduced by Biber in the 1980s (Biber 1988) and subsequently developed by him and his colleagues (cf. Berber Sardinha & Veirano Pinto 2014). The goal of an MD analysis is to determine the dimensions or underly‑ ing parameters of variation in the data. Traditionally, the MD framework has been applied to the study of cross‑register variation based on patterns of lexicogrammat‑ ical characteristics. The texts in a corpus are tagged for these characteristics, and counts are taken for each characteristic in each text. These counts are then normed to a rate of (usually) a thousand words, and the tagged counts are submitted to a factor analysis that identifies the latent groups of co‑occurring linguistic character‑ istics. Standardized frequencies are computed for the characteristics that load on the factors, and these standardized frequencies are summed up for each register, thereby producing factor scores, which are a value attached to each text for each factor. Next, the factors are interpreted functionally, based on the linguistic characteristics that loaded on them, and an interpretive label is suggested to capture the underlying functional and communicative parameters of variation. Various MD analyses have been conducted over time, both for language‑wide register variation, such as English (Biber 1988), Spanish (Biber, Davies, Jones & Tracy‑Ventura 2006), and Brazilian Portuguese (Berber Sardinha, Kauffmann & Acunzo 2014), and within particular registers, such as literature (Egbert 2012) and television (Berber Sardinha & Veirano Pinto in press), to mention just a few. However, no extensive studies exist in the lit‑ erature on the systematic variation in English collocation across registers.Several major differences exist between a mainstream MD analysis and the MD analysis carried out here (see Table 1 for a summary). First, in this investigation, the units upon which the analysis was based were collocations – more specifically, pairs of words, with one representing a node and the other, a collocate (these nodes and collocates were selected from among the most frequent words in each register in
    COCA; see below). Second, in this investigation, the measurements taken for each unit were not text counts, but a word association statistic (logDice; see Rychly 2008) that gauged the attraction between the two words. Third, the factor scores were cal‑ culated for the collocates of each node rather than the texts in the corpus. Finally, as mentioned, the interpretation of the factors in this investigation was based primar‑ ily (but not solely) on lexical features revealed by their semantic preference (Stubbs 2007), lexical sets (Sinclair & Jones 1974/1996), word fields1 (Lehrer 1974; Trier 1931), ‘aboutness’2 (Phillips 1989; Scott 2000; Yablo 2016), topics (Berber Sardinha 1997), and subject matter (Schütze 1998). The collocations loaded on factor 8 in this study, like |mix-v + bowl~n|, |mix-v + ingredient~n|, and |cup-n + sugar~n|,3 can be used to illustrate this last point. In order to interpret a factor containing such collo‑ cations, the analyst must look at text samples where the collocations occur, examine their pattern of distribution across the registers, and then determine what the factor means. One interpretation is that the relationships holding between the words can be understood in terms of semantic preference – that is, ‘the relation between the node word and lexical sets of semantically related word‑forms or lemmas’ (Stubbs 2007: 178). Furthermore, ‘semantic preference refers to what has traditionally been known as lexical field: a class of words that share some semantic feature. (…) This will relate to the topic of the surrounding co‑text: what the text is about’ (Stubbs 2007: 178). As a result, this interpretation would suggest that the latent parameter underlying the factor was the lexical field, topic, or subject matter of cooking. In con‑ trast, another possible interpretation is that the relationships arising from these col‑ locations are related to a particular discourse category reflected in the register where these collocations are found in the corpus. These collocations are typically found in recipes and culinary reports included in the magazine section of COCA, and such texts can be seen as manifestations of ‘instructional’ or ‘directive discourse’ (Berber Sardinha, Kauffmann et al. 2014), thereby encapsulating the idea of language being used systematically to give directions to help users accomplish particular tasks, such as preparing a meal, operating an appliance, or playing a game. Consequently, this interpretation would highlight the discourse aspects of the collocations and might suggest an interpretive label along the lines of instructional or directive dis‑ course to the factor. As mentioned, in this research project, the interpretation was
    1. ‘A set of linguistic forms that expresses the underlying conceptual structure’ (Trier 1931, p. 1).
    2. ‘… the relation that meaningful items bear to whatever it is that they are on or of or that they address or concern’ (Yablo 2016, p. 1).
    3. In each pair, the node word appears followed by a dash (-) and the collocate, by a tilde (~). Individual node – collocate pairs are listed between bars.
      primarily driven by lexical concerns; therefore, the interpretive label suggested for this particular factor invoked a lexical field (‘cooking’) rather than a discourse cat‑ egory (‘directive discourse’). In other cases, the most adequate interpretation of the data was functional or discourse‑based, as with the first factor in this study, which corresponded to collocations like issue~n + relate-v, seem~v + appropriate-j, iden- tify-v + problem~n, and specific-j + need~n (see below). In this case, a functional interpretation was preferred – namely, ‘literate discourse’ – because such a label indi‑ cated the discourse category where these collocations are most likely to be found. Apart from these differences, the goal of the analysis remains the same between a mainstream and a collocation‑based MD analysis: to determine the underlying parameters of variation across text varieties. In a mainstream MD analysis, these parameters for the most part reflect structural characteristics, whereas here, they correspond to lexical choices. The claim here is, that register variation is motivated by both structural/functional factors and lexical ones. Although a considerable body of evidence exists about how register variation is influenced by functional consid‑ erations, little evidence exists about how register variation is shaped by lexical use. To our knowledge, this is the first MD analysis of English collocations in the litera‑ ture and the first large‑scale study of English collocations from a register perspec‑ tive. A previous MD analysis of collocations exists for Brazilian Portuguese (Berber Sardinha, Mayer Acunzo & São Bento Ferreira in press), which served as the basis for a collocation dictionary of that language and whose method was largely repli‑ cated here, but that study did not focus on register variation.Table 1. Comparison of traditional and collocation‑based MD analyses

      Traditional MD analysisMD collocation analysisGoalDetermine the underlying parameters of variation across registersUnit of observationTextsCollocatesMeasurementNormed countsLexical association scoresPrimary interpretationFunctional, communicativeSemantic preference, topical
      The corpus used for this investigation was the COCA – more specifically, its full text version, purchased from the BYU website. The full text version differed from the public version available online in that, for every 200 words of text, 10 words were replaced with a wildcard character (‘@’) for copyright reasons, which means that 95% of the original corpus was used in the investigation (ca. 440 million words). COCA comprises five different registers: spoken (American radio and television programs), magazine, newspaper, academic (books and journal arti‑ cles), and fiction (literary fiction). According to the COCA website corpus.byu. edu/coca/, visited January, 2016), the corpus contains some 190,000 texts.
      A script was developed that produced lists with the 300,000 most frequent word pairs in each register within a window of four words to either side of the node. From these lists, the script identified the collocates of the 2,000 most frequent lemmas of common nouns, main verbs, and adjectives in each register in COCA.4 The script then calculated the logDice statistic of each word pair to measure the strength of association. The logDice was calculated as follows (Rychly 2008: 9): logDice = 14+log2 Dice; Dice = 2fxy/(fx + fy), where fxy is the joint frequency of the node and the collocate within the 4–4 span, fx is the fre‑ quency of the node in the corpus, and fy is the frequency of the collocate in the corpus. The size of the span was chosen based on the work of Sinclair and Jones (1974/1996: 27), according to whom ‘by measuring the influence exerted by all the types in ten different texts, it was ascertained that for any node, a very high proportion of relevant information could be obtained by examining collocates at positions N −4 to N +4.’ The collocations were not distinguished according to the position of the collocate relative to the node (left versus right, immediate versus non‑immediate, first right versus third left, etc.), because there was no reason to assume that positional restrictions would be associated with register distinctions. In other words, in this exploratory study the intent was to verify whether collocation in general rather than collocation position is influenced by register. In addition, if collocations were distinguished by position, there is no consensus as to how fine the distinction among the positions should be, such as whether a lump sum of left versus right collocates would be appropriate or whether a more fine‑grained position‑by‑position classification would be in order (e.g., individual counts of L1, L2, L3). Further research could explore the role of the position of the collocate relative to the node on register varia‑ tion. The logDice lists of each register were then merged into a single spread‑ sheet totaling 3,511 columns (one for each node) and 23,602 rows (one for each collocate).Table 2 presents the size of the corpus and the sample of collocates taken for each register, and Figure 1 provides a snapshot of the spreadsheet where the data were recorded.
    4. These lemmas were treated as types, not tokens, in the spreadsheet columns. Because many of these lemmas appeared in more than one register and each lemma could only appear in one column of the spreadsheet, when they were combined, repeated occurrences were col- lapsed, thereby reducing the total from 10,000 (5 × 2,000) to 3,511.
      Table 2. Breakdown of data used in the study5

      RegisterCOCA,text version tokens

      Collocate sample

      Figure 1. Snippet of the data on a spreadsheet
      The factor analysis was carried out in SAS University Edition, using principal fac‑ tor as the method of extraction, which generated a scree plot of the eigenvalues6 per factor solution (Figure 2). The scree plot was examined for plateaus (‘elbows’) that suggested the optimum number of factors in the data; in this case, one such plateau was found at factor 9, thereby suggesting that nine factors were a plausible solution. A total of 3,201 variables (node words) with communalities lower than .15 were dropped, and the remaining 310 variables were submitted to a factor analysis
    5. Initially, a list of the 300,000 most frequent lemmatized collocates of the 2,000 most frequent lemmatized nodes (more specifically, nouns, verbs, and adjectives) of each register was created. Each resulting list was saved as the collocate sample for a particular register. The size of each list is shown in the ‘collocate sample’ column in Table 2. These samples were of unequal size because the registers had unequal proportions of words that were removed (namely, proper nouns, auxiliaries, modal verbs, and foreign words). The collocate sample was not subsequently balanced in size across the register categories because that would have involved reducing the individual lists, thereby causing data loss. Moreover, the resulting indi- vidual samples are of comparable length, ranging from 19% to 22% of the total.
    6. An eigenvalue ‘indicates the amount of variance extracted and corresponds to the sum of loadings on a factor after their squaring’ (Doise, Clemence, & Lorenzi-Cioldi 1993, p. 72).
      rotated with Promax that extracted nine factors. The factorial pattern table (Table A1 in the appendix) was then examined, and variables with loadings lower than .3 were discarded. Factor scores were computed for each collocate by summing up the standardized logDice values of the nodes that loaded on each factor. The resulting solution was then interpreted by looking at the collocations present in each factor; to aid in the interpretation, samples of text were tagged with the online English semantic tagger USAS , which classified each word into at least one semantic category representing a semantic field. Its output was then processed with a purpose‑built script that extracted lists of the most fre‑ quent semantic fields annotated by the tagger and the words associated with them. As mentioned, the interpretation of the factors as dimensions took into consider‑ ation a range of constructs, tools, and techniques, including the whole set of col‑ locations in the factor, the semantic preferences exhibited by the collocations, their lexical or semantic fields, the ‘aboutness’ manifested in the items, the subject matter and topics expressed, the major collocations sorted by their factor scores, KWICs of node – collocate combinations (obtained from the online version of COCA), the variation (of lack thereof) across the registers, and samples of texts in which the collocations occurred. Finally, mean dimension scores were computed for each register on each dimension, charts were produced that displayed these mean scores, and statistics were calculated in SPSS to measure the extent and size of the variation – namely, the F‑Score and the coefficient of determination (R2).
  3. Dimensions of collocation in American English
    In an MD analysis, a dimension is an underlying parameter of variation associated with a factor, which in turn is a set of correlated linguistic characteristics. In this study, the factors are groups of correlated collocations repeatedly found in texts, and the dimensions are the parameters underlying the variation of these colloca‑ tions across the registers. The individual dimensions are presented below, based on the interpretation of the factors. In interpreting the factors in an MD analy‑ sis, it is crucial to consider the mean factor scores for the registers, as these give a general idea of the situational context underlying the parameters of variation. However, the mean factor scores for a collocation‑based MD analysis can be a less reliable indicator of the central tendency of the data than for a grammar‑based MD, because the collocations of a particular set of words are generally sparser than structural features. A collocation that occurs in a particular text might not occur at all in several other texts in the same corpus, whereas a noun or adjective will most probably occur in all texts. As a result the mean frequencies for such collocations can be influenced by both the lack of collocations and the presence of frequent
    150 11401301201101009080270605040302010
    45 67 8 9
    10 11 12 13 14 15 160 1 2 3 4 5 6 7 8 9Factors
    10 11 12 13 14 15 16Figure 2. Scree plot
    collocations. This aspect becomes relevant when calculating the mean factor score for each register, because the mean is highly influenced by skewed distributions like those that are likely to exist with collocation data. Skewed distributions of the factor scores were found for all the factors in this study, which was not surprising given the selective nature of phraseology (words will not combine randomly). As a result, a large number of zero values existed in the data (because collocations simply did not occur in the corpus or because they did not occur frequently enough to be included in the analysis). At the same time, some nodes entered into a large number of collocations with different collocates, resulting in large factor scores for those collocates. In summary, for each register, we found two basic sets of complemen‑ tary observations: a large number of words that formed few or no collocations and a comparatively smaller set of words that entered into multiple collocations. Often, this complementary distribution varied across the registers. For example, the word ability generated a large set of collocations in academic (120), but small sets of col‑ locations in the remaining registers (e.g., in fiction, it had only one collocation; see Table 3). This skewed distribution caused a mismatch between the mean and the median (the point where 50% of the values occur), where one would expect them to be equal or at least close if the data were normally distributed. In this case, the mean
    was driven up by the high scores, whereas the median was pulled down by the low scores. To better represent the pattern of the data, three types of graphs are shown for each dimension: a regular MD graph of the means of each register, boxplots showing the major distributional characteristics of the logarithmic values of the factor scores, and strip charts presenting the actual spread of the data points. It was important to consider these three tools for the interpretation of the factors, because each provided a different angle from which to observe the data. The mean values graphs were significant in that they indicated that high scoring registers probably had a set of really distinctive collocations that in a way ‘set the tone’ for the dimen‑ sion, even if these high scores reflected the tail end rather than the central tendency of the data. The boxplots provided a visual representation of the range of the data points, including the upper and lower quartiles (the top and bottom of the ‘box,’ respectively), each corresponding to 25% of the data, the median (the mid‑point of the distribution – i.e., the line intersecting the box), the greatest and least values (the end points of the ‘whiskers’ – that is, the lines extending from the top and bot‑ tom of the boxes), and the outliers (data points that fall an abnormal distance away from the remaining values – that is, they are at least 1.5 times either the first or the last quartile). The boxplots were constructed using the logarithmic transformation of the factor scores so as to reduce the scale of the y‑axis (the factor scores) and permit the display of the actual boxes on the chart area (otherwise, the boxes ended up being drawn as a single line). Although the log scores do not reflect the actual distance between the factor scores, they do retain the relative position of registers in relation to each other. Finally, the strip charts depict the factor scores along a natural scale, thereby providing a bird’s eye view of the data.
    1. Dimension 1: Literate discourseWith 331 variables, factor 1 is the largest in the data, encompassing a wide variety of abstract words related to different semantic fields, such as modifying and chang‑ ing (change, develop, development), affecting and causing (affect, basis, influence, etc.), speech (communication, discussion, explain, etc.), investigating (analysis, researcher, empirical, etc.), quantifying (additional, increase, multiple, etc.), con‑ ceptualizing (concept, criteria, perspective, etc.), methods (approach, framework, methodology), comparing (compare, comparison, difference, etc.), and knowledge (data, information, knowledge, etc.). Typical collocations7 include |issue~n + relate- v|, |factor~n + relate-v|, |seem~v + appropriate-j|, |appropriate-j + behavior~n|, |iden- tify-v + problem~n|, |identify-v + specific~j|, |specific-j + need~n|, |specific-j + area~n|,|individual-j + difference~n|, |individual-j + right~n|, |assessment-n + tool~n|, and
      1. The word pairs appear in the order in which they most frequently occur in COCA.
        |risk~n + assessment-n|. The mean factor scores (Figure 3) suggest that the academic register is the most prominent, this is confirmed by both the box plot (Figure 4) and the strip chart (Figure 5). Thus, overall, the factor reflects literate discourse in general and academic language in particular. These collocations sharply distinguish academic writing from the other registers. The F‑score is significant, suggesting a statistical difference among the mean register scores on the dimension. However,+64.0 Academic+63.0+62.0//+05.0+04.0+03.0+02.0+01.00.0–01.0–02.0–03.0–04.0–05.0–06.0–07.0–08.0–09.0–10.0–11.0–12.0–13.0–14.0–15.0–16.0–17.0–18.0–19.0–20.0–21.0–22.0MagazinesNewspapers Spoken
        FictionF = 313.64, < 0001, R2 = .05Figure 3. Mean factor scores for Dimension 1, Literate discourse
        according to the R2, only 5% of the variation across the registers is accounted for by the register distinctions on this dimension, which is due to the wide dispersion of the factor scores, as denoted by the large standard deviations for all registers (i.e.,297.3 for academic, 12.0 for fiction, 53.2 for magazines, 45.6 for news, and 42.3 for spoken). Samples of major collocations in the factor appear in Example 1.
        Dimension 1
        Acd.Spk. Nws.RegisterMgz. Fct.
        Figure 4. Boxplot for Dimension 1, Literate discourse

        1000Dimension 1

        Spk. Nws.Score
        Mgz. Fct.Figure 5. Spread of factor scores for Dimension 1, Literate discourse
        1. It is an example of many of the key issues and challenges related to identifying empirically supported treatments (and assessments). (Education & Treatment of Children; academic)By design, the nature of the task was the same throughout the administration (e.g., child was exposed to test plate/image, expected to respond), so 1 min seemed appropriate to capture the nature of the child’s behavior during the task. (Language, Speech & Hearing Services in Schools; academic)Then fundamental problems and trends were identified and a rough geography curriculum was developed accordingly. (Education; academic) These resources are organized toward the needs of specific students and divided into lists of availability from the Internet and in print. (Music Educators Journal; academic)
    2. Dimension 2: Oral discourseFactor 2 is the second largest, with 210 variables, and includes words expressing a large array of semantic fields, such as kinship (brother, daughter, family, etc.), time periods (moment, day, hour, etc.), speech (say, speak, talk, ask, tell, etc.), quantities (couple, number, percent, etc.), understanding and learning (know, remember, learn, understand, idea, etc.), thoughts and beliefs (believe, feel, think, know, etc.), and evaluation (bad, best, better, good, great). The variety of semantic fields suggests that the vocabulary in the factor does not represent any particular field or domain, but rather the needs of speakers in oral communication. This is reflected in the most typical collocations in the factor, such as |want~v + know- v|, |people~n + know-v|, |want-v + say~v|, |people~n + think-v|, |think-v + go~v|,|kid-n + school~n|, |young~j + kid-n|, |other~j + thing-n|, |kind~n + thing-n|, |like- v + see~v|, |like-v + ask~v|, |long~j time-n|, |let~v + start-v|, and |get~v + start-v|, which are often found in colloquial conversations or registers that try to emulate unscripted dialog, like television programs and literary fiction. Samples of the col‑ locations appear in Example 2. The mean factor scores (Figure 6) indicate a contrast between spoken and fiction on the positive pole and newspapers, magazines, and academic on the negative, which reinforces the interpretation of the factor as hav‑ ing an underlying oral spoken‑like component that is in contrast with the literate component underlying the previous dimension. Yet in both the box plot and the strip chart (Figures 7 and 8), this contrast is not apparent, and there is actually a tie among the registers. For this reason, the factor does not seem to correspond to actual spoken language, but to oral language, which can be written or spoken. Based on these observations, the proposed interpretive label is oral discourse. As with the previous factor, the F‑score suggests a statistical difference among the registers, but because the R2 is quite low (.03% – again, due to the large standard deviations),
      Newspapers Magazines
      AcademicF = 19.612, < .0001, R2 = .003Figure 6. Mean scores for Dimension 2, Oral discourse
      Dimension 2
      –4Acd. Spk. Nws. Mgz. Fct.RegisterFigure 7. Boxplot for Dimension 2, Oral discourse

      Acd. Spk. Nws. Mgz. Fct.ScoreFigure 8. Spread of factor scores for Dimension 2, Oral discoursethe probability of predicting the register from the characteristics of this dimension alone is also very small.
      1. So I just want to let everybody know. That’s a fun place to go. (NBC Today; spoken)I just heard him say that it really doesn’t matter what people on Capitol Hill think of Newt Gingrich. (CBS This Morning; spoken)He’s been married for 14 years and has two school aged kids.(NBC Today; spoken)
    3. Dimension 3: Objects, people, and actions
      Factor 3 also comprises a large variable set (123), but unlike the previous dimen‑ sion, the semantic fields of its vocabulary are narrower, including basically items related to the description of physical objects and the human body, such as body parts (head, face, hand, etc.), names of furniture items (bed, chair, desk, etc.), colors (blue, brown, gray, etc.), substances and materials (glass, stone), general appearance and physical properties (bare, lean, shiny, etc.), shapes (corner, round, twist), parts of buildings (bedroom, door, floor, etc.), moving things (slide, walk, drag, etc.), putting/taking/pushing things (drag, hang, pull, etc.), measurement
      Newspapers AcademicF = 246.4, < .0001, R2 = .04Figure 9. Mean scores for Dimension 3, Objects, people, and actions
      Dimension 38
      Acd. Spk. Nws. Mgz. Fct.Register
      Figure 10. Boxplot for Dimension 3, Objects, people, and actions

      Acd. Spk. Nws. Mgz. Fct.ScoreFigure 11. Spread of factor scores for Dimension 3, Objects, people, and actions(tiny, heavy, tall, thin, etc.), and sensory experience (kiss, touch, gaze, etc.). Typical collocations include |stare-v + window~n|, |stare-v + ceiling~n|, |slide-v + open~j|,|pull-v + trigger~n|, |car~n + pull-v|, |dark-j + hair~n|, |dark-j + eye~n|,|hang-v + phone~n|, |hang-v + wall~n|, |lean-v + kiss~v|, |lean-v + chair~n|,|tall~j + thin-j|, |thin-j + layer~n|, |pale-j + skin~n|, and |pale-j + blue~j|. Accord‑ ing to the three graphs (Figures 9, 10, and 11), this dimension distinguishes fic‑ tion from the other registers, which suggests that these collocations are generally used to describe actions, settings, or the physical appearance of characters in a story (Example 3). The proposed interpretive label for this dimension is therefore ‘objects, people, and actions.’ As with the previous dimensions, a statistical differ‑ ence exists among the registers, as indicated by the F‑score; however, because of the large standard deviations in factor scores, the R2 is small (4%), suggesting that the chances of predicting the register from the factor scores for this dimension alone are small as well.
      1. Jade leaned against the window and stared toward the Tennessee River. (Love Lifted Me; fiction)Her hair was perfect, dark, thick, lustrous, and piled up on her head. (Some Like Them Rich; fiction)Even if you dreaded the trigger being pulled, wanted the ram to wake from his trance and bound suddenly away, to safety. (We Were the Mulvaneys; fiction)
    4. Dimension 4: Colloquial and informal language useFactor 4 includes 104 words, consisting mostly of kinship terms – both traditional/ formal (aunt, cousin, grandmother, etc.) and informal (mom, dad, mama)8 – reli‑ gious and supernatural terms (lord, angel, god, dragon, etc.), words expressing thinking and believing (guess, imagine, wonder, etc.), informal or personal atti‑ tude markers (fool, silly, stupid), living creatures (cat, horse, rabbit), expletives (fuck, fucking), and terms of address (Mrs., Miss, captain, etc.). Some typical collocations are |afraid-j + lose~v|, |mama-n + papa~n|, |mama-n + daddy~n|,|mommy~n + daddy-n|, |glad-j + hear~v|, |glad-j + see~v|, |stupid-j + question~n|,|mother~n + grandmother-n|, |grandmother-n + die~v|, and |ship~n + captain-n|. According to all three charts (Figures 12, 13, and 14), this dimension also distin‑ guishes between fiction and the other registers, like the previous one; however, the collocations highlight colloquial, informal language use, which is therefore the interpretive label suggested for the dimension. The score for spoken on this dimension underscores the nature of COCA’s spoken register, which is not natu‑ rally occurring conversational speech; therefore, it is not as high as one would expect actual conversation to be. The interpretation of the statistics for the factor is similar to the previous dimensions: Whereas the F‑score indicates a statisticalDimension 48
      2Acd. Spk. Nws. Mgz. Fct.RegisterFigure 13. Boxplot for Dimension 4, Colloquial and informal language use
      1. It is interesting that the factor analysis split kinship terms between dimensions 2 and 4. Words like father and mother loaded on factor 2, while words like mom and dad loaded on factor 4. This reflects the different collocations and register preferences of these words.

      difference among the registers, the low R2 (.1%) shows that the factor scores can‑ not predict the register categories. Text samples appear in Example 4.
      Spoken MagazinesAcademic/Newspapers
      F = 57.681, < .0001, R2 = .01Figure 12. Mean scores for Dimension 4, Colloquial and informal language use
      1. Julietta Giordano slipped past her papa and mama, her elder sister, and her three brothers as they ate breakfast. (Heart Most Worthy; fiction)It was very early morning there, but his mother was glad to hear he’d arrived safely. (The Welcome Committee of Butternut Creek; fiction)The ship’s captain told them he had never sailed with a finer group. (True Sisters; fiction)

        0Dimension 4Acd. Spk. Nws. Mgz. Fct.ScoreFigure 14. Spread of factor scores for Dimension 4, Colloquial and informal language use
    5. Dimension 5: Organizations and the governmentFactor 5 comprises 80 variables, including vocabulary related to the government (citizen, congress, council, etc.); money, markets, and the economy (funding, tax, budget, market, trade, etc.); organizations (board, committee, leader, etc.); groups and affiliations (association, member, public, etc.); places (country, district, national, etc.); law and order (law, regulation, security); and the military (defense, military). Typical collocations include |protection~n + agency-n|, |official- n + say~v|, |international-j + monetary~j|, |national-j + association~n|, |district- n + attorney~n|, and |federal~j + government-n|. The mean scores chart seems to distinguish between newspapers and academic on the positive pole and magazines and fiction on the negative pole, with spoken being unmarked (Figure 15). This is partly confirmed by both the box plot and the strip chart (Figures 16 and 17), which suggests a tie among academic, spoken, and newspaper and confirms that fiction is the least marked register. Although the apparent distinction among the registers is confirmed by the significant F‑score, as with the other dimensions, the probability of predicting the register categories from the factor scores is very small (.15%). The composition of the factor suggests labeling this dimension as organi‑ zations and the government; collocations are illustrated in Example 5.
      +05.0+04.0+03.0+02.0+01.000.0–01.0–02.0–03.0–04.0–05.0–06.0–07.0Newspapers Academic SpokenMagazines
      F = 93.325, < .0001, R2 = .015Figure 15. Mean scores for Dimension 5, Organizations and the government
      1. The Environmental Protection Agency and city Environmental Protection Department announced May 8 they will spend federal funds to professionally clean the apartment (Associated Press; newspaper)‘the strategic weight is shifting south,’ said a senior Australian official. (Washington Post; newspaper)high‑level meetings that began Friday and included top government officials from the Federal Reserve (USA Today; newspaper)
        Dimension s
        Acd. Spk. Nws. Mgz. Fct.RegisterFigure 16. Boxplot for Dimension 5, Organizations and the government
        Dimension s
        0Acd. Spk. Nws. Mgz. Fct.ScoreFigure 17. Spread of factor scores for Dimension 5, Organizations and the government
    6. Dimension 6: Politics and current affairsThe 59 variables in this factor generally reflect radio and television programs about politics and current affairs, with most of the variables relating to the program format (interview), politics (politician, democrat, elect, etc.), giv‑ ing opinions (comment, admit, react, etc.), expressing opposition (blame, criticize), acknowledging and disputing (question, admit, warn, etc.), agree‑ ing and disagreeing (agree, disagree), and evaluating (exciting, interesting, awful, wonderful). Some frequent collocations are |other~j + politician-n|,|decline~v + interview-v|, |police~n + interview-v|, |deserve-v + credit~n|,|think~v + deserve-v|, |hurt-v + economy~n|, |worry-v + future~n|, |respect- v + right~n|, and |disagree-v + president~n|, among others, some of which are illustrated in Example 6. Based on the means graph (Figure 18), the basic dis‑ tinction drawn by the dimension seems to be between spoken on the positive pole and the remaining registers on the negative pole. The prominence of spo‑ ken seems to be confirmed by the strip chart (Figure 20), but on the box plot (Figure 19), there seems to be a tie, with fiction slightly ahead. It is therefore not clear how the dimension corresponds to register distinctions, which is corrobo‑ rated by the low coefficient of determination. As a result, the dimension label proposed takes into account topical selection rather than registerial preferences – namely, ‘politics and current affairs.’ This again highlights the nature of the spoken texts in COCA, which consist of radio and television shows, where this subject matter is usually addressed, and not colloquial conversation, where such topics would probably not be talked about as often.

      Newspapers Academic/Fiction/Magazines
      F = 48.634, < .0001, R2 = .008Figure 18. Mean scores for Dimension 6, Politics and current affairs
      1. All of the other GOP politicians only got single digits. (Hannity; spoken) And two months later, those attorneys, Stufft and Battle, who declined our requests for interviews, were off the case. (NBC Dateline; spoken)The Dow is up and so are gas prices. Will that hurt the economy and President Obama? (This Week; spoken)
        2Acd. Spk. Nws. Mgz. Fct.RegisterFigure 19. Boxplot for Dimension 6, Politics and current affairs

        0Dimension 6Acd. Spk. Nws. Mgz. Fct.ScoreFigure 20. Spread of factor scores for Dimension 6, Politics and current affairs
    7. Dimension 7: Feelings and emotionsFactor 7 is unusual for two main reasons. First, unlike the large previous factors, factor 7 is small, containing only four variables related to the semantic field of feel‑ ings and emotions. Second, the collocations loaded on fiction only, which means that – although they did occur in all the registers – they did not occur as frequently as required by the cut‑off point and, therefore, were dropped. As mentioned, the cut‑ off point was equivalent to the 300,000th most frequent word pair in each register; for academic, this was N = 95, for magazine, N = 87, for newspaper, N = 88, for spo‑ ken, N = 80, and for fiction, N = 73. The collocations loading on this factor had fre‑ quencies below this cut‑off for all registers but fiction. For instance, feel~v + shame-n had a frequency of 59 in academic; therefore, it was excluded from the data for that register. In contrast, in fiction its frequency was 217; therefore, it was included in the pool for the register. The fact that a single register had loadings meant that the factor could not differentiate among the registers (the F‑score did not reach statisti‑ cal significance; all three charts are in agreement, see Figures 21, 22, and 23). In fact, only four collocates entered into collocation with the nodes – namely feel~v, face~n, voice~n, and eye~n. The resulting collocations were |feel~v + shame-n|,|feel~v + guilt-n|, |feel~v + rage-n|, |face~n + rage-n|, |feel~v + excitement-n|, |excite- ment-n + voice~n|, and |eye~n + excitement-n| (see Example 7).
      Fiction (Academic/Magazines/Newspapers/Spoken)*
      F = 2.341, NS, R2 = .000* Registers in brackets had no loadings on this factorFigure 21. Mean scores for Dimension 7, Feelings and emotions
      1. There is no shame in feeling such deep sorrow at your loss. (A Little Bit Sinful; fiction)‘When she called me, there was excitement in her voice,’ Smith says. (Cosmopolitan; magazine)Bumble’s face turns red with rage. (Oliver Twist; fiction)
    8. Dimension 8: CookingFactor 8 is specific to the topic of cooking, with all of its 37 variables relating to various aspects of culinary recipes, such as the actions involved in preparing food (mix, add, stir, etc.), utensils and measurements (cup, bowl, tablespoon, etc.), ingredients (salt, pepper, vegetable, etc.), and their characteristics (fresh,
      Acd. Spk. Nws. Mgz. Fct.RegisterFigure 22. Boxplot for Dimension 7, Feelings and emotions
      Dimension 7
      0Acd. Spk. Nws. Mgz. Fct.ScoreFigure 23. Spread of factor scores for Dimension 7, Feelings and emotions
      Newspapers Fiction Academic/Spoken
      F = 107.772, < .0001, R2 = .018Figure 24. Mean scores for Dimension 8, Cookingall‑purpose). Major collocations include |mix-v + bowl~n|, |mix-v + ingredient~n|,|cup-n + sugar~n|, |add-v + heat~n|, |add-v + onion~n|, |tablespoon-n + olive~j|,|cook~n + stir-v|, |stir-v + mixture~n|, |teaspoon~n + salt-n|, |salt-n + pepper~n|,|cup-v + flour~n|, |cup-v + all-purpose~j|, |combine-v + bowl~n|, and |combine- v + ingredient~n|, as illustrated in Example 8. The mean scores graph suggests a distinction between magazines on the positive pole and the remaining registers on the negative pole (Figure 24), which is corroborated by the other charts (Figures 25 and 26). However, as with the other registers, the coefficient of determination (R2)
      Dimension 8
      1Acd. Spk. Nws. Mgz. Fct.RegisterFigure 25. Boxplot for Dimension 8, Cooking
      0Acd. Spk. Nws. Mgz. Fct.ScoreFigure 26. Spread of factor scores for Dimension 8, Cooking
      was low, again suggesting the low probability of correctly predicting the registers from the factor scores for this single dimension.
      1. In a medium bowl, mix vinegar, sugar, and salt, stirring occasionally, until sugar dissolves. (Country Living; magazines)Stir together 3 cups sugar and 2 cups fresh lime juice in a large punch bowl. (Southern Living; magazines)2 Tbsp. olive oil; 1/8 tsp. ground red pepper (cayenne); salt and pepper; 1/3c. plain nonfat yogurt. (Good Housekeeping; magazines)
    9. Dimension 9: Education researchLike factor 7, this factor has a reduced number of variables (6), which on close inspection revealed texts generally related to the field of education research. Consequently, literary fiction texts did not incorporate any of these collocations (and are therefore missing from the graphs). This is quite apparent in the collo‑ cates, which include education‑specific words from the semantic fields of teach‑ ing (teacher), students (student, child), schooling (high, school), assessment (high, low), and research (item, scale, respondent). Typical collocations include|student~n + benefit-v|, |benefit-v + program~n|, |expose-v + student~n|, |expose- v + child~n|, |educate-v + public~n|, |educate-v + child~n|, |item~n + rate-v|,
      |rate-v + scale~n|, |exhibit-v + behavior~n|, |exhibit-v + characteristic~n|, and|student~n + creativity-n|. The mean scores graph (Figure 27) shows a small preference for academic (on the positive pole), but this did not come through as clearly in the other charts (Figures 28 and 29). Again, this indicates that the dimension does not correlate with register distinctions, which is confirmed by the low R2, but with topical preferences that cut across the different registers (with the exception of fiction, as mentioned).
      F = 10.536, < .0001, R2 = .002Figure 27. Mean scores for Dimension 9, Education research
      1. Finally, the findings confirm that a more positive perception by students toward the environment benefits the students’ lives (College Student; academic)

      Dimension 96
      Acd. Spk. Nws. Mgz. Fct.RegisterFigure 28. Boxplot for Dimension 9, Education
      Dimension 9400
      0Acd. Spk. Nws. Mgz. Fct.ScoreFigure 29. Spread of factor scores for Dimension 9, Education
      Children who have been exposed to early‑childhood education are entering school with a leg up. (Education Week; academic)Calculating the percentage of participants in each group that exhibited behavior within the normative range at post‑treatment. (School Psychology; academic)In sum, the nine dimensions of collocation were determined as follows:
      1. Literate discourse
      2. Oral discourse
      3. Objects, people, and actions
      4. Colloquial and informal language use
      5. Organizations and the government
      6. Politics and current affairs
      7. Feelings and emotions
      8. Cooking
      9. Education
      As mentioned, the R2 statistics revealed that predicting the register from the dimension scores for each individual dimension was unlikely, given the wide range of variation among the collocations. However, by combining the scores of
      the collocation on all of the dimensions, the task of predicting the register from the collocation becomes a more realistic goal; this is taken up in the next section.
  4. Assigning collocations to register categories based on their MD profile
    The goal of this part of the analysis was to verify the extent to which the individual collocations could be classified according to the registers in which they occurred. In other words, while the preceding section showed that collocations may not be strong predictors of register categories, it is possible that register categories are strong predictors of collocation. To this end, a discriminant function analysis (DFA) was employed, which used the factor scores of each collocate with each node on each dimension as input and produced discriminant equations that were used to place the collocation in its most likely register, based on its factor scores. Using DFAs to predict register categories from dimension scores has been used in the literature before. For instance, Berber Sardinha and Veirano Pinto (2016) used a DFA based on Veirano Pinto’s (2014) MD analysis of American cinema to deter‑ mine whether major film genres (action, adventure, and comedy) predicted the language occurring in the movies. In this study, the idea is that individual colloca‑ tions have a dimensional ‘fingerprint’ made up of the set of its factor scores and that this fingerprint is indicative of the register in which the collocation occurred. According to Cantos Gómez (2013: 104):A discriminant function analysis […] is concerned with the problem of assigning individuals, for whom several variables have been measured, to certain groups that have already been identified in the sample. It is used to determine those vari‑ ables that discriminate between two or more naturally occurring groups.The individuals, in this case, were the collocates, and the groups, the register cat‑ egories. The analysis was run through the discriminant option in SPSS 20 for Mac, which yielded the following equations, one for each register:Dspok = −1.954 + F9  −.002 + F8  −.001 + F7  .007 + F6  .005 + F5  .001 + F4 −.004 + F3  −.002 + F2  .004 + F1  −.001Dmag = −2.011 + F9 × −.001 + F8 × .017 + F7 × .004 + F6 × −.006 + F5 × .001 + F4× −.003 + F3 × .000 + F2 × .004 + F1 × −.001Dnews = −2.035 + F9 × .009 + F8 × .004 + F7 × .005 + F6 × −.006 + F5 × .011 + F4 ×−.002 + F3 × −.001 + F2 × .003 + F1 × −.002Dacad = −2.600 + F9 × −.043 + F8 × −.001 + F7 × .001 + F6 × −.004 + F5 ×.001 + F4 × .000 + F3 × −.001 + F2 × −.001 + F1 × .006Dfic = −2.304 + F9 × .002 + F8 × −.002 + F7 × −.005 + F6 × −.006 + F5 × −.005 + F4× −.001 + F3 × .010 + F2 × .003 + F1 × −.001
    The equations incorporate the classification function coefficients output by the DFA procedure, which include a constant (the first value in the equations) and a series of values (Fisher’s linear classification coefficients) multiplied by the actual factor score of each collocation. SPSS automatically entered the factor scores in the equations in place of the F1, F2, F3, etc., placeholders and assigned a collocate to the register whose equation yielded the greatest result. To illustrate, let us take the collocate ability, which occurred in five different registers in conjunction with different sets of nodes, with different levels of strength of association, as shown in Table 3 by the logDice scores for each collocation. Based on these data, abil- ity received one score for each factor for each register, as shown in Table 4 (rows beginning with ‘F’). These factor scores were then entered in the discriminant function equations, which also yielded one solution for each factor for each reg‑ ister (see Table 4, rows beginning with ‘D’). For example, for academic, the scores for ability on the various factors were F1 = 785.34, F2 = 126.01, F3 = −8.30, F4 =−3.29, F5 = 6.95, F6 = 38.27, F7 = −.03, F8 = −2.22, and F9 = −.11. Substitut‑ing these values into the equations, the following results were obtained: Dspok =−2.01, Dmag = −2.54, Dnews = −3.38, Dacad = 1.86, and Dfic = −3.05. For each reg‑ ister, the collocation was attributed to that register whose equation produced the greatest value. As Table 4 shows, when ability was evaluated for spoken based onits collocation profile for that register (given in Table 3), the DFA predicted that its collocation profile did indeed resemble that of spoken, as the equation results were higher for spoken (−1.88). Similarly, with academic, the DFA correctly pre‑ dicted that the collocations of ability were more likely to be found in academic writing (1.86), because the collocation profile of ability in academic is highly dis‑ tinctive: Of the 120 collocations in academic, only nine (7.5%) were also found in spoken (affect_v, give_v, government_n, lose_v, make_v, people_n, take_v, think_v, use_v), 14 (12%) in magazines (affect_v, child_n, confidence_n, give_v, improve_v, level_n, limit_v, lose_v, make_v, new_j, people_n, see_v, take_v, use_v), 10 (8%) in newspapers (athletic_j, give_v, government_n, limit_v, lose_v, make_v, people_n, show_v, take_v, use_v), and one (.8%) in fiction (lose_v). However, when the col‑ locations of ability in magazines were evaluated, the DFA wrongly predicted they were more likely to be found in spoken, given the greatest equation result for spo‑ ken (−1.92). This result was perhaps caused by the fact that spoken and magazine did have eight collocations of ability in common (namely affect_v, get_v, give_v, lose_v, make_v, people_n, take_v, and use_v). The same misclassification occurred with news and fiction: 14 collocations of ability in news were also found in spo‑ ken (get_v, give_v, government_n, lose_v, make_v, pay_v, people_n, take_v, use_v), and the single one found in fiction was also present in spoken (lose_v).In order to classify the collocations, the ‘leave one out’ model was used in the DFA, whereby for each collocate, a new set of equation parameters was
    Table 3. Collocations of ability
    Register Collocations of abilitySpoken affect_v (8.03), best_j (7.65), get_v (5.09), give_v (6.37), go_v (3.85),government_n (5.79), lose_v (7.27), make_v (5.18), pay_v (6.42), people_n (4.98),president_n (5.07), take_v (4.65), think_v (4.78), use_v (5.85).Magazine affect_v (7.83), body_n (7.24), child_n (5.93), confidence_n (8.26), get_v (5.41),give_v (6.84), improve_v (7.96), level_n (7.43), limit_v (8.06), lose_v (8.09),make_v (6.02), new_j (5.14), people_n (5.94), see_v (5.86), take_v (5.59), use_v(5.96).Newspaper athletic_j (8.64), company_n (5.70), get_v (5.78), give_v (6.71), government_n(6.48), limit_v (8.53), lose_v (7.27), make_v (5.99), pay_v (6.42), people_n (5.56),play_v (6.03), show_v (6.82), take_v (5.51), use_v (5.93).Academic academic_j (7.95), achieve_v (6.73), achievement_n (8.06), act_v (6.81), affect_v(7.96), apply_v (6.78), assess_v (7.99), athletic_j (8.04), average_j (7.32), base_v(6.56), belief_n (7.20), child_n (7.70), cognitive_j (8.92), communicate_v (8.11),confidence_n (8.40), control_v (7.63), cope_v (7.78), create_v (6.81), critical_j(6.73), define_v (6.75), demonstrate_v (8.42), depend_v (7.36), determine_v(6.46), develop_v (8.15), development_n (5.85), difference_n (6.90), different_j(6.35), enhance_v (8.44), evaluate_v (7.38), factor_n (6.07), find_v (5.80),general_j (7.09), give_v (6.48), government_n (6.37), great_j (6.49), group_n(6.20), help_v (6.05), high_j (7.69), human_j (6.33), identify_v (7.44), improve_v(8.17), include_v (6.45), increase_v (7.23), individual_j (7.45), individual_n(7.52), influence_v (7.67), information_n (6.56), intellectual_j (7.95), interest_n(6.87), knowledge_n (7.32), lack_v (7.22), language_n (7.19), leadership_n (6.81),learn_v (7.46), learning_n (6.32), level_n (7.89), limit_v (8.56), lose_v (7.11),low_j (7.31), maintain_v (7.64), make_v (6.89), manage_v (7.28), measure_n(6.90), measure_v (7.83), meet_v (6.42), mental_j (7.04), natural_j (6.89), need_n(6.29), new_j (5.74), other_j (5.60), others_n (6.71), participant_n (6.49), people_n(6.16), perceive_v (7.64), perceived_j (8.53), perception_n (8.06), perform_v(7.91), performance_n (6.85), person_n (7.05), personal_j (6.47), physical_j (7.24),power_n (6.24), predict_v (7.14), problem_n (6.07), produce_v (6.58), provide_v(6.86), read_v (8.01), reading_n (7.42), recognize_v (7.47), reduce_v (6.95),reflect_v (6.92), relate_v (6.86), relationship_n (6.22), require_v (6.61), respond_v(7.08), school_n (5.69), see_v (6.10), show_v (6.55), skill_n (8.15), social_j (5.94),spatial_j (7.61), speak_v (6.73), specific_j (6.53), state_n (6.69), student_n (8.18),system_n (5.63), take_v (5.87), task_n (6.76), teach_v (6.80), teacher_n (6.73),test_n (7.47), test_v (7.30), think_v (7.11), thinking_n (7.08), understand_v (7.22),use_v (7.07), verbal_j (7.58), woman_n (5.85), work_v (6.82), write_v (6.63).Fiction lose_v (7.34).
    obtained by excluding the values for that particular collocate so as not to skew the classification in its favor. As a result, instead of a single set of five equations, as illustrated, 23,602 different equation sets were obtained, one for each new case in the data. The classification results reported below are for this cross‑validated
    Table 4. Factor scores and discriminant function results for the collocations of ability
    option. In addition, because the data sets for the registers were of different sizes, the same number of collocates was selected for each register so as not to skew the results in favor of the registers with larger collocation sets, as unequal sample sizes would have invited more cases to be assigned to the larger samples simply because of the better odds of making a correct prediction. Four samples of differ‑ ent sizes were used – namely, 500, 1,000, 2,000, and 4,000 collocates per register; these consisted of the collocates with the greatest scores per register. However, because each collocate was paired with each one of the 310 nodes (regardless of whether they formed collocations or not, each word pair was considered), the total number of word combinations covered by the samples was actually much larger than the size of the samples indicates. For instance, for the 1,000 sample, 1,550,000 (i.e., 1,000 collocates × 5 registers × 310 nodes) node – collocate pairs were involved. As there were five registers, the chance classification baseline was 20% (i.e., 100/5); as Table 5 shows, all sample sizes achieved prediction results that were better than the baseline. The best results were obtained with the 500‑ word sample, where the majority of collocations (56.7%) were assigned to their current registers at a rate nearly three times better than chance. The table also showed that prediction increased as the sample size decreased, and this loss of accuracy can be attributed to the ‘noise’ created by the inclusion of individual collocations that appeared in different registers.
    Table 5. Cross‑validated classification results
    Register sampleTotal sampleWord combinationsCorrect prediction5002,500775,00056.7%1,0005,0001,550,00048.9%2,00010,0003,100,00041.1%4,00020,0006,200,00034.7%
    The prediction results are broken down by register in the confusion matrix in Table 6. The best predicted register was academic, with 67%, followed by spoken (63%), newspaper and fiction (both at 57%), and magazine (39%). This difference in prediction reflects the level of specificity of the collocations in each register: Correctly classified collocations had a unique set of collocates that set it apart from other registers, and in general the collocations had higher logDice stats in the current register. For instance, the collocate study~n, which was correctly classified as academic, entered in collocation with nodes such as present-j (log‑ Dice = 10.8), result-n (10.1), examine-n (10.1), and conduct-v (10) – all of which immediately bring to mind academic discourse. In contrast, in fiction, study~n produced different collocations, with lower logDice stats, than in academic: hall- n (logDice = 7.4), father-n (6.3), door-n (5.6), night-n (5.6), and year-n (5.2). With spoken, the collocate people~n formed strong collocations with lot-n (log‑ Dice = 10.8), american-j (10.4), say-v (9.9), other-j (9.9), and know-v (9.9), which are quite typical of radio and television dialog. In contrast, in academic, people~n created collocations with indigenous-j (logDice = 9.5), live-v (9.4), native-j (9), old-j (8.9), and number-n (8.6); again, not only was the set of node words differ‑ ent, but the logDice stats were lower than in spoken. In news, a frequent collocate was take~n, which collocated characteristically with place-n (logDice = 10.4), care-n (10.2), step-n (9.9), advantage-n (9.8), time-n (9.3), and action-n (9.3), whereas in fiction, it collocated with care-n (logDice = 11.1), breath-n (10.4), step-n (10.1), deep-j (10.1), place-n (9.7), and hand-n (9.7). Notice how, in both registers, take~n collocated with place-n, but it did so more robustly in news. In magazines, a typical collocate was add~n, which entered in collocation with heat- n (logDice = 10.5), onion-n (10.3), mixture-n (9.8), garlic-n (9.7), and salt-n (9.6), whereas in academic, it did so with emphasis-n (logDice = 10.2), additional- j (8.5), dimension-n (8.5), value-n (8.2), and layer-n (8.0). Finally, in fiction, a characteristic collocate was hand~n, whose customary collocates included hold-v (logDice = 10.9), put-v (10.6), shake-v (10.5), raise-v (9.8), take-v (9.7), and wave- v (9.1), whereas in spoken, they included other-j (logDice = 10.3), raise-v (8.8), left-j (8.5), tie-v (8.3), count-n (8.3), and right-j (8.2). Again, although both reg‑ isters had a mutual collocate (raise-v), hand~n had a stronger attraction to it in fiction than in spoken.
    Table 6. Confusion matrix for discriminant function analysis
    Predicted classificationCurrent register
    TotalSpoken63.4%(317)*7.0%(35)28.4%(142).0%(0)1.2%(6)100.0(500)Magazine27.4%(137)39.4%(197)26.0%(130).6%(3)6.6%(33)100.0(500)Newspaper28.4%(142)13.2%(66)57.0%(285).0%(0)1.4%(7)100.0(500)Academic2.4%(12)13.8%(69)17.2%(86)66.6%(333).0%(0)100.0(500)Fiction22.2%(111)19.6%(98)1.2%(6).0%(0)57.0%(285)100.0(500)Total28.8%(719)18.6%(465)26%(649)13.4%(336)13.2%(331)100.0(2500)*numbers in brackets represent the count of instancesThe confusion matrix also shows how each register was cross‑classified. Spoken was mostly cross‑classified as newspaper (28%), an example of which is american~n: Six collocates out of the top 10 in both registers matched – namely, african-j (log‑ Dice spoken 8.7 versus newspaper 10.5), average-j (8.6 versus 7.8), believe-v (8.1versus 8.1), kill-v (8.5 versus 7.9), majority-n (8.8 versus 8.6), and percent-n (9.0 ver‑ sus 8.0). Magazine was mostly cross‑classified as spoken (27%), an example being want~n: Of the top 10 collocates in each register, seven were mutual – namely, get-v (logDice magazine 9.2 versus spoken 9.6), go-v (9.2 versus 9.3), hear-v (8.9 versus9.4), know-v (10.3 versus 10.3), make-v (8.7 versus 9.5), people-n (9.4 versus 9.7), and see-v (9.3 versus 9.7). Newspaper was also mostly cross‑classified as spoken (28%). For the word people~n, eight of the top collocations were similar in both registers: get-v (logDice news 9.5 versus spoken 9.6), know-v (9.6 versus 9.8), lot-n(10.1 versus 10.8), other-j (9.3 versus 9.9), say-v (9.3 versus 9.9), see-v (9.4 versus9.3), think-v (10.0 versus 10.5), and want-v (9.8 versus 9.7). Academic was mostly cross‑classified as newspaper: Eight of the top 10 collocations of government~n in both registers were similar – namely, agency-n (logDice academic 9.2 versus news‑ paper 9.2), central-j (9.0 versus 8.1), chinese-j (8.5 versus 8.2), federal-j (10.8 ver‑sus 11.0), local-j (9.5 versus 9.9), official-n (9.7 versus 9.8), and state-n (8.9 versus 8.8). Finally, fiction was often mistakenly classified as spoken (22%). An example is life~n, which shared six of the top 10 collocates in each register – namely, change-v (logDice fiction 8.3 versus spoken 9.1), live-v (9.5 versus 9.5), real-j (8.5 versus 8.7),rest-n (9.4 versus 9.3), save-v (9.5 versus 9.9), and whole-j (9.2 versus 8.7).
  5. Conclusion

This chapter presented a large‑scale study of cross‑register variation among American English collocations. Nine dimensions were defined, each representing

a distinct semantic parameter underlying the use of collocations in text. Just as texts live on a multidimensional space, as evidenced by the continued work in the mainstream MD tradition, so do collocations, as this chapter has shown. How‑ ever, these two levels do not seem mutually reducible, as there was little overlap between the English textual dimensions (Biber 1988) and the English collocation dimensions. In general, the collocation dimensions reflect the corpus upon which they are based; therefore, to be directly comparable, they would have to be drawn from the same corpus as the text dimensions or vice versa. This issue is being examined in ongoing research. Taking account of register differences in studies of language use in general and in studies of lexical choice in particular is not a trivial matter. As Biber (2012) has argued, most corpus‑based studies of linguistic pat‑ terning fail to recognize the effect of register on language use. Therefore, account‑ ing for register is critical in studies of language use, especially in a corpus as large and varied as COCA. However, as Biber (2012) noted, ‘it is still the norm in most studies of collocation and lexico‑grammatical associations to disregard the pos‑ sible influence of register differences’ (p. 34). He recommended ‘begin[ning] a research study with the hypothesis that such register differences exist’ (p. 34), which was the course taken here. The register differences associated with colloca‑ tion use shown by this study provide another ‘nail in the coffin’ in the attempts to describe ‘general English’ or any other language, as if language were a homogenous whole. Although we still see researchers working to characterize a language from that perspective, it is seldom realistic, given the mounting evidence of the effect of register on language use at different linguistic levels (e.g., see Herrmann & Berber Sardinha 2015, on the effect of register on metaphor use). By starting with regis‑ ter and describing how register differences constrain choice, our descriptions can reflect more naturally ‘the perspective of a conversational participant or a normal reader of a text’ (Biber 2012: 33) in terms of their actual experience as language users, which ultimately lends more validity to the research undertaken.

The lexical dimensions uncovered in this project reveal the kind of patterning that Sinclair began to explore in his groundbreaking work on collocation more than 40 years ago. By observing the collocates of individual nodes, Sinclair and Jones (1974/1996) noticed how the frequent collocates of different words formed sets around concepts. They called the ‘groups of words with a tendency to occur in the same environment’ (p. 44) lexical sets. For instance, they observed that several words shared collocates related to the concept of time:

ago is a significant collocate of two, time, and years; spend of time, and year; and after of year and day. Many collocates significantly with years and time, and long with time, while and ago. Time and year collocate significantly with each other, and day with night, hour and hours. (Sinclair & Jones 1974/1996: 45).

The current study took a principled approach to the identification of lexical sets through the MD approach, and the resulting dimensions reflected the concept‑ clustering property of lexis that Sinclair and Jones noticed. Furthermore, they pre‑ dicted that ‘eventually it might be expected that most open class items could be arranged into lexical sets, using a standard clustering technique’ (p. 45), a predic‑ tion confirmed by this study. However, this study went further in that it detected the most distinctive sets in American English and showed that such lexical sets as embodied in the dimensions are influenced by register.

In conclusion, lexical priming theory claims that individuals should be able to ‘subconsciously identify the genre, style, or social situation [in which a word combination] is characteristically used’ (Hoey 2013: 3344). Substituting register and collocation in this statement, the claim would be that registers have such characteristic collocations. The findings of the present study provide evidence of both the predictability and extent of such register‑characteristic collocations, with academic having the most predictable collocations, followed by radio and televi‑ sion material (i.e., spoken), fiction, newspaper, and magazine. As the current study demonstrated, a statistical association exists between collocation and register in text; therefore, it seems plausible to expect that speakers do in fact store in their minds some information about the register preferences of collocations, as pre‑ dicted by lexical priming theory. Theoretically, this study provides evidence of the convergence of major contemporary theories in corpus linguistics – namely, that collocation is a fundamental principle of language use (Sinclair), that register is a major underlying parameter of lexico‑grammatical variation (Biber), and that language users are primed for both collocation and register variation (Hoey).


I want to thank both CNPq (Brasília, DF; grants #477586/2013‑9; 303710/2013‑6; 471052/2010‑ 8) and Fapesp (São Paulo, SP; grant #2010/18736‑5), whose support enabled the research pre‑ sented in this chapter. I’m grateful to Jesse Egbert and the editors for their insightful comments on an earlier version of the chapter


Berber Sardinha, T. 1997. Automatic Identification of Segments in Written Texts. PhD dissertation, University of Liverpool.

Berber Sardinha, T., Kauffmann, C. & Acunzo, C.M. 2014. A multdimensional analysis of reg‑ ister variation in Brazilian Portuguese. Corpora 9(2): 239–271. doi: 10.3366/cor.2014.0059

Berber Sardinha, T., Mayer Acunzo, C. & São Bento Ferreira, T. In press. Dimensions of colloca‑ tion in Brazilian Portuguese: Exploring the Brazilian Corpus on Sketch Engine. In Essays in Lexical Semantics in Honor of Adam Kilgarriff, M. Diab & A. Villavicencio (eds). Berlin: Springer.

Berber Sardinha, T., São Bento Ferreira, T. & Teixeira, R. d. B.S. 2014. Lexical bundles in Brazil‑ ian Portuguese. In Working with Portuguese Corpora, T. Berber Sardinha & T. São Bento Ferreira (eds), 33–68. London: Bloomsbury.

Berber Sardinha, T. & Veirano Pinto, M. (eds). 2014. Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber [Studies in Corpus Linguistics 60]. Amsterdam: John Benjamins. doi: 10.1075/scl.60

Berber Sardinha, T. & Veirano Pinto, M. 2016. Predicting American movie genre categories from linguistic characteristics. Journal of Research Design and Statistics in Linguistics and Communication Science 2(1): 75–102. doi: 10.1558/jrds.v2i1.27515

Berber Sardinha, T. & Veirano Pinto, M. In press. American television and off‑screen registers: A corpus‑based comparison. Corpora.

Biber, D. 1988. Variation across Speech and Writing. Cambridge: CUP.

doi: 10.1017/CBO9780511621024

Biber, D. 2010. What can a corpus tell us about registers and genres? In The Routledge Hand- book of Corpus Linguistics, A. O’Keeffe & M. McCarthy (eds), 241–254. London: Routledge. doi: 10.4324/9780203856949.ch18

Biber, D. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8(1): 9–37. doi: 10.1515/cllt-2012-0002

Biber, D. & Conrad, S. 1999. Lexical bundles in conversation and academic prose. In Out of Corpora – Studies in Honour of Stig Johansson, H. Hasselgard & S. Oksefjell (eds), 181–190. Amsterdam: Rodopi.

Biber, D., Davies, M., Jones, J.K. & Tracy‑Ventura, N. 2006. Spoken and written register variation in Spanish: A multi‑dimensional analysis. Corpora 1(1): 1–37. doi: 10.3366/cor.2006.1.1.1

Biber, D. & Gray, B. 2013. Lexical frames in academic prose and conversation. International Journal of Corpus Linguistics 18: 109–135. doi: 10.1075/ijcl.18.1.08gra

Cantos Gómez, P. 2013. Statistical Methods in Language and Linguistic Research. Sheffield: Equinox.

Doise, W., Clemence, A. & Lorenzi‑Cioldi, F. 1993. The Quantitative Analysis of Social Represen- tations. Hemel Hempstead: Harvester Wheatsheaf.

Egbert, J. 2012. Style in nineteenth century fiction: A multi‑dimensional analysis. Scientific Study of Literature 2(2): 167–198. doi: 10.1075/ssol.2.2.01egb

Herrmann, J.B. & Berber Sardinha, T. (eds). 2015. Metaphor in Specialist Discourse [Metaphor in Language, Cognition, and Communication 4]. Amsterdam: John Benjamins.

doi: 10.1075/milcc.4

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

doi: 10.4324/9780203327630

Hoey, M. 2013. Lexical priming. In The Encyclopedia of Applied Linguistics, C. Chapelle (ed.), 3342–3347. Hoboken NJ: Wiley.

Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP. doi: 10.1017/CBO9781139524773

Lehrer, A. 1974. Semantic Fields and Lexical Structure. Amsterdam: North‑Holland.

McEnery, T., Xiao, R. & Tono, Y. 2006. Corpus-based Language Studies: An Advanced Resource Book. London: Routledge.

Menon, S. & Mukundan, J. 2012. Collocations of high frequency noun keywords in prescribed science textbooks. International Education Studies 5(6): 149–162. doi: 10.5539/ies.v5n6p149

Pace‑Sigge, M. 2015. The Function and Use of TO and OF in Multi-word Units. Houndmills: Palgrave Macmillan. doi: 10.1057/9781137470317

Phillips, M. 1989. Lexical Structure of Text. Birmingham: ELR, University of Birmingham. Rychly, P. 2008. A lexicographer‑friendly association score. In Proceedings of Recent Advances in

Slavonic Natural Language Processing, RASLAN 2008, P. Sojka & A. Horák (eds), 6–9. Brno: Masaryk University.

Schütze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24(1): 97–123. Scott, M. 2000. Focusing on the text and its key words. In Rethinking Language Pedagogy from a Corpus Perspective, Vol. 2, L. Burnard & A. McEnery (eds), 103–122. Frankfurt: Peter Lang.

Sinclair, J.M. & Jones, S. 1974/1996. English lexical collocations: A study in computational linguistics. In J.M. Sinclair on Lexis and Lexicography, J.A. Foley (ed.), 22–68. Singapore: UniPress.

Sinclair, J.M., Jones, S. & Daley, R. 1970/2004. English Lexical Studies: The OSTI Report,. Ramesh Krishnamurthy (ed.). London: Continuum.

Stubbs, M. 2007. Quantitative data on multi‑word sequences in English: the case of the word ‘world’. In Text, Discourse and Corpora, M. Hoey, M. Mahlberg, M. Stubbs,W. Teubert & (eds), 163–190. London: Continuum.

Trier, J. 1931. Der deutsche Wortschatz im Sinnbezirk des Verstandes; die Geschichte eines Sprachli- chen feldes. Heidelberg: C. Winter.

Veirano Pinto, M. 2014. Dimensions of variation in North American movies. In Berber Sardinha & Veirano Pinto (eds), 109–149.

Yablo, S. 2016. Aboutness. Princeton NJ: Princeton University Press.


Table A1: Factor loadings

No more than 50 words are shown per factor, for reasons of space. Values in brack‑ ets are the loadings of the node word in the factor. The words are sorted by loading in decreasing order in each factor. The letter following the dash (‑) after each word designates part of speech (n: noun; v: verb; j: adjective).

Factor 1

relate‑v (.63), appropriate‑j (.60), identify‑v (.60), specific‑j (.59), individual‑j (.58),

assessment‑n (.58), examine‑v (.57), learning‑n (.57), assess‑v (.57), participant‑n

(.57), knowledge‑n (.56), suggest‑v (.55), approach‑n (.55), method‑n (.55), dem‑

onstrate‑v (.55), individual‑n (.55), evaluate‑v (.54), behavior‑n (.54), determine‑

v (.54), indicate‑v (.54), intervention‑n (.54), performance‑n (.53), type‑n (.53),

classroom‑n (.53), analysis‑n (.53), develop‑v (.53), model‑n (.53), educational‑j

(.53), various‑j (.52), significant‑j (.52), result‑n (.52), activity‑n (.52), practice‑n

(.52), describe‑v (.52), teaching‑n (.52), use‑n (.52), measure‑n (.51), involve‑v

(.51), focus‑v (.51), teacher‑n (.51), study‑n (.51), define‑v (.51), effective‑j (.51),

base‑v (.51), similar‑j (.50), evaluation‑n (.50), skill‑n (.50), subject‑n (.50), social‑

j (.50), academic‑j (.50).

Factor 2

know‑v (.70), want‑v (.69), think‑v (.67), kid‑n (.66), thing‑n (.65), like‑v (.64),

time‑n (.63), start‑v (.63), go‑v (.63), way‑n (.63), come‑v (.63), good‑j (.62), see‑v

(.62), man‑n (.62), day‑n (.61), people‑n (.61), try‑v (.61), old‑j (.61), talk‑v (.61),

work‑v (.61), tell‑v (.61), mean‑v (.60), woman‑n (.60), place‑n (.60), look‑v (.60),

find‑v (.59), year‑n (.59), lot‑n (.59), life‑n (.59), get‑v (.59), friend‑n (.59), guy‑n

(.58), family‑n (.58), young‑j (.58), let‑v (.57), child‑n (.57), live‑v (.57), take‑v

(.56), call‑v (.56), ask‑v (.56), other‑j (.55), leave‑v (.55), happen‑v (.55), big‑j (.55),

love‑v (.55), father‑n (.54), keep‑v (.54), mother‑n (.54), story‑n (.54), feel‑v (.54).

Factor 3

stare‑v (.57), slide‑v (.53), pull‑v (.52), dark‑j (.49), hang‑v (.48), lean‑v (.48),

thin‑j (.47), pale‑j (.47), slip‑v (.47), shake‑v (.46), window‑n (.46), shoulder‑n

(.45), back‑n (.45), head‑n (.45), touch‑v (.44), stand‑v (.44), gray‑j (.44), face‑n

(.44), hand‑n (.44), tiny‑j (.44), grab‑v (.43), smile‑v (.43), arm‑n (.43), glass‑n

(.43), push‑v (.42), rub‑v (.42), bed‑n (.42), eye‑n (.42), finger‑n (.42), mouth‑n

(.42), walk‑v (.41), leg‑n (.41), thick‑j (.41), twist‑v (.41), hair‑n (.40), swing‑v

(.40), blue‑j (.40), peer‑v (.40), door‑n (.40), stick‑v (.40), close‑v (.40), press‑v

(.39), wall‑n (.39), lift‑v (.39), neck‑n (.39), floor‑n (.39), pink‑j (.39), foot‑n (.39),

glance‑v (.39), chair‑n (.38).

Factor 4

afraid‑j (.60), mama‑n (.60), daddy‑n (.57), glad‑j (.57), stupid‑j (.55), grand‑

mother‑n (.55), captain‑n (.52), miss‑n (.52), supposed‑j (.51), wonder‑v (.51),

aunt‑n (.50), okay‑j (.49), dare‑v (.48), sorry‑j (.48), strange‑j (.48), wish‑v (.48),

hell‑n (.48), angry‑j (.47), lucky‑j (.47), sick‑j (.47), dad‑n (.47), cop‑n (.47),

honey‑n (.46), tired‑j (.46), uncle‑n (.45), while‑n (.45), guess‑v (.45), hate‑v (.44),

hungry‑j (.44), funny‑j (.44), shit‑n (.43), stranger‑n (.43), careful‑j (.43), silly‑

j (.42), forget‑v (.42), cry‑v (.42), imagine‑v (.42), laugh‑v (.41), startle‑v (.41),

dragon‑n (.41), wake‑v (.40), mind‑v (.40), terrible‑j (.40), scared‑j (.40), mom‑n

(.40), cousin‑n (.40), grin‑v (.40), mrs‑n (.40), crazy‑j (.40), surprised‑j (.39).

Factor 5

agency‑n (.47), official‑n (.46), international‑j (.44), national‑j (.44), district‑n

(.43), government‑n (.42), federal‑j (.42), security‑n (.42), state‑n (.42), local‑j

(.41), department‑n (.41), service‑n (.41), public‑j (.41), nation‑n (.40), center‑

n (.40), office‑n (.40), employee‑n (.39), state‑v (.39), authority‑n (.39), leader‑n

(.39), law‑n (.39), administration‑n (.39), director‑n (.38), policy‑n (.38), orga‑

nization‑n (.38), support‑v (.38), bank‑n (.37), resident‑n (.36), foreign‑j (.36),

representative‑n (.36), military‑j (.36), congress‑n (.36), citizen‑n (.36), member‑

n (.36), county‑n (.36), aid‑n (.35), university‑n (.35), commission‑n (.35), plan‑v

(.35), health‑n (.35), major‑j (.35), council‑n (.35), plan‑n (.35), industry‑n (.34),

south‑n (.34), operation‑n (.34), association‑n (.34), economy‑n (.34), union‑n

(.34), european‑j (.33).

Factor 6

politician‑n (.47), interview‑v (.46), blame‑v (.44), deserve‑v (.44), matter‑v (.43),

disagree‑v (.43), react‑v (.42), interesting‑j (.42), hurt‑v (.41), worry‑v (.41),

respect‑v (.40), attack‑v (.40), interested‑j (.39), ok‑j (.39), like‑j (.39), deal‑v

(.39), sue‑v (.39), concerned‑j (.38), trust‑v (.38), act‑v (.38), sort‑n (.38), me‑n

(.38), exciting‑j (.38), extraordinary‑j (.38), question‑v (.37), vote‑v (.37), juror‑n

(.37), ms‑n (.37), criticize‑v (.36), democrat‑n (.36), suspect‑v (.36), wonderful‑j

(.35), figure‑v (.35), well‑n (.35), agree‑v (.35), senator‑n (.34), tremendous‑j (.34),

honest‑j (.34), dangerous‑j (.34), cooperate‑v (.33), amazing‑j (.33), tough‑j (.33),

admit‑v (.33), fear‑v (.32), warn‑v (.32), convince‑v (.32), awful‑j (.32), appeal‑v

(.32), respond‑v (.32), fun‑j (.31).

Factor 7

shame‑n (.35), guilt‑n (.35), rage‑n (.33), excitement‑n (.32).

Factor 8

mix‑v (.43), cup‑n (.42), add‑v (.41), tablespoon‑n (.41), stir‑v (.41), salt‑n (.41),

cup‑v (.40), combine‑v (.39), pepper‑n (.39), vegetable‑n (.39), fresh‑j (.39), tea‑

spoon‑n (.37), butter‑n (.37), sugar‑n (.37), onion‑n (.37), sauce‑n (.36), oil‑n

(.36), garlic‑n (.35), mixture‑n (.35), medium‑j (.35), tomato‑n (.34), hot‑j (.34),

lemon‑n (.33), juice‑n (.33), egg‑n (.33), milk‑n (.33), olive‑j (.32), chop‑v (.32),

heat‑n (.31), bowl‑n (.31), chopped‑j (.31), sprinkle‑v (.31), cook‑v (.31), green‑j

(.31), rice‑n (.30), flour‑n (.30).

Factor 9

benefit‑v (.41), expose‑v (.39), educate‑v (.38), rate‑v (.34), exhibit‑v (.34), creativ‑

ity‑n (.30).

Colligational effects of collocation

Lexically‑conditioned dependencies between modification patterns of the noun cause

Pascual Cantos & Moisés Almela

University of Murcia

Previous research into lexical constellations has uncovered the existence of dependency relations among different collocations of a word (Cantos & Sánchez 2001; Almela 2011; Almela et al. 2011; Almela 2014). Such dependencies are obtained when the strength of the attraction between a node and one or more

of its collocates is contingent on their co‑occurrence with a third element (a co‑ collocate). For instance, the probability that face (verb) collocates with decision is increased by the presence of modifiers of a specific semantic type (e.g., hard, difficult, tough) but weakened by the presence of other types of modifiers such as wise, informed, rational, etc. Implications of this phenomenon for the analysis of

word meaning and for the notion of ‘collocation’ have been examined in previous studies. With this chapter we attempt to explore the possibility of extending the notion of co‑collocation – and the methodology associated with it– to the analysis of some aspects of colligational priming (Hoey 2005). We hypothesize that

the strength of attraction between a lexical item and a grammatical slot can be influenced (strengthened or weakened) by the instantiation of other colligations of the same node in the same syntagmatic environment, and that it is possible to capture these dependencies between colligations by adapting the methodology of co‑collocation analysis. We are also interested in determining whether these phenomena of dependency interact with collocational primings. The study will

be focused on the relationship of particular collocations and syntactic preferences of the noun CAUSE. Using data from a large English web corpus, we will analyze the association between specific collocations of verbs with CAUSE as object and their impact on the co‑occurrence probability of two different types of modifiers: premodifiers and of-headed prepositional postmodifiers. The results suggest that the strength of association between these two modifiers in the context of CAUSE is influenced by the type of verbal collocate.

  1. Introduction and research questions
    Our goal is to explore an aspect of co‑collocation that has not been tackled yet, namely, its relationship with colligational priming. In Hoey’s (2005) theory of lexical
    doi 10.1075/scl.79.09can© 2017 John Benjamins Publishing Company
    priming, colligational priming refers to the grammatical profile of the combinatory behaviour of a word. Colligation represents thus one of the levels of description of a word’s primings. The term colligation itself is attributed to J R. Firth and is used very frequently in the Firthian tradition and in the corpus linguistics literature.The main theoretical issue addressed in this study is whether the different colligations of a word are independent of one another or whether they depend as much on the lexical item with which they are associated as on other grammatical preferences of the same item. This question can be extended to formulate the more general issue of whether the primings observed for a given word at a particular level of description (colligational, collocational, or through semantic‑association) represent a specific property of that lexical item, as formulated in “classic” lexical priming theory, or whether they also depend on a more complex pattern or cluster of primings integrating two or more levels of description (for instance, an interac‑ tion between a set of collocates and two or more colligations).The answer to this question may differ from one word to another, since dif‑ ferent lexical items exhibit different patterns of behaviour towards their contex‑ tual associations. For the sake of focus, this research concentrates on the node CAUSE (noun). Previous studies have analysed the collocational behaviour of this word – see, for instance, Stubbs (1995), who analysed the collocational profile of CAUSE both as noun and as verb. Here we will adopt a different perspective on the analysis of this word, since we will focus on the interaction between colligations and also on the relationship between colligational and collocational priming.The questions we seek to answer are two:
    1. Are there dependency relations between different colligations of CAUSE (noun)?
    2. If so, are these dependencies influenced by the interaction with collocational primings of CAUSE (noun)?
  2. Methodology
    All the data used in this study have been extracted from the corpus enTenTen2013. This corpus is a member of the TenTen family, a group of web corpora available at the Sketch Engine.1 English corpora from this family are named enTenTen and are general corpora of English. With 19,717,205,676 tokens, enTenTen2013 is at pres‑ ent the largest of this group. Corpus size is an important factor in the analysis of
    1. Sketch Engine is a corpus query system developed by Lexical Computing Limited; www.
      complex co‑ocurrence patterns. Many of the co‑occurrence patterns involving the presence of two different collocates of a node do not emerge from smaller samples2 and data sparseness is a major problem.The analysis of the data will be organized in two phases: colligation analysis and co-colligation analysis. In the first phase, we will apply descriptive statistics (essentially probabilities and dispersion data) to the analysis of some aspects of the colligational behaviour of the noun CAUSE. In addition, this data will be used to describe the bias of this word towards particular syntactic slots and against others. Thus, the methods used in this phase will not differ substantially from the usual corpus‑linguistic techniques for describing the grammatical behaviour of words. This phase of the analysis will be focusing on two pairs of grammatical relations: subject and object, on the one hand, and premodifier and postmodifier, on the other. These grammatical relations are among the most salient ones in the enTenTen2013 corpus, and mutually exclusive, which is very suitable for direct comparisons. At this point the goal of our analysis will be to determine whether the syntagmatic behaviour of CAUSE shows a preference for any of the two slots set in contrast.In the second phase we shall move from standard colligation analysis to co‑colligation analysis. The latter describes the relationship of attraction – or repulsion – between different syntactic slots associated with the behaviour of a particular lexical item or lexical pattern under scrutiny. To our knowledge, this perspective on colligational patterning has not been explored before in the litera‑ ture on lexical priming.The method of co‑colligation analysis will be based on an adaptation of the method of co‑collocation analysis employed in previous research (Almela et al. 2011; Almela 2014). The strategy of co‑collocation analysis is to compare the strength of attraction between the node and a given collocate (let us call it collo- cate1) and the strength of attraction of the same collocate towards the pair formed by the node and another collocate, which we shall call collocate2. In both cases the strength is measured in terms of conditional probabilities. Thus, in the formulae below, Prob1 represents the likelihood that collocate1 occurs given the presence of the collocation of the node and collocate2, while Prob2 represents the likelihood that collocate1 occurs given the presence of the node, but not necessarily of col‑ locate2 too.
    2. For instance, the combination mitigate … unintended consequence(s) is not found in the BNC, which with 100 million words is a relatively small corpus according to present-day stan- dards. However, in the enTenTen2013 corpus this combination occurs 28 times. This allows us to consider mitigate and unintended as potential members of a co-collocational pattern, some- thing which is not possible using data from the BNC alone. Needless to say, this does not mean that for many other research purposes the BNC will be more useful than TenTen corpora.
      • Prob1 = Prob(collocate1|node,collocate2) and
      • Prob2 = Prob(collocate1|node)Prob1 may be called inter-collocational conditional probability; Prob2 will be called intra-collocational conditional probability. In addition, Prob1>Prob2 implies that the conditional probability of collocate1 given the collocation of the node and collo‑ cate2 is greater than that of collocate1 given the node. This indicates that the co‑ occurrence of the node and collocate2 exerts a greater attraction towards collocate1 than the sole presence of the node. Whenever this is given, we can say that the syntagmatic attraction between node and collocate is strengthened by the inter‑ collocability relation, i.e. by the interaction of the collocational pair with another collocation of the same node. This relation between collocations will be termed pos- itive inter-collocability. Conversely, negative inter-collocability is when the reverse occurs: Prob1<Prob2, which is an indication that the attraction between node and collocate1 is weakened rather than strengthened by the presence of collocate2.To put it simply, the comparison of inter‑ and intra‑collocational conditional probability focuses on differences observed in the probability of a particular out- come given different cues. The cue is a feature that defines some kind of informa‑ tion about a context of occurrence of an item, and the outcome is the item whose probability of occurrence in that context is being measured. The higher the value, the greater the capacity of the cue for predicting the outcome. The maximum is 1, where all the occurrences of the cue are accompanied by the outcome, and the minimum is –1, where none of the instances of the cue occur in the company of the outcome. In co‑collocation analysis we compare the probability of a particu‑ lar outcome (for instance, collocate1) given cues of different complexity: in inter‑ collocational probability (Prob1) the contextual feature that functions as the cue is a collocational pair, while in intra‑collocational probability (Prob2) the cue is more simple, as it consists of a single word. In the formula provided above, the cue for Prob2 is the node. However, since conditional probabilities are directional, it is also possible in principle to use the node as the outcome in both cases.In several respects, co‑collocation analysis is similar to Gries’ (2013) ‘Delta P Dispersion’ (ΔP). Although these methods pursue different goals, they share several characteristics. ΔP was proposed as an alternative method of collocation extrac‑ tion, while co‑collocation analysis is intended to capture dependency relations between collocates after they have been extracted. Thus, co‑collocation analysis is not a method for selecting or sorting collocation candidates but rather a method for analysing relations among collocates. The similarity with ΔP lies in the fact that both make use of conditional probabilities involving lexical co‑occurrence data. However, in ΔP the probabilities are corrected by the use of variables from a con‑ tingency table, while in co‑collocation analysis such correction is not applied. The
        only reason for this is that, as far as we are informed, no other method of colloca‑ tion analysis developed so far has been devised to apply such correction of condi‑ tional probabilities across intra‑ and inter‑collocational levels (as befits a method for collocation extraction, ΔP does this only at the intra‑collocational level).In our approach, in order to adapt the method of co‑collocation analysis to the more abstract level of co‑colligation, we can reduce the level of lexical speci‑ fication in the patterns described and replace some of their lexically filled slots with lexically unfilled syntactic slots. Each of such slots can be called a colligate if it is shown that the syntactic behaviour of the node is strongly biased towards co‑occurring with words occupying that position. In an analogous manner to co‑ collocation analysis, we can compare the strength of attraction between the node and a particular colligate (say, colligate1) with the strength of attraction between that colligate and another colligate associated with the same node, which will be termed colligate2. ProbA and ProbB below stand for intra-colligational and inter-colligational conditional probability, respectively:
      • ProbA = Prob(colligate1|node,colligate2) and
      • ProbB = Prob(colligate1|node)The quantitative technique for computing the value of ProbA and ProbB is displayed below:(1) ProbA =node word + colligate1 + colligate2node word + colligate2
        node word(2) Prob= node word + colligate1Following the analogy with co‑collocation, ProbA>ProbB implies that the condi‑ tional probability of colligate1 given the co‑occurrence of the node and colligate2 is greater than that of colligate1 given the node. This relation between colligates of a node will be termed positive inter-colligability. Conversely, negative inter-colliga- bility obtains when ProbB>ProbA.Like co‑collocation analysis, these measures are directional and, in principle, we could also input colligates as cues and assign the node the role of outcome. In the present study we shall apply only the directionality specified in the foregoing formulae above (with the node as a component of the cue and one of the colligates as an outcome).However, it is unlikely that a colligate on its own – or even a combination of colligates – can operate as a powerful cue, as the capacity of syntactic slots for pre‑ dicting their lexical contexts is in general much weaker than the capacity of lexical
        items for predicting their grammatical contexts. Alternatively, we can also use a lexical pattern, instead of a single word, as the cue in co‑colligation analysis. In fact, it is reasonable to predict that a collocation will be a more powerful cue than a single word: the probability of finding a word from a specific grammatical class and function as outcome will increase if the cue is a collocation instead of a single word because, in general, the environment of a collocation tends to be syntacti‑ cally more restricted than that of an individual item (thus, the presence of, say, a possessive determiner is much more predictable from the collocation own cause than it is from the occurrence of cause alone). When a collocation is used as cue, the formula for co‑colligation analysis can be extended as follows:
      • ProbA’ = Prob(colligate1|collocation,colligate2) and
      • ProbB’ = Prob(colligate1|collocation)which is equivalent to:
      • ProbA’ = Prob(colligate1|[node,collocate],colligate2) and
      • ProbB’ = Prob(colligate1|(node,collocate)The method for calculating the values of ProbA’ and Prob B’is:(1) ProbA¢=(node word, collocate) + colligate1 + colligate2(node word, collocate+ colligate2
        (2) ProbB¢ =(node word, collocate) + colligate1(node word, collocate)
        As one of our goals is to obtain information about the interplay of colligation and collocation, we shall use elements of different levels of abstraction as cues. In the first level of co-colligation analysis we shall use the node (CAUSE) as the only lexical component of the cue, the other component being a non‑lexically specified slot:
      • ProbA = Prob(colligate1|CAUSE,colligate2)
      • ProbB = Prob(colligate1|CAUSE)In the second level of co-colligation analysis, the level of collocational specification, the cue will be increased by adding specific lexical fillers for an additional slot. These fillers will be collocates of CAUSE occurring in syntactically defined combi‑ nations. Thus the formulae are:
      • ProbA’ = Prob(colligate1|[CAUSE,collocate]colligate2)
      • ProbB’ = Prob(colligate1|[CAUSE,collocate])
        The collocations used as cues in the second‑level of co‑colligation analysis will be obtained using a conventional method of collocate extraction. The frequency threshold was set to 3 (i.e. node and collocate had to co‑occur at least three times for them to be considered as potential collocation candidates), and the measure of lexical association that we used is logDice (Rychlý 2008). This measure is a refine‑ ment of Dice score. One of the disadvantages of Dice score is that its values are usu‑ ally very small figures. Rychlý (2008: 9) fixed this problem adding 14 to the logged Dice coefficient, so the maximum value will always be 14, but usually less than 10, and negative values will mean no statistically significant co‑occurrence. Another advantage of logDice, in addition to its interpretability, is its stability across sub‑ corpora and across samples of different size (Rychly 2008). Its formula is:x + fy14 + log2 f2fxyInstead of establishing a collocational window (a maximum number of words to the left and right of the node), the collocational distance was defined syntactically using the Word sketch tool of Sketch Engine, which classifies logDice collocates automatically into grammatical relations such as Subject, Object, Modifier, etc.At all stages of the analysis the results provided by the Word sketch tool were manually supervised in order to filter out possible tagging or parsing errors (for example, lead was excluded from the list of top verbal collocates because we found that the occurrences of leading as an adjective in leading cause had been errone‑ ously tagged as instances of the verb lead taking CAUSE as an object).
  3. Results and analysis
    The first step in our analysis was the identification of a set of syntactic slots around the noun CAUSE that can form potentially strong colligation combinations. We focused on four grammatical positions that occur frequently in the context of CAUSE and meet the said requirements:
    1. Premodifier: noun or adjective occupying a premodifier slot headed by cause(e.g. root in root cause, or probable in probable cause).
    2. Premodified: head noun premodified by cause (e.g. analysis in cause analysis).
    3. Verb(Subject_of): verb taking cause as the head of its subject (e.g. trigger in the cause triggered diverse reactions). This category also includes co‑occurrences with cause in by‑agentive phrases of passive constructions (e.g. triggered by many causes such as…).
    4. Verb(Object_of): verb taking cause as its object head noun (e.g. determine indetermine the cause of…).
      Each of these four positions constitutes a potential colligate, i.e. a syntactic slot with which the word under scrutiny (the node, to borrow the term from colloca‑ tion studies) might be primed to co‑occur. Besides, the four slots selected allow for a great range of flexibility in their combinations. The first two slots (Premodifier and Premodified) can be combined with CAUSE in the same phrase (e.g. root cause analysis), although this is not frequently the case, and each of them can co‑occur both with the Verb(Subject_of) slot and with the Verb(Object_of) slot of CAUSE, as in the following examples:
      1. to determine the root cause of the issue (Verb[Object_of] + Premodifier + CAUSE);
      2. the root cause of the plaque buildup is unknown (Premodifier + CAUSE +Verb[Subject_of]);
      3. multidisciplinary teams conduct a root cause analysis of reported adverse events (Verb[Object_of] + Premodifier + CAUSE + Premodified);
      4. A root cause analysis showed that a failure in the system… (Premodifier + CAUSE + Premodified + Verb[Subject_of]).The complete set of colligates considered in this phase of the analysis is shown in Table 1.
        Table 1. Potential colligational slots and their combinations


        0.54100                  99.46               90807060%50403020100Premodifier PremodifiedFigure 1. Premodifier vs. Premodified distributions
        The second step was aimed at identifying a possible bias in the colligational pref‑ erences of CAUSE towards some of these grammatical positions and against oth‑ ers. A preliminary analysis of the data (Figure 1 and Table 2) reveals that there is a preference for CAUSE to co‑occur with the Premodifier slot (99.46%) instead of the Premodified position (0.54%). That is, CAUSE has a strong preference for being premodified by an adjective or another noun instead of being used as a pre‑ modifier itself. This is relevant because in other nouns the bias against the occur‑ ring as premodifier is not as overwhelming. For example, with the noun SOURCE the proportion of co‑occurrences with the Premodified slot (18.12% compared to 81.88% of Premodifier slot) is 22 times higher than with CAUSE. Another exam‑ ple is BASIS: in this case the proportion of co‑occurrences with the Premodified slot (5.62% compared to 94.38% of Premodifier slot) is ten times higher than with CAUSE. Another interesting remark is that there are relatively few collocates of CAUSE that fill in both slots.As for the distribution of CAUSE in subject and object functions, Figure 2 (see also Table 2) also evidences a bias towards one of the slots, which in this case is the object function. Comparing co‑occurrences of CAUSE with the slots Verb(Object_ of) and Verb(Subject_of), we observe that the former accounts for a much higher
        Table 2. Distribution data of two pairs of slots in the context of CAUSE

        B0 72.B627.14706050%403020100Verb(Subject_of) Verb(Object_of)Figure 2. Verb(Subject_of) vs. Verb(Object_of) distributions
        proportion of instances (72.86%) than the latter (27.14%). Thus, combinations of CAUSE with Verb(Object_of) are 2.7 times more frequent than combinations of the same noun with Verb(Object_of). Again, the results for the noun investigated indicate a strong preference for co‑occurring with particular syntactic functions. In the light of the lexical priming theory, it can be said that CAUSE is primed to co‑occur with verbs that take it as its object and with nouns and adjectives that premodify it. This adds to previous findings in the corpus linguistics literature showing that lexical items have preferences for particular grammatical contexts even when other combinations are also grammatically possible in principle (Sin‑ clair 1991; Francis 1993; Hunston & Francis 2000; Hoey 2005; among others).Hence, we find the trend that the noun CAUSE is primed to fill object slots (72.86%) instead of subject slots (27.14%), and to function as a head noun modi‑ fied by adjectives or by other nouns (99.46%) rather than being used as a premodi‑ fier of other nouns (0.54%). This is evidence in favour of treating Verb(Object_of) and Premodifier as colligates of CAUSE.In the third step we applied the two levels of co‑colligation analysis explained in the previous section. Due to limitations of space, we concentrated our analysis on only one of the two colligates identified in the previous step. The slot selected was Premodifier. Next, we looked for a potential colligate that could be introduced as cue in the formulae for co‑colligation analysis, in order to measure the influ‑ ence it might exert on the co‑occurrence probability of Premodifier. Logically, the candidate had to be one that is compatible in the same clause with the colligates already selected, and even with their combination. Ideally, it should also be a rela‑ tively frequent slot, as in principle that should increase the probability of finding a sufficiently large number of combinations with the Premodifier slot and with the node (CAUSE) in specific syntactic positions.The slot of-Postmodifier meets these criteria. This slot represents the position filled by the head of a noun phrase occurring inside an of‑headed prepositional phrase that functions as a phrasal postmodifier of CAUSE (e.g. failure in cause of failure, or freedom in cause of freedom). The frequency of co‑occurrence of this slot with CAUSE is substantially higher than that of the Premodified slot, though slightly lower than the frequency of Premodifier (compare Table 2 and Table 3).
        Table 3. Distribution data of two slots in the context of CAUSE

        Table 4 shows the results from applying the first‑level of co‑colligation analy‑ sis, where the only lexical filler is the node (CAUSE) and the slots corresponding to its colligates remain collocationally unspecified. At this stage, the formulae applied for obtaining the inter‑colligational probability (ProbA) and the intra‑colligational probability (ProbB) are:
        • ProbA = Prob(Premodifier|CAUSEObject_of,Of‑Postmodifier)
        • ProbB = Prob(Premodifier|CAUSEObject_of)ProbA measures the probability that an occurrence of CAUSE as object of a verb is preceded by a premodifier given the presence of an of‑headed prepositional post‑ modifier; ProbB measures the probability that an occurrence of CAUSE as object is preceded by a premodifier, regardless of whether or not the combination is also accompanied by an of‑postmodifier of CAUSE.In, Table 4w1 stands for the uses of CAUSE as an object noun; s1 and s2 rep‑ resent two different slots: Premodifier (s1) and of-Postmodifier (s2). The rightmost column (DIF) represents the difference between inter‑ and intra‑collocational probability. ProbA is less than ProbB, which means that, at this level of description
        • i.e. a collocationally unspecified level – the instantiation of the of-Postmodifier slot does not increase the co‑occurrence probability of a Premodifier. On the con‑ trary, the effect is a decrease in the probability of the latter. Thus, they are not positive co‑colligates.Table 4. Distribution data of two slots in the context of CAUSE3

          F(w1,s1,s2)ProbA = P(s1|w1,s2)ProbB = P(s1|w1)DIF(ProbA; ProbB)424410157536137032389570.284290.37119−0.08690
          However, the second‑level of co‑colligation analysis – with partial collocational specification of the grammatical context of the node – yields different results (see Table 5). The results displayed for ProbA’ and ProbB’ in Table 5 result from applying the following formula:
        • ProbA = Prob(Premodifier|[CAUSEObject_of,collocateVerb]Of‑Postmodifier)
        • ProbB = Prob(Premodifier|[CAUSEObject_of, collocateVerb)
        The first column of Table 5 shows the list of 20 top logDice collocates of CAUSE in the Verb(Object_of) slot (during the manual supervision of the Word sketch three
        1. w1 = CAUSE as object noun; DIF = s1 = Premodifier slot; s2 = of-Postmodifier slot.
          verbs –underlie, lead, and be– were filtered out as parsing errors). In this table, w2 stands for any verbal collocate taking w1 (CAUSE) as object noun; w1, s1, s2 and DIF represent the same as in Table 4. The difference with respect to Table 4 is that in this case the Verb(Object_of) slot has been filled in with specific collocates of CAUSE. The collocates have been arranged in order of decreasing difference between inter‑colligational and intra‑colligational probability (the value of ProbA– ProbB is given in the rightmost column).Four aspects of these results merit special attention (see Figure 3). First, there are six collocates in position w2 that are associated with a positive co‑colligational relation between the slots analyzed (see scores in bold type in Table 5). Thus, while the first level of co‑colligation analysis indicated that the probability of CAUSE (as object) co‑occurring with a premodifier is weakened by the presence of an of‑ postmodifier, the results obtained from the second level of co‑colligation analysis suggest that this effect is contingent on specific collocations of CAUSE. With some verbs the presence of a postmodifier decreases the probability of a premodifier of object CAUSE, but with other verbs (determine, treat, know, address, understand, pinpoint) the presence of a postmodifier increases this probability. We can inter‑ pret these data as an indication that collocational patterning has an effect on the relationships between different colligates of the node.Table 5. Co‑colligation analysis: Second level (partially specified by collocations)4

          w2F (w1,w2)F(w1,s1,w2)F(w1, F(w1,s1,w2,s2) w2,s2)ProbA’ = P(s1| [w1,w2],s2)ProbB’ = P(s1|w1,w2)DIF(ProbA’;ProbB’)determine157883865386521410.553950.244810.30914treat51592010275212080.438950.389610.04934know137093266538614520.269590.238240.03135address90094904531030140.567610.544340.02326understand87732077448411390.254010.236750.01727pinpoint209979511734520.385340.378750.00658discover3879124121016530.310800.31993−0.00912explain261052214622730.186730.20000−0.01327eliminate3817173021439380.437700.45324−0.01553diagnose228360815273680.241000.26632−0.02532identify127905000674224480.363100.39093−0.02783
        2. w2 = verbal collocate taking w1 as object noun; w1, s1, s2 as in Table 4 (see previous footnote).
          Table 5. (Continued)

          w2F (w1,w2)F(w1,s1,w2)F(w1,w2,s2)F(w1,s1,w2,s2)ProbA’ = P(s1| [w1,w2],s2)ProbB’ = P(s1|w1,w2)DIF(ProbA’;ProbB’)investigate456879031244400.140850.17294−0.03210plead1413188466190.040770.13305−0.09228further42805711766310.017550.13341−0.11586advance69519673929840.021380.13912−0.11774champion52008252242340.015170.15865−0.14349help1414435731108720.064980.25262−0.18763promote548315131387830.059840.27594−0.21610serve397511631611980.060830.29258−0.23175support241411151014772170.146920.47678−0.32986
          L1 (colloc. unspecified)L2 (determine)L2(treat) L2(know) L2(address) L2(understand) L2(pinpoint) L2(discover) L2(explain) L2(eliminate) L2(diagnose) L2(idenLfy) L2(invesLgate)L2(plead) L2(further) L2(advance) L2(champion)L2(help) L2(promoe) L2(serve)L2(support)–0.4

          Figure 3. DIF(ProbA’; ProbB’): First level (collocationally unspecified: L1) vs second level (collocationally specified: L2)5
        3. L1 = first-level of co-colligation analysis (collocationally unspecified); L2 = second-level of co-colligation analysis (partially specified by collocations).

        The second point to be highlighted is that the six verbs that are associated with positive co‑colligation of Premodifier and of-Postmodifier have a consistent seman‑ tic relation with CAUSE. All of them co‑occur with one of the major senses of CAUSE, namely, with the sense of ‘origin of an event’, as opposed to the ‘effect/con‑ sequence’. The occurrences of CAUSE as object of these six verbs are systematically associated with this primary sense of the noun analyzed (see some examples below, from (5)–(10)). This stands in sharp contrast to the heterogeneity of senses that CAUSE activates in collocation with those verbs that are associated with a negative co‑colligational relation between Premodifier and of-Postmodifier. Some of them are combined with the ‘origin’ sense of CAUSE too (e.g. discover, explain, eliminatediagnose, as in Examples (11)–(14)), but others activate the meaning of ‘aim’ or ‘ideal’ (see, for instance, advance, supportfurther, help in Examples (15)–(18)), and there is one (plead) that is combined with a variety of senses of CAUSE.
      5. Mainpine can often determine the exact cause of the problem in a very short period of time.
      6. In some cases, blindness can be reversed if the underlying cause is treatedvery early.
      7. …you need to know the specific cause of your symptoms.
      8. …their task was the creation of high-level policies or systems aimed ataddressing the more fundamental causes of homelessness.
      9. The goal is to better understand the potential causes of these disorders.
      10. Over the years, psychologists and other experts have pinpointed these specificcauses of conflict.
      11. The Foundation has funded over 200 research projects designed to discoverthe causes of traffic crashes.
      12. explain the major causes and events of World War II.
      13. These diagnostic imaging tests are also used to eliminate other commoncauses of abdominal and inguinal pain.
      14. It is very important to diagnose the real cause of snoring.
      15. …disadvantaged groups are able to use the law to advance their causes of social progress and equality.
      16. I would invite you to revisit the issues and financially support the cause of affordable, reliable energy if…
      17. …and its associated silent auction brought in about $10,000 in charitable contributions to further the cause of literacy.
      18. Colourful but empty rhetoric cannot help the cause of democracy and it has no place in the current debate.
    The third interesting aspect is that the order of the DIF score – see the difference between ProbA’ and ProbB in the rightmost column in Table 5 – is related to the semantic differences observed above. In Figure 3 all the verbal collocates listed above are arranged form left to right in order of decreasing difference between the inter‑ and the intra‑colligational probability (the leftmost bar does not stand for any particular verbal collocate but for the value of first‑level co‑colligation analysis, prior to the specification of collocational data). Now, all the verbal collocates to the left of plead (from determine to investigate) are systematically associated with the ‘origin’ sense of CAUSE; those to the right are systematically associated with the meaning of ‘aim’. Between the two groups is plead, whose range of meaning and usage are different from the other verbs in the list for two reasons: first, because it is often used in collocations with a very specific legal sense, not exactly identical to any of the other two senses of CAUSE mentioned (e.g. …to properly plead a cause of action); and second, because many of its occur‑ rences include archaic expressions from the Bible (e.g. Blessed be the Lord, that hath pleaded the cause of my reproach…). Thus, even though several verbal col‑ locates connected with the ‘origin’ sense produce negative values of co‑colligation between Premodifier and of-Postmodifier, one thing they have in common is that their scores are all higher than the verbs related to other senses of CAUSE. All the verbal collocates related to the meaning of ‘aim’ occupy the right‑hand side of the graph, to the right of plead.This could indicate that the effect of the of-postmodifier on the probability of the premodifier is related to the meaning of CAUSE. The most negative effect is observed in collocations associated with the meaning of ‘aim’. Positive effects are invariably linked in this list with collocates that activate the meaning of ‘ori‑ gin’, and where verbs from this semantic group have negative effects on the said co‑colligational relation, the effects are milder compared to verbs of the other groups.Moreover, there is a certain semantic consistency among the verbs of each of these two major groups. That similarity lies not only in the aforementioned relation to specific senses of CAUSE but also in the meaning that each of them contributes to the collocation. Among the verbs of the first semantic group there are many that describe a cognitive activity: determine, know, understand, pinpoint, discover, explain, identify, investigate. All these are associated with the ‘origin’ sense of CAUSE. The second group of verbs (further, advance, champion, help, promote, serve, support) is even more consistent, because they all express the idea of ‘doing something in order to make X succeed’.
    Finally, the fourth point that we would like to emphasize is that the relation‑ ship observed between meaning components and distribution is specific of the interaction of collocation and co-colligation. This means two things:
    1. The same patterns are not obtained when the co‑colligational relations are analyzed without specification of collocational data. We have explained this above in relation to the differences observed between Table 4 and Table 5.
    2. The same patterns are not obtained when the interaction of collocation and colligation is analyzed without specification of co‑colligation, i.e. of depen‑ dency relations between colligates. This is illustrated by the differences between Table 5 and Table 6.

    Table 6. Simple colligational dependency
    Table 6 indicates the probability of the Premodifier slot given the presence of par‑ ticular collocations of verbs and the noun CAUSE (w1w2, s1 and s2 have represent the same as Table 4 and Table 5). Crucially, in this case the distribution of semantic
    sets of verbal collocates is not closely related to the probability figures. The verbs associated with the highest scores are not always connected to the ‘origin’ meaning of CAUSE (for example, support ranks second), and conversely, the verbs at the bottom of the list– those which are less powerful predictors of the Premodifier slot for CAUSE – are not always associated with the meaning of ‘aim’ (thus, investigate is placed towards the lower part of the list). Verbal collocates belonging to different sets are scattered in different parts of the list.
  4. Discussion and conclusions

The results presented in this chapter point towards an influence of collocation on the co‑occurrence probabilities of different colligates of a node. Although the behaviour of CAUSE is biased towards particular syntactic slots and against oth‑ ers, the evidence analyzed here suggests that such preferences are not always a characteristic of the syntagmatic behaviour of the node (CAUSE in this case), and that they may be influenced by factors that lie beyond the relationship between the node and the colligate in question. One such factor (the one we have examined here) is the interplay with collocation and with other colligations of the same node simultaneously.

Thus, the answer to the two research questions posed in Section 2 is “yes” in both cases. There is a dependency relation between different colligations of CAUSE (at least between Premodifier and of-Postmodifer) and this dependency is influenced by the interaction with collocational primings of CAUSE.

Moreover, we have found some indications that the different effects that particular collocations exert on co‑colligational relations are related to semantic properties of such collocations. These results dovetail well with the general prin‑ ciple, well‑known to corpus linguists, that semantic properties tend to correlate with distributional properties. However, to the best of our knowledge, the study of correlations between meaning and distribution has not addressed yet the kind of interaction between collocational patterns and colligational dependencies that has been described here. The complexity of the patterns identified in this study lies in the fact that they point to an interaction not only between units from different lev‑ els of analysis (for example, collocations and colligations) but also between depen- dency relations operating at more than one level of analysis. This is observable in the effect that particular collocations exert not just on the preference of CAUSE for particular colligations but, more accurately, on the preferences that particular colligations of CAUSE have for other colligations of this word.

The existence of such collocationally motivated dependencies between colliga- tions poses a serious methodological challenge to corpus lexical studies because the level of complexity that it introduces cannot be adequately dealt with using the

traditional methods of corpus‑based collocational and colligational description. Here we have attempted to overcome this problem by adapting the methodology of co‑collocation analysis, which in previous research has been used to describe dependency relations between different collocates of a node.

The relation observed between co‑colligation and polysemy can contribute to the development of some aspects of sense‑context correlations stated in the theory of Lexical Priming. One of the hypotheses about the relation between polysemy and lexical priming formulated by Hoey (2005) (see also Tsiamita 2009; Patterson 2016)) predicts that “where two senses of a word are approximately as common as each other, they will both avoid each other’s collocations, semantic associations and/or colligations” (Hoey 2005: 82). That is, two similarly common senses will tend to be primed for separate types of co‑occurrences describable at different lev‑ els (lexical, semantic, grammatical). The conclusions obtained from this study add a further, fine‑grained parameter for describing the different behavior of similarly common senses of a word: in addition to avoiding each other’s primings, it may be the case that their respective primings are also avoiding each other’s dependencies on other contextual preferences of the same word.


The terms node and collocate are used here in the strictly Sinclairian sense. The distinction between them is purely methodological (see also Jones & Sinclair 1974; Mason 2000; Sinclair 1991; among others). Node and collocate are not categories of lexical items with different properties. The node‑collocate distinction simply refers to different steps in the method of collocation extraction (input and output, respectively). Thus, each member of a collocation can be treated successively both as node and as collocate in different empirical studies. The node is simply the search term.


Almela, M. 2011. Improving corpus‑driven methods of semantic analysis: A case study of the col‑ locational profile of ‘incidence’. English Studies 92: 84–99. doi: 10.1080/0013838X.2010.537050 Almela, M. 2014. ‘You shall know a collocation by the company it keeps’: Methodological advances in lexical‑constellation analysis. In Investigating Lexis: Vocabulary Teaching, ESP,

Lexicography and Lexis Innovation, J.R. Calvo‑Ferer & M.A. Campos (eds), 3–26. Newcastle upon Tyne: Cambridge Scholars.

Almela, M., Cantos, P. & Sánchez, A. 2011. From collocation to meaning: Revising corpus‑based techniques of lexical semantic analysis. In New Approaches to Specialized English Lexicol- ogy and Lexicography, I. Balteiro (ed.), 47–64. Newcastle upon Tyne: Cambridge Scholars.

Cantos, P. & Sánchez, A. 2001. Lexical constellations: What collocates fail to tell. International Journal of Corpus Linguistics 6(2): 199–228. doi: 10.1075/ijcl.6.2.02can

Francis, G. 1993. A corpus‑driven approach to grammar: Principles, methods and examples. In Text and Technology, M. Baker, G. Francis, & E. Tognini‑Bonelli (eds), 137–156. Amsterdam: John Benjamins. doi: 10.1075/z.64.10fra

Gries, S.T. 2013. 50‑something years of work on collocations. What is or should be next. Interna- tional Journal of Corpus Linguistics, 18(1): 137–165. doi: 10.1075/ijcl.18.1.09gri

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

doi: 10.4324/9780203327630

Hunston, S. & Francis, G. 2000. Pattern Grammar. A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. doi: 10.1075/scl.4

Jones, S. & Sinclair, J. 1974. English lexical collocations. Cahiers de Lexicologie 24: 15–61. Mason, O. 2000. Parameters in collocation: The word in the centre of gravity. In Corpora Galore.

Analyses and Techniques in Descrbing English, J.M. Kirk (ed.), 267–280. Amsterdam: Rodopi.

Patterson, K.J. 2016. The analysis of metaphor: To what extent can the theory of lexical priming help our understanding of metaphor usage and comprehension? Journal of Psycholinguistic Research 45(2), 237–258. doi: 10.1007/s10936-014-9343-1

Rychlý, P. 2008. A lexicographer‑friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, P. Sojka & A. Horák (eds), 6–9. Brno: Masaryk University.

Sinclair, J.M. 1991. Corpus, Concordance, Collocation. Oxford: OUP.

Stubbs, M. 1995. Collocations and semantic profiles: On the cause of the trouble with quantita‑ tive studies. Functions of Language 2(1): 23–55. doi: 10.1075/fol.2.1.03stu

Tsiamita, F. 2009. Polysemy and lexical priming: The case of drive. In Exploring Lexis-Grammar Interface [Studies in Corpus Linguistics 35], 247–264. Amsterdam: John Benjamins.

doi: 10.1075/scl.35.16tsi

part iv

Language learning and teaching

Lexical and morphological priming

A holistic phraseological analysis of the Finnish time expression kello

Jarmo Harri Jantunen

University of Jyväskylä

Using the International Corpus of Learner Finnish, this study examines Finnish time expressions and the lexical item kello (‘watch, time, o’clock’), which is overused by learners of Finnish. The chapter provides a holistic phraseological account of kello that includes not only its collocates but also its morphological priming, n‑grams and semantic associations. Previous studies on phraseology have mostly concentrated on languages like English, which have little inflection with the result that morphology has rarely been touched upon in phraseology studies. The results suggest that the analysis of learner language benefits from

a holistic approach to phraseology and that morphological priming as well as semantic preference play an important role in the learner language phraseology.

  1. Introduction
    Several studies (e.g. Granger 1998; Nesselhauf 2005) have shown that learners face difficulties in using lexical co‑occurrence patterns, i.e. in lexical priming. Furthermore, previous pre‑corpus studies (e.g. Martin 1995; Kaivapalu 2005) and corpus analyses (e.g. Spoelman 2013) on learner morphology and morpho‑ syntax have shown that rich morphology of a target language causes problems to learners. Learners tend to overuse and avoid certain morphological structures (e.g. Jantunen & Brunni 2013) and learning results may be affected by the differ‑ ences and similarities in inflectional systems between source and target languages (e.g. Spoelman 2013). However, previous research into learner phraseology has not been conclusive in so far as it has not much taken into account the role of morphology in learning phraseology. This is perhaps due to the fact that most corpus studies of learner language are heavily biased towards English, which is not a morphologically rich language compared to many other languages. Further‑ more, there is also a lack of information on how learners cope with semantic asso‑ ciations. These phenomena have not been previously discussed as part of learner
    doi 10.1075/scl.79.10jar© 2017 John Benjamins Publishing Company
    phraseology research, although semantic preference and semantic prosody are essential core features of lexical items and ought to be mastered as well as colloca‑ tions (see Kennedy 2008).The present chapter endeavours to move from previous analyses to a more holistic analysis of a phraseological unit. It is also guided by the assumption that a corpus‑driven method will reveal interesting aspects of learner language. Thus far, learner corpus studies have been mostly corpus‑based (however, see Durrant 2009; Ivaska 2015), which means that earlier findings, SLA theories and research‑ ers’ intuition have provided the basis for the selection of items studied – and not the data itself. The present analysis takes a keyword analysis as a starting point and focuses on one single lexical item in order to provide a holistic picture of learner phraseology. The data come from the International Corpus of Learner Finnish (Jan‑ tunen & Brunni 2013).The chapter is structured as follows. Section 2 describes lexical priming, phraseology and morphological priming in the context of language learning. Next, the methodology and data are outlined in Section 3. The results of the keyword analysis are provided in the beginning of Section 4, which is followed by morpho‑ logical, lexical and semantic analyses of learner phraseology. Finally, Section 5 dis‑ cusses the role of morphological priming in learning and phraseology description.
  2. Priming, phraseology and learner language
    1. Lexical priming and language learningThe acquisition of native‑like phraseology clearly poses challenges for language learners. In general, for both native and non‑native speakers the mastery of phrase‑ ology is similar: it is based on the ability to recognise and learn language elements, and to store them in the memory from which they are subsequently retrieved whenever needed. However, as Wray puts it, (in line with Pawley & Syder 1983),[k]nowing which subset of grammatically possible utterances is actually com‑ monly used by native speakers is an immense problem for even the most profi‑ cient of non‑natives, who are unable to separate out and avoid the grammatical but non‑idiomatic sequences. (2000: 468)She further claims that second language learners chiefly register and remember not meaningful chunks, but individual words (Wray 2002: 209), which, of course, is related to how lexical items are taught and acquired: as individual items or chunks. According to Hoey (2005: 184), when a language is acquired as a L1, the range of speakers around, the social context on the whole and, finally, the quantity of input, are clearly different from and more extensive than those when compared
      with L2 learning situations. Thus, the exposure to primings is plentiful for native speakers and they encounter more primings and possibilities to prime. L2 learners, in turn, normally try to build their primings through fewer encounters. Further‑ more, teaching, the study materials and the learning context do not often allow cotextual priming to take place in a native‑like fashion, which causes cracks in the priming. These are evidenced by the numerous studies that highlight language learners’ problems with collocational associations (e.g. Granger 1998; Nesselhauf 2005; Jantunen 2015).Although a vast number of corpus studies have focused on more conceptual and abstract associations than collocation, such as colligations, semantic prefer‑ ence and semantic prosody, this kind of research has not much extended to the study of learner language nor provided support for teaching. Other cotextual prim‑ ings than collocations have, to our knowledge, been referred to only in a handful of previous studies (e.g. Kennedy 2008; Zhang 2009), and only in a few studies has the analysis been conducted on two cotextual levels (e.g. collocations and colliga‑ tions, Flowerdew 2006). The literature on teaching lexicon also maintains a very narrow view of phraseology, concentrating mostly on collocations. From the point of view of language learning and teaching, however, it is essential that phraseol‑ ogy is studied more holistically, as language learners face difficulties on all levels of phraseology not just with collocations (on the need of holistic analysis see Ellis 2008; Mahlberg 2006).
    2. Morphological primingWhen we focus on morphologically rich languages the need to examine phraseol‑ ogy and lexical priming also arises with regard to inflection in noun and verb par‑ adigms. The Finnish case system composes 15 cases with partly different endings in singular and plural forms and a vast number of different conjugational forms for verbs (see ISK 2004 § 81; Karlsson 1985). Furthermore, suffixes (e.g. marker in plural forms for nouns and in past tense forms for verbs) and endings cause morphophonological variation in vowels and consonants in stems, and progres‑ sive vowel harmony determines which of the allomorphs of the endings will be chosen. Consequently, the native speaker of Finnish primes not only several co(n) textual associations but also several morphophonological forms. However, word forms are not similar what comes to their frequency and usage. This is pointed by Karlsson (1985, 1986), who argues that not all word forms are equally frequent and thus important in a paradigm; this is also evident in his corpus analyses of noun paradigms. Rather, he states (1985: 136), “many paradigms – are systemati‑ cally stratified in regard to what forms actually occur”. Thus some forms are core items in a paradigm and others are peripheral, or even non‑existing (e.g. plural
      forms of singular person pronouns and comparatives of absolute adjectives). He concludes that at least the most frequent core forms are stored as wholes in the mental lexicon (p. 148). This storage hypothesis of morphological wholes resem‑ bles Sinclair’s (1991) idiom principle: certain core inflectional forms with their assosiations are retrieved from the mental lexicon after the fashion of recurrent lexical combinations.The existence of paradigmatic priming is also discussed by Hoey (2004: 24) when he defines grammatical priming as “the grammatical category a word belongs to”. He continues that “instead of saying ‘This word is a noun’ – I would argue we should say ‘This word is primed for use as a noun’”. He (2005: 155) also gives an example of the word consequence, which, according to him, shows clear priming for use as a noun. Hoey, then, relates grammatical priming to a node’s word classes. Jantunen and Brunni (2013) and Sonnenstuhl et al. (1999), in turn, relate it to a node’s inflectional forms: In their corpus analysis of the noun ihminen (‘person, human being’) and the verb pitää (e.g. ‘to like, to con‑ sider, to hold, must’), Jantunen and Brunni (2013) provide evidence that both lexemes favour certain core forms and especially in the case of pitää the mor‑ phological primings are clearly sense dependent (different senses favour dif‑ ferent inflected forms). They also claim that language learners are not familiar with the typical core forms and morphological primings but over‑ and under‑ use certain forms. Sonnenstuhl et al. (1999), in turn, have tested how German participles were recognised and target stem forms primed, and how the regular (e.g. öffnen – öffnete – geöffnet ‘open – opened – opened’) and irregular (e.g. schreiben – schrieb – geschrieben ‘write – wrote – written’) inflection affected the priming; the results indicated that regular verbs showed full priming, but irregular displayed a partial priming effect. This morphological priming has also been analysed in a similar manner from English verb forms (for a review, see Sonnenstuhl et al. 1999).The role of paradigmatic morphological priming (PMP) as an integral part of a phraseological unit can easily be illustrated with the help of synonymous word pairs. Synonyms often have clearly different patterns of use with respect to their collocations, colligations and semantic associations (see e.g. Stubbs 1995; Jantunen 2004), but PMP also seems to evidently differentiate synonymous expressions from each other. For example, the analysis (Jantunen 2001) of Finnish adjectives tärkeä and keskeinen ‘important, central’ has shown that in addition to differences in col‑ locational patterning, these adjectives differ in their PMPs: Tärkeä seems to favour partitive case (40% for tärkeä, and 19% for keskeinen) whereas keskeinen favours nominative (33% vs. 43%, respectively). The former is also more often used in comparative and superlative forms than keskeinen, while the latter is mostly used in the positive form. It is also hypothesised (Sinclair 1991; Stubbs 2001) and at
      least partly verified (e.g. Tognini‑Bonelli 1996: 77–80) that different inflectional forms of one lexeme have co(n)textual associations of their own and “that each distinct form is potentially a unique lexical item” (Sinclair 1991: 8). Consequently, it is indispensable to also take morphological and other grammatical paradigmatic primings into account when we aim to describe words’ usage holistically from the phraseological view in morphologically rich languages.In language learning, morphology and word inflection cause obvious prob‑ lems for learners (for an overview, see e.g. Ellis 2008: 82–91). Several studies of learning Finnish prove that learners encounter difficulties because of its rather rich morphology (e.g. Martin 1995; Siitonen 1999; Kaivapalu 2005), and even students whose L1 is closely related and has inflection (like Estonian), stumble with Finnish morphology (Kaivapalu 2005). In addition, the avoidance of diffi‑ cult forms or favouring forms learners feel unproblematic may as well serve as a basis for atypical morphological priming in L2. Contrary to previous examples, this time, however, the learner produces grammatically acceptable forms. Siitonen and Mizuno (2010), for example, discuss morpho‑syntactic structures, in which learners have produced grammatical, but, however, redundant and perhaps also hypercorrect possessive structures instead of more target like structures. Jantunen and Brunni (2013), in turn, noted that learners clearly overuse the form ihmiset ( from the paradigm of ihminen (‘human’). One reason for this atypical PMP is that ihmiset is used with a preceding quantifying pronoun (such as kaikki everyone.nom), although these pronouns already alone inherently denote human beings and thus kaikki ihmiset is a redundant structure. Both cases indicate that learners do not feel safe with certain morphological structures, but by adding information try to be certain that they end up with the correct form and meaning. That is to say that learner productions violate the paradigmatic morphological primings of L2 since words are not primed to occur in those forms, which are frequently applied in L2.
  3. Methodology and data
    1. Corpus‑driven approach and keywordsThe analysis will start using keyword analysis. Keywords are words whose fre‑ quency is atypically high in research data in comparison to another data and occur in that data more frequently than would be expected by chance alone (Scott & Tribble 2006: 55–59). Thus keywords may reveal which words are used in learner language more often or less frequently than in the native language and which words thus characterize this variant. These items can tentatively be called learner
      language keywords. This statistical keyness corresponds broadly to what is called over‑ and underuse, and usage of lexical teddy bears (Hasselgren 1994) in SLA studies, but differs in that keywords have a clear basis in statistical significance.In order to obtain both statistical and qualitative information on phraseology in learner Finnish, the present study is structured as follows: first, keywords are calculated using two comparable data; second, one keyword is chosen for a detailed analysis, which provides qualitative (phraseological) information and explains the reasons for the overuse (e.g. keyness) of that item. In the qualitative analysis, PMPs, collocations, n‑grams, and semantic priming are studied in order to give a mul‑ tifaceted picture of the lexical priming of the chosen item in learner Finnish and to clarify the role of PMPs in phraseology, its explanatory power in the analysis of keywords in learner data, and, finally, its significance in language learning. The keyword analysis is produced using the Keywords program within WordSmith Tools5.0. Since the aim of this study is to concentrate on typicality rather than atypicality, various parameters are set before calculation. Setting the minimum frequency cut‑ off at 20 ensures that low‑frequency items are not calculated. The strength of the difference is assessed using a log‑likelihood test with a p value of 0.000001.
    2. DataThe data come from the International Corpus of Learner Finnish (ICLFI, Jan‑ tunen & Brunni 2013). The size of the data is 730.000 tokens. Texts vary from fictional to non‑fictional and are produced by Finnish language learners from universities worldwide. The mother tongues of the students belong to various dif‑ ferent language families the total number of L1s being 22. Each main proficiency level is covered in the data, but the majority of the data falls into CEFR levels A2‑C1. The learner, text and learning context variables are not taken into account. Some variables (e.g. text type and task variables), however, will be discussed when the data is analysed.In learner corpus studies, especially in a Contrastive Interlanguage Analysis framework (CIA, Granger 1996), learner data is often compared with native speak‑ ers’ production, in order to reveal features that characterize learner language. Both Scott and Tribble (2006: 58–59) and Culpeper (2009: 34–35) stress that the refer‑ ence corpus is to be chosen carefully, since the reference corpus directly affects whether the keywords are relevant to illustrating the language one is studying. In the following analysis, native data is retrieved from the non‑translational subset of the Corpus of Translated Finnish (Mauranen 2000). This Native Finnish Corpus (hereinafter NFC) consists of published fictional and non‑fictional texts. Conse‑ quently, the NFC broadly corresponds to the ICLFI, since both data consist of argu‑ mentative, descriptive and narrative texts. The size of the NFC is 3.8 million tokens.
  4. Results
    1. Keywords in learner FinnishTable 1 lists the 20 most significant keywords in the ICLFI. The biggest group in the keyword list seems to be verbs. They are mostly in the 1st person singular present tense indicative forms (menenolensyön, opiskelen, pidänasun). These forms as well as the 3rd person singular forms and noun keywords (e.g. kielensuomea) seem to indicate what learners wrote about – for instance, studying Finn‑ ish (opiskelen (‘I study’), kielen (‘language’‑gen), suomeasuomen (Finnish‑part/ gen); yliopistossa (‘at the university’)). Writing tasks such as “Why do I study x” are common at beginner’s level, while another popular theme is ‘My day’, which at least partly explains certain verb forms, such as syön (‘to eat’) and menen (‘to go’). Also, the pronoun minä (‘I’), and probably also mehe (‘we, they’), are keywords
      Table 1. The 20 most significant keywords in the ICLFI
      Keynessonto be‑3sg10,213koskabecause4,460paljona lot, much4,333minunI‑gen4,254minäI‑nom3,545olento be‑1sg3,053menento go‑1sg2,851kellotime, o’clock2,617minullaI‑adess2,391pidänto like/hold‑1sg2,265syönto eat‑1sg2,087kielenlanguage‑gen1,969minustaI‑elat1,890mewe‑nom1,853opiskelento study‑1sg1,770suomeaFinnish‑part1,768täytyymust1,714hethey‑nom1,671yliopistossauniversity‑iness1,522asunto live‑1sg1,406
      due to these themes. Minä can also be frequent due to explicit redundant pro‑ noun‑predicative structure favoured by learners, where subject pronoun is present although it could be omitted since verbs are inflected for person (Minä juokse-n, I‑nom run‑1sg vs. Juokse-n, run‑1sg, ‘I run’). The abovementioned keywords are, then, obviously task‑dependent and do not describe learner language as such, and are consequently omitted from the analysis.The items koska (‘because, when’), paljon (‘a lot/s of, many’) and kello (‘watch, time, o’clock’) are, in turn, of more interest when investigating learner language. The overuse of the adverb paljon has already been analysed in two learner corpus studies: both Kallioranta (2009) and Jantunen (2007) have noted that it has more syntactic functions in L2 than in L1, and that in paticular its incorrect usage as an intensifier for positive adjective forms increases its frequency in the learner data. Koska in turn functions as a grammatical item and its overuse may indi‑ cate that it is preferred to its synonymous expressions sen tähden ettäsiitä syystä että (‘because’), since they are obviously structurally more difficult, and synony‑ mous milloin (‘when’). As a noun, the word kello is perhaps the most interesting among these three. It must be stressed, furthermore, that the overuse of the item kello could hardly have been discovered without a statistical analysis: since the structures in which it is used are well‑formed and grammatically correct (see Section 4.3.1), kello does not catch the eye in any single or even multiple texts. This was also discussed with experienced teachers who have collected material for the corpus: they have not realised its overuse over the course of their long teaching careers. Thus, it seems extremely unlikely that this item would have been chosen as a research object through introspection. Then, the following will pro‑ vide a holistic view on the phraseology of kello (nominative singular), and in doing so, also give possible phraseological reasons that may cause its clear over‑ use in learner production.
    2. The case of kello: A learner Finnish keyword or a genre‑specific item?When the corpus is analysed as a whole, there is a risk that keywords occur only in a limited part of the data, for example in certain text types or registers (see Rayson 2008: 526). Whether this is the case here is what will be studied first. The ICLFI consists of texts of which approximately 5% are diaries. Consequently, we could tentatively assume that the overuse of kello (‘watch, time, o’clock’) may result from the time expressions that are used in diaries. In these texts it is typical to write what has happed at a certain time of the day, as illustrated by Example 1:
      1. Kello neljä bussi lähti kotiin. (Swedish, beginner, diary) O’clock four bus‑nom leave‑pst:3sg home‑illat.‘The bus home left at four o‘clock.’
        In order to check whether the overuse is the result of the text‑type of diaries and ‘My day’ narratives, a new keyword list was made. This time, however, these texts were omitted from the data. With its frequency of 505 and keyness value of 728, kello still remains on the list of significant keywords, now ranked at 24th after omitting topic‑related keywords. Thus, it seems that the diary types of texts are not the only reason for the high number of instances of kello. The final adjustment reveals that kello is occasionally used in idioms and figura‑ tive expressions (e.g. biologinen kello ‘biological clock’) and in the meaning of ‘church bell’ and ‘doorbell’, which have finally been omitted from the data. Then, after omitting diaries and figurative cases, the final normed scores are 691 for learner Finnish and 194 for native Finnish per million words. The data show that learners of Finnish use kello in time expressions more than three times as much as native speakers. Consequently, it seems that kello is a potential learner language keyword in Finnish. The following analysis discusses the reasons for the overuse of kello.
    3. Kello as a phraseological unit
      1. Morphological priming in time expressionsIn Finnish there are several morpho‑syntactic ways of expressing time. The first means is inflection: time can be expressed inflecting the time expression (i.e. a numeral or number) in the ablative case, the endings of which are ‑lta/ltä (Exam‑ ple 2). Secondly, it can be expressed adding an optional kello (‘o’clock’) before the numeral or number in the ablative case (3), which makes the whole time expres‑ sion more explicit and perhaps more literary, as well. Finally, Example (4) shows a construction, which consists of kello and time expression in the nominative case.
        1. Tulen syömään kahde-lta.Come‑1sg eat‑ma‑inf‑illat two‑abl.
        2. Tulen syömään kello kahde-lta.
        3. Tulen syömään kello kaksi-Ø.‘I shall come to eat at two o’clock.’Though all these expressions are possible and well‑formed, the last one (4) illus‑ trates a case that is morphologically the most simple: in this case time is expressed using a structure in which kello (in base form) is followed by a numeral that is not inflected, but is in the base form contrary to the constructions displayed in (2) and (3). For the language learner, the possibilities of choosing inflected or non‑ inflected forms are not equal. If a learner decides to use a time expression that contains only a numeral (Example 2), they must know several morphophonologi‑ cal rules before the right form can be produced. For example, in the case of the
          numeral kaksi (‘two’), to put it simply, the stem has an hd : ks change before the following vowel and ending (kahd-e-lta [ABL]: kaks-i [NOM]), which means that the learner must know the relevant rule of consonant change when forming this inflectional form. Another rule concerns vowels: the nominative case kaksi has i in the end, but e instead before the case endings in singular forms. Finally, they must know which of the 15 cases ought to be used in time expressions. It is self‑evident that the construction in Example (4), i.e. kello‑nom + the basic dictionary form of the numeral (kaksi-nom), causes much less of a headache for the learner than the inflected alternatives in (2) and (3).To find out whether learners more often prime the simple non‑inflected forms rather than the structurally complicated inflected forms, the occurrences of the three structures of all the possible times around the clock are counted in the data (Table 2). The data reveals that learners clearly favour the simple kello + numeral‑ nom construction: its proportion is as high as 77.6% of all the time expressions, in the NFC the share is only 40.4%. In learner production the ablative occurrences together only count for 22.4%, whereas in native texts the proportion is almost 60%. Thus it seems that learners prime and overuse morphophonologically simple time expressions and underuse complex structures. It appears that the overuse of the word kello in learner data is not due to the quality of text‑type (diaries, ‘My day’ narratives) but is clearly the result of avoiding complex morpho‑syntactic structures and favouring simple ones instead.Table 2. The number and proportions of different time expressions in the data

          ICLFINFCABL194 (17.4%) –***157 (48.5%)kello + ABL55 (5.0%) –***36 (11.1%)kello + NOM863 (77.6%) +***131 (40.4%)Total1112 (100%)324 (100%)The usage of non‑inflected forms instead of inflected ones somewhat resembles Hasselgren’s (1994, see also Jantunen 2015) teddy bears, that is items learners feel safe with. It is reasonable to hypothesise in line with Hasselgren that learners also tend to use syntactic and morpho‑syntactic structures they know well and feel unproblematic. In the case of kello + numeral‑nom, it seems that learners have used morphophonological and morpho‑syntactic achievement strategies (for strategies, see e.g. Færch & Kasper 1984): to achieve the communicative goal they have primed structures, which they know well and which are simple but still serve the same communicative goal as other available but more complex structures. This lends support to earlier findings that L2 learners tend to avoid complex
          grammatical structures compared to less complex ones (see, e.g. for avoidance of complex verb structures in English, Laufer & Eliasson 1993; and in Finnish, Siitonen 1999) and to so‑called economized production strategy (Winkler 2009: 126) according to which learners, when minimizing production effort, resort to syntactically simple, but linguistically effective and correct, structures. The usage of non‑inflected forms of kello also supports Skehan’s (2009) Trade‑off Hypothesis: when prioritising non‑inflected and morphologically less complex forms learners endeavour to ensure that they end up with accurate forms but at the same time lose the aspect of complexity in language production. Skehan states that accuracy and complexity are in competition in language production and that committing attention to either of them may cause weaker performance in others (2009: 510–511). Furthermore, kello may also occur in learners’ texts owing fre‑ quently to transfer from L1, since in several languages an equivalent word exists in time expression structures before or after the numeral: e.g. in English He will come at eight o’clock, and in German Er wird um acht Uhr kommen. Thus, reasons other than morphological priming could also be provided for the overuse.
      2. Collocates and n‑grams of kelloFor the collocation analysis the span of 2L–2R was chosen; that is, two immediate words to the left and right of the node. Table 3 shows the 30 most frequent lemma collocates in frequency order. The language variants share collocates, which indi‑ cates that learners’ priming is at least partly similar to native speakers’ priming. These words are usually numerals (either written with a number (#) or with let‑ ters) signifying that kello <number, numeral> collocations might not distinguish learner language from native language. Nonetheless, these collocations are clearly more frequently primed by learners: the proportion of these collocates is 17.6% (= 232) in the learner data and 10.3% (= 343) in the native data (z = 7.13; < 0.001).In addition, there are other shared collocates: both lists share verbs such as ollatulla and alkaa, pronouns minähän and se, frequent conjunctions ja and kun, and the adverb jo (‘already’). These collocates, however, clearly dif‑ fer in their frequency and ranking. For example, in native speakers’ produc‑ tion se (‘it’) is the commonest pronoun followed by hän (‘s/he’) and minä (‘I’), whereas in learner data the most frequent pronoun is minä, followed by hän and se. This may be due to two reasons. First, language learners often produce, at least at the lower proficiency levels, texts on topics such as the learners’ daily life, studies, friends and so forth, which causes the overuse of the pronoun minä. Secondly, language learners are possibly not familiar with the informal use of the demonstrative pronoun se. In Finnish, the pronoun se is commonly used as
        third person singular pronoun instead of the more formal hän. This possibility is naturally exploited by writers in the NFC, but not by learners who adhere to hän, since they are either not familiar with the informal use of se or wish to produce more formal utterances.The data also provide evidence that learners prime patterns that do not occur in native language and that some patterns that are frequent in the natives’ produc‑ tions are absent from the learners’ texts. First of all, a set of verbs is missing from the NFC collocates: in the ICLFI the eighth most frequent word herätä (‘to wake up’) and nousta (‘to get up’) and finally nukkua (‘to sleep’) all relate to a stage of sleeping. These words as well as the word syödä (‘to eat’) occur in the list since time expressions are used when learners describe their own or someone’s else daily rou‑ tine. It is notable that these expressions still exist in the data although diaries and texts akin to them have been omitted. Other words related to diurnal activities are nouns koti (‘home’) and työ (‘work’). This suggests that learners tend to use kello with concrete words related to daily routines.In the NFC, among the ten most frequent collocates there are two verbs denot‑ ing looking: the fifth commonest vilkaista (‘to take a glance’) and the eighth com‑ monest katsoa (‘to look at’). In learner data, katsoa occurs only four times, and vilkaista is missing totally. This suggests that learners are not familiar with these common collocational primings. Furthermore, in the NFC, kello is often associ‑ ated with conjunctions mutta (‘but’) and että (‘that’), which provides evidence that in native Finnish, but not in non‑native Finnish, kello is often used either in subordinate clauses or in main clauses followed or preceded by a subordinate clause. Finally, in the list of frequent collocates in the NFC, there are adverbs (jo ‘already’, vasta ‘not until’ and vielä ‘yet’) and a postposition (jälkeen ‘after’) all referring to time. This indicates that native speakers prime kello + numeral often with another time expression, as shown in Examples (5) and (6). This means that in native Finnish time expressions often consist of several different individual items denoting time.
        1. Vapauduin vasta kello kaksi yöllä.‘I only finished at two o’clock in the morning.’
        2. Poikaystävä ei ollut kotona vielä kello kuusi.
        ‘The boyfriend was not yet at home by six o’clock.’Finally, we note that the L2 – R2 collocates of kello in the ICLFI constitute a more limited set of words with the top ten making up 36.5% of the total, and the top 30 accounting for 47.4% of the total instances in the span. In the NFC, the propor‑ tions are 29.7% (z = 4.07, < 0.001) and 30.4% (z = 3.59, p<0.001), respectively. This indicates that in learner production collocational patterns are more fixed and less varying in contrast to native language.
        Table 3. Top 30 collocates in learner and native data



        Ncollocatetotalsubtotalcollocatetotalsubtotal1numeral (any)125
        olla ‘to be’227
        2# (number, any)107
        numeral (any)217
        3olla ‘to be’74
        ja ‘and’134
        4ja ‘and’46
        # (number, any)126
        5kello ‘time, etc.’40
        vilkaista ‘to take a glance’72
        6minä ‘I’26
        kello ‘time, watch’69
        7aamu ‘morning’22
        se ‘it’64
        8herätä ‘to wake up’20
        katsoa ‘to look at’48
        9hän ‘s/he’11
        hän ‘s/he’42
        10koti ‘home’11482ei ‘no, not’39103811mennä ‘to go’10
        jo ‘already’35
        12ilta ‘evening’10
        kun ‘when’34
        13nousta ‘to get up’9
        mutta ‘but’25
        14syödä ‘to eat’9
        aamu ‘morning’22
        15tulla ‘to come’9
        alkaa ‘to begin’21
        16alkaa ‘to begin’8
        ilta ‘evening’21
        17kanssa ‘with’8
        minä ‘I’21
        18noin ‘circa’8
        aika ‘time’20
        19se ‘it’8
        että ‘that’20
        20he ‘they’7
        niin ‘so’17
        21kun ‘when’7
        soida ‘to ring’17
        22puoli ‘half ’6
        jälkeen ‘after’16
        23paljon ‘a lot’6
        tulla ‘to come’16
        24työ ‘work’6
        vasta ‘not until, only’16
        25yö ‘night’6
        iltapäivä ‘afternoon’15
        26jo ‘already’5
        sanoa ‘to say’15
        27käydä ‘to visit’5
        käydä ‘to visit’14
        28lähteä ‘to leave’5
        puoli ‘half ’14
        29nukkua ‘to sleep’5
        vielä ‘yet’13
        30soida ‘to ring’5625tasan ‘sharp’121422
        Table 4. Top 3‑ and 4‑grams in the data
        ICLFIFNFCFherätä aamu kello#6kello olla vasta # (time be only #)9(wake morning o’clock #)
        kello olla jo # (time be already #)8kello # minä olla (o’clock # I be)5kello olla jo puoli (time be already half)
        minä herätä kello # (I wake up o’clock #)3

        kello # ja (o’clock # and)12kello olla # (time be #)29kello # aamu (o’clock # morning)10kello olla jo (time be already)22kello olla # (o’clock be #)10vilkaista kello ja (take a glance atwatch and)16
        Since collocations are qualitatively and quantitatively partly different in the ICLFI and NFC we can suppose that n‑grams, in which kello is incorporated, are also dis‑ similar in these two data. Table 4 sums up the findings of 3‑ and 4‑grams in both data. Again aamu ‘morning’ and herätä ‘to wake up’’ seem to dominate in the most frequent grams, but native clusters appear to contain more time expressions as well as the verb vilkaista ‘to take a glance’. The result seems to indicate that certain clusters are again over‑ or under‑used and that learners are not familiar with the typical multi‑word patterns of the native language.
      3. Semantic priming of kelloThe preceding collocation analysis showed that the words co‑occurring with kello can be grouped into semantic sets such as ‘sleeping’ in learner data and ‘looking at’ in native data. In order to get a more precise picture of the semantic priming of kello, the semantic subsets in the cotext of kello have been investigated in both data (Table 5).
        Table 5. The semantic preferences of kello in the data

        ICLFINFCtime185 / 79.1% +***343 / 47.8%‑only with numeral105 / 44.9% +***204 / 28.4%‑numeral + time expression80 / 34.2% +***139 / 19.4%awakening/getting up35 / 15.0% +***7 / 1.0%travelling27 / 11.5% +**20 / 2.8%celebration22 / 9.4% +*12 / 1.7%location16 / 6.8%25 / 3.5%ringing/chiming9 / 3.4%41 / 5.7%looking at/taking a glance6/ 2.6% –***157 / 21.8%
        The analysis of the semantic preference gives a fuller picture of learner priming: kello is used most often in time expressions in both data, but learners favour this priming much more in contrast to native speakers (79.1% vs. 47.8%). Further‑ more, learners favour both structures that contain only a numeral time expression (Lihakauppa menee kiinni kello kahdeksan ‘The butcher’s will shut at eight o’clock’) and expressions that contain also another, less precise time expression beside the numeral as well (Päivällinen tarjoillaan kello yhdeksän illalla ‘Dinner is served at nine o’clock in the evening’). Both structures are also primed for usage by native speakers, yet they cannot be seen as preferred. In fact, this result is partly con‑ trary to the previous collocation analysis: we noted that natives prime certain time adverbs and postpositions together with kello more frequently than learners, but when the concordance lines are investigated more holistically utilizing semantic preference, the result will be more precise, and in fact opposite to the collocation analysis: time expressions are favoured by learners. Furthermore, the proportion of time expressions in the cotext of kello seem to be higher in learner data than what the collocation analysis actually shows, and the difference in proportions turns out to be rather striking.The semantic preferences of ‘awakening/getting up’, ‘travelling’ and ‘celebra‑ tion’ form an interesting group of semantic associations primed by learners. Since diaries and ‘My day’ narratives were omitted before the analysis, these semantic primings are not primed due to text type conventions by definition. A more thorough study of texts where the semantic preference ‘awakening/getting up’ is used reveals that these texts are often written by beginners or pre‑inter‑ mediate learners, who frequently write about concrete events in their lives with more or less limited vocabulary. Thus, the above‑mentioned semantic associa‑ tions may occur in the data on account of language proficiency. Consequently, we could suggest that semantic priming may be inextricably linked to the pro‑ ficiency levels of the learners. The absence of ‘looking at/taking a glance’ from learner data is, however, somewhat more complicated to explain. This may also be a matter relating to proficiency level, but, rather, similar to vocabulary teach‑ ing, which is very often “non‑phraseological”, as well as textbooks and other teaching materials.
  5. Conclusions

Understanding phraseology and its significance in language learning and teaching is extremely important when we aim to teach the native‑like usage of the target language. By studying the phraseological aspects in morphologically rich lan‑ guages, it is possible to challenge previous research results and provide informa‑ tion on the phraseological features that are less existent in morphologically poor

languages. Without phraseological studies of several language groups our view of phraseology and its function in language learning will remain incomplete.

The antecedent analysis discussed lexical and morpho‑syntactic features which may cause foreign‑soundingness or non‑nativeness in learner produc‑ tion. Atypical frequencies of lexical elements or forms, which are generally accepted as features that characterize learner language, were also found in the present study: the overuse of the word kello seemed to characterize learner Finn‑ ish irrespective of the text type or task variable and it could be considered as a learner Finnish keyword. The analysis also showed that the overuse of kello could only be noted using corpus‑driven approach, and that the qualitative informa‑ tion found in cotext analysis both fulfilled the picture of deviant phraseology in learner language and served clear explanations for its overuse.

Furthermore, the present analysis revealed that morphology and morpho‑ syntactic structures of the target language play an important role in the structural choices that learners make. If there are several grammatically acceptable alter‑ natives, learners tend to choose the morphologically easiest one. That, in turn, may cause atypicality and overuse of items in the production. Thus, it seems that learners not only prime target language lexically deviantly but they also prime it differently morphologically. Although the product could be grammatically well‑ formed, it can, however, differ from its actual usage in the target language at the lexical and morpho‑syntactic level.

Familiarity with formulaic sequences helps learners to become more fluent language users, and teaching that raises language students’ awareness of such sequences may, indeed, be beneficial for language production and processing. However, if phraseology is not taught explicitly and holistically, the results are rather modest. The holistic approach to phraseology does not only involve col‑ locations, which are widely discussed in learner language research and teaching, but also semantic associations and colligations. It is important to bear in mind that the deviation of semantic associations causes non‑nativeness despite the fact that the text would be correct grammatically. This has been highlighted in the present study, which has shown that the semantic preference of kello is somewhat biased in the learner data. To what extent semantic preferences are text type‑ and task‑ dependent and how this should be taken into consideration in classroom teaching, ought to be analysed in future studies. It also became clear that a holistic analysis gives a more precise picture of phraseology: the analysis of the frequent collocates turned out to give a limited and biased view of phraseology, which, however, has been complemented by analysing semantic associations. Concentrating on only one item reveals that even one single phraseological unit demands several cotex‑ tual skills from the language learner: collocational and morphological associations and also knowledge of rather abstract associations beyond lexical level, namely

semantic preferences. The present chapter naturally oversimplifies the picture of priming of phraseology in learner Finnish ignoring variables such as mother tongue background and proficiency levels. Nevertheless, the study has revealed that the hypothesised morphological priming is an essential factor in the process of phraseological priming, and its role in language learning and instruction ought to be studied more thoroughly in the future.


Culpeper, J. 2009. Keyness: Words, parts‑of‑speech and semantic categories in the character‑ talk of Shakespeare’s ‘Romeo and Juliet’. International Journal of Corpus Linguistics 14(1): 29–59. doi: 10.1075/ijcl.14.1.03cul

Durrant, P. 2009. Investigating the viability of a collocation list for students of English for aca‑ demic purposes. English for Specific Purposes 28: 157–169. doi: 10.1016/j.esp.2009.02.002

Ellis, R. 2008. The Study of Second Language Acquisition. Oxford: OUP.

Færch, C. & Kasper, G. 1984. Two ways of defining communication strategies. Language Learn- ing 34(1): 45–63. doi: 10.1111/j.1467-1770.1984.tb00995.x

Flowerdew, J. 2006. Use of signalling nouns in a learner corpus. International Journal of Corpus Linguistics 11(3): 209–226. doi: 10.1075/ijcl.11.2.04cla

Granger S. 1996. From CA to CIA and back: An integrated approach to computerized bilin‑ gual and learner corpora. In Languages in Contrast: Text-based Cross-linguistic Studies, K. Aijmer, B. Altenberg, & M. Johansson (eds), 37–51. Lund: Lund University Press.

Granger S. 1998. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis and Applications, A. Cowie (ed.), 145–160. Oxford: OUP.

Hasselgren, A. 1994. Lexical teddy bears and advanced learners: A study into the ways Norwe‑ gian students cope with English vocabulary. International Journal of Applied Linguistics 4(2): 237–260. doi: 10.1111/j.1473-4192.1994.tb00065.x

Hoey, M. 2004. The textual priming of lexis. In Corpora and Language Learners [Studies in Cor‑ pus Linguistics 17], G. Aston, S. Bernardini, & D. Stewart (eds), 21–41. Amsterdam: John Benjamins. doi: 10.1075/scl.17.03hoe

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

doi: 10.4324/9780203327630

ISK = Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., & Alho, I. (eds). 2004. Iso suomen kielioppi (A comprehensive Finnish grammar). Helsinki: Finnish Litera‑ ture Society.

Ivaska, I. 2015. Longitudinal changes in academic learner Finnish: A key structure analysis.

International Journal of Learner Corpus Research 1(2), 210–241. doi: 10.1075/ijlcr.1.2.02iva

Jantunen, J.H. 2001. ’Tärkeä seikka’ ja ’keskeinen kysymys’: Mitä korpuslingvistinen analyysi paljastaa lähisynonyymeista? (What can corpus analysis reveal about near synonyms?). Virittäjä 105(2): 170–192.

Jantunen, J.H. 2004. Synonymia ja käännössuomi (Synonymity and translated Finnish). Joensuu: University of Joensuu.

Jantunen, J.H. 2007. Oppijansuomen piirteitä korpusvetoisesti (A corpus‑driven study on learner Finnish). In Virsu 3: Suomalais-ugrilaisia kohdekieliä ja kontakteja, P. Muikku‑ Werner, O. Kokko, & H. Remes (eds), 69–83. Joensuu: University of Joensuu.

Jantunen, J.H. 2015. Oppimiskontekstin vaikutus oppijanpragmatiikkaan: astemääritteet lek‑ sikaalisina nallekarhuina (Learning context and its effect on learner pragmatics: Degree modifiers as lexical teddy bears). Lähivõrdlusi–Lähivertailuja 25: 105–136.

doi: 10.5128/LV25.05

Jantunen, J.H. & Brunni, S. 2013. Morphology, lexical priming and second language acquisition: A corpus‑study on learner Finnish. In Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead, S. Granger, G. Gilquin, & F. Meunier (eds), 235–245. Louvain‑la‑ Neuve: Presses Universitaires de Louvain.

Kaivapalu, A. 2005. Lähdekieli kielenoppimisen apuna (Contribution L1 to foreign language acquisition). Jyväskylä: University of Jyväskylä.

Kallioranta, O. 2009. Paljon-adverbin kollokointi oppijansuomessa: Korpusvetoinen tutkimus (The collocations of paljon adverb in learner Finnish: A corpus‑driven study). MA thesis, University of Oulu.

Karlsson, F. 1985. Paradigms and word forms. Studia Gramatyczne VII: 135–154.

Karlsson, F. 1986. Frequency considerations in morphology. Zeitschrift fur Phonetik, Sprachwis- senschaft und Kommunikationsforschung 39: 19–28.

Kennedy, G. 2008. Phraseology and language pedagogy: Semantic preference associated with English verbs in the British National Corpus. In Phraseology in Foreign Language Learning and Teaching, F. Meunier & S. Granger (eds), 22–41. Amsterdam: John Benjamins.

doi: 10.1075/z.138.05ken

Laufer, B. & Eliasson, S. 1993. What causes avoidance in second language learning: L1‑L2 dif‑ ference, L1‑L2 similarity, or L2 complexity? Studies in Second Language Acquisition 15(1): 35–48. doi: 10.1017/S0272263100011657

Mahlberg, M. 2006. Lexical cohesion: Corpus linguistic theory and its application in English language teaching. International Journal of Corpus Linguistics 11(3): 363–383.

doi: 10.1075/ijcl.11.3.08mah

Martin, M. 1995. The Map and the Rope. Finnish Nominal Inflection as a Learning Target.

Jyväskylä: University of Jyväskylä.

Mauranen, A. 2000. Strange strings in translated language: A study on corpora. In Intercultural Faultlines: Research Models in Translation Studies I, M. Olohan (ed.), 119–141. Manchester: St. Jerome.

Nesselhauf, N. 2005. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14].

Amsterdam: John Benjamins. doi: 10.1075/scl.14

Pawley, A. & Syder, F.H. 1983. Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In Language and Communication, J.C. Richards & R.W. Schmidt (eds), 191–226. New York NY: Longman.

Rayson, P. 2008. From key words to key semantic domains. International Journal of Corpus Linguistics 13(4): 519–549. doi: 10.1075/ijcl.13.4.06ray

Scott, M. & Tribble, C. 2006. Textual Patterns: Key Words and Corpus Analysis in Language Edu- cation [Studies in Corpus Linguistics 22]. Amsterdam: John Benjamins. doi: 10.1075/scl.22 Siitonen, K. 1999. Agenttia etsimässä. U-verbijohdokset edistyneen suomenoppijan ongelmana (In Search of an Agent. U‑verb Derivations and Advanced Students of Finnish). Turku:

University of Turku.

Siitonen, K. & Mizuno, M. 2010. Suomen monitahoinen possessiivisuffiksi ja suomenoppija. (Many‑sided possessive suffix of Finnish and a Finnish learner). Lähivõrdlusi–Lähivertailuja 19: 136–159. doi: 10.5128/LV19.09

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.

Skehan, P. 2009. Modelling second language performance: Integrating complexity, accuracy, fluency, and lexis. Applied Linguistics 30(4): 510–532. doi: 10.1093/applin/amp047

Spoelman, M. 2013. Prior Linguistic Knowledge Matters: The Use of the Partitive Case in Finnish Learner Language. Oulu: University of Oulu.

Stubbs, M. 1995. Collocations and semantic profiles: On the cause of the trouble with quantita‑ tive studies. Functions of Language 2(1): 23–55. doi: 10.1075/fol.2.1.03stu

Stubbs, M. 2001. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.

Sonnenstuhl, I. Eisenbeiss, S. & Clahsen, H. 1999. Morphological priming in the German men‑ tal lexicon. Cognition 72(3): 203–236. doi: 10.1016/S0010-0277(99)00033-5

Tognini‑Bonelli, E. 1996. Corpus Theory and Practice. Birmingham: TWC Monographs.

Winkler, S. 2009. The acquisition of syntactic finiteness in L1 German: A structure‑building approach. In Functional Categories in Learner Language, C. Dimroth & P. Jordens (eds), 97–134. Berlin: Mouton de Gryuter. doi: 10.1515/9783110216172.97

Wray, A. 2000. Formulaic sequences in second language teaching: Principle and practice.

Applied Linguistics 21(4): 463–489. doi: 10.1093/applin/21.4.463

Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: CUP.

doi: 10.1017/CBO9780511519772

Zhang, W. 2009. Semantic prosody and ESL/EFL vocabulary pedagogy. TESL Canada Journal

26(2): 1–12. doi: 10.18806/tesl.v26i2.411

Concordancing lexical primings

The rationale and design of a user‑friendly corpus tool for English language teaching and self‑tutoring based on the Lexical Priming theory of language

Stephen Jeaco

Xi’an Jiaotong‑Liverpool University, China

Lexical Priming (Hoey 2005) brings together a range of linguistic patterns that should be an important focus of language learning and teaching. But it also adds an additional load to learners and teachers, demanding attention to different primings of words and nested combinations of words. With many tendencies difficult to observe in dictionary entries or other concordancing software, learners and teachers will face difficulties finding and presenting information about these primings. This chapter introduces the design of a concordancer (The Prime Machine), created for a doctoral degree project and developed to be firmly based on the theory of Lexical Priming. It introduces the pedagogical rationale for the development of some key features, including the search screen interface and the display of concordance lines.

  1. Introduction
    A prevalent view of how language operates has been that grammar and vocabu‑ lary are separate systems and sentences can be constructed merely by choosing any syntactic structure and slotting in vocabulary. This view is still prevalent in many areas of language teaching. It is evident in China (where the author lives and works) in materials designed to familiarize students with grammatical con‑ structions following sets of rules, and in the wordlists of vocabulary which are frequently used to introduce isolated meanings of individual words, usually with just one or two word by word translations provided. Over the last few decades, corpus linguistics has presented challenges to this view of language, and by draw‑ ing on evidence which can be found in the patterning of language choices in texts, it provides both a means of narrowing down the range of items to be taught through an emphasis on the most frequent usage, and also a raising of the bar in the sense of demanding attention be paid to relationships between items in terms

    doi 10.1075/scl.79.11jea© 2017 John Benjamins Publishing Company
    of collocation and colligation. From Firth (1957) through to Sinclair (1991), and in a wide variety of corpus linguistic research as well as Systemic Functional Gram‑ mar, the necessity for language educators to move away from a belief in a gram‑ mar separated from the lexicon is plainly evident. The theory of Lexical Priming (Hoey 2005) makes a valuable contribution to linguistic theory by building on a range of insights gained from corpus linguistics and establishing a framework and evidence for the existence of other relationships which account for a sense of the naturalness or creativity of produced language. Hoey introduces the theory by providing a cognitive explanation for why collocation is so pervasive and it is clear that the other claims which Lexical Priming makes also challenge prevailing notions of how words and collocations can be used. The concept of textual col‑ ligation challenges the idea that words can be used freely in different positions in the sentence, paragraph or text for example. The concepts of semantic associa‑ tion and pragmatic association challenge the idea that words can be freely slotted into sentence structures purely based on some sort of inherent or isolated mean‑ ing. While other theories of language arising from corpus linguistics are clearly aiming to enhance a description of language and to thereby drive developments for language teaching, they are for the most part not particularly concerned with describing a model for the acquisition of language or the processes that under‑ pin language learning.1 Lexical Priming, however, fills this gap by using insights from corpus linguistics and corpus data as evidence to explain how individuals are primed through exposure and use of language, and explaining how this priming process is the basis for first and second language acquisition. From a pedagogical perspective, the theory could also be used as a powerful metaphor for explaining to adult learners why their understanding of language may need to be adjusted and how they might go about exploring wider relationships between words, context and meaning. Given some of the entrenched views about language which are often held by students regarding vocabulary learning strategies, it would be unrealistic to expect a piece of software to be able to completely shake and remodel their view of language and language learning priorities. Nevertheless, if a simple image of
    1. An exception would be developments in usage-based linguistics. For example, the CHILDES project was originally conceived in the early 1980s as “an archive for typed, hand- written, and computerized transcripts” for researchers in child language (MacWhinney 2014: 24), and its collection of corpus texts has been used for analysis in usage-based linguis- tics. Corpus methods have been applied to explore first language acquisition of individuals (Lieven, Behrens, Speares & Tomasello 2003). Developing areas of usage-based approaches look to greater use of corpora for the exploration of spontaneous speech in child language acquisition (Tomasello 2003) and there is a recognised need for larger corpora for second language acquisition research (Ellis, O’Donnell & Römer 2013).
      the human brain encountering words and phrases through hearing, reading and production and thereby building up patterns and expectations for how these could and should be used is presented, it could provide an impetus for encouraging lan‑ guage learners to look more deeply at the contexts of the language they encounter and the language that they produce. The idea that traces in the human mind of language which has previously been encountered are similar to concordance lines is a potent analogy, and also promotes a balanced understanding of how corpus resources can be used to find evidence but cannot ever represent the true priming of any one individual. In this sense, students can be assured of the relevance of corpus data as a way of gaining insights into real language use, while they are also encouraged to be critical and mindful of any resource’s limitations.Although corpora have had an indirect influence on language teaching through the creation of dictionaries and materials which draw on corpus data, the main pedagogic implementation of corpus linguistics is Data Driven Learning (DDL). Johns (2002) listed several advantages of DDL over other types of learning materials, including new ways of approaching problem areas such as prepositions with a main focus on meaning and also helping teachers and learners prioritize what should be learned. Bernardini (2004) argues that concordancing tasks can be used as a means of meeting a variety of language teaching goals. There are several reasons highlighted in the literature that explain why direct use of concordanc‑ ing software can be especially useful for learners. First, as Sinclair (1991) pointed out, if learners want to learn about common patterns of syntax associated with a particular word, dictionaries do not usually provide this. Secondly, as well as pro‑ viding more information in an accessible way, it has been argued that concordanc‑ ers give the learner an “ideal” space to test hypotheses (Kettemann 1995; cited in Meyer 2002). Studies have shown that teaching learners to use concordancers and then explore aspects of syntax by themselves can reduce their anxiety, and it has been suggested that this is because they can be freed from a sense of being sub‑ ject to human judgement (Hunston 2002). As well as providing the opportunity for learning about language use at the time concordancers are consulted, another advantage of teaching learners to use corpora is that it is a skill which can form part of their life‑long learning (Mills 1994). The procedures learners follow when they systematically perform searches and analyse corpus output help develop dis‑ ciplines for self‑access (Kennedy 1998). As Thomas explains, “teachers need to be aware of how much studying, learning and acquiring are taking place simultane‑ ously when learners are engaged in corpus‑based guided discovery tasks” (Thomas 2015: 17).Two of the primary aims of using concordancers with language learners are likely to be based on second language acquisition principles: that learn‑ ers should be exposed to target language in use (see, amongst others, Krashen
      1989); and that “intake is what learners consciously notice” (Schmidt 1990: 149). Tomlinson argues that an important objective in language learning should be for learners to discover for themselves language features which can be found in the authentic texts they encounter, so as to strengthen the positive effects of noticing and recognising a gap in their own language use (Bolitho et al. 2003; Tomlinson 1994, 2008). When used in language learning contexts, concordanc‑ ing software leads language learners to read multiple examples from authentic texts, and the potential for concordancers to promote active discovery of pat‑ terns is clear.However, despite some success, only a limited number of teachers and learners of second language seem to make regular use of these tools. Factors which may be holding teachers back from learning to use and teach corpus tools include issues with the context, the level of detail, the means of interpretation, and the time required to get results as well as the design of the software itself. Traditional Key Word in Context (KWIC) concordance output is almost completely cut away from its con‑ text (Hunston 2002). Also, the amount of detail that concordances can provide to a learner can be confusing (Kennedy 1998). However, Varley (2009) reports some suc‑ cess for students if they can cope with the “overwhelming” amount of corpus data. Another point is that beyond dealing with the amount of raw data, the skills required to actually interpret them in order to understand grammatical patterns are far from simple (Gaskell & Cobb 2004). Effort is still needed to strive to make concordancers more user‑friendly and more suitable for language learners (Horst, Cobb & Nicolae 2005; Krishnamurthy & Kosem 2007).The motivation for the development of a concordancer for lexical priming was twofold. As well as being deeply rooted in an appreciation of some of the struggles and difficulties faced by English teachers and language teacher manag‑ ers in terms of helping students in China (and, by extension, any other cohort of L2 language learners) appreciate their language needs and develop their lan‑ guage skills accordingly, the project was also designed to enable teachers and students to explore various features of the theory of Lexical Priming without needing to teach the theory explicitly. It would not be desirable to replace the wordlists and sets of grammar rules that students and teachers may currently use with a complicated exposition of Lexical Priming with all the technical and linguistic background knowledge which that would require. The software is designed, however, to encourage exploration of some of its features and to make it possible to see tendencies of words and phrases which are not usually apparent in either dictionary examples or the output from other concordancing software. The software aims to make insights about the English language based on Lexical Priming accessible and rewarding, by providing a multitude of examples from corpus texts and additional information about the contextual environments
      in which words and combinations of words tend to occur. While inspiration and methodological approaches have been drawn from other concordancing software, the design of each aspect of the new concordancer, called The Prime Machine, has focused first and foremost on how the most basic building blocks of the data structures and the user interface can support pedagogical priorities. The project is also in line with suggestions from two reviewers of Hoey’s book on Lexical Priming: Garretson (2007) suggested that technology could provide ways to make analysis of Lexical Priming less time‑consuming; Kaszubski high‑ lighted the scope for the theory in the design of “learner concordancing prac‑ tices” (2007: 292).The rest of this chapter will introduce some of the ways in which design fea‑ tures of The Prime Machine were inspired and driven by concepts from the theory of Lexical Priming. A fuller description of the technical procedures; a fuller dis‑ cussion of some of the background issues; and further examples and evidence are presented in the doctoral thesis (Jeaco 2015). What follows here is a list of 3 claims about the software design, with a brief description of some of its features related to: the search query screen; the features of the concordance line displays; and the ways the user is encouraged to interact with the data.
  2. Developing learner‑friendly design criteria for The Prime Machine
    When designing the screen that language learners will use to formulate queries in a concordancer, it is important to consider what the main reasons might be for them to perform searches. As such, the The Prime Machine highlights links between words and related words, thus showing how far near‑synonyms differ. Furthermore, alternative English translations of Chinese words are provided. Access to wider contexts of each word a learner looks at is important. The applica‑ tion aims to make the learner find out about textual colligation, context and co‑ text, going beyond what KWIC would offer the lay user. A third feature focusses on the tendency for words or phrases to occur in different positions in a text. This is an interesting and under‑researched area which is somewhat difficult to explore using standard concordancing software and an issue The Prime Machine aims to address.
    1. Claim 1: The design should help language learners explore differences between words and phrasesLooking through the literature on Data Driven Learning (DDL) and studies which have evaluated corpus tools with language learners, there seems to be a
      consensus that comparisons of synonyms, as well as prompts to explore other word forms, would be particularly helpful. In one of the earliest papers on DDL, Johns (1991) explained that students often come to concordancers wanting to compare pairs of words. Corpora are thought to help demonstrate differences between synonyms clearly (Kaltenböck & Mehlmauer‑Larcher 2005). All of the suggested activities given by Coniam (1997) for how corpora could be used in teaching require learners to compare. Three out of the six uses of corpora in the classroom given by Tsui (2004) involve different aspects of synonymy: near syn‑ onyms, words which are very close in meaning, and words which have the same translation in the learner’s own language. However, student feedback from some studies has also shown that while it can be rewarding, learners find the discovery of differences between synonymous words both difficult and time‑consuming (Yeh, Liou & Li 2007). There are several other obstacles which learners need to overcome. In order to see a pattern, learners may need to perform two or more searches (Gaskell & Cobb 2004). Learners are not always ready to call to mind suitable words for comparisons. They may not be able to come up with further ideas on what to search for (Gabel 2001). Sun (2003) notes that ineffective search skills also lead to frustration.Given the importance placed by teachers and researchers on the power of comparisons in DDL, it seems strange that little support is provided in most con‑ cordancing software to facilitate this. Both WordSmith Tools (Scott 2010) and AntConc (Anthony 2004) require use of multiple windows or saved results in order to view two sets of concordance results or collocations simultaneously. While The Sketch Engine (Kilgarriff, Rychly, Smrz & Tugwell 2004) includes the Sketch-Diff function, only the summary Word Sketches are available in this view, and com‑ paring actual concordance lines would require moving backwards and forwards between pages or having multiple tabs open in the browser. Each of these tools can provide a rich variety of ways for a researcher to make comparisons between items but the pathway for making these comparisons can be complicated.A key design feature of The Prime Machine was to make comparisons between search terms as easy as possible through the facility to enter two queries at one time, leading to the retrieval of two sets of results; and the provision of pop‑up lists giving the user suggestions in terms of alternative word forms, related words and collocations. In the list of claims summarizing the theory of Lexical Priming, Hoey (2005: 13) draws attention to several contrasts: differences between synonyms; dif‑ ferences between senses of polysemous words; differences between nested combi‑ nations of words; and differences across domain and genre. The user interface was designed to facilitate the selection of items across such contrasts and to enable the user to view results for concordance lines, collocations and other data in a side‑ by‑side view.

      Figure 1. Screenshot showing auto‑complete support for a query
      The text input box was designed to aid the user with spelling and in choosing between similar strings of words. Figure 1 shows how as the learner starts to type a search term into the box, the words in the currently selected corpus with the same first few letters appear, displayed in descending order of frequency. Once a complete word has been entered or selected, the application provides other pop‑ up lists, giving suggestions for comparisons which could be made. A stemming process makes links between different word forms held in the database, so these other word forms are presented to the user underneath the right‑hand text box. The software also provides a pop‑up list of related words. Links between words and related words are based on the words being alternative English translations of Chinese words or on WordNet (Miller 1995). To create the Chinese transla‑ tion‑based links, the CC-CEDICT database file (MDBG 2013), a freely available Chinese‑English dictionary file, was downloaded and imported into Microsoft Excel. The columns of English words for each Chinese headword were then imported into a database so as to establish links between English words if they occurred in the same row in the original table. Through accessing the database a list of words which are alternative translations for Chinese headwords can be retrieved for any of the words listed as English translations in the dictionary. The concordancer was developed specifically with Chinese learners of English in mind, but future versions could incorporate lists derived from dictionary map‑ pings for a range of different languages or simple thesaurus data. Links based on WordNet are based on semantically related words and include additional links across word forms. A DICE mutual information score is used to provide a rank‑ ing for the similar word pairs, so that they appear in the drop‑down list with more mutually exclusive items towards the top.
      As well as auto‑complete for single words, computer users are also familiar with multi‑word units appearing as they enter queries into various search boxes across different applications and websites. In The Prime Machine, since they are extracted, stored and indexed in advance, short lists of collocations can be retrieved very quickly, allowing suggestions beyond single words to be provided with almost instantaneous feedback on the collocational strength of two or more items.2 The pop‑up lists for collocations, alternative word forms and words with similar meaning are shown in Figure 2.
      Figure 2. Screenshot showing prompts which appear for consequence in the BNC: Academicsub‑corpusCollocations that contain words with the same stem and/or the same words in a dif‑ ferent order appear on the right‑hand side to encourage users to compare these with their main query. Figure 3 and Figure 4 show how these suggestions appear on screen. As well as looking at pairs of words, comparisons of a single item across dif‑ ferent corpora can be a good way to show how use varies across different regis‑ ters. Comparing the results of the analyses of two or more language samples is an important part of register analysis, since it is through comparison with other reg‑ isters that the characteristics of one register become clear (Biber & Conrad 2009). Just as most concordancing software does not provide an easy way to view and compare the results of two different items on the same screen, being able to view and compare results from two different corpora is also far from straightforward. WMatrix (Rayson 2008) makes comparisons of two texts or two collections of texts very clear and is an excellent tool for researchers wanting to use differences
      1. As a piece of software purposefully designed to support the examination of the kinds of relationship between words that are introduced in Hoey’s theory of Lexical Priming, colloca- tions are defined in this project based on his 2005 definition. In the software, collocations refer to combinations of two, three, four or five words in a four-word window either side of a node. Full details of the way in which these are calculated are provided in Jeaco (2015).

        Figure 3. Auto‑Complete suggestions showing collocations for data from the BNC: Academicsub‑corpus for the query outcome longterm
        Figure 4. Auto‑Complete suggestions showing raw window search queries for data from theBNC: Academic sub‑corpus for the query outcome longterm
        in frequency between two corpora as a starting point for exploration of differences between the two collections of texts. If, however, a language learner wants to see how a word is used differently in two different corpora, corpus software packages do not provide much support.
        In The Prime Machine, a comparison between two corpora can be made easily using the “Compare with another corpus” sub‑tab. The search box on this screen looks and behaves as before, with auto‑complete support at the word and colloca‑ tion level. To the right of this box, a drop‑down menu is provided which contains a list of all the other corpora. When the user clicks on the “Compare” button, the application checks that the word or combination of words is present in both cor‑ pora at least once before the query is allowed to proceed. If the words do not appear in either of the two corpora, feedback is provided. Figure 5 shows the search screen for comparing corpora. In order to allow access to the complete corpus as well as comparisons across its sub‑corpora, texts from the BNC are stored in the database twice: once as part of the complete corpus and once in a sub‑corpus determined according to the text, following the major groups provided by Lee (2001).
        Figure 5. The “Compare with another corpus” sub‑tab on the main Search Tab
    2. Claim 2: The design of the display for concordance lines should help language learners notice textual colligation, co‑text and contextsOnce the concordance lines for a query or a pair of queries have been retrieved, the results must be presented to the user. While it has been recognised that in order to access some information it may be necessary to have longer contexts than the standard KWIC concordance line (Hunston 2002; Sinclair 1991), as many researchers have asserted, there are some advantages of viewing vertical lists of truncated sentences four words either side of the search term. Being able to see a large number of results provides a degree of “safety” for conclusions which the user draws (Mair 2002). They can provide a “snapshot” of how lexis is usually used (Johns 2002), can be seen as focusing on the “central” and “typical” (Hun‑ ston 2002), and can be organised in such a way as to highlight patterns (Gaskell & Cobb 2004). Sinclair (1991) suggested that KWIC provides access to patterns which are not meaning‑bearing, allowing the distinction between the physical objects of text in the corpora and their meanings to be clear. However, for a corpus engine built on the theory of Lexical Priming, it would seem that access to wider contexts is important. For all the advantages of KWIC, by showing the node word in the centre of the screen, not only are paragraph breaks usually masked, but the
      position of the node in the sentence is not very prominently displayed either. Even if the KWIC window is limited to words occurring in the same sentence, white space to the left of a sentence initial instance gives some indication that the word occurs towards the beginning of a sentence, but then masks whether or not this is a paragraph break. Concordance lines in which the node word is more than 4 or 5 words away from the start of a sentence appear much the same whether or not they are towards the beginning of a long sentence, part of a singleton paragraph, or towards the middle of an average length sentence. One challenge for this project was to find a way to present a much wider context than usual in a way which also facilitates visual scanning of patterns, while at the same time enjoying many of the benefits of KWIC. The Lines Tab in the application provides a KWIC view, and although this is much more similar to other concordancers, the design also incor‑ porated some consideration of the position in paragraph and sentence. However, one of the main differences in the presentation of concordance lines in The Prime Machine is the Cards Tab and the single card shown on the Lines Tab for the cur‑ rently selected line. A screenshot of the Cards Tab showing the paragraph layout, different heights of cards, the collocation captions and the citation information can be seen in Figure 6. The card template that is used to organise and present the words in sentences before and after each node will accommodate a fairly wide range of configurations including cards where all three sentences appear as one field, and others where paragraph breaks and headings can be seen before or after the node sentence. The beginning or end of a text is indicated by a blank line at the

      Figure 6. The Cards Tab with captions at the top and highlighting of the line containing the node; with incidental data from the BNC: Academic sub‑corpus for a search on the node pilot. The currently selected card is shown with a yellow caption
      top or bottom of the card. The card view is intended to be a compromise between the desire to provide additional information about headings and paragraphing and trying to reduce the complexity of both displaying text as it would be shown in the original sources. It is a simplification bringing some order and uniformity to aspects like font size, colour and highlighting, while providing some visual infor‑ mation about the position of words in sentences and sentences in paragraphs.One issue regarding cards is that it is rather more difficult to scan across sev‑ eral concordance lines and to see patterns in the co‑text. As well as gentle high‑ lighting of the row in the card that contains the node word, the list of collocations for the current node word is also used to provide a visual cue at the top of each card in the form of a caption. This was designed to highlight the relationship between the concordance line and collocations. The caption provides an important way of helping learners see nearby words which have a strong relationship with the node, without disrupting the flow of text. Including collocates in a caption goes some way towards overcoming Kenning’s (2000) concern that language learners may need help in seeing how a search term is actually part of a longer unit. It should also support teachers wanting to follow some of the other recommendations in the literature; recommendations such as teaching learners how to note collocations by drawing attention to extra words around a collocation (Lewis 2000: 134) and directing learners away from separate word analysis (Siyanova & Schmitt 2008).As well as providing additional data and information in the extended context and the captions, the Cards view in The Prime Machine also prominently shows the source of each concordance line. Language learners using a concordancer are much less likely to be aware of the composition of the corpus and also tend to be less sensitive to notions of how language use changes across different genres and registers. However, as mentioned in earlier, an important point Hoey makes regarding all of the claims forming his theory of Lexical Priming is that they are “constrained by domain and/or genre” (2005: 13). In the design of the Cards and Lines views for The Prime Machine, the question of how best to facilitate clearer information about the source of each concordance line was considered carefully. Firstly, in order to provide a quick sense of the kind of text from which the concor‑ dance line is taken, each individual text in the corpus is assigned to one main text category which is set at the time it is imported and this is used in a heading at the top of each card. Below this heading, other tags or metadata are displayed in the style of an academic reference or other referencing convention.
    3. Claim 3: The design should help language learners notice features in the patterning of words and phrasesThe tendency for words or phrases to occur in different positions in a text is an interesting and under‑researched area, and one which is somewhat difficult to
      explore using standard concordancing software. Nevertheless, some work has been done looking at some of the possible different text units and the tendency for words and longer phrases to occupy positions at the beginning of these. WordSkew is a software tool which allows counts to be performed for items within sentences, paragraphs, sections or texts in terms of absolute slots or by dividing the discourse unit into portions or equal length (Barlow 2016). The Concord tool in Wordsmith Tools provides columns of data showing the position of each concordance line as a percentage relative to several text units. Wordsmith Tools was the software used by Hoey (2005) and some of the ways it can be used to investigate textual colligation are demonstrated by Scott and Tribble (2006). Garretson’s CenDiPede software (2010) includes three features under the heading “Pseudo‑Colligation”, two of which are relevant to textual position. The first uses the results from clausal analysis to report the raw frequency of occurrences of the node occurring before the verb within its clause. This is designed to be a rough mapping to theme‑rheme. The second is described as a “nod to Hoey’s notion of textual colligation” (Garret‑ son 2010, p. 149), and is the percentage of instances of the node where it is sentence initial. At the text and paragraph level, Hoey and O’Donnell (2008) and O’Donnell et al. (2012) compared the first sentences of texts and paragraphs against the sen‑ tences from the remainder of these texts in order to establish which words had a tendency to be used in text initial and paragraph initial position. Their procedure was complicated, especially for the generation of concgrams, and involved split‑ ting the corpus into sub‑corpora according to each of the required set of positions, using concgrams in Wordsmith Tools and then a Python script before running the wordlist function in Wordsmith Tools again.The use of the key word method to identify words which occur with statisti‑ cal significance in text initial or paragraph initial position seems very promis‑ ing. However, concordancing software provides little integration of functions to explore such features and few language learners would be skilled or motivated enough to go through the process of splitting a corpus themselves and then per‑ forming key word analysis and interpreting the results. The results from the study by O’Donnell et al. (2012) which found that one in forty individual words showed a tendency to be used in specific positions provides good evidence that this is something worth researching further, but it does also suggest that if the start‑ ing point is a word or phrase and the aim to is to discover whether or not this word or phrase has such a tendency, the overwhelming majority of cases are likely to be disappointingly negative. Writing about concordancing software in more general terms, Cobb (1999) argues that language learners need software which does not assume detailed linguistic knowledge and which also does not assume that the users will be curious enough to explore. It would seem obvious that for phenomena like textual colligation which are less well‑understood by both teach‑ ers and students, these two aspects of software design are even more important.
      Therefore, procedures were developed for The Prime Machine to calculate, store and display tendencies of words and nested combinations to occur in various environments. As well as measures related to textual colligation, several other measures were developed to target some of the other features of lexical priming. It is hoped that the aim of drawing learners’ attention to this selection of features
      Table 1. Features of lexical priming measured and stored in The Prime MachineGroup Feature Values LevelHeadings Title Title;Not a title Sentence Heading Heading;Not a heading Sentence
      Position in text*Sentence position in text Text Initial;Text Ending;Not textinitial or text endingParagraph position in text First Paragraph;Last Paragraph;Notfirst or last paragraphSentence SentenceSentence position in paragraphFirst Sentence;Last Sentence;Not first or last sentenceSentenceWord position in sentence First Fifth; First Third;Last Third;Last Fifth;Not first or last thirdWord
      Complexity,Word position in sentence Theme;Rheme;(unknown) Word Complexity Simple Sentence;Complex Sentence SentenceModality, Voice & PolarityModality Volition/prediction;Permission/ possibility/ability;Obligation/ necessity;No modalsSentence & WordVoice Active Voice/Other;Passive Voice Sentence & WordPolarity Positive;Negative Sentence & WordDeterminers & PrepositionsDeterminers Definite articles / Possessives;Indefinite articles;No articlesPrepositions Near Prepositions;Not Near PrepositionsWord
      WordRepetition Repetition Same form;Same stem;Not repeated Summaryinformation only
      * Not all the values for features in this group are mutually exclusive. For example, words that are in the first fifth of a sentence will also be in the first third.
      will resonate with language teachers and that will help learners engage with the data in the concordance lines more easily. Although the range of features is lim‑ ited, some of the well‑known trouble‑spots for English for Academic Purposes have been targeted, with the use of articles and propositions, passive voice, and modal verbs included. Rather than looking for specific features and then looking at the words which display a specific tendency, the aim of processing and storing these data is to highlight to the user any tendencies which exist for the specific words or collocations that they have used in their search query. The results of key word analyses for the features are made available in the database so that it is pos‑ sible to retrieve the tendencies which are key for the search query.Table 1 shows the list of features and how they are organized into 5 groups.In order to measure tendencies, features are flagged in the database either at the word or sentence level through a series of processes. Lists of words and collocations are generated according to the proportion of instances in the corpus for each feature of the contextual environment, and a statistical test is applied so that those meeting a threshold will also be stored in list of significant items for each feature. The contingency table used for sentence level measures is shown in Table 2, and that for word level measures is shown in Table 3. For colloca‑ tions, the contingency tables are based on the number of occurrences of the node of the multi‑word unit in each environment. Summary data is stored for all log‑likelihood values reaching a BIC value of 2. Following Wilson (2013), Bayes Factors are used as a way of standardizing the cut‑off point for the key word method, and the level of significance is stored using the BIC interpretation given there.
      Table 2. Contingency table for sentence level featuresCorpus One Corpus TwoFreq. of word A = inside sentences with the specificfeatureTOTAL C = Count of all words insidesentences with the specific featureB = Outside the sentences with the specific featureD = Whole corpus – C

      Table 3. Contingency table for word level featuresCorpus One Corpus TwoFreq. of word A = where the specific feature has beenmarkedTOTAL C = Count of all words with the specific featureB = where the specific feature is absentD = Whole corpus – C

      Table 4. Selected examples from the help screen
      Headings: HeadingExamples from the BNC: Academic sub‑corpusOnly 0.6% of words in this corpus are part of a heading.Yet 13% of the occurrences of the word conclusion are paragraph headings and none of the occurrences of the word ending are paragraph headings. Obviously, the heading used for the last section of an academic article is usually Conclusion, but it also occurs very frequently within sentences.Position in text: Paragraph position in textExamples from the Hindawi Computer Science corpusOnly around 3 in 100 words are part of the first paragraph of texts.Yet around a quarter of the occurrences of the words advances and increasingly are in the first paragraph of texts. Other words often used in the first paragraph are emergingnovel, and growing. These give a sense of how changes have occurred and progress has been made.Only around 1 in 200 words are part of the last paragraph of texts.Yet words like hope and future occur in the last paragraph much more often than that. Words which frequently occur in the last paragraph of a text often give a sense of looking forward to the future.CMVYN group: ModalityExamples from the BNC: Academic sub‑corpusLess than 5% of words in the corpus are near modal verbs.Yet words like legitimatelyusefullyconceivably and easily are often used with the words cancouldmay or might.Words like remembered, noted, emphasised and stressed are often used with the words must, should, need to or ought to. Other words often used with these modals are carefully and surely. Words like suffice, cease, depend and disappear are often used with the words will, would or shall. Other words often used with these modals are examine, argue and discuss.Det. & Prep. group: PrepositionsExamples from the BNC: Academic and BNC: Newspapers sub‑corporaA little more than half of all words in these corpora are near prepositions.Yet 99% of the occurrences of the word spite are near prepositions while none of the occurrences of the word despite are near prepositions.Sometimes similar words can be quite tricky to use correctly when writing in a foreign language, but a quick search for despite vs. spite in either of these corpora can show preposition patterns very clearly. We would expect the concordance lines to show us despite near verbsand in the phrase “despite the fact”. We would also expect to see spite used in sentences in the phrase “in spite of ”.
      A few examples are presented in Table 4 as they appear in the software’s help screens, stripped of all the technical evidence, where they are provided in order to help explain to an advanced learner what each feature was designed to measure. The reader is not being asked to dwell too heavily on whether there is anything remarkable or surprising about the tendencies of the example words to be used in these specific contexts, but rather to consider whether given a learner’s interest in
      Figure 7. Graph display for compare mode for the Voice submenu on the Graphs Tab with results for consequences compared against outcomes from the BNC: Academic sub‑corpusthe use of one or more of these words it would not be to his or her advantage to have attention drawn to the existence of such tendencies.When the concordance lines are retrieved, the concordancer is able to pres‑ ent information about the proportion of instances for the currently downloaded sample and the proportion of instances in the corpus as a whole, as well as a list of features for which the search term has been pre‑calculated as having a statisti‑ cally significant relationship. This information is displayed in the form of graphs, designed to help the learners appreciate that these primings are almost always rep‑ resentative of relative frequencies rather than absolute restrictions on use. Krish‑ namurthy and Kosem (2007) make many suggestions about the visual design of a corpus tool and the incorporation of icons and graphs into The Prime Machine was in part a response to these. An example of a graph is shown in Figure 7.One of the striking things from Hoey’s (2005) presentation of the evidence for the priming of words is the need to consider what the expected values or what typical environments for each kind of feature would be. Clearly, the number of text initial sentences will always be very small compared to the whole corpus, yet because of the differences in the length of the texts in different corpora, these pro‑ portions can vary. Similarly, some features such as passive voice tend to be much less common in some text types than in others and so it is useful to be able to high‑ light cases where the proportion is much higher or lower than would be expected based on a collection of texts as a whole. For the graphs, values for expected values are calculated using the total number of words in each priming environment in the whole corpus, and these are displayed using arrows marked “norm”.
      One of the important goals of the project was to find a way to make tenden‑ cies of words and collocations more prominent and guide the learner to find interesting and useful patterns. A researcher who is highly motivated to explore exhaustively the evidence for primings of a particular word or phrase based on tendencies revealed through corpus analysis may well be motivated enough to spend time trying different features, not losing too much interest if no relationship is found. However, if a vast array of options is made available to learners without any guidance, they could either waste time filtering the data or become frustrated. Therefore, a means was needed of helping direct the user’s attention to priming information which might be explored more fruitfully, and this is the purpose of the “hot” icons which appear on a dock at the bottom of the results screen. When each list of concordance lines and other summary data are retrieved, the applica‑ tion goes through the table of statistically significant priming environments and changes the icons to match the features. Icons representing priming environments which do not reach the “Positive evidence” BIC Factor Score for the current search term are set to be invisible. Clicking on the icon takes the user directly to the sub‑ section on the Graphs Tab menu corresponding to this feature. Figure 8 shows concordance lines with the dock at the bottom showing statistically significant tendencies for position, complexity, indefinite articles and repetition. Figure 9 shows how the icon grows in size when the mouse is hovered over it.
      Figure 8. Lines Tab showing the card for the currently selected concordance line and the dock of icons for the node pilot in the BNC: Academic sub‑corpus
      Figure 9. Enlarged icon showing positive evidence for a tendency to occur after indefinite articles. The hand icon represents the mouse cursor position
      An important point is that providing a summary of typical environments for a word or collocation should not be an end in itself; rather the software should encourage learners to consider and explore for themselves whether the words they encounter or want to use in their own writing might be primed to occur with other features. To this end, a system was devised to allow users to move from the list of features on the Graphs Tab to a filtered list of concordance lines matching those features. Figure 10 shows the checkboxes and filter buttons available for one of the priming menus.By removing the ticks from some of these boxes, the user can filter down the results. Looking at filtered results may help to show learners how a word or colloca‑ tion is used in particular priming environments. The option to compare concor‑ dance lines for the same item to see whether patterns can be seen or conclusions can be drawn according to different contexts and to allow learners to see variation as well as common patterns. The complex categories used for some of the priming features can also be made easier to understand by showing users lines matching the features