Emergence of a large corpus of written texts for some languages has enabled quantitative as well as qualitative analyses of the synchronic conditions for diachronic changes, over both long and short spans of time. This title presents studies on synchronic and diachronic variations based on such corpus data.
LINGUIST List 14.1736
More information about this seller Contact this seller 6. Focuses on aspects of variation and change in language use in spoken and written discourse on the basis of corpus analyses, providing new descriptive insights, and new methods of utilising small specialized corpora for the description of language variation and change. Printed Pages: xix, 57, pp. More information about this seller Contact this seller 7. Age is by far the most underdeveloped of the sociolinguistic variables in terms of research literature.
This book examines the discourse of adulthood and accounts for sociolinguistic variation, with regards to age and gender, through the exploration of a 90, word age and gender-differentiated spoken corpus of Irish English. More information about this seller Contact this seller 8. This title aims to examine the perceived gap between teenage language and that of the older population, by giving an insight into the London teenage vernacular towards the end of the 20th century.
Printed Pages: xiii, pp. More information about this seller Contact this seller 9. Constitutes a large-scale empirical investigation of an international variety of English. Using a combination of the corpus compiled for this project and relevant sections of ICE-India as its database, this work tests descriptions and characterizations of English in India, and provides an account of register variation in Indian English.
More information about this seller Contact this seller This book is not only highly relevant to readers interested in Sri Lankan English and South Asian Englishes since it also offers more general sociolinguistic perspectives on the dynamics of postcolonial Englishes world-wide and on the inextricable link between language and identity. Printed Pages: vi, pp. The contributions to this volume apply and extend the techniques of corpus linguistics and diachronic linguistics to the challenge of describing and explaining grammatical change in varieties of English world-wide. Printed Pages: xi, pp.
Many large-scale investigations of linguistic variation are unfeasible using traditional approaches. This volume is a collection of papers that illustrate the ways in which linguistic variation can be explored through corpus-based investigation. Variability is characteristic of any living language. This volume approaches the life cycle of linguistic variability in English using data sources that range from electronic corpora to the internet.
In the spirit of the Weinreich, Labov and Herzog classic, the fifteen contributions divide into three sections, each highlighting different stages in the dynamics of English across time and space. They show, first, how increase in variability can be initiated by processes that give rise to new patterns of discourse, which can ultimately crystallize into new grammatical elements.
The next phase is the spread of linguistic features and patterns of discourse, both new and well established, through the social and regional varieties of English. The final phase in this ebb and flow of linguistic variability consists of processes promoting some variable features over others across registers and regional and social varieties, thus resulting in reduced variation and increased linguistic homogeneity.
Printed Pages: xvi, pp. Covers such topics as: language standardization in ideological context; Nuremberg and its language; the distribution of variable consonant sets; social identity, stylistic factors and orthographic congruity; and statistical models of Nuremberg's consonantal variation. Taking sociolinguistic globalization yet a step further, this monograph investigates the use of Jamaican Creole in a web discussion forum by combining quantitative and qualitative methodology in a sociolinguistic third wave approach. We observe that in some ways sociolinguistics and corpus linguistics have always been converging at the same time that we observe they have always been diverging in other ways.
A consideration of Eckert's , under review paper, "Three waves of variation study", provides further light on this situation. In this historiographical assessment of quantitative sociolinguistic work, Eckert classifies the study of sociolinguistic variation into three major categories, or "waves. The second wave of study involves ethnography and studying smaller groups of speakers and writers to greater depth, focusing on more local patterns of language use.
The third wave of study is about practice and agency, rather than social structures. Eckert points out that these three waves are not necessarily chronologically ordered. Yet, despite there not being a direct chronology that corresponds to the three waves, current interest in sociolinguistics is moving increasingly towards third wave-like approaches see also COUPLAND, First wave, and to a lesser extent second wave, sociolinguistic research would appear to fit comfortably within a corpus linguistics mold. It is in these "waves" of research that we can draw strong connections between sociolinguistic and corpus linguistic methods and practice where the quantitative large-scale analysis of corpora is most helpful.
However, the focus and methodologies of third wave research appear to share less with corpus linguistics and perhaps have more similarities with Conversation Analysis, cf. The survey method's primary virtues are coverage and replicability, both of which depend on the use of pre-determined social categories and fairly fleeting social contact with the speakers chosen to represent those categories. As a result, the social significance of variation can be surmised only on the basis of a general understanding of the categories that serve to select and classify speakers.
There is no question that the broad demographic patterns of variation are important. But just as a map of New York City does not tell you what the streets are like, or what it's like to walk on them, the macro-sociological patterns of variation do not reveal what speakers at different places in the socioeconomic hierarchy are doing socially with those variables Eckert under review: 6. It does not negate the importance of structure, but emphasizes the role of structure in constraining practice and, in turn, the role of practice in producing and reproducing structure.
In the study of variation, a focus on practice brings meaning into the foreground, as we try to get at what speakers are doing on the ground. At the same time, it moves us closer to the goal of studying the actual process of change Eckert under review: These passages help explain the tension in actualizing a "corpus sociolinguistics. Yet, it appears to be an impossible task to make replicable and generalizable, especially through corpus-based methods, the ethnographic and instance-specific knowledge a researcher must gain in order to understand the actual creation and negotiation of social meaning "on the ground.
From this, it seems that some sociolinguists will continue to be uninterested in corpora and corpus methods. Nonetheless, there are concrete steps that corpus developers could take to enhance the possibilities of a "corpus sociolinguistics" and to increase the utility of corpora for pursuing sociolinguistic research, and I turn to these now. Technological advancements have been paramount in the development of sociolinguistics. The same is true of course for corpus linguistics.
The current research in both approaches would be impossible without modern recording equipment and the ability to store, process, and analyze large amounts of text and audio data through computerized means. While it may be the case that sociolinguistics and corpus linguistics diverge in coming years in their research orientations and methodologies in some ways as is indicated by Eckert's third wave , it seems likely that continued technological advancements in the development, annotation, and analysis of corpora will lead to increased opportunities for sociolinguistic engagement with corpora.
This is especially true for research investigating aspects of language and social structure i. All researchers working with recorded data can benefit from advancements in the treatment of these data.
(PDF) Corpus Linguistics: A General Introduction | Dr Niladri Sekhar Dash - xofonezucy.tk
Some of these advances will occur, I believe, without the need for an explicit "call to arms. I consider these in turn. The need for large, publicly available corpora of more diverse spoken language varieties and increased sharing of existing data. One might argue that a primary benefit of corpus-based approaches to linguistic analysis is that the development, publication, and sharing of public corpora allows for the best possible advancement of empirical knowledge about language.
By allowing and, further, promoting the repeated and repeatable analysis of the same publicly available datasets, corpus linguistics fosters an environment that more fully fits the "scientific method" mode of research than many other areas in linguistics. Scholars can question and refine previous findings by re- analyzing the original data; they can extend or modify the annotation schemes and data coding used in previous research; they can compare previously analyzed datasets directly to newly developed datasets; and so on.
By working from a shared pool of data, researchers are best able to collectively develop agreed upon knowledge about language. This, I believe, is a major benefit of corpus-based work which in my opinion has been under-boasted about by corpus linguists. The vast bulk of sociolinguistic research, even that based on thoroughly balanced and representative linguistic databases, has been conducted on proprietary datasets that are not available for peer review or outside consideration. The common practice in sociolinguistics is for individual groups of researchers to develop highly specialized, but closed, databases, which are not made widely available to outsiders.
This tendency is not ill intentioned, but rather is the outcome of historical processes in the field. A huge amount of effort, time, and money goes into the collection of sociolinguistic data and the compilation of any spoken language dataset; NEWMAN, and within sociolinguistics as, unfortunately, with many disciplines , academic "credit" has come from the analysis of the data and not its collection or compilation. Researchers traditionally have not wanted to get "scooped" cf.
These issues of anonymity and privacy are complex and difficult to answer when deciding to share fieldwork data CHILDS et al. Finally, since sociolinguistic datasets have typically been developed in order to research a specific question or set of questions, it has often been assumed that once the original questions have been studied in depth there is not further interest in the datasets themselves.
This trend of closed data appears to be changing and it is now the case that more groups of sociolinguistic researchers are making their data available to colleagues and to the public cf. One could imagine there being much richer corpora available, especially "conventional" corpora, if the developers of those corpora could draw on the spoken language data collections of sociolinguists. To give a specific example, African American English AAE has been studied at exceptional length in North American sociolinguistics and has been the subject of a vast body of empirical and quantitative investigations cf.
Is present day AAE converging with, or diverging from, white varieties or regional varieties of American English? However, one could argue that additional progress could be made if scholars had access to a large, shared pool of data against which they could test competing theories or could cite broadly available evidence in order to support or refute particular positions. While some groups of sociolinguistic researchers have invested in developing thorough transcription and annotation schemes for their data e. Thus, there are massive amounts of sociolinguistic recordings, which are simply not available in forms that avail themselves to corpus linguistic approaches.
The costs of developing complete "corpus-like" data collections can unfortunately be too high, especially when the research questions at hand often involving particular sociolinguistic variables, cf. In recent work investigating the dative alternation in African American English, Kendall, Bresnan, and Van Herk forthcoming attempted to take stock of the amount of transcribed sociolinguistic AAE data that was available if one pooled data from across several research groups.
All told, we obtained only about a quarter million words of transcribed AAE speech, even though many scholars were extremely generous in making data available to us for analysis. This is not to say that a quarter million words is all that exists, but rather that these data i. It seems clear that doing corpus-based analysis on AAE will require further corpus compilation and creation work.
In sum, countless researchers would be greatly aided by the availability of a large, publicly available corpus of African American English. And this is just one example of a non-standard variety of English. We can readily imagine how many language researchers would benefit from corpora developed for other varieties and varieties of other languages.
We need more large-scale publically available corpora of non-standard language varieties. Connecting "data" to data and the question of "taming". A second area from which sociolinguistic research could benefit would be a greater focus on the kinds of annotation available in corpora. This is not to propose that Hymes' model in particular be adopted by corpus developers, but more simply to highlight some of the kinds of annotation that would further sociolinguistic research possibilities through corpora and, more generally, might lead to richer annotation frameworks than are most often currently used.
Of course, there are huge difficulties in implementing these kinds of ethnographically informed annotation systems in a general way. They are often not readily applicable on a wide-scale, or individual annotation schemes are too bound up with a specific project, or a specific researcher's agenda, to be of use beyond a specific corpus or a specific research project.
Even social measures that may seem straightforward at first glance, like socio-economic class or education level, must often be contextualized for the particulars of the group under study cf. For instance, when studying the language use of non-mainstream populations, such as rural African Americans in the U. How to best achieve the kind of annotation necessary to make cross-group comparisons in these sorts of situations, or whether such annotation is possible in the first place, is a difficult question to answer.
This question, however, returns us to the quote on the first page of this paper BEAL et al. In their two edited volumes about "unconventional" corpora, Beal et al.
Poplack, in her foreword to the volumes, explains,. Taming, as understood here, is largely a question of representation: How to represent forms for which there is no standard orthography, what to represent, how much to annotate, how much analysis to impose on the materials, how to represent ambiguities and indeterminacies, how to represent the finished product to the end-user POPLACK, , p. While I find Beal et al. It seems to me that one reason traditional corpora have not been used as extensively for sociolinguistic research is precisely because they have been extensively "tamed," and this "taming" has rendered them less sociolinguistically "real" or useful.
I resist the temptation to label this something like "data left in the wild". In Kendall , I proposed a model for considering data within sociolinguistics that attempts to maintain close connections between layers of annotation or metadata. Crucially, this involves being explicit about layers of abstraction steps away from the original source data in our annotation and metadata creation processes. Figure 1 , from that paper, contrasts what I consider to be a traditional approach to sociolinguistic analysis and data management with an approach that I believe has greater benefits.
The basic premise is that sociolinguists are interested in understanding patterns of language in their social contexts, but that all quantitative work or in fact any work based on records of speech, including audio-only and even video recordings, since recordings never capture the entirety of a real-world event involves abstractions away from the true, contextualized language data, the actual real-world speech event.
In the "traditional" model, layers upon layers of annotation are developed, many of which increase the distance between the "data" in quotes, indicating some level of abstraction from the actual or ideal data and the real-world speech events that are ultimately the objects of interest, the true data no quotes. For example, if I am interested in studying variable realizations of the English past tense like unmarking or non-standard past tense marking , I might audio record a speech event and from that recording develop a transcript, which, for sake of the example, we will assume accurately captures the variable realizations of the English past tense morpheme.
I then extract the frequencies of the various realizations of the past tense morpheme along with other contextual information and then compile this as a spreadsheet, which I add to compiled data from other speakers and other speech events.
Linguists from Sweden
In the end, I have a data file ready for quantitative analysis, but I have also moved several steps away from the original speech event. My language data has become a spreadsheet of frequencies or data tokens with very little available matrix talk, perhaps a concordance-like "keyword in context" amount of surrounding context. It is no longer quite "language," having been separated from its full communicative context. This likely does not matter as far as the success of my quantitative analysis goes, but the closer examination of individual tokens has become difficult, as has my ability to question the original coding of the morphemes.
The "re-conceived" model of Figure 1 focuses primarily on maintaining linkages between levels and types of annotation. As in the hypothetical example discussed for the traditional model, I may wish to transcribe the recording and then to extract quantitative data from that. However, here the emphasis would be on maintaining links between each of these layers of data with the other layers. This is achieved through a focus on accurate time-stamping and the development and use of software built for time-aligned linguistic or at least audio annotation.
Both of these projects feature a time-aligned transcription model which is dynamically linked to the underlying audio recordings and to any additional researcher notes or quantitative data. In addition to the transcript text, the user has direct access to the recording audio, as well as to fine-grained information about where silences occur and their lengths.
- Cultural Anthropology: An Applied Perspective?
- Women linguists.
- Citations per year!
- VTLS Chameleon iPortal System Error Occurred..
Users can also get "close up" views of individual transcript lines, as in Figure 3 , which displays the text of a line along with the audio itself, as well as a spectrogram and pitch track for the utterance created dynamically from the audio. Users can extract phonetic information directly from this view only pitch data is illustrated in Figure 3. Transcripts in SLAAP are dynamic entities and can be reformatted in numerous ways, from textual representations, like the columnar format suggested by Ochs , to various graphical formats screenshots of these other transcript views and a fuller discussion of SLAAP's transcript model are available in KENDALL, , Traditional corpus analysis features are available, such as in Figure 4 , which displays the highest frequency bigrams on the left and a sample concordance on the right; for the phrase "high school" from the same transcript shown in Figures 2 and 3.
Since all of the utterances are time-stamped, SLAAP is able to show a graphical timeline at the top-right indicating where each of the concordance lines occurs in the recordings the single line that extends the length of the timeline image represents the temporal duration of the full recording; the filled bar that extends roughly across the left-half of the line represents the transcribed portion of the recording; the dots below the lines show when the concordance lines occur in time.
The connection between the audio recording and the transcript and other annotation layers is not the only step available towards spoken language corpora that fit the "re-conceived" model of Figure 1 , but it is, I believe, a large step towards improved spoken language data. Further, SLAAP's transcript implementation has been shown here only as one demonstration of a way that this can be accomplished 12 and SLAAP, itself, is meant only as one possible example.
Such systems, by basing the annotation on the temporal record of the recording, allow for multiple versions of annotation and multiple versions even of transcription and give the end-users, the analysts, the ability to customize their interfaces with the data. The "re-conceived" model of abstraction for socio linguistic data in Figure 1 is perhaps less a proposal for the future than it is a way to think about and steer the changes that are occurring in the ways that audio-based spoken language recordings are manageable and increasingly managed.
By focusing on building flexible annotation systems that maintain links through various levels of annotation and, most importantly, to the source recording, we can build corpora, which, instead of needing to be "tamed", can be utilized in a richer variety of ways than currently possible. I believe these sorts of models present the best opportunities for fruitful future work at the interface of corpus linguistics and sociolinguistics.
They also would yield more flexible spoken language corpora for a range of applications beyond sociolinguistics. In this paper, I have outlined some areas where corpus linguistics and sociolinguistics have strong existing connections and some areas where these connections are less strong.
These are advancements that I believe would greatly aid sociolinguistic research, as well as non-sociolinguistically oriented corpus-based research, and would build stronger bridges between sociolinguists and corpus linguists. The bulk of this paper has approached the relationship between sociolinguistics and corpus linguistics primarily from the perspective of sociolinguistics and, as such, has largely framed its discussion in terms of what corpus linguistics "can do" for sociolinguistic research.
Yet, these suggestions have important ramifications on corpus linguistics more generally and I hope these ramifications are clear to readers: The development of more spoken language corpora, from a range of varieties and with more flexible annotation, will benefit corpus linguistic research widely. As my discussion of Eckert's "three waves" account of the development of variationist sociolinguistic research indicates, it will likely be the case that much important sociolinguistic work remains heavily engaged in and devoted to a kind of analysis that is likely impossible through the use of corpora.
Although, at the same time, as Baker points out, tools from corpus linguistics can still be used for examining transcribed data, regardless of the overall direction the research or data takes provided it is transcribed, of course. Software-based archives, like that demonstrated by SLAAP above, can help bring corpus-based methods and a more explicit focus on data to sociolinguistic research, even that which is not interested in large-scale analysis. I would like to end by posing the question: What can corpus linguists do now to best advance sociolinguistic research and to best promote the use of corpora and corpus methodologies in sociolinguistics?
There are clearly several answers to this question and while others may respond differently, my own wish would be that corpus linguists especially those who have extensive experience in corpus development work directly with sociolinguists especially those who focus on field-based research and ethnography to develop sociolinguistically rich, "unconventional" corpora, to make those corpora publically available to researchers, and to work towards developing best-practices for the corpus-like treatment of sociolinguistic spoken language data.
As I have argued elsewhere KENDALL, , sociolinguistic data and data management practices could greatly benefit from the knowledge and expertise of corpus linguists and language documentarians. I thank the editors and the anonymous reviewers for excellent and helpful comments on an earlier draft of this paper. I also thank Gerard Van Herk, for many conversations relating to these topics, and Charlotte Vaughn, who helped with the original conception and production of Figure 1.
Any errors, of course, remain my own. Creating and Digitizing Language Corpora. Corpus linguistics in the UK: Resources for sociolinguistic research. Language and Linguistics Compass , v. Cambridge: Cambridge University Press, Sociolinguistics and Corpus Linguistics.
Edinburgh: Edinburgh University Press, Inferring variation and change from public corpora. The Handbook of Language Variation and Change. BEAL, J. Corpus methods for sociolinguistics. Retrieved: April 1, Drift and the evolution of English style: A history of three genres. Language , v.
BIRD, S. A formal framework for linguistic annotation. Speech Communication , v. Praat: Doing phonetics by computer. Retrieved: September 18, Gender, sexuality and language. Encyclopedia of Language and Linguistics. Oxford: Elsevier, Variable rules: Performance as a statistical reflection of competence. Language, v. Sex and gender in variationist research. Safe harbour: Ethics and accessibility in sociolinguistic corpus building.
Corpus Linguistics and Linguistic Theory , v. Style : Language variation and Identity. Tracking the development of discourse 'like' in contemporary Canadian English. Doctoral Thesis Proposal. University of Toronto, Toronto, Canada, March 16, Counting matters: Normalization and accountability. In preparation. Prestige, accommodation, and the legacy of relative who. Language in Society , v. New York: Random House, Variation, convention, and social meaning. Three waves of variation study: The emergence of meaning in the study variation.
Under review. Retrieved: September 7, Brown Corpus manual. Revised and amplified, Corpora, v. Acta Linguistica Hafniensia , v. Exploring variability within and between corpora: Some methodological considerations. Berlin: De Gruyter Mouton, New York: Cambridge University Press, Foundations in Sociolinguistics : An Ethnographic Approach. Philadelphia: University of Pennsylvania Press, Getting off the Goldvarb standard: Introducing Rbrul for mixed-effects variable rule analysis.