Workshops

Times: 9am-12noon and 1pm-4pm

In this workshop, we will be dealing with some basic questions: what is a corpus? What is corpus linguistics? Which chances and limitations are there in corpus linguistic studies? What do we need to take into account for corpus-design? How does corpus analysis actually work?

We will be working with different English and German corpora (e.g. BNC, COCA, Cosmas…), corpus tools (e.g. Antconc) and general data processing software (e.g. Excel).

In total, this workshop aims at enabling students to successfully work on their PhD/Master thesis and to provide them with the necessary set of tools. For this reason, there will be many hands-on activities with corpora and respective software.

This workshop familiarises participants with a range of approaches to analysing speech using the freely available program PRAAT by Paul Boersma and David Weenink (www.praat.org). In terms of thematic depth, the workshop mainly aims at participants with little or no prior experience with PRAAT. Introductory knowledge of phonetics on the level of an introductory linguistics class is assumed, but no further experience with acoustic phonetics is required.

We will approach the topic by looking at a range of well-known linguistic variation phenomena including, inter alia, vowel length in Scottish English vis-a-vis RP and vowel quality differences between major varieties of English. The data we will be looking at comes from the International Dialects of English Archive (http://www.dialectsarchive.com/). The workshop focuses on four interwoven aspects in basic speech analysis:

1. PRAAT – the basics

Participants will be familiarised with the basic interface options that are available in PRAAT. A short overview of the available representational layers (waveform, LPC, etc.) will also be provided.

2. Spectrogram analysis

The main “mode” of visualised speech that will be  looked at in some detail is the spectrogram. Although the workshop does not mean to provide a full-fledged introduction to acoustic phonetics, participants will be familiarised with the acoustic correlates of basic phonetic parameters such as pitch and vowel quality, and how they can be identified and measured on the basis of spectrographic data.

3. Annotation

Moving beyond basic phonetics proper, we will also spend some time looking at how to produce so-called “text grids” in PRAAT; that is, how to tag stretches of speech orthographically providing  additional information on phonemes, words, etc.

4. Automatised measurements

As a last step, annotated stretches of speech will be used in order to familiarise participants with data mining using a number of scripts that allow for automatising some aspects of phonetic analysis, i.e. to extract large numbers of relevant data in a short time.

Participants are encouraged to bring up their own ideas either beforehand or during the workshop. I am looking forward to seeing you!

Annotation and query are a mainstay of corpus linguistic research. Corpus studies can benefit greatly from high-quality annotation. It is a means of making linguistic expertise and intuitions explicit and thus accessible to the analyst and to further steps in the research process. Yet, in order to harness the linguistic knowledge provided by corpora and annotations, linguists must also be able to formulate or operationalise their linguistic questions and address them to the corpus.

This workshop offers students of linguistics and other philologies a hands-on tutorial in annotation and query techniques for the analysis of digital corpora. In the workshop, we are going explore techniques of automatic and semi-automatic annotation together with matching techniques for querying corpus data. Students are going to learn to employ the Stanford NLP Part of Speech Tagger[1] as an example of an automatic annotation process they can use to annotate their own corpora.  We are furthermore going to study the structure of linguistic data in plain text and annotated corpora and the resulting affordances for operationalisations of linguistic research questions in the form of queries. Query techniques to be explored in the workshop include the integrated search for linguistic forms and annotations based on the use of regular expressions as implemented in standard tools such as text editors (e.g. Notepad++[2]) and the widely used concordancing software such as AntConc[3]. Working examples in the workshop are from a corpus of political speeches.

Tools used in the workshop:

[1] Stanford NLP Part of Speech Tagger
[http://nlp.stanford.edu/software/tagger.shtml]

[2] Notepad++
[https://notepad-plus-plus.org/]

[3] AntConc
[http://www.laurenceanthony.net/software/antconc/]

[4] AnQCor Workshop materials (forthcoming):
[http://linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:anqcor]