The Dialogue Diversity Corpus
William C. Mann
Version 2.0
September 2003

The Dialogue Diversity Corpus (DDC) is about finding data to use in research on human interaction, especially dialogue. This edition, Version 2.0, retains access to all of the still-accessible sources that were available through the original release. To review the changes, click Additions to DDC in Version 2.

DDC is intended to facilitate all varieties of research that require dialogues from multiple situations as data. For studies of dialogue dynamics, situational effects in dialogue, dialogue coherence, dialogue genre comparison, studies of role and status in dialogue and many other topics, very diverse dialogue data must be brought to bear on single studies.

This corpus is a modest effort to facilitate such research. Although it is a small website, it gives access to hundreds of dialogues, many thousand lines of interaction, and an open ended way to find more.

Creating dialogue data can be very expensive, but accessing existing data can be very inexpensive, making research resources more available for studying.

DDC consists of two parts:

    Part One presents actual dialogue data below for a limited number of dialogues. Because it is very accessible it may help with pilot studies, planning, conceptual development and short projects such as term papers.

    Part Two is focused on how to find additional dialogue data. It includes information on corpora indexes, corpora collections and websites, along with methods for finding dialogue data collected for non research purposes. It is on another page: Finding Dialogue Data .

The Dialogues:

There are 54 dialogues directly accessible on this site (23 individual dialogues and 2 collections). All of them are freely available for research use, including study, citation and inclusion in publications. Some of them require a particular acknowledgment, available here with the dialogue.

Generally the amount of text represented here is a tiny fraction of the amount available at the source. Because file availability is a variable thing, moving and melting like an iceberg, we suggest that data users who need more than is given here go to the sources and obtain all that they need as soon as they know they might use it.

Links and Files:

The dialogues are represented, as much as possible, in the context of the Internet and by their original presentations. (This saves work, makes upgrades automatically available and often associates the dialogues with their associated documentation and exposition.) However, if the Internet forms disappear, I have enough backup to put these particular dialogues, but not the entire sources, into the Internet accessible corpus directly.

In some cases the dialogues are only available on the Internet in a highly embedded form – zipped up or behind an unfamiliar search engine or even distributed into a data base in which they are only implicit, constructible. In such cases they are directly accessible here.

Item Collective Name Situation Source Link to Dialogue 1 Link to Dialogue 2 Link to Supplement if any
1 Medical Interview Physician and patient

H. Mischler book



2 Academic Speech Academic Advising for Linguistics,
Introductory Astronomy Discussion
University of Michigan Advising Astronomy The MICASE Corpus
3 Physics Tutoring One student, one tutor Circle Corpus, University of Pittsburgh Student S1 Student S13 Physics Overview
4 Algebra Tutoring One student, one tutor Circle Corpus, University of Pittsburgh Overview 2 Multi Problem Transcript Overview
5 Travel Agent Working Telephone Travel Services American Express/SRI Tape 6, Call 1 Tape 14b, Call 1 Master File .
6 Santa Barbara Corpus of Spoken American English Friends interact U. C. Santa Barbara Interaction 7 Interaction 11 Corpus Source:
7 ICE Singapore English Friends interact face to face International Corpora of English (ICE) Spoken Dialogue 45 Spoken Dialogue 81 ICE
8 ICE Singapore English Friends interact by telephone International Corpora of English (ICE) Spoken Dialogue 91 Spoken Dialogue 99 ICE Singapore Annotation Manual (.pdf)
9 Trains 91 Transport planning task University of Rochester Multiple Trains 91 Dialogues ------- The Trains Corpora
10 Trains 93 Transport planning task University of Rochester Dialogue 1.1 Dialogue 12.4 The Trains Corpora
11 Operator Console Corpus Computer mediated manual computer help USC/ISI OC397 OC399 Genesis of the Operator Console Corpus
12 Maptask Cooperate to draw lines on maps U Edinburgh Q4nc4 (.DOC file) Q3nc2 (.DOC file) Overview
13 Corpus of Spoken Professional American English (CSPAE) Professional Meetings Athelstan (a private company) "Sample" ------ CSPA
14 Monroe Emergency Service Management   Allocation in Simulated Emergency   University of Rochester Session 6 Transcript   Session 6 - DAMSL annotation  

Session 6 Initiave annotation  

Monroe Project  
Ideal data:

To study the commonalities of dialogue, the skills and orientations by which people move effectively from one use of dialogue to another, to engage in any sort of study that deals directly with ranges of dialogue use, one is faced immediately with overwhelming complexity. People, situations and purposes all vary together in an unbounded way.

If we could somehow study the same people in interaction, hour by hour and day by day, the dynamics of their interaction would not be constant. It would vary by situation and by the moment by moment changes in their intentions. Such data would allow us to start to tell the differences between personal variation and changes of situation and purpose.

Of course, such data cannot be found. Individual people do not enter into every sort of situation, so restriction of study to a few people would eliminate a lot of the diversity that occurs. Study of dialogues from certain sorts of situations, such as airplane crashes or spacecraft emergencies, absolutely requires studying dialogues not reproducible in laboratories. Other kinds of study similarly need comparable diversity of data. Fortunately, there is a surprising quantity of data available.

Dialogue Distinctives:

Distinctives of dialogues will interact with research goals to create some helpful selectivity in what is actually useful. The list below is intended to suggest potentially relevant contrasts:

  1. Natural personal intentions vs. assigned intentions,
  2. Interaction between strangers vs. interaction between familiar persons.
  3. Peer interaction vs. socially unequal interaction
  4. Language and dialect variation
  5. Shared language and dialect vs. interacting across a difference
  6. Language, dialect and situation well understood by the research team vs. not
  7. Two party vs. more than two
  8. Auditory medium vs. textual medium
  9. Presence of visual contact vs. absence
  10. Video record vs. other
  11. Immediate access vs. delayed access (email.)
  12. Complete dialogues vs. excerpts
  13. Participation of children
  14. Original purpose of the recording

DDC does not come close to representing all of these contrasts. They are rather a kind of checklist for thinking about what sorts of dialogues are acceptable for particular research purposes.

Whole-Corpus Aspects of the DDC

It might be tempting to study the entirety of the DDC and draw some conclusions about dialogue or conversation as a whole. Such conclusions would be unjustified. The DDC is simply not representative of dialogue as a whole, nor is it even representative of the kinds of dialogue that get recorded and circulated.

The assembly of DDC has been opportunistic and effort limited. The result is in no way representative of human dialogue. DDC may grow in the future, and its composition would surely change. Again, there would be no reason to expect either the older or the newer version to be representative.

Thus DDC is unsuitable for studying the frequency of anything. It is also unsuitable for establishing that any particular thing does not happen.

Instead, DDC is representative of a wide range of phenomena for which we can say “This happens.” It is thus more representative of dialogue as a whole than a corpus of dialogue representing a single situation, a single configuration of media or a single kind of participants' interests.

Therefore it seems justified to point out that study of the corpus as if it had some kind of homogeneity is to be avoided.

The support of the University of Southern California is gratefully acknowledged.
This entire web site is Copyright 2002, 2003: William C. Mann

