OCSystem - Collection of the OC Corpus
Bill Mann - 2000, 2001
The OC Corpus was collected on a large multi-user mainframe computer system in the mid-1970s. It represents over 600 interactions between individual system users and the computer operator on duty.The system typically served about 50 users at once, using a time-sharing operating system, running in a large isolated computer room, separate from the users. Some users were in the same building, and others were connected to it through a network or telephone dialup.
All of the users' interfaces to the operating system were based on characters, typed onto a command line. In addition to CRT monitors, some of the users used printing terminals, often teletype machines, as their only display devices. (All alphabetic characters were upper case.) Each active, logged-on user had a continuously running " job " on the mainframe, identified with the user's logon name. Each user had an allocation of hard disk memory space, which persisted over long periods of time, and in addition could use magnetic tapes stored in the computer room.
Part of the operating system was a job for the computer operator to use in controlling the system. It was identified with the name OPERATOR. The controlling interface for OPERATOR was a printing terminal called the Operator Console or OC. This terminal produced a continuous stream of paper representing all of the actions of the operator, and all of the notices and error reports that the system produced. These were generally saved for a few weeks to assist in diagnosis of system behavior.
The operating system allowed users to link their terminals together, using a command called LINK. If two terminals were linked together, then typically each would continue to control its own job, but any character displayed on one terminal would also be displayed on the other.
Often users would link to the job of OPERATOR. This was necessary in order to get magnetic tapes mounted, and to restore archived or recent lost files, but it was also used to seek information about the system status and about how to do things. The operator's terminal was occasionally linked to more than one other terminal.
The paper stream from the console of OPERATOR was therefore a complete, self transcribing source of dialogues. Every character of the dialogue appeared just as it was during the interaction. This paper stream was collected, with permission, for many months. Large fractions that did not contain dialogues were recycled, and a residue of six inches or so of paper has been retained. As part of studies of dialogue in the 1970s, the dialogues were numbered and retyped into files. The original files have been lost, but the paper has not. The OC Corpus is an unselective sample of these dialogues, retyped from original paper versions. Aside from modifications to preserve privacy, the human-typed portions are just as they appear on the paper. Where there were identifiable residues of actions of the computer operating system, not typed by either dialogue participant, they have been skipped. Apparent spelling errors have been retained.
===========
One of the advantages of the OC corpus is its completeness and verifiability. There are no lost gestures, intonation, shared visual space or other invisible influences on the communication. Only the timing is missing, and it seems generally unimportant, partly because many users were poor typists, and partly because the operators were often busy and continually interrupted by the telephone.For study of the basic features of two party interaction, these are advantages.
===========
The system was a DEC PDP-10, running the TENEX operating system at Information Sciences Institute (ISI) of the University of Southern California. The network was the ARPANET, later divided and renamed the Internet. Users included staff members of the Institute, staff members of ARPA, the sponsor, and others. Generally, but not always, when users linked to the OPERATOR, they did not know what individual they would be conversing with.
===========
A sample dialogue, OC1, is below. This presentation retains control characters, including leading semicolons (which prevent the system from taking the line as a command), @ (the system prompt), ^C (which is a temporary interruption of ongoing job activity), the BRE command (which breaks the link,) and similar computer artifacts.Such items are not retained in the retyped corpus.
OC 1
LINK USERNAME
LINK FROM OPERATOR, TTY40
@;;THE FILES YOU REQUESTED HAVE BEEN RESTORED IN THE SAME DIR UNDER THE SAME NAME.
@^C
@;I WOULD HOPE SO!! OK, THANKS, SEEYA
@;YOU BET
@BRE