. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
H U M D R U M N E W S
Issue No. 1 1994 September 23
A Newsletter for Music Researchers Using the Humdrum Toolkit.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Welcome to the first issue of HUMDRUM NEWS. This newsletter is intended
to facilitate communication with music researchers who are either using
the Humdrum Toolkit, or are contemplating using the Humdrum Toolkit.
In this inaugural issue, we provide an extended tutorial on building musical
inventories. A second article reviews how to download free copies of the
Humdrum Toolkit from an FTP archive.
Your comments and questions are welcome. Mail to
David Huron
dhuron@ccrma.stanford.edu
c/o Center for Computer Assisted Research in the Humanities
525 Middlefield Road, Suite 120
Menlo Park, California 94025-3443
U.S.A.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
::::::::::
TALLAHASSE
::::::::::
If you are planning to attend the Society for Music Theory Conference
in Tallahasse, Florida, plan to visit the display booth for the Center
for Computer Assisted Research in the Humanities. Browse through
recent electronic editions of music releases from CCARH, and see a
demonstration of the Humdrum Toolkit.
If you don't already have the Humdrum Toolkit, bring three 1.4 megabyte
DOS-format disks to the SMT Conference. You can take Humdrum home
with you for free! The software runs on any UNIX system, and can also
run under DOS, OS/2, or Windows NT -- provided you have access to
UNIX utilities. Such utilities are available from commercial
vendors for approximately $250.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
::::::::
TUTORIAL Generating Inventories
::::::::
A common task carried out using computers is that of building an
*inventory* -- that is, identifying the number of occurrences of
various types of data.
Questions such as the following all pertain to the generation of
inventories:
* Does Liszt use a greater variety of harmonies than Chopin?
* What is the most frequently used dynamic marking in Beethoven,
and how does Beethoven's practice compare with that of Brahms?
* Are flats more common than sharps in Monteverdi?
* Did Bartok's preferred articulation marks change over his
lifetime?
* Is there a tendency to use the subdominant pitch less often in
pop melodies than in (say) French chanson?
* How frequent are light-related words such as "lumen" or "lumine"
in the different monastic offices for Thomas of Canterbury?
* Is it true that 90 percent of the notes in a given work by Bach
use just two durations (such as eighths and sixteenths, or
eighths and quarters)?
* What is the most common instrumental combination for sonorities
by Musorgsky?
[N.B. Humdrum commands to answer the above questions are
given at the end of this tutorial.]
The above questions are all variations of one of the following forms:
How many different types of _____ are there?
What is the most/least common _____?
What is the frequency of occurrence for various _____s?
In some cases, we're asked to compare two or more repertoires when
answering one of these basic questions.
For illustration purposes, consider the case of a Humdrum file named "foo"
containing the following simple input:
**foo
A
B
A
A
C
B
*-
Remember that any line beginning with an asterisk is a Humdrum
*interpretation*. Any interpretation beginning with two asterisks
indicates the TYPE of data being represented -- in this case a fictitious
data-type called "foo". The final line (*-) is a spine-path terminator
that simply identifies the end of the data. All other lines here are
data records.
It doesn't matter what the data represent. The "A", "B", and "C" might
signify different articulation marks, chords, harmonic intervals, or
instrumental configurations. Whatever is represented, the process of
generating an inventory is the same. Ultimately, we'd like to produce
a simple distribution that indicates:
3 occurrences of "A"
2 occurrences of "B"
1 occurrence of "C"
Building an inventory is a three-step process. First we need to *filter*
the input so only the data of interest is present. Second we need to
*sort* like-with-like. And third we need to *count* the number of
occurrences of each type of data token.
Let's begin by discussing the second process. The UNIX "sort" command will
rearrange lines of data so that they are in alphabetical/numerical order.
The command:
sort foo > sorted.foo
will sort the file "foo" and place the results in a file named "sorted.foo."
The file "sorted.foo" will contain the following:
**foo
*-
A
A
A
B
B
C
Notice that the asterisk is treated as alphabetically prior to the
the letter `A', so all the Humdrum interpretation records have been
moved to the beginning of the output. Notice also that all of the lines
beginning with the letter `A' are now collected on successive lines in
the output. Similarly, the `B's have been rearranged on successive lines.
The third step in generating an inventory is to count the number of
occurrences of each unique data token. The UNIX "uniq" command will
eliminate successive duplicate lines. For example, if we type:
uniq sorted.foo
The output will be as follows:
**foo
*-
A
B
C
Notice that repetitions of the data "A" and "B" have disappeared.
The simple "uniq" command is useful for telling us "how many different
things" there are in an input. For example, the above output identifies
just five different records.
If we wish, we can have "uniq" also identify the number of instances
of each type of record. The "-c" option for "uniq" will cause a `count'
to be prepended to each output line. If we type:
uniq -c sorted.foo > unique.foo
The output will be as follows:
1 **foo
1 *-
3 A
2 B
1 C
The prepended counts tell us that `A' has three instances, `B' occurred
twice, and all other records occurred just once.
In the above output, **foo, and *- are Humdrum interpretations rather
than data, so we probably don't want them to appear in our inventory.
If our file had contained comments, or null data records, these would
have also appeared in our output, although we are not likely to be
interested in them. This leads us to the first step in generating
an inventory -- *filtering* the input in order to eliminate records
that we'd prefer to omit from our final output.
The Humdrum "rid" command can be used to eliminate various classes
of Humdrum records. The "rid" command provides a number of options,
and each option eliminates a different class of records. Here are
the record classes with their associated options:
-G eliminate all global comments
-g eliminate only global comments that are empty
-L eliminate all local comments
-l eliminate only local comments that are empty
-I eliminate all interpretations
-i eliminate only null interpretations
-D eliminate all data records
-d eliminate only null data records
If you don't know the difference between these types of Humdrum records,
it's advisable to read Section I (General Introduction) in the Humdrum
Reference Manual.
Returning to our **foo output --
1 **foo
1 *-
3 A
2 B
1 C
-- we have little interest in the interpretations (**foo and *-).
So, before sorting our original file, we could use the "rid" command
with the -I option, to eliminate all interpretations. It's also
common to want to eliminate comments and null data records from an
inventory, so a frequent invocation is as follows:
rid -GLId foo > filtered.foo
:::::::
REPRISE
:::::::
So let's summarize what we've done so far. Generating an inventory is
really a three-step process. First we *filter* the input so only the
data of interest is present. Typically, this means using the "rid"
command with one or more options to eliminate comments, interpretations,
and perhaps null data records. Second we need to *sort* the data using
the "sort" command so that identical records are amalgamated as
neighbors. Finally, we can use the "uniq -c" to *count* the number of
occurrences of each type of data token. By way of summary, the command
sequence is:
rid -GLId foo > filtered.foo (step #1)
sort filtered.foo > sorted.foo (step #2)
uniq -c sorted.foo > inventory.foo (step #3)
On UNIX, a set of commands that sequentially process a given input
can be joined together as a "pipeline". A pipeline feeds the output
of one process to the input of another process. This means that we
can simplify the above sequence of commands into a single pipeline --
and avoid generating intermediate files:
rid -GLId foo | sort | uniq -c > inventory.foo
Notice that the inventory will pertain to whatever data was provided
in the original input. We've been using the abstract data "A", "B",
and "C". However, this data might represent any type of discrete
data, such as melodic intervals, articulation marks, instrumentation
indications, or Latin text.
Once you have generated the inventory, use your favorite "spread-sheet"
or "graphics" package to format or display the results.
:::::::::::
VARIATION 1
:::::::::::
In the above example, we assumed that the input consists of a single
Humdrum spine (i.e. a single column of data). However, Humdrum files
can have any number of spines, and each spine might represent radically
different types of data. For example, the following file (named
"foobar") contains two spines, one with "foo" data, and the second
with "bar" data. These data types might represent melodic intervals
and fingering information, or dynamic markings and stem-directions,
or whatever.
**foo **bar
A G
B G
A G
A A
C G
B G
*- *-
Notice that the letter `A' might signify something very different
in the "bar" representation than the same letter in the "foo"
representation.
If we apply our above inventory-generating commands for the
file "foobar", the result will be as follows:
1 A A
2 A G
2 B G
1 C G
Notice that the inventory is based on *entire records* containing
both "foo" and "bar" data. This is the reason why the foo-bar
data-pair "A A" is considered different from foo-bar data "A G".
Depending on the user's goal, this may or may not be the most
appropriate output.
A situation where this approach might be desired is when we are
counting the number of different spellings of chords (e.g., how
many different sonorous arrangements are there?). If "**foo" and
"**bar" represent pitches in two concurrent voices, then it may be
important to have both concurrent data tokens participating in
the inventory.
In other circumstances, we may not want this. For example, if
we are interested only in foo-related data, we need to eliminate
the irrelevant "**bar" data so it won't interfere. This is
easily done using the Humdrum "extract" command.
The "extract" command is quite powerful -- and deserves a tutorial
in its own right. But let's just note that the "-i" option for
"extract" allows you to specify the interpretation(s) of interest.
If we type the following command:
extract -i '**foo' foobar > foo
-- the resulting "foo" file will contain just the **foo data:
**foo
A
B
A
A
C
B
*-
Having eliminated the **bar data, we can then proceed as before
to generate our inventory. In fact, we can include the "extract"
command as part of our pipeline:
extract -i '**foo' foobar | rid -GLId | sort | uniq -c > inventory.foo
Alternatively, if we wanted to generate an inventory consisting
just of "**bar" data, we need make only a slight alteration:
extract -i '**bar' foobar | rid -GLId | sort | uniq -c > inventory.bar
This will extract the **bar data, and use that data as the basis to
build an inventory. Incidentally, given the "foobar" file as input,
the above "inventory.bar" file will be:
1 A
5 G
-- meaning 5 occurrences of the data "G" and 1 occurrences of "A".
Finally, consider the situation where we want both "foo" and "bar"
data to participate in our inventory, but we want the data treated
independently, rather than as concurrent data pairs. For example,
imagine that both "foo" and "bar" encode dynamic markings. (It
might be that "foo" encodes dynamic markings above the staff --
so-called "overlay" -- while "bar" encodes dynamic markings below
the staff -- so-called "underlay"). We might not care where the
dynamic markings are located, we simply would like to create an
inventory of *all* dynamic markings.
In this case, we will need to use "extract" twice so that each
spine is placed in a separate file:
extract -i '**foo' foobar > justfoo
extract -i '**bar' foobar > justbar
Now we can concatenate the two files so that they form a single
column of data. Amalgamating files end-to-end can be done using
the UNIX "cat" command:
cat justfoo justbar > foobar.cat
The file "foobar.cat" will look like this:
**foo
A
B
A
A
C
B
*-
**bar
G
G
G
A
G
G
*-
Now that each data token of interest is on its own line, we can
generate the appropriate inventory. The command:
rid -GLId foobar.cat | sort | uniq -c
-- will result in the following combined inventory:
4 A
2 B
1 C
5 G
Notice that our use of "extract" is part of the process of *filtering*
our initial data so that the inventory is based on the data of
interest to us.
:::::::::::
VARIATION 2
:::::::::::
For short inventory lists, it is easy to identify which records are
the most common and which records are the least common. For longer
inventory lists, it may be more difficult to scan through the output
to find the most frequent or least frequent occurrences. For such
long outputs, it might be more convenient to produce an output sorted
according to frequency of occurrence. Notice that each output record
from "uniq -c" begins with a number, and so the output is ideally
suited for numerical sorting. We've already learned that the UNIX
"sort" command rearranges input records in alphabetic/numeric order.
If we type
sort inventory.foo
The output will be as follows:
1 C
2 B
3 A
Now the output is sorted so that the least frequent occurrences are
at the beginning, and the most frequent occurrences are at the end
of the output. Incidentally, "sort" has a "-r" option that causes
the output to be sorted in reverse order. If we use "sort -r", then
the most common occurrences will be placed at the beginning of the
output:
sort -r inventory.foo
produces the following output:
3 A
2 B
1 C
Once again, we can amalgamate all of the required commands into a
single UNIX pipeline. The following pipeline produces an inventory
for any type of Humdrum input, sorted from the most common to the
least common data:
rid -GLId foo | sort | uniq -c | sort -r > inventory.foo
Incidentally, if we place the above command line in a file called
"inventory", and if we replace the filename "foo" by the shell
variable "$1", then we could create a new command that would auto-
matically generate a data inventory. The syntax would be:
inventory <filename>
:::::::::::
VARIATION 3
:::::::::::
In other circumstances, it may be helpful to determine the proportion
or percentage values rather than the actual numerical count. This can
be calculated by dividing each of the inventory count numbers by the
total number of data records processed. First, we can establish the
total number of data records using the following pipeline:
rid -GLId foo | wc -l
The UNIX "wc" (word count) command counts the number of lines, words,
and characters in an input. With the "-l" option, only the number of
lines is output. This will give us the total number of elements.
Simple division will generate the percentages.
Suppose the total number of data records was determined to be 874.
If you are familiar with the UNIX "awk" command, you could easily
generate the percentages for each data type via the command:
awk '{print $1/847*100 "\t" $2}' inventory.foo
This will create a two-column output. The first column will
indicate the percentage of occurrence, and the second column
will identify the corresponding type of data.
:::::::::::
VARIATION 4
:::::::::::
The "uniq" command provides two other options (besides the -c option),
that are occasionally useful. The "-d" option causes "uniq" to output
ONLY those records that are duplicated. In other words, records that
occur only once are eliminated from the input. This option can be
useful when there are a lot of single-occurrence data tokens and you
are only interested in those data records that occur more frequently.
Another useful option for "uniq" is the "-u" option. This causes ONLY
those records that are unique (occur only once) to be output. This
option can be useful when you are looking for rare circumstances in
your data.
rid -GLId foo | sort | uniq -uc (output only the rare events)
rid -GLId foo | sort | uniq -dc (eliminate all the rare events)
:::::::::::
VARIATION 5
:::::::::::
Notice that two data records must be identical in order for them
to be considered "the same" by "sort" and "uniq". This means that
records such as the following are considered entirely different:
ABC
abc
Abc
"ABC"
ABC.
CBA
Remember that step #1 in generating inventories requires that you
filter the data so only the data of interest is passed to "sort"
and "uniq". This means you must be careful about the state of the
input. Depending on your goal, you will either want to TRANSLATE
the input to some other more appropriate representation, or you may
EDIT the existing representation in order to discard or transform
otherwise confounding data.
TRANSLATING your data involves changing from one type of information
to another -- that is, changing the exclusive interpretations. For
example, if you want to produce an inventory of melodic intervals,
then you might need to translate a **pitch representation to a melodic
interval representing (such as **mint). Or if you would like to
produce an inventory of scale degrees, you might need to translate
a **pitch representation to a scale-degree representation (such as
"**deg" or "**degree). The number of ways in which your data can be
translated is far too large to discuss here. But later tutorials will
introduce many appropriate Humdrum translations.
For now, we will introduce one of the most useful types of data
filtering -- EDITING with the Humdrum stream-editor ("humsed").
The stream-editor will not change the type of exclusive interpretation
in your data. It will merely provide you with ways of automatically
editing your data so that certain types of information are eliminated,
replaced, or otherwise transformed.
The "humsed" stream-editor is fashioned along the lines of the
UNIX "sed" stream-editor. A stream-editor is like an ordinary text
editor; it will allow you to insert, delete, replace, or otherwise
manipulate the data in a file. Stream-editors differ from ordinary
editors in that they carry out the editing operations *automatically*
rather than having you open the document and edit the text manually.
Humsed provides a great range of capabilities, but we will limit our
discussion to a single process. Consider the following simple "humsed"
command:
humsed 's/"//' foo
The material in single quotes ('s/"//') is an editing command that
is passed to the humsed stream-editor. In this case, the letter "s"
identifies a *substitution* command. The ensuing slashes (/) are
used to delineate two character strings. The first character string
is the string that is to be replaced -- in this case the double-quote
character ("). The second character string is the replacement string
-- in this case nothing. In other words, this command will cause the
occurrence of a double-quote character to be eliminated.
How might we use the "humsed" substitute command? Suppose we had a
file (named "notes") consisting of pitch information, and we wanted
to create an inventory of the diatonic pitch-letter names. Our input
might look like this:
**notes
A
B
B
D
F#
D#
E
*-
Without modification, our inventory would appear as follows:
1 A
2 B
1 D
1 D#
1 E
1 F#
But this inventory distinguishes D-sharp from D-natural -- which
is not what we want. The answer is to filter our input so that
the sharps are removed.
Adding the appropriate "humsed" command to our pipe:
humsed 's/#//' notes | rid -GLId | sort | uniq -c
-- will produce the following output:
1 A
2 B
2 D
1 E
1 F
::::
CODA
::::
Having read through the above tutorial on generating inventories,
see if you can understand how the commands given below can be used
to solve the question posed:
Does Liszt use a greater variety of harmonies than Chopin?
extract -i '**harm' liszt* | rid -GLId | sort | uniq | wc -l
extract -i '**harm' chopin* | rid -GLId | sort | uniq | wc -l
What is the most frequently used dynamic marking in Beethoven,
and how does Beethoven's practice compare with that of Brahms?
extract -i "**dynam" beeth* | rid -GLId | sort | uniq -c | sort -r | head -1
extract -i "**dynam" brahm* | rid -GLId | sort | uniq -c | sort -r | head -1
[You might want to refer to the UNIX documentation for the
"head" command.]
Are flats more common than sharps in Monteverdi?
humsed 's/[^#-]//g' montev* | rid -GLId | sort | uniq -c
[This assumes monophonic **kern inputs.]
Did Bartok's preferred articulation marks change over his lifetime?
extract -i '**kern' early | humsed 's/[^"`~^:I]//g' | rid -GLId | sort | uniq -c
extract -i '**kern' late | humsed 's/[^"`~^:I]//g' | rid -GLId | sort | uniq -c
[This assumes that copies of early and late works have been
concatenated to the files "early" and "late."]
[See Section 2 of the Humdrum Reference Manual for details on
articulation marks for the **kern representation.]
Is there a tendency to use the subdominant pitch less often in pop
melodies than in (say) French chanson?
deg -xt pop* | grep -c '4'
deg -xt chanson* | grep -c '4'
[This assumes that the inputs are monophonic.]
How frequent are light-related words such as "lumen" or "lumine"
in the different monastic offices for Thomas of Canterbury?
extract -i '**words' office* | egrep -ic 'lum.+n[e]*$'
[This is the fast way. Familiarity with regular expressions helps.]
Is it true that 90 percent of the notes in a given work by Bach
use just two durations (such as eighths and sixteenths, or
eighths and quarters)?
humsed 's/[^0-9.]//g' bach | rid -GLId | sort | uniq -c
[Repeat the above command for each work and inspect the results.]
What is the most common instrumental combination for sonorities
by Musorgsky?
This problem is slightly more complicated and so will be
deferred to a future tutorial.
[End of Tutorial]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
::::::::::
RETRIEVING A Copy of the Humdrum Toolkit Via FTP
::::::::::
The letters "ftp" stand for "file transfer protocol" -- a popular
way of retrieving information from the Internet.
If you have access to e-mail, you probably have immediate access
to FTP.
Try typing:
ftp <ENTER>
If you receive the following prompt, you are all set to go:
ftp>
First, you need to connect to the Humdrum archive site at the
University of Waterloo. Type:
ftp> open archive.uwaterloo.ca <ENTER>
If the connection is successful, you will be asked to login:
Name (archive.uwaterloo.ca):
Use the special login name "anonymous":
Name (archive.uwaterloo.ca): anonymous <ENTER>
You will then be asked for a password. Type your e-mail address:
Password: your_id@machine.place.etc
Change directories to the Humdrum archive:
ftp> cd uw-data/humdrum
To view the contents of the Humdrum FTP directory, type:
ftp> ls
Begin, by downloading two "readme" files using the FTP "get" command:
ftp> get readme
ftp> get readme.2nd
These files will then be transferred to your home account.
You might also want to download two additional files which give
further information about Humdrum:
ftp> get faq
ftp> get install.txt
The "faq" file is a file containing "Frequently Asked Questions"
(and their answers) concerning Humdrum. The "install.txt" will
give you some advanced information about the circumstances
required for installing Humdrum on various machines and operating
systems.
If at any time you run into problems, type the FTP "help" command.
This will list all of the available FTP commands:
ftp> help
If you want to know more about a particular command, type "help"
followed by the command name, e.g.
ftp> help get
End your FTP session by typing:
ftp> close
followed by:
ftp> bye
Take the time to carefully read the files "readme" and "readme.2nd."
Also have a look at the "faq" and "install.txt" files. These files
explain in detail the capabilities and requirements of Humdrum, and
so will help you make up you mind whether you really want to download
and install the entire Humdrum Toolkit.
If you decide that you want to receive the entire Humdrum package,
invoke ftp again, and return to the archive site.
Set the transfer mode to binary by typing:
ftp> binary
If you are transferring to a hard disk, simply type the command:
ftp> mget *
(The transfer process will take some time.)
If you are transferring directly to floppy disk, three 1.4
megabyte disks will be required. Transfer the files "hum.2"
and "hum.3" to your `second' and `third' disks, respectively.
(insert disk #2:)
ftp> get hum.2
(insert disk #3:)
ftp> get hum.3
(insert disk #1:)
ftp> get hum.1
ftp> get licence
ftp> mget install.*
ftp> mget hum*.ksh
ftp> get humunix
Close the ftp connection, and print out a copy of the installation
guide -- either "install.txt" (simple ASCII text) or "install.ps"
(a postscript-format file).
When installing Humdrum, don't forget to read the licensing agreement.
Although Humdrum is free, you must register your copy.
::::
NOTE
::::
Please note: The Humdrum FTP files can be downloaded to any
machine. However, UNIX or UNIX utilities must be present on the
machine of final destination before the Humdrum Toolkit can be
*installed*. For non-UNIX users, use of Humdrum may require the
purchase of commercial UNIX utilities for DOS or OS/2. Detailed
technical information is available in the Humdrum Installation Guide.
::::::::::::::::::::::::::::::
What's in the Humdrum Toolkit?
::::::::::::::::::::::::::::::
The FTP distribution includes a complete copy of the Humdrum
Toolkit software, an Installation Guide, demonstration software,
an electronic edition of the 48 fugues from J.S. Bach's Well-
Tempered Clavier, as well as a selection of one hundred additional
scores. In addition, a 550-page Humdrum Reference Manual is
available in postscript form, and may be printed locally on any
postscript printer.
[End of HUMDRUM NEWS]