The goal of statistical analysis is to estimate the probability of truth or falsity for some statement.

Estimating the likelihood of some statement is common in all areas of scholarship. For example, historical musicologists have debated the circumstances surrounding Tchaikovsky's death. Did Tchaikovsky commit suicide? Which of the following conclusions is best?

Tchaikovsky almost certainly committed suicide. Tchaikovsky very likely committed suicide. Tchaikovsky probably committed suicide. Tchaikovsky might have committed suicide. Tchaikovsky probably didn't commit suicide. Tchaikovsky very likely didn't commit suicide. Tchaikovsky almost certainly didn't commit suicide.

Notice first that all of these statements are *probabilistic*.
No one is declaring that "Tchaikovsky committed suicide."
or that "Tchaikovsky didn't commit suicide."

At the moment, most scholars would probably say that Tchaikovsky "very likely" committed suicide. But some scholars think this language is too strong, and would prefer to say that Tchaikovsky "probably" committed suicide. Notice that the debate hinges on what is an appropriate adjective.

The choice of adjectives can get very testy. For example, a scholar might assert in professional meetings that Tchaikovsky "probably" committed suicide, but then later publish an article that says Tchaikovsky "very likely" committed suicide. This change in language might well cause consternation for knowledgeable scholars who regarded the first statement as prudent but regard the second statement as exaggerating the strength of the evidence.

A statistical analysis typically ends with the calculation
of a value called *p* (for probability)
which ranges between 0 and 1.
A value near 0 means that the probability of the
statement's being true is very low;
a value near 1 means that the probability is high.

One of the values of calculating probabilities is that it can bring an end to constant bickering about appropriate adjectives. In essence, statistics formalizes intuitions and helps to exclude the influence of our personal biases when presenting information.

In effect, the statistical musicologists endeavors to replace statements like those on the left by statements like those on the right:

Tchaikovsky almost certainly committed suicide. Tchaikovsky committed suicide. ( p=.99)Tchaikovsky very likely committed suicide. Tchaikovsky committed suicide. ( p=.80)Tchaikovsky probably committed suicide. Tchaikovsky committed suicide. ( p=.70)Tchaikovsky might have committed suicide. Tchaikovsky committed suicide. ( p=.50)Tchaikovsky probably didn't committed suicide. Tchaikovsky committed suicide. ( p=.30)Tchaikovsky very likely didn't committed suicide. Tchaikovsky committed suicide. ( p=.20)Tchaikovsky almost certainly didn't committed suicide. Tchaikovsky committed suicide. ( p=.01)

Calculating the value of *p* can be done only
if the evidence can be expressed quantitatively.
For many historical and other questions, it may not
be possible to characterize the evidence quantitatively.
But it is often the case that, with a little ingenuity,
it is easy to formulate the evidence in a quantitative manner.

Quantitative evidence is referred to as "data".

There are a surprising number of claims about music that can be tested using statistical methods. The kinds of statements that might be evaluated by statistical means are suggested by the following:

- There are a lot of dynamic changes in Beethoven's music.
- Brahms tends to use a lot of hemiolas.
- Gershwin avoided syncopation in his melodies.
- Dowland's lute music shows little influence of the instrument.

Statisticians have devised innumerable procedures for
calculating *p* depending on the type of data,
the quality of the data, the volume of data,
and the nature of the claim being examined.
Each valuative procedure is called a statistical
*test*

There are many kinds of statistical tests.
These tests are often identified by the names of their
inventors (*e.g.,* Hartley's F-test,
the Wilcoxon signed-ranks test, the Kolmogorov-Smirnov test,
and the Kaiser-Meyer-Olkin test).

Contemporary statistical methodology follows a modified form of Popperian logic: theories cannot be proven but they can be tested. Contemporary statistical method also adheres to the logic of falisification.

The researcher formulates some hypothesis (e.g. All (or most)
*X* have the property *Y*).
The researcher cannot prove statements, only falsify them.
If the statement is, in reality, true, then it will be
impossible to falsify.
If the statement is, in reality, true,
then it should be possible to falsify the
*reverse* of the hypothesis.

The reverse hypothesis would be:
All (or most) *X* **do not** have the property *Y*.
The researcher's task is to attempt to *falsify*
this "reverse hypothesis".

In practice, the researcher is not interested in showing that the experimental hypothesis has a high probability of being true. Rather, the researcher attempts to show that the null version of the hypothesis has a low probability of being true.

*H*: Funeral marches tend to be in minor keys.

*H0*: Funeral marches tend not to be in minor keys.

*Operational Definition:* 1. minor key is
defined as a lowercase letter in Grove's repertory list.

*Operational Definition:* 2. funeral march
is any work containing the word "funeral" in the title
(in English, French, German, Italian, Spanish).

*Population*: All funeral marches.

*Control*: All works.

*Sample*: All "funeral" works listed in a randomly
selected volume of Groves (Vol. 17).
Seven funeral marches.

*Control Probability*: Use the control group to establish
the proportion of all works that are in minor keys.
For the sake of simplicity, suppose that 50% of all
works are in minor keys.

*Measurement*:
All seven are in minor keys.

Sample Probability: 0.5 X 0.5 X 0.5 X 0.5 X 0.5 X 0.5 X 0.5 = 0.0078125

One of the most common statistical tests is the chi-square test.
This test takes its name from the Greek letter *X*
(pronounced "Kye"), which is used to compare proportions
or ratios.
It facilitates the determination of whether the proportion of
occurrences of some feature in one data sample is significantly
greater than or less than the proportion of the same feature
in a different data sample.

By way of example, let us test the hypothesis that Brahms uses hemiolas more than his contemporaries do. To set up the test we need first to count the number of hemiolas in (a) a sample of music derived from Brahms's music and (b) a sample of music by other composers who were active at the same time. Let us suppose that the raw data looks like this:

COMPOSER(S) MEASURES SEARCHED HEMIOLAS FOUND Various 5000 23 Brahms 500 9

Proportionally, hemiolas occurred in 1.8% of the data for Brahms (9 out of 500) but just 0.7% in the music by other composers (23 out of 5000). Is this a big difference? A chi-square test will tell us.