Chapter 1: The nature of sound and the structure and function of the auditory system

Psychoacoustics or auditory psychophysics attempts to specify the relationships between the physical characteristics of sounds that enter the ear and the sensations that they produce.

Emphasis on underlying mechanisms at various levels of explanation.
Must first know something about the physical nature of sound and basic anatomy and physiology of the auditory system.

The physical characteristics of sounds.

The nature of sound.

Sound originates from the motion or vibration of an object that is imposed upon the surrounding medium as a pattern of changes in pressure.
The sound wave is propagated in all directions from the vibrating object by transmission of the vibration (condensation and rarefaction) of individual molecules along the axis of propagation through the surrounding medium.
Such a longitudinal wave weakens with distance from the object and is subject to reflections and refractions caused by objects in its path.
The simplest type of such vibration mathematically, physically and auditorily is the sine wave that can be modeled by a pendulum, a spring, or a tuning fork.

Plotting pressure variations against time to obtain a waveform for the sine wave yields a function of the form A sin (2p ft), where A is the maximum amplitude of the vibration, f is the frequency of the vibration, and t is time as shown in Fig. 1.1.
A single continuous sine wave can thus be completely specified by two parameters.

A is the amount of pressure variation about the mean.
The frequency parameter f is the number of times per second the waveform repeats itself, specified in hertz (where 1 Hz = 1 cps).
Instead of frequency, one can specify the period, which is simply the reciprocal of frequency.
If the sine wave is turned on and off or we are interested in the relationship between two or more different sine waves, the phase must also be given to specify the portion of the cycle through which the wave has advanced in relation to some fixed point in time.

The perceptual qualities of the pure tone to which a sine wave gives rise are related to the parameters of the wave.

Pitch is related monotonically to frequency (demo).
Loudness is related monotonically to amplitude (demo).

Fourier analysis and spectral representations.

All sounds can be specified in terms of variations in sound pressure over time, but when sounds are complex it is often more useful to specify them in the frequency domain.
This is made possible by Fourier analysis that breaks down the complex wave into a series of sinusoids, each with a specific frequency, amplitude and phase.
Adding these sinusoids together produces the original complex wave and is referred to as Fourier synthesis.
The simplest type of complex tone is one that is periodic.

Such periodic or harmonic complex tones are composed of a number of sinusoids whose frequencies are integer multiples of a (not necessarily present) fundamental frequency.
The fundamental frequency equals the repetition rate of the complex waveform as a whole.
The components of the tone are called harmonics and are numbered beginning with the fundamental as one.

Fig. 1.2 illustrates how a complex tone can be built up from a series of sinusoids (demo).
The structure of a sound, in terms of its frequency components, is often represented by its magnitude spectrum, a plot of sound amplitude, energy or power as a function of frequency. Fig. 1.3 shows examples of magnitude spectra.
The term partial is used to describe any discrete sinusoidal component of a complex sound, whether it is a harmonic or not.

The measurement of sound level.

Instruments used to measure magnitudes of sounds, such as microphones, normally respond to changes in air pressure.

Sound magnitudes are often specified in terms of intensity, which is the sound energy transmitted per second (i.e. the power) through a unit area in a sound field.
For our purposes, acoustic intensity is proportional to the square of the pressure variation.

Our auditory systems can deal with a huge range of sound intensities making it inconvenient to deal with sound intensities directly. Instead, a log scale expressing the ratio of two intensities is used.

One intensity, I₀, is chosen as a reference and the other intensity, I₁, is expressed relative to this.
One bel is defined to be an intensity ratio of 10:1. Thus intensity in bels = log₁₀ (I₁/I₀).
The bel is a rather large unit and thus is usually divided into 10 decibels (dB) so that number of decibels = 10 log₁₀ (I₁/I₀).

When the magnitude of a sound is specified in dB, it is referred to as a sound level.

The sound level is an intensity ratio, not an absolute intensity.
To specify the absolute intensity it is necessary to state that the sound, I₁, is n dB above or below some reference intensity I₀.
The most common reference intensity for sound measurements is 10^-12 W/m², which was chosen to be close to the average human absolute threshold for a 1000-Hz pure tone. A sound level specified using this reference is referred to as sound pressure level (SPL).
It is sometimes useful to choose as a reference level the threshold of a subject for the sound being used. A sound level specified in this way is referred to as sensation level (SL).
It is also useful to adapt the dB notion for ratios of pressures. Number of decibels = 10 log₁₀ (I₁/I₀) = 10 log₁₀ (P₁/P₀)² = 20 log₁₀ (P₁/P₀).
Table 1.1 gives some examples of sound levels, in dB SPL, corresponding to various common sounds.

Beats.

When two sinusoids with slightly different frequencies are added together, the resulting wave resembles a single sinusoid, with frequency equal to the mean frequency of the two components, but with amplitude fluctuating at a regular rate.
These fluctuations are known as beats and occur because of the changing phase relationship between the two sinusoids, which causes them alternately to reinforce and cancel one another (Fig. 1.3b).
Beats are heard as loudness fluctuations and can be a problem in some experiments (demo).

The concept of linearity.

The auditory system is often conceived as being made up of a series of devices or systems, each with input from the previous device and output to the subsequent device.
Such a device is said to be linear if certain relationships between its input and output are true.

Superposition--The output of the device in response to a number of independent inputs presented simultaneously should be equal to the sum of the outputs that would have been obtained if each input were presented alone.
Homogeneity--If the input to the device is changed in magnitude by a factor k, then the output should also change in magnitude by a factor k, but be otherwise unaltered.

The output of a linear device never contains frequency components that were not present in the input signal.
Some parts of the auditory system are approximately linear, while other parts behave in a grossly nonlinear way.
If a device is linear, measuring its response to a sinusoidal input as a function of frequency tells us all we need to know to predict its response to any input.

Perform a Fourier analysis of the arbitrary complex input.
The response to the complex input can then be calculated as the sum of the responses to its sinusoidal components.
This is one of the reasons sinusoidal stimuli are so frequently studied in psychoacoustics.

If a device is not linear, its response to complex inputs cannot generally be predicted from its response to sinusoidal inputs.
To discover the characteristics of such a nonlinear device its response to both sinusoidal and various complex inputs of interest must be studied directly.

Filters and their properties.

Filters are used to manipulate the spectra of stimuli for psychoacoustic experiments and provide models of how certain parts of the auditory system behave.
Filters are linear devices that attenuate some frequencies more than others.

A highpass filter removes all frequency components below a certain cutoff frequency, but does not affect components above that frequency.
A lowpass filter does just the opposite.
A bandpass filter has two cutoff frequencies, passing components between those two frequencies and removing components outside this passband.
A bandstop filter also has two cutoff frequencies, but it removes components between these two frequencies, leaving other components intact.

In practice it is not possible to design filters with perfectly sharp cutoffs. Instead there is some range of frequencies over which components are increasingly attenuated, but not completely eliminated.
Thus, in order to specify a filter we have to define both its cutoff frequency and the slope of the filter response curve.
Some typical filter characteristics are shown in Fig. 1.5.

The cutoff frequency is usually defined as the frequency at which the output of the filter has fallen by 3 dB or reduced in power by 1/2, relative to output in the passband.
For a bandpass or bandstop filter, the range of frequencies between the two cutoffs defines the bandwidth of the filter and the midpoint of the pass or stop band is called the filter's center frequency (CF).

An alternative way to measure bandwidth is the equivalent rectangular bandwidth (ERB), which is simply the bandwidth of a rectangular filter with the same height and area as our filter.
The characteristics of filters outside their pass bands are often linear when plotted on dB versus log-frequency coordinates. Thus, slopes are often specified in dB/octave.
A filter does not affect the waveform of a sinusoid, but usually does alter waveforms that are more complex. E.g., passing white noise through a narrow bandpass filter produces a waveform resembling a sinusoid fluctuating in amplitude from moment to moment and has a pitch-like quality corresponding to the center frequency of the filter.
The characteristics of a filter can be obtained by a Fourier analysis of its impulse response.
A filter's response is not instantaneous.

The narrower the bandwidth and the steeper the slope of a filter, the longer its response time.
Thus, an increase in frequency selectivity can only be obtained at the expense of a loss of resolution in time.

Basic structure and function of the auditory system.

The outer and middle ear.

The outer ear consists of the pinna and auditory canal, as shown in Fig. 1.7.

The pinna modifies incoming sound, particularly at high frequencies, and plays a role in sound localization.
The auditory canal channels the sound to the middle ear.

The middle ear consists of the tympanic membrane (eardrum) and three small bones (ossicles), the malleus (hammer), incus (anvil), and stapes (stirrup).

Sound arriving at the tympanic membrane sets it into vibration that is transmitted through the ossicles, acting as a series of levers, to the inner ear.
The main function of the middle ear is to match the acoustic impedance of the air to that of the inner ear.
Transmission of sound through the middle ear is most efficient at middle frequencies (500-4000 Hz).
Middle ear reflex.

The inner ear and the basilar membrane.

The inner ear consists of the cochlea, which is a rigid, bony, snail-shaped structure filled with almost incompressible fluids.
The cochlea is divided along its length by the basilar membrane into two connected chambers, the scala vestibuli and the scala tympani.
The stapes contacts the oval window at the base of the cochlea and its movements in response to sound force fluid from the upper to the lower chamber of the cochlea, with the pressure producing an outward movement of the round window.
This creates a pressure difference across the basilar membrane that causes it to move.
The response of the basilar membrane to sinusoidal stimulation takes the form of a travelling wave that moves along the membrane from base to apex.
The amplitude of this wave increases at first and then decreases abruptly, as shown in Fig. 1.8, producing an envelope maximum at a particular position along the membrane.
The position of the peak vibration along the basilar membrane varies with frequency of stimulation, as shown in Fig. 1.9.

High frequencies produce a maximum displacement near the base of the basilar membrane.
Low frequencies produce patterns of vibration that extend all along the membrane, but reach a maximum near the apex.
The frequency that gives maximum response at a particular point on the basilar membrane is known as the characteristic frequency of that place.

In response to steady sinusoidal stimulation, each point on the basilar membrane vibrates in an approximately sinusoidal manner with a frequency equal to that of the input waveform.
Each point on the basilar membrane may be considered a bandpass filter with a center frequency, bandwidth and slopes outside the passband. The bandwidth increases roughly in proportion to the center frequency (typical bandwidths are 0.5-0.15 octaves).
If two sinusoids of different frequencies are presented simultaneously, the response of the basilar membrane is somewhat more complex and depends on the frequency separation of the two tones.

At large separations, two separate peaks of vibration are obtained, much as if each tone had been presented separately.
When the tones are closer together, some points respond to both tones and those points have a complex, rather than a sinusoidal vibration.
If the tones are yet closer together there is simply a single peak vibration, but that peak is somewhat wider than for a single tone.

For frequencies above 500 Hz, the position on the basilar membrane most excited by a given frequency varies approximately with the logarithm of frequency, and relative bandwidths of the vibration patterns are approximately constant.
The impulse response of a particular point on the basilar membrane resembles a dampened oscillation with a frequency corresponding to the center frequency of that point.

The transduction process and the hair cells.

How is information about frequency, amplitude and time, which is carried in the vibration patterns of the basilar membrane, converted or coded into neural signals in the auditory nervous system?
Between the basilar membrane and the tectorial membrane are hair cells, which form part of a structure called the organ of Corti, as shown in Fig. 1.13.

On the side of the tunnel of Corti nearest the outside, there are about 25,000 outer hair cells, each with about 140 'hairs'.
On the other side of the tunnel are about 3,500 inner hair cells, each with about 40 'hairs'.

The functions of the inner and outer hair cells are different from each other.

Motion of the basilar membrane excites the inner hair cells that in turn excite the approximately 20 auditory neurons that contact each cell.
The outer hair cells receive efferents from higher brain centers and affect the mechanics of the cochlea to produce high sensitivity and sharp tuning.

Cochlear echoes.

Kemp found that if a low-level click is applied to the ear, then it is possible to measure sound being reflected from the ear, using a microphone sealed into the ear canal.
The early part of this reflected sound comes from the middle ear, but at longer delays it reflects active processes occurring inside the cochlea.
Investigation of these cochlear echoes suggests several conclusions.

These processes have a strong nonlinear component.
They are biologically active.
They are physiologically vulnerable.
They appear to be responsible for the sensitivity and sharp tuning of the basilar membrane.

Neural responses in the auditory nerve.

Spontaneous firing rates and thresholds.

Auditory nerve fibers have spontaneous firing rates from less than 0.5 to about 250 spikes/sec.
The spontaneous rates are correlated with position and size of the synapses on the inner hair cells.

High rates go with large synapses on the side of the inner hair cells facing the outer hair cells.
Low rates go with small synapses on the other side of the inner hair cells.

High spontaneous rates are correlated with low thresholds and vice versa.
Thresholds vary from near 0 dB to 80 dB SPL or more.

Tuning curves and iso-rate contours.

Frequency selectivity of a single nerve fiber can be illustrated by a tuning curve, which plots the fiber's threshold as a function of frequency, as shown in Fig. 1.14.

On the log frequency scale, the tuning curves are steeper on their high frequency side.
The frequency at which a fiber's threshold is lowest is called its center frequency.
The frequency selectivity of a fiber is derived from the frequency selectivity of the point on the basilar membrane that activates it.
The tonotopic or place representation of frequency on the basilar membrane is preserved in the auditory nerve bundle with high center frequencies in the periphery of the bundle and an orderly decrease in center frequency towards the center of the bundle.
Sharpness of tuning on the basilar membrane now appears to be the same as for single neurons in the auditory nerve.

Iso-rate contours can be used to describe the characteristics of single fibers above threshold.

The intensity of sinusoidal stimulation required to produce a predetermined firing rate in the neuron is plotted as a function of frequency.
Resulting curves have the same general shape as tuning curves, but sometimes broaden at high sound levels.

Another alternative is iso-intensity contours that plot firing rates at equal sound levels as a function of tone frequency.

Shape depends on sound level chosen.
Difficult to interpret because relationship between firing rates and intensity of stimulation is non-linear.
Show potentially important result that for some fibers the frequency that gives maximum firing rate varies with frequency.

Rate versus level functions.

Fig. 1.16 shows how the rate of firing of an auditory neuron varies with intensity of a sinusoid at the neuron's center frequency.
Neurons vary as to spontaneous and maximum levels.
Neuron is said to saturate when further increases in intensity produce no further increases in firing rate.
Range between threshold and saturation is called dynamic range, and is between 20 and 50 dB for most neurons.
Some neurons show sloping saturation, or a gradual increase in firing rate even at high sound levels. This occurs mainly for neurons with low spontaneous rates.

Neural excitation patterns.

In response to low levels of sinusoidal stimulation, there is a high level of activity in neurons with center frequencies close to that of the stimulus and falling off rapidly to either side.
At higher levels of stimulation, saturation can produce a high level of activity across units with a wide range of center frequencies.

Phase locking.

Information about the stimulus is carried not only in the firing rate of neurons, but also in the temporal pattern of these firings.
In response to sinusoidal stimulation, nerve firings tend to be phase locked or synchronized to the stimulating waveform.

A given fiber does not necessarily fire on every cycle of the waveform, but its firings occur at roughly the same phase of the waveform.
Thus, the time intervals between firings are approximately integer multiples of the period of the waveform.

One way to demonstrate phase locking in a single auditory nerve fiber is to plot a histogram of the time intervals between successive firings as in Fig. 1.19.
There is variability in the exact instant of initiation of a nerve impulse and as frequency increases the period of the waveform eventually becomes as short as this variability.
Thus, phase locking in the human auditory system breaks down above 4-5 kHz.

Two-tone suppression.

The tone-driven activity of a single fiber in response to one tone can be suppressed by the presence of a second tone.

For a neuron responding to a tone near its center frequency, a second tone presented within the excitatory area bounded by the tuning curve for that neuron usually increases its firing rate.
When the second tone falls just outside this area, the firing rate is usually reduced, as shown in Fig. 1.20.

The suppression effects onset and offset very rapidly, and are thought to occur on the basilar membrane.
Phase locking of the neuron may also shift from the original tone to the suppressor tone.

Limited investigations of phase locking to stimuli that are more complex have begun, but are beyond the scope of our course.

Neural responses at higher levels in the auditory system.

Anatomy is well known, but physiology is not.
Feature detectors are involved, but cataloging of relevant features is just beginning.