Spectrogram of a banded wren from the Cornell Lab of Ornithology website
You can build a real-time audio spectrogram that can display a spectrum of audio signals. Using two microcontrollers and a little knowhow, your system can display frequency spectrum content from a microphone or an audio line-in using 4-bit grayscale scrolling on any NTSC television.
If you have ever wondered what a sound looks like, then our real-time scrolling spectrogram is the solution to all of your problems—or at least the one where you cannot quite visualize sounds. That being said, the device we designed enables any audio enthusiast to visualize patterns in live audio input with an inexpensive setup.
Our real-time audio spectrogram can display a spectrum of audio signals as a grayscale spectrogram. While our initial goal was to create a system that could detect and compare a user’s voice to a prerecorded audio sample, that would’ve required a much stronger microcontroller than the Atmel ATmega1284p microcontrollers we’d planned to use. So instead, we determined that providing a real-time spectrogram of the input audio signal would be a fun and practical way of recognizing and comparing various audio samples.
A spectrogram is a two-dimensional image of an audio signal, where the x-axis and y-axis represent the time and frequency, and the intensity of the signal at a particular time and frequency is represented by its whiteness in grayscale—or by different colors, in the case of color spectrograms (see Photo 1). Humans are quite capable of reading spectrograms of various audio inputs, ranging from simple bird calls to human speech. Thus, we decided that a microcontroller-based, real-time spectrogram would be a useful device in aiding audio and voice recognition by humans.
The project is a real-time scrolling spectrogram-style visualization of an audio signal (see Figure 1). It displays the frequency spectrum content taken from a microphone or an audio line-in in real time using 4-bit grayscale scrolling on any NTSC television. Features include play/pause functionality, several scroll speed settings, and grayscale amplitude display in either linear or log scale. The FFT algorithms and display processes are implemented on two ATmega1284P microcontrollers. One microcontroller is purposed entirely for data acquisition and audio processing (i.e., FFT calculation), while the other is used only for generating the video output to the screen. As long as your NTSC television can support a resolution greater than 160 × 200 pixels, the television is compatible. On top of that, the spectrogram supports frequencies up to 4 kHz, covering an overwhelming majority of frequency ranges present in voice and typical modern music.
Our design features two ATmega1284p microcontrollers mounted on a breadboard, a soldered audio amplification and filtering circuit, and a small NTSC television for displaying the spectrogram. We used two microcontrollers since the FFT algorithm and video processing both take up a major portion of a microcontroller’s processing power.
The audio amplifier circuit simply accepts a stereo line-in signal or a microphone audio input signal (see Figure 2). The line-in and microphone inputs have separate gain stages using Texas Instruments LM358 op-amps, and a mechanical switch selects one of the two signals to feed to the eighth-order, low-pass filter. We chose a Maxim Integrated MAX295 eighth-order, low-pass filter for its high roll off so that it acts as good antialiasing filter without attenuating the lower frequency contents that need to be displayed. The low-passed signal is then fed to the ADC through a 330-Ω resistor. We also made sure that the filters were near linear phase so that phase distortion is kept at a minimum.
On the line-in amplifier specifically, the stereo line-in’s left and right channels are fed through a summing unity-gain inverting amplifier, which is then fed through a DC-biasing circuit. The biased output is then fed through an AC-coupled noninverting amplifier with a gain of 6.67 V/V, which was derived experimentally to be suitable for the given line-in input voltage range. Similar to the line-in circuit, the mic-in circuit runs the input from the microphone through the biasing circuit. This signal is then passed through a high gain noninverting amplifier. As with the line-in circuit, this value was chosen after experimentally determining the maximum voltage swing of the mic-in audio to be around 20 mV peak-to-peak. Thus, to be able to detect voices of reasonable strength and distance away from the microphone, we chose the above resistor value to obtain a gain of 266.67 V/V, or +48.5 dB. This ensures that the maximum output of the microphone results in a full voltage swing across the ADC range.
Since we sample the audio signal with the ADC at an 8-kHz sampling rate, we must low-pass filter the audio signal before the ADC conversion to remove aliasing. With our eighth-order, low-pass filter, we were able to have a 3-kHz cutoff frequency, which allowed us to preserve much of the higher frequencies and still attenuate the frequencies above 4 kHz by a factor of 10 for minimal aliasing.
For 16-level grayscale generation, we borrow the circuit built by Francisco Woodland and Jeff Yuen in their project, “Gray-Scale Graphics: Dueling Ships” (Cornell University, ECE4760, 2003). As you can see in Figure 3, we use a R2R resistor ladder network to add the 4 bits representing grayscale intensity and the sync signal. The added voltage is passed through a common collector circuit that serves as a voltage buffer, and the resulting voltage is shifted to the 0-to-1-V range by a voltage divider.
Lastly, the push button circuits are simple debounced push buttons with resistors that toggle the play/pause, speed control, and log/linear conversion functionalities for the audio and video microcontroller units. The ATmega1284p microcontroller is embedded on a custom microcontroller board designed by Dr. Bruce Land. You can refer to the course webpage (http://people.ece.cornell.edu/land/PROJECTS/ProtoBoard476/) for the full schematic.
The software portion of the project was split in two between the code for the audio microcontroller and the video microcontroller. The audio microcontroller is responsible for the audio data acquisition: converting the analog audio signal into a digital signal by ADC sampling; converting the digital time domain signal into frequency domain using FFT; and transmitting the processed data to the video microcontroller through USART. The video microcontroller is responsible for receiving the data from the audio microcontroller through USART, performing the visualization processing by using 4-bit grayscale, implementing a circular buffer to get a scrolling display and transferring the data to the NTSC television. The FFT code in audio microcontroller and the video display code in the video microcontroller were based on example code by our instructor, Dr. Bruce Land (Cornell University, ECE4760). Regarding the software setup, we used Atmel AVR Studio 4 version 4.15 with the WinAVR GCC Compiler version 20080610 to build and write the code and to program the microcontroller.
The modified audio signal from the audio analog circuit was fed into the ADC port A.0. The ADC voltage reference was set to AVCC. By default, the successive approximation circuitry requires an input clock frequency between 50 and 200 kHz to get maximum resolution, according to the ATmega1284’s specifications. So, the ADC was set to execute at 125 kHz to allow 8 bits of precision in the ADC resulting value, which we deemed sufficient for the project. In order to get accurate results, we must sample the ADC port at precisely spaced intervals, since the ADC is running at 125 kHz and it takes 13 ADC cycles to produce a new 8-bit ADC value (ranging from 0–255), or about 104 µs. The ADC was set to “left adjust result,” so all 8 bits of the ADC result were stored in the ADCH register. At a sample rate of 8 kHz, an ADC value will be requested every 125 µs, which means there should be sufficient time for a new ADC value to be ready each time it is requested. The sample rate was set to be cycle accurate at 8 kHz by setting the 16-MHz Timer 1 counter to interrupt at 2,000 cycles and having the microcontroller interrupt to sleep at 1,975 cycles (slightly before the main interrupt) to ensure that no other processes would be interfering with the precise execution of the Timer 1 ISR where the ADC is sampled. The output from the ADC was stored in bits 2 to 9 of an 8:8 bit format fixed-point integer buffer on which a Hanning window was applied. Then it is used as input for the FFT.
After waiting for 128 samples of the audio signal, FFT conversion begins in the main loop of the program. We use the FFT conversion code provided for us by Dr. Bruce Land (“DSP for GCC,” 2013), which takes real and imaginary input arrays of size N (where N is a power of 2) and calculates the resulting real and imaginary frequency vectors in place. Since the resulting array contains both positive and negative frequencies symmetrically, we use only the first 64 entries of the array.
Following the conversion, we calculate only the magnitude of the frequency, since phase is not used. Since taking a square root of the sum of the squares is too costly, we use the following approximation to calculate the magnitude:
frk represents the kth element of the real vector and fik represents the kth element of the imaginary vector. As shown in “Function Approximation Tricks” (www.dspguru.com/book/export/html/62), the above approximation has an average error of 4.95%. This is acceptable since our frequency magnitude resolution is already so low (only 16 discrete values), so the magnitude calculation error is insignificant compared to the quantization error.
The calculated magnitude is then divided into 16 discrete levels by linear or base-2 logarithmic scaling, according to the user’s choice. With linear scaling we take the middle 8 bits of the 16-bit fixed-point value (bits from 4 to 11) and then assign a value to a new char array that contains the scaled values. With base-2 logarithmic scaling, we use the position of the most significant bit of the magnitude as the logarithmic value to store in the char array. For example, a magnitude value with the most significant bit at position 5 would be approximated as log(magnitude) = 5 + 1 = 6. After discretization is done, the 64 bytes of frequency values are transmitted to the video microcontroller via the USART data transmission protocol. This is discussed in detail in the following section.
USART COMMUNICATION PROTOCOL
For transmitting data from the audio MCU to the video MCU, we use the USART data transmission and reception protocol. The USART protocol uses two wires for transmit and receive, and an additional wire when using synchronous mode. We opted to use USART in synchronous mode since synchronous mode provides the fastest data transfer rate, since the additional clock is used to check the integrity of the communication whereas the asynchronous mode requires additional overhead in checking data integrity. We use USART channel 0 on the audio microcontroller to transmit the frequency magnitude data to the video microcontroller and USART channel 1 on the video microcontroller to receive the transmitted data. We chose USART channel 1 instead of 0 on the video microcontroller since we originally used channel 0 to output black-and-white NTSC signal to the television.
Both USART are set up to transmit 8-bit size characters. Since we are using synchronous mode, start and stop bits are not needed. Thus, a frame consists of a total of 8 bits. Since both USART are operating in synchronous mode, we set the data rate of the transmission by the following equation:
fOSC is the CPU clock frequency. Since the external clock frequency has to be less than a quarter of the CPU frequency—which is at 16 MHz—the maximum data rate we can achieve from the above equation is 2 MHz, which we achieved by setting the register UBRR0 = 3. Since we can keep sending data at every clock edge as long as we keep the USART Transmit Data Buffer register full at all times, we thus calculate the maximum data transfer rate as 250,000 byte frames per second.
A major hurdle in USART communication was flow control—namely, when to start/stop sending data to the video microcontroller. Notice that when the video microcontroller is busy drawing the visible lines on the television. It does not have time to accept data from the audio microcontroller. Thus, we must ensure a correct flow control scheme where we only send data when the audio microcontroller is ready to transmit (i.e., it has all the frequency magnitudes calculated) and the video microcontroller is ready to receive (i.e., drawing blank lines on screen).
To achieve this, we adapted the flow control scheme used by Alexander Wang and Bill Jo in the project, “Audio Spectrum Analyzer” (Cornell University, ECE4760, 2012). We used Pin D.6 as the “transmit ready” pin that the audio microcontroller controls and Pin D.7 as the “receive ready” pin that the video microcontroller controls. When the audio microcontroller is ready to transmit, it sets the transmit ready pin HIGH and waits for the receive ready pin to go HIGH. When the video microcontroller is in turn ready to receive, it will set its receive ready pin HIGH, and the audio microcontroller starts transmitting data. Note that the audio microcontroller cannot send data indefinitely since the video microcontroller has to put out a sync pulse every 63.625 µs. Thus, we designed the audio microcontroller to send 4 bytes of data per video line timing, which requires a total of approximately 16 μs. When the video microcontroller starts receiving the data per line, it immediately lowers the Receive Ready pin so that the next 4 bytes are not sent until the next line. After all 64 bytes of data have been transmitted, the audio microcontroller sets the Transmit Ready pin LOW and waits for the new sample data.
The video microcontroller is responsible for taking the received frequency values and displaying them in a grayscale scrolling spectrogram on a NTSC television. A television operating on the NTSC standard is controlled by periodic 0-to-0.3-V sync pulses that determine the start of video lines and frames being displayed. One can imagine a single trace moving across the monitor in a “Z-pattern” along each line. Although the standard dictates drawing 525 lines in 30 frames per second, one can choose to display only half of the lines (262) at 60 frames per second. This results in a 63.5-μs video line timing. Although the exact timing may vary by about 5%, it is crucial to have the timing be consistent. After a sync pulse, a voltage level ranging from 0.3 to 1 V determines the black/white level of the display, where higher voltage results in a whiter image. For more detailed information on the NTSC standard and its implementation on ATmega microcontrollers, refer to Dr. Bruce Land’s “Video Generation with Atmel Mega644/1284 and GCC” (http://people.ece.cornell.edu/land/courses/ece4760/video/index.html).
For accurate sync generation, we use Timer 1 to generate the sync signal by entering the COMPA interrupt service routine every 63.625 μs. Since one can only enter the interrupt after finishing the execution of the current instruction, this results in an approximately one- to two-cycle inconsistency in our video line timing. To prevent this, we set OCR0B such that we enter the COMPB ISR just before the video line timing to put the CPU to sleep. Since it always takes the same amount of cycles to wake up from Sleep mode, this ensures consistent video line timing.
When receiving the 64-frequency magnitudes from the audio microcontroller, the video microcontroller stores these values in a temporary receive buffer. The receive buffer is a 64-byte size char array, with each element storing the magnitude of each frequency bin. After the receive buffer is filled, the content of the buffer is copied to the Circular Screen Buffer for displaying to the television screen.
To have a scrolling spectrogram, one must display the contents of the Receive Buffer as a column of grayscale “intensities” on the rightmost side of the television screen, while shifting the older data left by one. In a previous lab involving the NTSC television in our design course, we had a single screen array of size bytes per line × screen height. Thus, the single screen array acts as a combination of multiple screen line buffers, where each line buffer immediately follows the previous line buffer. Obviously, this is impractical for a scrolling display, since shifting every byte of data on the linear screen array will take O(n2) time.
To solve the scrolling problem efficiently, we use circular line buffers. For each line buffer in the screen array, we keep track of a “start” pointer of the buffer. When a new frequency magnitude value is to be added, we add it at the position the pointer is indicating, and then move the pointer right by one. When we need to actually display the content of the buffer, we start displaying from the element the pointer is pointing at, and end at the element left of the pointer. Thus, as shown in Figure 4, we can efficiently shift the position of each data relative to the buffer’s starting position without doing any actual data shifting.
As we explained in the Hardware section, we use four GPIO pins (pin A.0 to A.3) to generate the grayscale intensity and pin B.0 to output the sync signal. To display each line, we keep a temporary array pointer starting from the start pointer of the circular line buffer. For each “pixel,” we take the byte value indicated by the array pointer, assign the value to PORTA, and increment the array pointer, making sure that the pointer loops back to the lowest index of the circular buffer when it goes over the highest index. Thus, a higher frequency magnitude corresponds to a whiter level and lower magnitude corresponds to a darker one. With this method we were able to generate up to 64 discrete time intervals on the television without any assembly coding.
PUSH BUTTON CONTROLS
We wanted the user to be able to play or stop the display at any given moment, and also change the speed of scrolling so that audio signals with larger duration can be captured within a single screen. Thus, we created two simple de-bouncing state-machine for the Video MCU which change their states given input from their respective push buttons. For every time interval the play/pause state and speed state of the system is evaluated and changed according to the de-bounced push button inputs. On the Audio MCU, a single log/linear state machine is responsible for transmitting either logarithmic or linear scaled magnitude information to the Video MCU.
The resulting real-time spectrogram was very clear and accurate, with us being able to observe various audio samples and being able to distinguish them clearly. We were able to visualize a wide range of audio signals ranging from simple claps and whistles to even recorded music. We found that the scrolling spectrogram generator was very accurate—albeit low in resolution—on displaying even the most complex of spectrograms such as human speech. This is due to the fact that most speech and music are within our 4-kHz limit, as the highest note on a standard 88-key piano is 4,186 Hz.
Of particular interest were our results from animal calls—specifically bird songs—which had spectrograms that we could easily compare to those posted on the ‘Net. Every bird species has its own distinctive mating and social calls, and these can often easily be recognized by observing their spectrograms. We were able to clearly and accurately distinguish the different bird calls. This also alludes to some practical implications of our project. Photo 2 shows the spectrogram of a banded wren as produced by our system. Notice that you can clearly observe the various characteristics of the bird call, such as the sudden rise in its pitch and the dissipation of the sound over time.
Photo 3 shows a snapshot of a dual-tone multi-frequency (DTMF) signaling—in this case, a modem dial-up handshake. Notice how the specific frequency bands that DTMF uses are accurately captured by the spectrogram. Also, the modem’s data scrambling in the latter half of the spectrogram—which is the white noise background that we hear during a modem dial-up sequence—is also clearly visible.
Due to the finite window of the spectrogram displayed on screen, we feel that pictures are inadequate in experiencing the full capabilities of our system. To experience the capabilities of our spectrogram in its fullest, check out the videos posted on Bruce Land’s YouTube page: https://www.youtube.com/user/ece4760. (Direct links to specific video are listed in the Resources section at the end of this article.)
Our project is not only useful aesthetically, but also in terms of distinguishing different bird calls. Although it is lower in resolution and range than spectrograms generated from dedicated hardware and software, we our portable, low-cost devices will aid ornithology researchers in their fieldwork. We are very excited about this prospect and plan to pursue this approach in the future.
Authors’ Note: The contributions of Madhuri Kandepi, another collaborator on this project, made this project possible. In addition, we’d like to thank Bruce Land and the support offered to us through the ECE4760 class he teaches at Cornell University. Lastly, we’d like to acknowledge the work of Alexander Wang and Bill Jo on their “Audio Spectrum Analyzer,” which was the source of our inspiration for this project.
 B. Hayes, “Spectrogram Reading Practice,” 2004, www.linguistics.ucla.edu/people/hayes/103/SpectrogramReading/Index.htm.
 Wikipedia, “Audio Frequency,” http://en.wikipedia.org/wiki/Audio_frequency.
Cornell Lab of Ornithology, “What is a Sound Spectrogram?,” www.birds.cornell.edu/brp/the-science-of-sound-1/what-is-a-spectrogram/.
B. Land, “Prototype Board for Atmel Mega 644,” http://people.ece.cornell.edu/land/PROJECTS/ProtoBoard476/.
———, “Bird Call 1 Desktop,” www.youtube.com/watch?v=y6yHE4OqZSI.
———, DialUpModem Desktop. https://www.youtube.com/watch?v=3MZB4BtSd7M.
———, “DSP for GCC,” 2013, http://people.ece.cornell.edu/land/courses/ece4760/Math/avrDSP.htm.
———, ECE4760: Designing with Microcontrollers, Cornell University, http://people.ece.cornell.edu/land/courses/ece4760/.
———, “Realtime Sound Spectrogram,” https://www.youtube.com/watch?v=bZ4p090KL3w.
———, “Video Generation with Atmel MEga644/1284 and GCC,” 2012, http://people.ece.cornell.edu/land/courses/ece4760/video/index.html.
Iowegian International Corp., “Function Approximation Tricks,” www.dspguru.com/book/export/html/62.
A. Wang and B. Joe, “Audio Spectrum Analyzer,” http://people.ece.cornell.edu/land/courses/ece4760/FinalProjects/f2012/ajw89_bwj8/ajw89_bwj8/index.html.
F. Woodland and J. Yeun, “Gray-Scale Graphics: Dueling Ships,” 2003, http://people.ece.cornell.edu/land/courses/ece4760/FinalProjects/s2003/fww3jhy5/.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JANUARY 2016 #306 – Get a PDF of the issueSponsor this Article
Varun Hegde is a veteran web developer and loves DIY electrical projects. His interests include applications of embedded systems in audio processing and human-computer information. He is a senior studying Electrical & Computer Engineering at Cornell University.
Hyun Ryong is a senior in Cornell College of Engineering, majoring in Electrial & Computer Engineering. His primary interests include computer architecture, embedded systems and memory systems, and conducts research in these areas in the M3 Architecture Research Group.