Projects Research & Design Hub

Build a SoundFont MIDI Synthesizer (Part 1)

Written by Brian Millier

Using Teensy 4

In this two-part article series, Brian discusses several ways to emulate musical instruments with an electronic synthesizer, and the technologies they require. He then focuses on the SoundFont standard, setting up the groundwork for Part 2 where he programs and builds a MIDI wavetable synthesizer using Teensy 4.

  • How to emulate musical instruments with an electronic synthesizer
  • How to understand the SoundFont standard
  • How to build a MIDI wavetable synthesizer using Teensy 4
  • How to understand the concept of looping in wavetable synthesis
  • How to convert a SoundFont file using Python
  • How to work with wavetable synth objects
  • Teensy 4 module
  • Teensy audio library
  • Python language
  • Microsoft’s Visual Studio
  • Teensy Audio System Design Tool 
  • SoundFont
  • MIDI

Back in Circuit Cellar issue 328 (November 2017), I described a Hammond tonewheel organ emulator using a Teensy 3.6 module. The Hammond organs used mechanical tonewheels to generate 91 sine waves at the required “musical” frequencies, and the organ mixed these to produce a wealth of different “voices.” It was based on the mathematical principle that you can generate virtually any desired waveform by combining sine waves consisting of the fundamental tone and various proportions of higher harmonic frequencies.

Although these organs produced a rich variety of voices, many sonic subtleties are present in conventional musical instruments that were not present in the Hammond organ’s sound—or any electronic organ, for that matter. The most notable difference is that when you play a conventional musical instrument, each note has an amplitude envelope. That is, its amplitude rises quickly from silence, stays at some relatively constant value while the note is being held and then decays (usually exponentially) back to silence after the musician stops playing that note.

There are actually more “phases” to this amplitude envelope, but you get the idea. Conversely, the Hammond organ’s sound goes from complete silence to some fixed amplitude immediately, and stays constant until the key is released, at which time it returns immediately to silence. I am ignoring the “percussive” voicing on the Hammond organ in this comparison, but it is a limited form of the envelope concept.

To emulate conventional musical instruments with an electronic synthesizer, many different approaches have been taken over the last 70 years. To a large extent, the forms of emulation that were commercially developed depended heavily on the available electronic technology of the time. I won’t delve into the various approaches taken in the past, but will concentrate on the modern wavetable synthesis method that is in common use today. This method requires fast microcontrollers (MCUs) and lots of cheap RAM memory—both of which are commonplace today.

WAVETABLE SYNTHESIS
The concept here is to forget about using a complex algorithm to generate the required waveform, but instead to “record” a conventional musical instrument to obtain the actual waveform of the instrument’s sound. Alternately, a wavetable synthesizer can use algorithmically-derived waveforms, instead of acoustic musical instrument samples, to generate waveforms that are musically pleasing, but not derived from any actual musical instrument.

These sound “samples” are then stored in some form of non-volatile storage, such as a memory card or other high-capacity ROM memory chip(s). When you play the synthesizer, that waveform is read out at a fixed sample rate. A method known as Direct Digital Synthesis (DDS) is used to achieve all the necessary note frequencies.

— ADVERTISMENT—

Advertise Here

The process I just described, while accurate, only describes a small part of the overall wavetable synthesizer functionality that is necessary to produce musically accurate recreations of acoustic instruments. If you did nothing beyond implementing the simple procedure described, the results would be musically boring and quite unrealistic in the emulation of many acoustic musical instruments. There are three reasons why this would be the case.

First, the whole concept of the amplitude envelope, mentioned above, is not captured in the recorded waveform. You could record the note being played for several seconds, thus including the amplitude envelope. However, this would take up a lot of memory. And how would you handle notes of varying durations?

Second, conventional musical instruments, being physical objects, have acoustical properties, such as resonances, that vary as you go from the lowest note that they can produce up to the highest note. Therefore, you can’t assume that a waveform recorded for a low note will bear any resemblance to that of higher notes within the instrument’s range. In practice, you must record several sample waveforms spread over the instrument’s range, and store each of them in memory.

Third, most musical instruments do not produce a perfectly stable frequency while a note is being played. Often a small frequency modulation, called “vibrato,” occurs. Sometimes this vibrato isn’t present at the very beginning of the note, but gets introduced slowly as the note is held. This vibrato may or may not be present for each note played, and is part of the musician’s style of playing. Therefore, the synthesizer should allow for this possible variation, by responding to “modulation” commands coming from the attached keyboard.

Those three reasons are not a complete list of the parameters that must be taken into consideration to produce a realistic wavetable synthesizer. However, they are the most important ones.

LOOPING
Another required concept that’s important in wavetable synthesis is “looping.” The idea here is that the electronic waveform sample that has been recorded contains two main sections. They are: 1) the initial “attack” section, where the sound goes from silence to a reasonably stable frequency and maximum amplitude; and 2) the “loop” section, where the sound is reasonably stable in terms of frequency and amplitude.

Therefore, when you are “playing” a sample, you first read out the initial attack part of the sample. Then you continuously repeat the loop section of the waveform, for as long as the note is being held. Once the key is released, you can continue reproducing that loop section of the waveform, but you use another part of the wavetable routine that is handling the amplitude envelope to exponentially reduce the amplitude of the note down to silence.

You might question why you even need the initial attack portion of the waveform, when you have a separate section of the program handing the amplitude envelope. Conventional musical instruments often have quite complex attack waveforms, in terms of both amplitude and frequency/harmonic content. Therefore, is would not be sufficiently accurate to depend solely on the amplitude envelope function in the program to emulate the attack section of the sound. The amplitude envelope is handled by a function called ADSR (from the Attack, Decay, Sustain and Release phases). An ADSR envelope is shown in Figure 1. Note that in acoustic musical instruments, the attack, decay and release phases are generally exponential in nature.

FIGURE 1 – Any musical instrument’s sound has an amplitude envelope comprising four basic sections: Attack, Decay, Sustain and Release. The richer the instrument’s sound, the more complex this envelope will be.

SoundFont is a file format and associated technology that uses sample-based synthesis to play MIDI files. In short, SoundFont files contain recorded audio samples of musical instruments. Looking at SoundFont files, the decay time is often specified as zero, because this isn’t a big consideration in the overall replication of the note. (The Decay section of Figure 1 is somewhat exaggerated for illustration’s sake.) Also, due to the relatively short amount of computing time that is available with a sample rate of 44,100Hz, the exponential curves found in the attack/decay/release phases are often simulated with a simpler linear ramp, instead of an exponential curve.

— ADVERTISMENT—

Advertise Here

When wavetable synthesis was first developed commercially, the synthesizer companies basically generated their own sample waveforms, stored them in ROM memory devices in the instrument and developed proprietary methods to handle the envelope requirement and various other modulation requirements. These synthesizers were called “samplers.” There were ways to extend the available voices on such instruments using plug-in memory cartridges, but nothing was standardized among the various synthesizer companies. In time, this led to the development of the SoundFont standard, which, in its own way, advanced the state of the art in wavetable synthesizers in much the same way that the earlier MIDI standard advanced electronic musical instruments in general.

THE SOUNDFONT STANDARD
You could write a whole book about this standard. In the RESOURCES at the end of this article, you can find all the references for this article—including a link [1] to just such a book that describes in detail the SoundFont specifications. The initial version 1.0 of the SoundFont standard was introduced by Creative Labs for its Sound Blaster AWE32 product. This wasn’t a conventional synthesizer, but rather a sound board designed to be mounted in a PC computer—plugged into the internal ISA bus. Figure 2 shows the AWE32 sound board for the PC computer. The SoundFont format is now also used by some stand-alone synthesizers, and in instrument plug-ins used by DAW software on PCs and Macs.

FIGURE 2 – Back in the ‘80s, a good-quality sound card for the IBM PC computers looked like this. Creative Labs, which produced this soundc ard, was the originator of the SoundFont standard.

The SoundFont file structure is, by necessity, somewhat non-rigid. Depending on the acoustic “richness” of the instrument being emulated, the size of the waveform sample(s) required can vary dramatically. As mentioned earlier, to replicate the sound of any acoustic instrument accurately, you must provide different samples over the range of notes that the instrument can produce. A piano, for example, might require 20 or more individual samples to cover the its 88 keys—one for every group of three or four keys. Some deluxe piano wavetable voices use one waveform sample per note!

These individual samples are assigned to “regions,” and in addition to the region’s basic waveform, you also have other required data—such as the ADSR envelope, vibrato parameters and numerous other parameters that affect the nuances of the instrument’s voice. I refer to these other parameters in this article as “metadata,” although that isn’t the term that is used in the SoundFont reference. Listing 1 shows the data structure used in my program to define what I call the metadata. Note that there is one such structure for every region of the voice.

struct sample_data {
int16_t* sample;
bool LOOP;
int INDEX_BITS;
float PER_HERTZ_PHASE_INCREMENT;
uint32_t MAX_PHASE;
uint32_t LOOP_PHASE_END;
uint32_t LOOP_PHASE_LENGTH;
uint16_t INITIAL_ATTENUATION_SCALAR;
// VOLUME ENVELOPE VALUES
uint32_t DELAY_COUNT;
uint32_t ATTACK_COUNT;
uint32_t HOLD_COUNT;
uint32_t DECAY_COUNT;
uint32_t RELEASE_COUNT;
int32_t SUSTAIN_MULT;
// VIBRATO VALUES
uint32_t VIBRATO_DELAY;
uint32_t VIBRATO_INCREMENT;
float VIBRATO_PITCH_COEFFICIENT_INITIAL;
float VIBRATO_PITCH_COEFFICIENT_SECOND;
// MODULATION VALUES
uint32_t MODULATION_DELAY;
uint32_t MODULATION_INCREMENT;
float MODULATION_PITCH_COEFFICIENT_INITIAL;
float MODULATION_PITCH_COEFFICIENT_SECOND;
int32_t MODULATION_AMPLITUDE_INITIAL_GAIN;
int32_t MODULATION_AMPLITUDE_SECOND_GAIN;
};

LISTING 1 – The part of the SoundFont file structure that contains what I call the “metadata” for the voice, in addition to the actual wavetables that define the voice.

The SoundFont file structure was adapted to fit into the pre-existing Microsoft RIFF-wave file structure. This structure uses various “chunks” and “sub-chunks” to store data. In the case of SoundFonts, the various sample waveforms required for each keyboard region are stored in one type of chunk, and that region’s metadata (ADSR envelope settings, vibrato, sample rate, loop length and so on) are stored in other chunks. Another chunk stores identification data, such as the voice name and who engineered the voice samples. I provide a link to the RIFF file format [2], although that is very generic. The SoundFont reference manual describes the way in which the RIFF-file format is used for that purpose in greater detail.

If the SoundFont file structure had been defined differently—for example, as one in which there were a certain number of well-defined data fields with known lengths and readily-discernable terminators/separators—I might have been tempted to write a routine in the C language to parse that file into a form that this project could use. This could have been handled by the Teensy 4, itself, since it is certainly powerful enough, and there is plenty of program memory space available to easily handle a complex routine such as this.

However, seeing how complex these SoundFfonts were in the RIFF format, I decided that it was more practical to use the work that had already been done by the group of students who had written the Teensy Wavetable library object itself.

CONVERTING A FILE USING PYTHON
I touched upon the complexity of the SoundFont file format in the last section. The people who wrote the Wavetable object for the Teensy Audio library decided to write a Python program that would allow you to:

1) Browse your filesystem for a desired SoundFont file.
2) Observe and choose which of the keyboard regions you wanted to import into the wavetable object library.
3) Format the SoundFont waveforms and the metadata into several blocks of information, which are then stored as a “.cpp” file. A small amount of remaining data is stored in an associated “.h” file.
4) Choose a primary filename for the “.cpp” and “.h” files.

The developers call this program “decoder.py.” I commend them on using Python for this task. When it comes to handling/parsing out complex file structures, Python is an excellent choice. If you look at the original decoder.py program, you can see that it’s doing a whole lot of parsing and conversions in a relatively small program—a tribute to the efficiency of the Python language.

Python is available for free for Windows computers, Mac OS X and Linux. Personally, I had only used Python sporadically several years ago, as part of some work I did with the Raspberry Pi (running Linux). I had never used it in Windows. I was optimistic that I could remember enough Python to convert this program to something that would work the way I needed for this project.

I have been using Microsoft’s Visual Studio for 5 years, and am currently using the latest version—Visual Studio 2019. In the past, I had only used Visual Studio with the Visual Micro plug-in, which allows me to develop programs on any MCU platform that the Arduino IDE supports. For me, that includes AVR, Arm (Teensy 3.x and 4) and the Espressif ESP8266 and ESP32. However, Visual Studio also supports Python development. I was able to add this functionality by running the Visual Studio Installer (from the Windows Start Menu) and adding Python support. This puts the Python executable program into the folder:

Program Files (x86) Microsoft Visual Studio\Shared\Python37_64

— ADVERTISMENT—

Advertise Here

You will have to make a PATH to this location, so that you can run it from other folders (where your Python programs will be located). If you don’t know how to make a windows PATH, I describe how to do this in the “Python Resources help file” that is included with the rest of the source files for this project.

FOCUS ON FILES
Once you have Python installed, you can get the SoundFont decoder.py program (and the associated controller.py program, which provides a Windows GUI which runs it) from the GitHub source [3]. There are some dependencies that need to be installed, and this is also described on that GitHub site and summarized in my “Python Resources help file.” While you will need the other files from this GitHub repository, you must use the decoder.py file that I provide with the rest of the source code, because I modified it to work for this project. This is available on the Circuit Cellar code and files webpage.

Why can’t you just use the original decoder.py program written by the developers of the AudioSynthWavetable library? They designed the decoder.py program to decode the SoundFont file data in such a way that it could be “#included” into a Teensy C language program- as “.cpp” and “.h” files. Basically, this makes the chosen SoundFont voice an integral part of the synthesizer program—it is stored in flash memory. Their program only produces the one voice that you have selected and embedded in the Teensy program. I wanted to be able to choose various voices at will, loaded from an SD card.

The format of the .cpp and .h files from the original decoder.py program was dictated by how the compiler could handle the various data structures. In fact, many of my metadata parameters were defined in such a way that certain Audio library constants were imbedded in the parameter, itself. For example:

uint32_t(431*SAMPLES_PER_MSEC/8.0+0.5) , //RELEASE_COUNT

In equation form, the release count parameter is defined as:

Expressing it in this way is fine when the expression evaluator used in the pre-compile phase will translate this expression into a number. However, I did not want the SoundFont data to arrive in my program in a format that would require a lot of expression evaluation. Instead, I modified the original decoder.py program to format its parameters as actual numbers, with no system-dependent constants such as SAMPLES_PER_MILLISECOND embedded in the parameter.

I made other modifications to simplify the processing of the SoundFont data by the Teensy 4 program, and to eliminate some other lines of code that was originally needed to allow the ”.cpp” and “.h” files to be directly included in the Teensy program code.

Figure 3 shows a screen capture of the controller.py program in action. What you see here is the controller.py program running. This is a Windows GUI skin over the decoder.py program (a Command Line Python program). In the upper left portion, you can see a black rectangle, the Python interpreter (py.exe) running from the command line and interpreting the controller.py program. When the controller.py program has all the required user input, it passes that information to the decoder.py program to do the actual decoding. Running an interpreted language like Python is different from what most of us are used to with compiled (.exe) applications!

FIGURE 3 – The GUI for the Python program that converts standard SoundFont files into a format that can be used by the C code running on the Teensy MCU.

In Figure 3, I’ve loaded a SoundFont file containing a whole set of General MIDI programs or voices, and selected the Baritone Sax voice. All that’s left to do is choose how many of the samples (regions) you’re going to select. This will depend upon how many keyboard regions will fit into the Teensy 4’s 480,000-byte sample waveform memory. The size of the regions that you have selected is reflected in the “Sample Stats” window. Finally, you must pick a filename and select a folder in which to store the files.

For this project, all voices must be named in sequence from 1.cpp to 127.cpp—plus the like-named .h files. You can save these files to the PC hard disk, itself, or insert an SD card adapter into your PC and save it to SD card directly. One way or the other, you need all the sound files on the SD card. Now that the SoundFont files have been converted into a form that this project will accept, let’s look at the Wavetable Synth object in the Teensy Audio library.

WAVETABLE SYNTH OBJECT
The Teensy audio library object AudioSynthWavetable was written by some Portland University graduate students. It is documented on the GitHub site [3]. Figure 4 shows an instance of this object placed on the Teensy Audio System Design Tool workspace. I’ve provided a link to this on-line tool [4]. In Figure 4, the Help screen for it is on the right. The seven available functions are shown. Basically, you set the instrument up by passing a string describing it to the setInstrument() function. That string was defined in the SoundFont’s “.h” file that has been “#include”-d in the program. (after having run the decoder program described in the last section) You use the playFrequency() function to start a note playing at a defined frequency and amplitude. You use the stop() function to end a note, and the isPlaying() function to know when the final release phase of the note is over.

FIGURE 4 – The Teensy Audio System Design Tool is used to interconnect the various sound modules that have been written for the Audio library. This shows the Wavetable object in the Design Tool. I made extensive changes to it, to make it do what I expected from the project.

The Wavetable synth object is monotonic—it can only play one note at a time. You must define many instances of this object to handle many notes at a time. Considering that a note may linger for up to a few seconds during its release phase, I chose to include 48 separate Wavetable objects in my program, to handle many notes being played simultaneously. Note that you would generally use the Audio System Design Tool to arrange/wire up your various audio objects, and let it produce the C code that integrates these objects into your program. However, in this case, there are 48 discrete AudioSynthWavetable objects needed, along with 17 mixers to combine all the Wavetable objects. In this case, it was easier just to use the code that was written by the original developers in their sample program (lines 15-67 in my program).

I must commend the students who wrote this library object. Generating sounds based upon the SoundFile template requires the following considerations:

1) Scan through the waveform table (from the region corresponding to the given note) using the DDS method, to provide the basic waveform.
2) Perform the above scan using the initial section of the waveform table for the attack phase of the note.
3) Scan through the loop portion of the waveform repeatedly, during the time that the key is being depressed on the MIDI keyboard controller.
4) Control the amplitude of the signal throughout the loop duration, using the decay time and then the sustain time envelope parameters.
5) When the stop() function is called, continue scanning the waveform, but exponentially decay its amplitude to silence, according to the release time parameter.
6) Keep track of when the sound has decayed to zero amplitude, for use by the isPlaying() function.
7) Apply modulation, such as vibrato, the depth of which will often start at zero and increase while the note is still sounding.

Those are the main functions. There might be some minor ones implemented that I am unaware of. While the original AudioSynthWavetable object was very nicely implemented, in my opinion, it had two significant shortcomings.

First, you must load a specific SoundFont voice into flash memory—so it’s a part of the Teensy program. Therefore, the program can reproduce just that one particular voice. It’s possible to #include more than one set of .cpp and .h files into the program. This could give you a several voices, but in practical terms, there is not enough memory space to handle more than one reasonably rich voice, even with the Teensy 4.

Second, although the original library object can handle vibrato (and in some SoundFonts a delayed vibrato), it cannot produce vibrato that is triggered by the musician moving the Modulation controller (the “mod wheel”), or from Channel Aftertouch messages. This “expressiveness” is commonly used by keyboard musicians.

When the Teensy 4 became available, it greatly increased the available amounts of both Flash and SRAM memory, compared to earlier Teensy modules. Given that the Teensy 4 had 1MB of SRAM, I felt it would be possible to convert the AudioSynthWavetable object to allow the following activities:

1) Move the waveform/metadata memory from program memory, where the original object placed it into SRAM. Note: This was a lot more time consuming than I had anticipated, mainly because I was using someone else’s code instead of writing my own.
2) Use an SD card to store up to 127 separate voices, which could be selected and loaded into SRAM on demand.
3) Allow the “mod wheel” to modulate the vibrato amount in real time. Also allow MIDI Channel Aftertouch messages to do the same thing.

In Part 2 of this article (Circuit Cellar 360, July 2020), I’ll describe the programming I wrote to accomplish this, and show the circuitry that I built to implement the MIDI wavetable synthesizer. 

RESOURCES

References:
[1] SoundFont Technical Specification:
http://freepats.zenvoid.org/sf2/sfspec24.pdf
[2] WAVE PCM soundfile format
The WAVE file format is a subset of Microsoft’s RIFF specification
http://soundfile.sapp.org/doc/WaveFormat
[3] AudioSynthWavetable object library and associated utilities:
Project  repository located on GitHub at
https://github.com/TeensyAudio/Wavetable-Synthesis
[4] Teensy Audio System Design Tool:
https://www.pjrc.com/teensy/td_libs_Audio.html

Espressif Systems | www.espressif.com
Microsoft | www.microsoft.com
PJRC | www.pjrc.com

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • MAY 2020 #358 – Get a PDF of the issue


Keep up-to-date with our FREE Weekly Newsletter!



Don't miss out on upcoming issues of Circuit Cellar.

Note: We’ve made the May 2020 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.


Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
+ posts

Brian Millier runs Computer Interface Consultants. He was an instrumentation engineer in the Department of Chemistry at Dalhousie University (Halifax, NS, Canada) for 29 years.

Supporting Companies

Slider

Upcoming Events

Copyright © 2021 KCK Media Corp.

Build a SoundFont MIDI Synthesizer (Part 1)

by Brian Millier time to read: 17 min