CC Blog Research & Design Hub

Real-Time Automatic Music Transcriber

Using a Raspberry Pi RP2040

This article describes our final project for Cornell University’s course on Digital Systems Design with Microcontrollers. We designed a system that uses a Raspberry Pi RP2040 microcontroller to generate a MIDI output from microphone input. We also researched ways to utilize machine learning to improve our note classification process.

  • What’s a project using the Raspberry Pi RP2040?
  • How can I build an automatic music transcriber?
  • How can I implement machine learning in a project?
  • Raspberry Pi RP2040

Automatic Music Transcription (AMT) is a multidisciplinary process with a broad set of applications for converting music files to sheet music. AMT is an especially challenging problem that continues to drive research in electrical engineering, specifically digital signal processing and machine learning. An ideal system capable of perfect automatic transcription would enable a plethora of other technologies with far-ranging applications to the teaching, creation, and study of music. While AMT systems with convincing efficacy are not unheard of, most take advantage of the information available in a full recording, and thus require a pre-recorded audio input. Comparatively little work has focused on designing a simple, real-time device.

The idea of an embedded digital system for transcribing music in real time fit perfectly with both our personal interests and the learning goals of the final project for our capstone design class (Digital Systems Design with Microcontrollers) at Cornell University. During the final 4 weeks of our course, we set out to design and build a simple, low-cost system, based on the Raspberry Pi Pico (RP2040 microcontroller) and utilizing the RP2040 C SDK to generate a MIDI output from an audio input in real time. Building off what we learned in the course, we developed a real-time note detection algorithm, along with the accompanying generation of corresponding MIDI (Musical Information Digital Interface) protocol messages. MIDI is a common protocol for digital music devices, and with some post-processing, our messages can be played back. In addition, we investigated the prospect of utilizing a machine-learning model to improve the note categorization process.


Our physical system is shown in Figure 1. We designed it for simplicity of user experience and construction. The Raspberry Pi RP2040 MCU defaults to a mode in which it listens for audio input. By playing music or sounds nearby, the microphone picks up the audio and processes it. The user can then press an integrated button to stop the audio listening and processing, and send the generated MIDI stream through serial UART (universal asynchronous receiver/transmitter) to a Python script, so that it can be turned into an appropriate MIDI file. The end result is a file that can be used in any DAW (Digital Audio Workstation) or other software capable of parsing MIDI to display, modify, or playback the transcribed audio. Simply pressing the RP2040’s onboard reset button allows the user to play new audio.

Labeled Image of Our Project
Labeled Image of Our Project

The system has four components: 1) n audio input through an electret microphone and the RP2040’s analog-to-digital converter (ADC); 2) a fast Fourier transform (FFT) and power spectrum analysis of the audio; 3)the note detection algorithm; and 4) the MIDI output and file generation

The analog signals detected by the microphone are fed through the Raspberry Pi Pico’s onboard analog-to-digital converter (ADC). From there, we perform a fixed-point implementation of an FFT on the ADC output, to gather frequency and power information from the perceived sounds. By referencing the output of the FFT, we are able to analyze and determine which notes were just heard, and appropriately convert them to a corresponding MIDI message. The generated MIDI messages are stored in hexadecimal on the microcontroller; when directed by the user or by a preprogrammed timeout, it is sent to one’s computer through UART serial, to be turned into a usable .MID file by a Python script.

Our system also includes a VGA driver for the RP2040’s programmable I/O (PIO) module to display the FFT and note detection on a monitor in real time. Although the VGA component is not imperative to the overall functionality of the project, it is a useful debugging tool that proved to be useful throughout the design and implementation of the project. Furthermore, it is interesting to visualize how the system is working in real time.


A block diagram is given in Figure 2, and Figure 3 is the circuit schematic. Because our project was mostly software-based, aside from the microphone, there are relatively few hardware components. The only components required for our project are the Raspberry Pi Pico, the microphone, and a STOP_TRACK button. The VGA display is not required, but proved to be useful for debugging and verification of our algorithm and design.

Our project runs on the RP2040 MCU chip embedded in the Raspberry Pi Pico board. The RP2040 contains: a dual-core Arm Cortex-M0+ processor, meaning we can run two processes simultaneously; several programmable input/output (PIO) state machines (Figure 4), which allow us to utilize the VGA display; an ADC, which is needed to convert the output of the microphone into a form that is interpretable by the FFT; and other unutilized features.

Block diagram of our hardware layout.
Block diagram of our hardware layout.
Circuit schematic of our project
Circuit schematic of our project
State machine diagram of our software implementation
State machine diagram of our software implementation

We used an Electret Microphone Amplifier, which includes a microphone and a Maxim MAX4466 op-amp. These two parts work in conjunction to give our system less noisy, amplified sound to process. Although this device is designed to be less noisy than a standard microphone, it was heavily affected by the Pico’s DC 60Hz power supply, which caused low-frequency power to show up in our FFT. We effectively eliminated this issue by placing a grounded metal sheet on the top and bottom of the entire system. The output of this device is connected to a GPIO pin on the Pico, and then connected internally to the Pico’s ADC.

The last piece of hardware is the VGA screen, which we used to visualize debugging information throughout the 4 weeks of lab time. As already mentioned, the VGA screen was an incredibly helpful tool to understand how our system was working, and through the debugging process when we ran into problems with our code. The VGA display was a better method of debugging, compared to printing to the serial monitor, because it takes less time to display, and makes it easier to visualize things in a graphical way. This is not to say that the serial monitor was obsolete; serial communication was critical in transferring the compiled MIDI from the microcontroller to one’s computer.


The code used in our project is available on the Circuit Cellar Article Code and Files webpage. To facilitate the timely detection of notes and avoid the increased latency of floating-point arithmetic, we utilize a fast Fourier transform implementation. It uses fixed-point arithmetic, which is generally faster than an equivalent floating-point algorithm. Because of this change, we cannot utilize the library function C provides for the FFT, which is built to use floating-point arithmetic. For an earlier lab in this class, we were provided with code to detect a single peak in an FFT, which is a 1024 sample in-place fixed-point FFT algorithm operating at a 10kHz sample rate. Based on this, we modified the code to detect any peaks that were above a threshold, thus enabling us to detect chords of music that contain multiple played notes in a single unit of time.

Fixed-point arithmetic (Figure 5) is similar to standard signed integer arithmetic, except that we treat the crossover between a whole number and a fraction as being between bits 15 and 16, instead of after the final bit. In doing so, we gain resolution to handle smaller numbers, but lose range to handle larger numbers. Importantly for our purposes, by using fixed-point, our FFT and other arithmetic operates much faster (due to using integer machine instructions) than if we used floating point arithmetic. This is important for us to be able to detect notes in a timely manner.

Fixed point arithmetic mapping to a signed int. The MSB is still the sign bit.
Fixed point arithmetic mapping to a signed int. The MSB is still the sign bit.

Fixed point arithmetic is identical to integer arithmetic for addition, subtraction, and absolute value. However, to achieve multiplication and maintain the correct decimal position, we must typecast both inputs to signed long and shift our output, and cast it back to fixed-point, which allows us to capture all the bit data we need. Similarly, for division, we must cast the numerator value of division to signed long, but division is still rather slow. Division takes roughly 8 additional cycles, which are used to address the division accelerator of the RP2040 and generate a result. In contrast, multiplication does not need to address an accelerator and does not require additional cycles over standard ALU functions.

Conversion from floating point to fixed point can be achieved by multiplying the float by 215, or 32,768, and typecasting, while the inverse of dividing by 215 as a float would convert from fixed point to floating point. Integers can be converted with an arithmetic shift by 15 bits, left to convert into fixed-point and right to convert out of fixed-point. For all the aforementioned arithmetic operations and type conversions, we had macros we could call upon to do the correct arithmetic operation, since C lacks operator overloading in the same manner that C++ has.

Our design was multi-core, and to achieve this we utilized the Protothread library to enable basic threading on our embedded system without requiring us to initialize an entire OS to adjudicate this division of work. Protothreads are unique, because they are stackless and non-preemptive. One core handles most of the note detection and another handles polling to output our MIDI file, once we determine that the song we are detecting has ended. This is a rudimentary use of threading and is suboptimal, since the second core is underutilized; however, due to end-of-semester deadlines, we did not have the chance to optimize this part of the project.

Note Classification: From the output of the FFT, we implemented note classification, which classifies bins from the FFT into actual music notes. For a sound to be classified, the value at its respective bin must reach a certain threshold. We chose this threshold experimentally, selecting a value that ensured a low false-positive rate. Additionally, the respective bin must be a peak, meaning it must have the largest magnitude in comparison to its neighbors. This requirement gives us the functionality not to make the mistake of classifying several notes at once.

Once those requirements are met, we can then actually classify the bins we have detected in our sound as notes. On start-up, we populate a look-up table full of notes (A0 through C8) and their respective frequencies (27.5Hz through 4.19kHz). We populate the look-up table on start-up, so we do not constantly have to do calculations that would waste crucial CPU time. To classify notes, we compare the frequency of the sound heard to those of the frequencies in the note look-up table. We do this by comparing the difference between the bin’s corresponding frequency and each frequency in our look-up table iteratively, and retaining the index of the note that has the least distance between itself and the bin under examination. That index can then be used to index into the look-up table to return the actual note played and in which octave it is played.

Additionally, due to the poor quality of our microphone, we implemented a debouncer, which converted notes detected to MIDI, if and only if the note was heard on or off for more than a certain number of code cycles. We define a code cycle as one iteration through our main logic loop. To accomplish this on the microcontroller, we had to implement a data structure that can hold values corresponding to whether the note had just been turned on or off, and how many sequential cycles the note had been held on or off.

Implementation of MIDI: Our next challenge was the implementation of MIDI. MIDI is a standardized protocol to facilitate the playback and editing of music through computers and various instruments. MIDI enumerates a series of events, which can range from copyright information to changes in time signature. However, the ones we exclusively care about for the purpose of this project are note events, which play notes in the MIDI protocol. MIDI note events work by activating and deactivating a note. In general, MIDI events are timed by the change in time between MIDI events, rather than the absolute time that has occurred in a song. Thus, to turn a note into MIDI, we must be able to signal when a note is turned on or off, and keep track of the time between events, called “delta time.”

To determine the delta time, we implemented an interrupt service routine (ISR) that would increment a global variable that would correspond to the current time, so that time between events can be quantized to follow the MIDI protocol accurately. After plenty of testing, we ultimately found that a timer interrupt running at 40kHz was optimal for our purposes. Creating this ISR presented an interesting challenge: running this ISR significantly increased our processing time. To mitigate this, we utilized the RP2040’s second core to handle the ISR, leaving the first core to be devoted solely to music detection and classification.

The last requirement for MIDI was to generate the header, track, and end chunks of the MIDI file, which contains information about the song as a whole and the track that will be played. Header chunks essentially let a parser know that the binary is a MIDI file and what to expect in terms of the tracks within. A track chunk enumerates information about timing for the events within the track, and how many events are in this track; and an end chunk denotes the end of a track, so multiple tracks can be contained within one MIDI file.

To generate a track chunk, you must define the size of the song in bytes, meaning one must know how many MIDI events will occur before the song begins. Thus, we are not able to generate a valid MIDI file in real time, due to the requirement to define the length. Throughout the note detection and classification, if a note met all the specified requirements to be denoted as a heard note, we would generate an appropriate MIDI message in real time, and store the message in an array of chars. In the end, the size of this char array would tell us the size of the track, and we would update the track chunk. In addition, upon completion of the song, we must append an end track event, so that a MIDI parser will be able to find the end of a track and so that we follow the MIDI protocol accurately. After all of this is finished, the MIDI file is complete for our purposes.

Conversion to .MID Files: Once the conversion to MIDI is complete on the microcontroller, it can be sent to a personal computer, for conversion into a usable .MID file, which is the standard file type for MIDI files. This is done by sending the MIDI information stored in hexadecimal on the microcontroller through serial UART to the computer. Using a Python library called “pyserial,” one can easily receive information from the serial monitor by directing it to read from a communication (COM) port. The Python file for our application performs a simple conversion from hexadecimal to binary, and creates and stores the generated MIDI file in the local directory. This gives us the final product of a parsable .MID file that can be uploaded to any MIDI-compatible device or application.

VGA Driver: The last software component of our system is the VGA driver for the RP2040’s programmable input/output (PIO), which was provided to us by our course director V. Hunter Adams. This driver facilitates the addressing of three PIOs on the RP2040 with data to allow for the display of images on a VGA screen. The three PIOs are responsible for: 1) driving the VGA’s HSYNC, which is responsible for telling the screen to start writing to the next set of pixels; 2) VSYNC, which is responsible for telling the screen to start writing on a new frame; and 3) three analog red, green, and blue pins, which outputs the color of each pixel. This driver was implemented to allow for output to the VGA screen for ease of debugging and development.


Overall, our project worked successfully and accurately with pure tone instruments such as a recorder or flute, and the generation of a valid MIDI file. Our embedded system successfully performs note detection and classification, MIDI conversion, and VGA display for debugging purposes. The design was intentionally kept simple for ease of user experience and construction, with the RP2040 microcontroller defaulting to a mode in which it listens for audio input and waits for the user to trigger audio processing via a button.

Our system also highlights successful hardware and software co-design. The hardware includes the RP2040 microcontroller/Raspberry Pi Pico board, Electret Microphone Amplifier, and VGA screen, while our software utilizes additional hardware components. Our system’s software design utilizes the Protothread library to enable basic threading on the RP2040, and implements note detection and classification by selecting a threshold value and finding the peak frequency.

Our system also includes a VGA driver for the RP2040’s PIO module, which allows for real-time display of the FFT and note detection on a monitor, making it easier for users to visualize the system’s operation and debug any issues. The VGA display proved to be a critically helpful tool in understanding how the system was working throughout the 4 weeks of lab time.

Finally, our system successfully implements MIDI file generation for the detected notes through a simple conversion process and sending the final file to a computer via UART.

References, sources, a parts list, and a link to a demonstration video of our project are given on the Circuit Cellar Article Materials and Resources webpage [1].


The original inspiration for the project came specifically from a TensorFlow article by ARM, demonstrating real-time detection of fire alarms. As musicians, we had a strong interest in trying to extend this idea to a musical concept that we (and others) would find personally useful. We chose pitch detection and MIDI conversion, because we figured that this task might be well suited to a convolutional neural network that could theoretically learn to extract features, allowing it to differentiate among fundamentals, harmonics, and noise. This would enable more accurate polyphonic detection (detecting multiple pitches simultaneously) than if we were to only use our short-time Fourier transform (STFT).

From this example, we were able to understand and experiment with the main steps required for deploying a model on the Pico–first, building and training the model with TensorFlow; and (more importantly) second, converting the model to a format deployable on the Pico. Specifically, to run the model on the RP2040, it needs to be converted into a TensorFlow Lite Micro (TFLμ) format and then compiled and stored with the rest of the program in flash, meaning it needs to be converted to a C/C++ header or source file. Converting a saved TFLμ model into a C header can be done trivially by using a hexdump tool (for example, using the xxd command), and ensuring the resulting output is stored in an unsigned character array. However, before the saved model file can be converted, the model itself must first be quantized.

While TensorFlow allows for varying degrees of quantization, for our purposes of deploying a model to run in real time on a microcontroller without a dedicated floating point unit, all floating path math in the model needs to be quantized to integer operations. Specifically, to be compatible with the optimized ARM CMSIS-NN kernels used by the Pico TFLμ port, all variables must be quantized to 8-bit integers.

The final step is writing a program to actually invoke the model on the Pico. In our case, for feeding the model output to the MIDI generator we would ideally only have to re-use the peak picking code from our STFT. From here, the next goal was to develop the actual model we were going to use. We believed that converting and/or building off of a pre-existing model with proven performance would be more interesting, allow us to achieve a higher quality end result, and better serve our time constraints. To this end, we explored a few different pre-existing open-source models (such as Spotify Research’s “basic pitch”), but ultimately ran out of time.

Other potential improvements to pitch detection could include a cepstrum approach or implementing a method to ignore harmonics present in the FFT. Although our system was able to detect pure tones with good accuracy, lack of optimization techniques often led to malformed/mistimed MIDI files. Extracting fundamentals from more complex signals was challenging, making it unreasonable to take recordings of instruments such as a piano.

Additionally, our initially limited understanding of MIDI file structure made it difficult to follow the protocol accurately, leading to timing issues in the generated files. With a better understanding of the header and delta-time that maps better to our application, these issues potentially could have been mitigated, resulting in better-behaved MIDI files. 

[1] Demonstration video of project:

Adafruit |
Raspberry Pi |

Hunter Van Adams, ECE 4760/5730 Webpage,
Hunter Van Adams, Fixed-Point FFT Webpage,
Hunter Van Adams, VGA PIO Webpage,
Raspberry Pi, RP2040 Datasheet,
Raspberry Pi, Pi Pico Datasheet,
Bruce Land, Protothread Description,
MIDI Association, MIDI Specification,
David Back, Standard MIDI-File Format,

ML Research
Pico TensorFlow Lite Port,
Updated Pico TensorFlow Lite Port used in project,
“End-to-end tinyML audio classification with the Raspberry Pi RP2040”,
“A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation”,
“Automatic Music Transcription: An Overview.”,
Demonstration video of project:

Code and Supporting Files


Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.

Note: We’ve made the Dec 2022 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.


Advertise Here

Sponsor this Article
+ posts

Chris Schiff, Jacob Lashin, and Romano Tio studied Electrical and Computer Engineering at Cornell University. They are continuing their education in Cornell’s ECE Master of Engineering program. Their technical interests include medical devices and machine learning, robotics and embedded systems, and computer hardware and architecture, respectively.

Supporting Companies

Upcoming Events

Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2024 KCK Media Corp.

Real-Time Automatic Music Transcriber

by Chris Schiff, Jacob Lashin, and Romano Tio time to read: 15 min