Roberto developed a glove that enables communication between the user and those around him. While the design is intended for use by people communicating in American Sign Language, you can apply what you learn in this article to a variety of communications applications.
While studying at Cornell University in 2014, my lab partner Monica Lin and I designed and built a glove to be worn on the right hand that uses a machine learning (ML) algorithm to translate sign language into spoken English (see Photo 1). Our goal was to create a way for the speech impaired to be able to communicate with the general public more easily. Since every person’s hand is a unique size and shape, we aimed to create a device that could provide reliable translations regardless of those differences. Our device relies on a variety of sensors, such as flex sensors, a gyroscope, an accelerometer, and touch sensors to quantify the state of the user’s hand.
These sensors allow us to capture the flex on each of the fingers, the hand’s orientation, rotation, and points of contact. By collecting a moderate amount of this data for each sign and feeding it into a ML algorithm, we are able to learn the association between sensor readings and their corresponding signs. We make use of a microcontroller to read, filter and send the data from the glove to a PC. Initially, some data is gathered from the users and the information is used to train a classifier that learns to differentiate between signs. Once the training is done, the user is able to put on the glove and make gestures which the computer then turns into audible output.
We use the microcontroller’s analog-to-digital converter (ADC) to read the voltage drop across each of the flex sensors. We then move on to reading the linear acceleration and rotation values from the accelerometer and gyro sensor using I2C. And finally, we get binary readings from each of the touch sensors regarding if there exists contact or not. We perform as many readings as possible within a given window of time and use all of this data to do some smoothing. This information is then sent through serial to the PC where it is gathered and processed.
Python must listen to information coming in from the microprocessor and either store data or predict based on already learned information. Our code includes scripts for gathering data, loading stored data, classifying the data that is being streamed live, and some additional scripts to help with visualization of sensor readings and so on.
MCU & SENSORS
The design comprises an Atmel ATmega1284P microcontroller and a glove onto which the various sensors and necessary wires were sewn. Each finger has one Spectra Symbol flex sensor stitched on the backside of the glove. The accelerometer and gyro sensors are attached to the center of the back of the glove. The two contact sensors were made out of copper tape and wire that was affixed to four key locations.
Since each flex sensor has a resistance that varies depending on how much the finger is bent, we attached each flex sensor as part of a voltage divider circuit in order to obtain a corresponding voltage that can then be input into the microcontroller.
We determined a good value for R1 by analyzing expected values from the flex sensor. Each one has a flat resistance of 10 kΩ and a maximum expected resistance (obtained by measuring its resistance on a clenched fist) of about 27 kΩ. In order to obtain the maximum range of possible output voltages from the divider circuit given an input voltage of 5 V, we plotted the expected ranges using the above equation and values of R1 in the range of 10 to 22 kΩ. We found that the differences between the ranges were negligible and opted to use 10 kΩ for R1 (see Figure 1).
Our resulting voltage divider has an output range of about 1 V. We were initially concerned that the resulting values from the microcontroller’s ADC converter would be too close together for the learning algorithm to discern between different values sufficiently. We planned to address this by increasing the input voltage to the voltage divider if necessary, but we found that the range of voltages described earlier was sufficient and performed extremely well.
The InvenSense MPU-6050 accelerometer and gyro sensor packet operates on a lower VCC (3.3 V) compared to the microcontroller’s 5 V. So as not to burn out the chip, we created a voltage regulator using an NPN transistor and a trimpot, connected as shown. The trimpot was adjusted so that the output of the regulator reads 3.3 V. This voltage also serves as the source for the pull-up resistors on the SDA and SCL wires to the microcontroller. Since the I2C devices are capable only of driving the input voltages low, we connect them to VCC via two 4.7-kΩ pull-up resistors (see Figure 2).
As described later, we found that we needed to add contact sensors to several key spots on the glove (see Figure 3). These would essentially function as switches that would pull the microcontroller input pins to ground to signal contact (be sure to set up the microcontroller pins to use the internal pull up resistors).
Interfacing with the MPU-6050 required I2C communication, for which we chose to use Peter Fleury’s public I2C library for AVR microcontrollers. I2C is designed to support multiple devices using a single dedicated data (SDA) bus and a single clock (SCL) bus. Even though we were only using the interface for the microcontroller to regularly poll the MPU-6050, we had to adhere to the I2C protocol.
Fleury’s library provided us with macros for issuing start and stop conditions from the microcontroller (which represent different signals that the microcontroller is requesting data from the MPU-6050 or is releasing control of the bus). These provided macros allowed for us to easily initialize the I2C interface, set up the MPU-6050, and request and receive the accelerometer and gyroscope data (described later).
While testing our I2C communication with the MPU-6050, we found that the microcontroller would on rare occasions hang while waiting for data from the I2C bus. To prevent this from stalling our program, we enabled a watchdog timer that would reset the system every 0.5 seconds, unless our program continued to progress to regular checkpoint intervals, at which time we would reset the watchdog timer to prevent it from unnecessarily resetting the system. We were able to leverage the fact that our microcontroller’s work consists primarily of continuously collecting sensor data and sending packets to a separate PC.
For the majority of the code, we used Dan Henriksson and Anton Cervin’s TinyRealTime kernel. The primary reason for using this kernel is that we wanted to take advantage of the already implemented non-blocking UART library in order to communicate with the PC. While we only had a single thread running, we tried to squeeze in as much computation as possible while the data was being transmitted.
The program first initializes the I2C, the MPU, and the ADC. After it enters an infinite loop it resets the watchdog timer and gets 16 readings from all of the sensors: accelerometers, gyroscopes, flex-sensors, and touch sensors. We then take all of the sensor values and compute filtered values by summing all of the 16 readings from each sensor. Since summation of the IMU sensors can produce overflow, we make sure to shift all of their values by 8 before summing them up. The data is then wrapped up into byte array packet that is organized in the form of a header (0xA1B2C3D4), the data, and a checksum of the data. Each of the sensors is stored into 2 bytes and the checksum is calculated by summing up the unsigned representation of each of the bytes in the data portion of the packet into a 2-byte integer. Once the packet has been created it is sent through the USB cable into the computer and the process repeats.
Communication with the microcontroller was established through the use of Python’s socket and struct libraries. We created a class called SerialWrapper whose main goal is to receive data from the microcontroller. It does so by opening a port and running a separate thread that waits on new data to be available. The data is then scanned for the header and a packet of the right length is removed when available. The checksum is then calculated and verified, and, if valid, the data is unpacked into the appropriate values and fed into a queue for other processes to extract. Since we know the format of the packet, we can use the struct library to extract all of the data from the packet, which is in a byte array format. We then provide the user with two modes of use. One that continuously captures and labels data in order to make a dataset, and another that continuously tries to classify incoming data.
Support Vector Machines (SVM) are a widely used set of ML algorithms that learn to classify by using a kernel. While the kernel can take various forms, the most common kind are the linear SVMs. Simply put, the classification, or sign, for a set of readings is decided by taking the dot product of the readings and the classifier. While this may seem like a simple approach, the results are quite impressive. For more information about SVMs, take a look at scikit-learn’s “Support Vector Machines” (http://scikit-learn.org/stable/modules/svm.html).
PYTHON MACHINE LEARNING
For the purposes of this project we chose to focus primarily on the alphabet, a-z, and we added two more labels, “nothing” and “relaxed”, to the set. Our rationale for providing the classifier “nothing” was in order to have a class that was made up of mostly noise. This class would not only provide negative instances to help learn our other classes, but it also gave the classifier a way of outputting that the gestured sign is not recognized as one of the ones that we care about. In addition, we didn’t want the classifier to be trying to predict any of the letters when the user was simply standing by, thus we taught it what a “relaxed” state was. This state was simply the position that the user put his/her hand when they were not signing anything. In total there were 28 signs or labels.
For our project we made extensive use of Python’s scikit-learn library. Since we were using various kinds of sensors with drastically different ranges of values, it was important to scale all of our data so that the SVM would have an easier time classifying. To do so we made use of the preprocessing tools available from scikit-learn. We chose to take all of our data and scale it so that the mean for each sensor was centered at zero and the readings had unit variance. This approach brought about drastic improvements in our performance and is strongly recommended. The classifier that we ended up using was a SVM that is provided by scikit-learn under the name of SVC.
Another part that was crucial to us as developers was the use of plotting in order to visualize the data and qualify how well a learning algorithm should be able to predict the various signs. The main tool that was developed for this was the plotting of a sequence of sensor readings as an image (see Figure 4). Since each packet contained a value for each of the sensors (13 in total), we could concatenate multiple packets to create a matrix. Each row is thus one of the sensor and we look at a row from left to right we get progressively later sensor readings. In addition, every packet makes up a column. This matrix could then be plotted with instances of the same sign grouped together and the differences between these and the others could then be observed. If the difference is clear to us, then the learning algorithm should have no issue telling them apart. If this is not the case, then it is possible that the algorithm could struggle more and changes to the approach could have been necessary.
The final step to classification is to pass the output of the classifier through a final level of filtering and debouncing before the output reaches the user. To accomplish this, we fill up a buffer with the last 10 predictions and only consider something a valid prediction if it has been predicted for at least nine out of the last 10 predictions. Furthermore, we debounce this output and only notify the user if this is a novel prediction and not just a continuation of the previous. We print this result on the screen and also make use of Peter Parente’s pyttsx text-to-speech x-platform to output the result as audio in the case that it is neither “nothing” or “relaxed.”
Our original glove did not have contact sensors on the index and middle fingers. As a result, it had a hard time predicting “R,” “U,” and “V” properly. These signs are actually quite similar to each other in terms of hand orientation and flex. To mitigate this, we added two contact sensors: one set on the tips of the index and middle fingers to detect “R,” and another pair in between the index and middle fingers to discern between “U” and “V.”
As you might have guessed, the speed of our approach is limited by the rate of communication between the microcontroller and the computer and by the rate at which we are able to poll the ADC on the microprocessor. We determined how quickly we could send data to the PC by sending data serially and increasing the send rate until we noticed a difference between the rate at which data was being received and the rate at which data was being sent. We then reduced the send frequency back to a reasonable value and converted this into a loop interval (about 3 ms).
We then aimed to gather as much data as possible from the sensors in between packet transmission. To accomplish this, we had the microcontroller gather as much data as possible between packets. And in addition to sending a packet, the microcontroller also sent the number of readings that it had performed. We then used this number to come up with a reasonable number of values to poll before aggregating the data and sending it to the PC. We concluded that the microcontroller was capable of reading and averaging each of the sensors 16 times each, which for our purposes would provide enough room to do some averaging.
The Python algorithm is currently limited by the rate at which the microcontroller sends data to the PC and the time that it takes the speech engine to say the word or letter. The rate of transfer is currently about thirty hertz and we wait to fill a buffer with about ten unanimous predictions. This means that the fastest that we could output a prediction would be about three times per second which for our needs was suitable. Of course, one can mess around with the values in order to get faster but slightly less accurate predictions. However, we felt that the glove was responsive enough at three predictions per second.
While we were able to get very accurate predictions, we did see some slight variations in accuracy depending on the size of the person’s hands. The accuracy of each flex-sensor is limited beyond a certain point. Smaller hands will result in a larger degree of bend. As a result, the difference between slightly different signs with a lot of flex tends to be smaller for users with more petite hands. For example, consider the signs for “M” and “S.” The only difference between these signs is that “S” will elicit slightly more flex in the fingers. However, for smaller hands, the change in the resistance from the flex-sensor is small, and the algorithm may be unable to discern the difference between these signs.
In the end, our current classifier was able to achieve an accuracy of 98% (the error being composed almost solely of u, v sign confusion) on a task of 28 signs, the full alphabet as well as “relaxed” and “nothing” (see Figure 5). A random classifier would guess correctly 4% of the time, clearly indicating that our device is quite accurate. It is however worth noting that the algorithm could greatly benefit from improved touch sensors (seeing as the most common mistake is confusing U for V), being trained on a larger population of users, and especially on larger datasets. With a broad enough data set we could provide the new users with a small test script that only covers difficult letters to predict and relies on the already available data for the rest. The software has currently been trained on the two team members and it has been tested on some users outside of the team. The results were excellent for the team members that trained the glove and mostly satisfying though not perfect for the other volunteers. Since the volunteers did not have a chance to train the glove and were not very familiar with the signs, it is hard to say if their accuracy was a result of overfitting, individual variations in signing, or inexperience with American Sign Language. Regardless, the accuracy of the software on users who trained was near perfect and mostly accurate for users that did not know American Sign Language prior to and did not train the glove.
Lastly it is worth noting that the amount of data necessary for training the classifier was actually surprisingly small. With about 60 instances per label the classifier was able to reach the 98% mark. Given that we receive 30 samples per second and that there are 28 signs, this would mean that gathering data for training could be done in under a minute (see Figure 6).
The project met our expectations. Our initial goal was to create a system capable of recognizing and classifying gestures. We were able to do so with more than 98% average accuracy across all 28 classes. While we did not have a solid time requirement for the rate of prediction, the resulting speed made using the glove comfortable and it did not feel sluggish. Looking ahead, it would make sense to improve our approach for the touch sensors since the majority of the ambiguity in signs come from the difference between U and V. We want to use materials that lend themselves more seamlessly to clothing and provide a more reliable connection. In addition, it will be beneficial to test and train our project on a large group of people since this would provide us with richer data and more consistency. Lastly, we hope to make the glove wireless, which would allow it to easily communicate with phones and other devices and make the system truly portable.
Arduino, “MPU-6050 Accelerometer + Gyro,” http://playground.arduino.cc/Main/MPU-6050.
Atmel Corp., “8-Bit AVR Microcontroller with 128K Bytes In-System Programmable Flash: ATmega1284P,” 8059D–AVR–11/09, 2009, www.atmel.com/images/doc8059.pdf.
P. Fleury, “AVR-Software,” 2006, http://homepage.hispeed.ch/peterfleury/avr-software.html.
Lund University, “Tiny Real Time,” 2006, www.control.lth.se/~anton/tinyrealtime/.
P. Parente, “pyttsx – Text-to-speech x-platform,” pyttsx 1.2 documentation, 2015, https://pyttsx.readthedocs.org/.
Python Software Foundation, “socket—Low-Level Networking Interface,” https://docs.python.org/2/library/socket.html.
———, “struct—Interpret Strings as Packed Binary Data,” https://docs.python.org/2/library/struct.html.
scikit-learn, “Preprocessing Data,” http://scikit-learn.org/stable/modules/preprocessing.html.
———, “Support Vector Machines,” http://scikit-learn.org/stable/modules/svm.html.
Spectra Symbol, “Flex Sensor FS,” 2015, www.spectrasymbol.com/wp-content/themes/spectra/images/datasheets/FlexSensor.pdf.
R. Villalba and M. Lin, “Sign Language Glove,” ECE4760, Cornell University, 2014, http://people.ece.cornell.edu/land/courses/ece4760/FinalProjects/f2014/rdv28_mjl256/webpage/.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JUNE 2016 #311 – Get a PDF of the issueSponsor this Article
Roberto Villalba is a Systems Software Engineer at Green Hills Software. He earned BS and MS degrees in Computer Science at Cornell University. Roberto’s is interested primarily in robotics, machine learning, and artificial intelligence.