Introduction
As the final project for my Computer Vision class (CSCI1430), me and my group developed MyPitch. The basis of this project is making music with gesture detection. We were inspired by the hand movements DJs make when they create music, and decided to build a computer vision implementation of that. The left hand changes the volume: if the hand is closed, the volume decreases to zero. If the hand is open, we increase the volume if the hand goes up, and decrease it if it goes down. For the pitch, we make the pitch higher if the hand goes up, and lower if it goes down.
In this implementation we use hand detection and gesture detection to get the hands' coordinates and to detect if the hand is a palm or a fist. Our implementation consisted of three main parts: hand location detection, hand posture detection and audio integration.
hand location detection
We use the hands' y-coordinates to change the audio (right hand for volume and left hand for pitch). To determine hand location in the frame, our first approach was to use a feature extraction algorithm. We tried finding the contours of hands using OpenCV functions and building a color histogram based off of the initial background and the skin color of the moving hands, as below:
However, this method was not as effective as we expected, since it was susceptible to subtle changes in the background (lighting or moving objects); objects of similar color (human faces or exposed skin); or the fast-moving nature of hands. Thus, we eventually settled in on using a neural-network-based hand detector.​​​​​​​
Instead of extracting the exact hands from the frame, being able to determine if there are hands in the frame and obtaining bounding boxes around the hand areas was enough for our needs. We just needed to train our detector on a large dataset of hands and make sure it gave results on every frame in realtime. After some research, we decided that TensorFlow's Object Detection API would be very useful in this case. We referred to an article where the author trained the model from this API on hand images (from the Egohands Dataset). After testing out the accuracy of this detector, we decided on using the frozen graphs provided as open resources. Using the frozen graphs, we are able to detect if there were hands in each frame and if so, where they were, using the center of these bounding boxes as the y-coordinates.
In order to continue with posture detection, we extract the images within the bounding boxes to be passed on into our posture detector. ​​​​​​​
GESTURE detection
Users can pause and resume the audio by closing and opening their left hand. In order to detect that, we decided to train a new model for detecting open palm and closed fist using TensorFlow. Since training a model from scratch required a lot of data and resources, we decided to use a technique called transfer learning, which uses a pre-trained model and trains a new classification layer on top. We opted to use the Inception V3 architecture, which is trained on ImageNet
To train a new classification layer on top of the model, we built our own dataset. We took pictures of both the left hand and the right hand in different backgrounds and built a dataset that contains around 300 images of close fists and around 300 images of open palms. In our first few attempts, the new models that we trained did not perform well. We tested with images our hand detection algorithm produced from the webcam, and only got around 50% accuracy on the test images. 
After more testing, we discovered that our model performs poorly on images that were not squared, images that did not have clear backgrounds, and images where the hand were not centered. We found out that this resulted from the dataset we created being overly simple and similar. Therefore, we created a new dataset with more variety, more complicated backgrounds, and different sizes, as shown below:

previous dataset

updated dataset

With the new dataset, we successfully trained a model with classification accuracy of 99%, and the probability that our model returns of the correct classification is usually higher than 95%.
audio changes
The default behavior in MyPitch is for a constant "A" note to be played, so that the user can really notice the changes in pitch and volume. However, the user is also able to input any sounds they want, as long at they are in *.wav format. 
In order to make the changes in audio seem realtime, we use numpy to loop through the wave data in chunks, and then play each chunk at the end of the loop. We also to use multiprocessing to achieve that: we have the main loop running in the main process, and another process running the audio loop. When there are changes in the y-location, the main thread communicates the changes to the audio thread, which changes the values in volume and pitch accordingly. 
• Pitch Shifting: To change the pitch we apply a Discrete Fourier Transform to shift wave frequency. First, we choose the range of shifting we want, converting the y-coordinate to that value. After that, we check if the n value of the desired shift value is positive or negative and shift the wave, as below. The function "rfft" computes the one-dimensional discrete Fourier Transform for real input, and "irftt" computes the inverse of the n-point DFT for real input:
• Volume Changes: To change the volume we tried several approaches, like changing the system volume and the wave amplitude. However, we could not guarantee that these worked for every kind of computer system and sound input, so we decided to use audioop, a Python module that works with PyAudio to change raw audio data. Another advantage of using audioop it that it the samples are also automatically truncated in case of overflow. To "pause" the sound when the user closes their hand, we simply set the volume value to 0. 
conclusion
This was an interesting and satisfying explo­ration of how computer vision allows us to develop new interfaces through which people can interact with software. Using your hands to control audio feels much more natu­ral than simply pushing buttons on a keyboard and makes, in our opinion, for an immersive experience. In the future, we'd hope to be able to extend the gestures to be able to cover more audio function, such as transitioning between songs and being able to sync beats with one another to truly complete the button-free DJing experience. ​​​​​​​
back to top, thanks for reading!