Because a single conversation can be interpreted in very different ways, some researchers are looking for a tech-enabled solution to interpret the tone of our words and body language.
These researchers stem from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute of Medical Engineering and Science (IMES) who say they've gotten closer to a potential solution: an artificially intelligent, wearable system that can predict if a conversation is happy, sad, or neutral based on a person's speech patterns and vitals. This sort of technology could be especially useful for people with anxiety or conditions such as Asperger's.
"Imagine if, at the end of a conversation, you could rewind it and see the moments when the people around you felt the most anxious," Tuka Alhanai, a graduate student who co-authored a related paper with PhD candidate Mohammad Ghassemi that they will present at next week's Association for the Advancement of Artificial Intelligence (AAAI) conference in San Francisco, said in a statement. "Our work is a step in this direction, suggesting that we may not be that far away from a world where people can have an AI social coach right in their pocket."
As a participant tells a story, the system can analyze audio, text transcriptions, and physiological signals to determine the overall tone of the story with 83 percent accuracy. Using deep-learning techniques, the system can also provide a "sentiment score" for specific five-second intervals within a conversation.
"As far as we know, this is the first experiment that collects both physical data and speech data in a passive but robust way, even while subjects are having natural, unstructured interactions," Ghassemi added. "Our results show that it's possible to classify the emotional tone of conversations in real-time."
The researchers say that the system's performance would be further improved by having multiple people in a conversation use it on their smartwatches, creating more data to be analyzed by their algorithms. This, however, doesn't mean the software will record every word users say — the algorithm runs locally on a user's device as a way of protecting personal information.
The prototype device is based on a Samsung Simband which is able to capture high-resolution physiological waveforms to measure features such as movement, heart rate, blood pressure, blood flow, and skin temperature. The system also captured audio data and text transcripts to analyze the speaker's tone, pitch, energy, and vocabulary.
And the initial results are promising, with the algorithm's findings aligning well with what we humans might expect to observe. For instance, long pauses and monotonous vocal tones were associated with sadder stories, while more energetic, varied speech patterns were associated with happier ones. In terms of body language, sadder stories were also strongly associated with increased fidgeting and cardiovascular activity, as well as certain postures like putting one's hands on one's face.
Overall, the model was able to classify the mood of each five-second interval with an accuracy that was approximately 18 percent above chance, and a full 7.5 percent better than existing approaches.
There's still work ahead before the algorithm is reliable enough to be deployed for social coaching. For future work the team plans to collect data on a much larger scale, potentially using commercial devices such as the Apple Watch that would allow them to more easily implement the system out in the world.
"Our next step is to improve the algorithm's emotional granularity so that it is more accurate at calling out boring, tense, and excited moments, rather than just labeling interactions as ‘positive' or ‘negative,'" Alhanai added. "Developing technology that can take the pulse of human emotions has the potential to dramatically improve how we communicate with each other."
This research was made possible in part by the Samsung Strategy and Innovation Center.