The National of Mental Health estimates that more than 40 million adults in the US experienced some form of mental illness in 2015, 16 million (or almost 7% of the US population) of which experienced at least one major depressive episode. Over their lifetime, almost 30% of US adults will develop an anxiety disorder.
This presents a massive load on the healthcare system and requires particularly adept clinicians that can skillfully and quickly diagnose a given mental illness. Even once diagnosed, judging the success of a treatment and gauging how well a patient responds to intervention could be a subjective exercise. In an effort to assist clinicians with these difficult tasks, researchers are looking into bringing computer vision advances into the clinic. MedGadget had the opportunity to chat with Dr. Louis-Philippe Morency, a Carnegie Mellon Professor who is a leader in this field, about his group’s MultiSense technology.
Mohammad Saleh, Medgadget: Tell us about your work on computer vision and human interaction.
Dr. Louis-Philippe Morency: We are developing technologies to automatically sense human non-verbal behaviours such as facial expressions, eye gaze, head gestures, and vocal non-verbal behaviours like the voice and its tenseness. The reason we’re automatically detecting these behaviours is to help clinicians working with mental health patients with their diagnosis and treatments of mental disorders such as depression, anxiety, PTSD, schizophrenia, and autism.
Medgadget: How does it work? Can you touch on the science underlying these algorithms?
Morency: One of the cornerstone algorithms, from a computer-vision perspective, is facial landmark detection. It automatically identifies the position of 68 “landmarks” or key points on the face. These were defined over the years as being reliable to track over time. Examples are the eyebrow positions, contour of the mouth, the eye corners, and jaw contours. These are cornerstones for a later stage of analysis because knowing their current shapes really helps understand and recognize the facial expression. This is coupled with things like head-tilt and eye-gaze estimations.
Medgadget: How much information can you get from these 68 facial parameters? How do you help a computer interpret them to actually “read” an expression?
Morency: These usually allow us to start looking at muscle changes, which in a sense is a way of quantifying facial expression. There was some very well received work early on from Paul Ekman about facial action coding systems and was popularized popularized in recent years through the TV show “Lie To Me”. It’s the idea that muscles of the face can be reliably annotated and are informative to interpret emotions. Depending on how you count it, there are about 28-50 facial action units, so the motion of these landmarks and wrinkles on the face allow us to identify which muscle changes. This information is still low-level, but later on we interrogate it over time to look for indicators of depression, anxiety, or PTSD.
Through great collaborations with clinicians in the mental health field and medical centers such as McLean Hospital, we gathered large datasets of interviews. For depression, we had almost 500 participants who interacted with our system, and as they were talking with the system we analyzed their non-verbal behaviours. So over time, we aggregate these numbers through some summative statistics to look at correlations between the non-verbal behaviours and their depression scale. That allowed us to identify the behavioural indicators that are best correlated with depression. This gave us about 20 behavioural indicators for depression which can be summarized for a doctor.
We’d like to eventually use this for screening, but in the short-term the main use-case for this technology is for monitoring patients during treatments. It’s much easier to look at changes of behaviour over time for the same person than to use it on a new person you’ve never seen before. There’s a lot of calibration that’s needed when you see a person for the first time – you adapt to their unique idiosyncrasies. But when you see them over time then you can gauge whether there have been changes between sessions. This could be useful for clinicians, helping indicate whether the treatment is going well, or if therapy approaches or medications need to be altered.
As for the technology under the hood, we depend on some AI algorithms – a lot of the technology behind it is based on probabilistic graphical models for facial landmark detection. We’ve also recently worked with more deep learning and neural network approaches.
Medgadget: Do you go into this telling the algorithm that these are all patients, or are you “blinding” it and letting the computer pick up on it and compare them to “normal” behaviours?
Morency: For some of our early studies, like the one with 500 volunteers, not everyone was a patient from the hospital. They were participants that were invited to come talk to our computer system or to an interviewer. Before and after people interacted with our system, we had them fill a questionnaire for depression, PTSD, and anxiety self-reporting. From these 500, we had somewhere around 15-20 percent with symptoms related to depression, about the same range for PTSD, and might have been higher for anxiety. It was an interesting population because it was more representative of the real-world. You could also imagine that observing a large a range of symptoms was challenging because the people who are very depressed will be admitted into hospital units. So we were looking at patients on the lower range of depressive symptoms. Now, though, we’ve also started working with hospitals to study referrals from the ER (when we recently started studying suicidal ideation). With McLean Hospital, we’re really working with patients.
Medgadget: You mentioned calibration being an important factor, particularly in setting a baseline measurement. I’m wondering – how accurate are these algorithms?
Morency: We’re working on this for an academic purpose. The goal of the software is not to diagnose depression, that’s always the job of the doctor. We’re building these algorithms as decision support tools for the clinicians so that they can do their assessments. But from an academic perspective, we do want to know how well these behavioural markers are correlating with the assessment of clinicians. We’ve done this work and seen around a 78% correlation. So it’s not 100%, but our data is significant. We’re definitely heading in the right direction! It’s also important to note that these algorithms work best when we have a specific patient interview style. Open-ended questions help us gather these non-verbal cues. We want questions that try to bring out the emotion and memories of the patient, and we can get those correlations when these types of questions are asked by the clinicians.
Medgadget: Where do you envision this technology in a clinical setting? How has it looked so far and how will that change?
Morency: The early stage of the technology was looking at screening for depression, PTSD, or anxiety. But eventually as the technology became more mature, we’re seeing the best use case being in treatment. We’re working closely with McLean hospital to study patients in hospital units. We’re currently looking at a population for psychosis (which includes schizophrenia and bipolar disorder) to look for their behaviour markers over time. We’d like to better identify the type of psychosis and give live feedback to clinicians.
Medgadget: Do any other factors play into these visual biomarker assessments? You mentioned visual non-verbal aspects, but do you take into account vocal or behavioural aspects as well?
Morency: Now that we are getting these promising results from non-verbal cues, our next line of research looks at the verbal aspects and the content of what they’re saying. We’re interested in patient lexical and grammatical usage and how that changes. Some previous studies have already seen some signs related to language usage in schizophrenia. But we’re also interested in these because non-verbal behaviours are best understood when they are contextualized with the verbal cues. The gestures and facial expression of a person are often better interpreted if you can see them. So this multi-modal analysis is what we’re pushing our algorithms to do.
Medgadget: That makes sense! It’s like trying to watch TV while it’s on mute – you don’t understand as much as you would if the volume was on.
Morency: Exactly! So our early work was using only video. It’s quite amazing that we were able to get such behavioural indicators while muting. We’re expecting much more robust indicators as we begin to integrate verbal cues.
Medgadget: You mentioned a few different medical conditions. Does the algorithm itself change between the different conditions?
Morency: You could see it as a three-layer problem. The first two layers generalize quite well between populations. One of them is unimodal and almost instantaneous sensing where you quantify the expression and gaze from the image. The second layer is integrating that information over time for recognition. These two seem to work well for our purposes, since we are serving mostly adults. We’ve done some work with teenagers, but not with children. We’re expecting to have to adapt the first two layers for those contexts.
The last layer is the detection of specific behavioural markers, and these are definitely specific to each disorder. We see some markers that generalize between depression and PTSD, for example, but a lot of them do change.
An example of some of these behavioural markers, think about smiling in depression. When we studied it, we were expecting that people who are depressed smile less often than others. It turns out that we were seeing about the same number of smiles, but surprisingly the dynamic of the smile was different. Those who were depressed had a shorter smile with less amplitude. It seems that because of social norms, they smile to be polite in a sense, but don’t feel it as much.
Another really interesting example comes from looking at PTSD. We expected more negative facial expressions from those affected. We didn’t see any distinction like that. However, when we separated men and women, men with PTSD showed an increase in negative facial expressions, while women showed a decrease. That was really interesting because it’s a gender-specific interaction that is probably also founded in social norms – men are typically allowed to show their negative expressions in American culture, but women are often expected to be more “smiley.”
Medgadget: So given how social norms factor into these behavioural cues, how well would these algorithms apply in different cultures? Is it particularly calibrated to Western norms?
Morency: We expect that some of these factors will generalize, but there will definitely be some changes. We have some indicators that gaze behaviour, for example, changes. There is a reduction in eye-contact when one is depressed, but in a different culture where it’s respectful not to make eye-contact, we expect the trend will still be observable, but to a lesser extent. That’s actually something we want to test, and are very interested in collaborating with international institutions to study it.
Dr. Morency gave a talk about his research at the World Economic Forum in 2015.
Medgadget: It sounds like a lot of what you’re doing is trying to get at the emotions underlying human interactions. You’ve also mentioned your interest in applications dealing with autism. Could you touch on the intersection of those two?
Morency: One aspect we’re interested in is to help better categorize and diagnose patients with autism. That’s work we’re doing with our collaborator at Yale University. But we’re also interested in helping them with general interactions and public speaking. We have a system with USC which is designed for everyone, but will be particularly well-suited for people on the lower part of the autism spectrum. The goal is to help them speak publicly and eventually present themselves for job interviews. So, although we’re interested in the uses for this technology in a clinical setting, there’s room for its use as a training system that gives live feedback to the user.
Medgadget: I came across a paper of yours looking at speech-patterns in suicidal teenagers. What was that about?
Morency: The suicidal teens study was another surprising result. We were very interested in studying teenagers who visit the ER with suicidal attempts or ideation. The goal originally was to differentiate between those with and without suicidal thoughts. It was previously demonstrated that the way language is used can serve as a marker. Those with suicidal thoughts use personal pronouns like “me” and “myself” a lot more often. We were able to differentiate between them based on their language use. But what we really wanted to do was to be able to predict repeated suicide attempts. So we called them a few weeks later to inquire, and it turns out that the breathiness quality of the voice was predictive of reattempts. It was counter-intuitive – we thought that a tense voice would be the most predictive. One hypothesis is that they’ve already made-up their minds and have made peace with it, so their voice is breathy instead of tense.
Medgadget: So what’s your vision for this field of work over the next decade or two?
Morency: We’re seeing a great openness from the medical field to integrate technology. In the next 5 years or less, we’ll see more validation studies. We’re seeing so many promising results for behavioural markers, and we’ll see more studies looking at how these results generalize and apply in different contexts. We’ll also see this technology coming into the field of telemedicine. Specialists are not always available locally to the patient, so being able to interact remotely and gather behavioural markers along the way is an aspect we see having a huge impact on the healthcare field in the coming years.
Medgadget: Are there any applications beyond the medical context that this technology could be used for?
Morency: It’s helpful for at least two more applications. One is for mining online videos. There’s a huge wealth of information from people posting videos online to express their opinions about everything! So summarizing these videos and being able to understand what they’re talking about and the opinions they’re expressing – what is sometimes called opinion-mining or sentiment-analysis – is a very intriguing application of our multi-modal system. Another line of research that personally excites me is to push forward an agenda to help with online-learning. It’s a field that has huge potential, but we’re not seeing all-positive results. We believe this technology could help students have more productive, remote work groups. We want to bring some of the advantages of face-to-face interactions to an online collaborative setting.
More of Dr. Morency’s work on his Google Scholar page…