Detecting Depression in Social Media Via Twitter Usage

Tulasiram
The Startup
Published in
8 min readOct 30, 2020

--

by Tulasi ram Ponaganti

Collecting tweets using twint and Analyzing tweets for detecting depression

It is a magic tool for scraping and fetching the data from Twitter based on the desired keywords in a few lines of code to execute in any terminal. In addition, I used python packages such as Tensorflow, matplotlib, NLTK, WordCloud, Gensim……etc.

;

More than 300 million people suffer from depression and only a fraction receive adequate treatment. Depression is the leading cause of disability worldwide and nearly 800,000 people every year die due to suicide. Suicide is the second leading cause of death in 15–29-year-olds. Diagnoses (and subsequent treatment) for depression are often delayed, imprecise, and/or missed entirely.

It doesn’t have to be this way. Social media provides an unprecedented opportunity to transform early depression intervention services, particularly in young adults.

Every second, approximately 6,000 Tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute,500 million tweets per day, and around 200 billion tweets per year.

Pew Research Center states that currently, 72% of the public uses some

https://www.who.int/news-room/fact-sheets/detail/depression

https://www.internetlivestats.com/twitter-statistics/

type of social media. This project captures and analyses linguistic markers associated 3with the onset and persistence of depressive symptoms in order to build an algorithm that can effectively predict depression. By building an algorithm that can analyze Tweets exhibiting self-assessed depressive features, it will be possible for individuals, parents, caregivers, and medical professionals to analyze social media posts for linguistic clues that signal deteriorating mental health far before traditional approaches.

so often requires the self-reporting of symptoms, social media posts provide a rich source of data and information that can be used to train an efficient model.

Data Collection and Data set Description:

Here I provide the link for the procedure of collecting tweet using twint in Ubuntu Linux

Tweets indicating depression were retrieved using the Twitter scraping tool TWINT using linguistic markers indicative of depression. The scraped tweets may contain tweets that do not indicate the user having depression, such as tweets linking to articles about depression or talking about loved ones who have depression. As a result, the scraped tweets will need to be manually checked for better testing results.

Using the above procedure, I collected Data for the below terms:

command line: twint -s “depression” — since 2019–07–20 -o depression — CSV

  • Depressed
  • Depression
  • Hopeless
  • Lonely
  • Suicide
  • Antidepressant
  • Antidepressants

These Tweets proved to contain lexical features strongly indicative of depression and were ideal for training an efficient and robust classifier

Data Exploration:

In order to build a depression detector, there were two kinds of tweets that were needed for this project: random tweets that do not necessarily indicate depression and tweets that demonstrate that the user may have depression and/or depressive symptoms. A dataset of random tweets can be sourced from the Sentiment140 dataset available on Kaggle, however, for this binary classification model, the dataset which utilizes the Sentiment140 dataset and offers a set of binary labels proved to be the most effective for building a robust model. There are no publicly available datasets of tweets indicating depression, so “depressive” Tweets were retrieved using the Twitter scraping tool TWINT. The scraped Tweets were manually checked for relevance (for example, Tweets indicating emotional rather than economic or atmospheric depression) and Tweetswere cleaned and processed. Tweets were collected by searching for terms specifically related to depression, specifically to lexical terms as identified in the unigram.

Because the nature of social media content poses serious challenges to applications of sentiment analysis, VADER was also utilized for general sentiment analysis of Tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media for general sentiment analysis that is specifically attuned to sentiment in microblog-like contexts. It allows for not only the classification of sentiment but also the associated sentiment intensity measures. This is extremely 17useful because Tweets often contain multiple sentiments. VADER doesn’t require training data but is constructed from a human-curated, valence-based, generalizable sentiment lexicon which is fast enough to be used with streaming data. While VADER does not detect depression in text, it gives a foundation for understanding the general sentiment of the data.

Here, I provided the link for metrics

https://www.kaggle.com/kazanova/sentiment140

https://www.kaggle.com/ywang311/twitter-sentiment

Exploratory Visualisation:

frequencies of characters and words

Most Common Words After the data preprocessing

Certain bigrams were also extremely common, including smile and wide, appearing 42,185 times, afraid and loneliness, appearing 4,641 times, and feel and lonely, appearing 3,541 times.

Frequency of Bigrams

As expected, the datasets used for this model are imbalanced. Most classification datasets don’t have exactly an equal number of instances in each class, but a dataset involving mental or physical health will almost certainly be imbalanced, as this one is. When analyzing a dataset of random Tweets for any kind of health issue, it is unsurprising that the random data would be heavily imbalanced. When creating a dataset that specifically looks for information associated with this health issue, it is unsurprising that the resulting dataset would be imbalanced in the opposite direction.

While accuracy is a good initial measure to use when evaluating a model, this model has an imbalanced dataset, and as such, requires additional measures to assess its accuracy and robustness. In a model that uses imbalanced data, it is likely that the accuracy score will be high but the accuracy score may only reflect the underlying class distribution. This is known as the accuracy paradox. The reason for this is that the model will learn to always predict the most common class because that is often the correct class to predict. Because of the accuracy paradox, it was important to not only provide the two separate datasets but also to collect a larger amount of data and change the performance metric to include more than simply the accuracy score.

Once the Tweets were cleaned, it was easy to see the difference between the two datasets by creating a word cloud with the cleaned Tweets. With only an abbreviated

TWINT Twitter scraping, the differences between the two datasets were clear:

Random Tweet Word Cloud:

Depressive Tweet Word Cloud:

Because of the nature of mental illness and its subjectivity, it made sense to use a binary classification model. The benchmark model chosen was a logistic regression model.

Graphs representing stages of model improvement After refinement:

Model Evaluation and Validation :

The final architecture, parameters, and hyperparameters were chosen because they performed the best among all the tried combinations. The LSTM model takes in the tokenized tweets and feeds them into an embedding layer to get an embedding vector. The model takes in an input and then outputs a number representing the probability that the Tweet indicates depression. The model takes in each input Tweet, replaces it with its embeddings, and then runs the new embedding vector through a convolutional layer, which is well-suited for learning spatial structure from data. The convolutional layer takes advantage of this and learns structure from the sequential data, which it then passes into a standard LSTM layer. The output of the LSTM layer is fed into a dense model for prediction. The model has an embedding layer, a convolutional layer, and a Dense layer and uses max pooling, a dropout of 0.5, binary cross-entropy loss, a Nadam optimizer, and a relu activation function in the first layer, and a sigmoid activation function in the dense layer. Accuracy and loss are recorded and visualized and compared to a benchmark logistic regression model.

Conclusion:

The use of linguistic markers as a tool in the analysis and diagnoses of depression has enormous potential. Depression can so quickly be seen in the text, even without the use of complex models. Simply by collecting, cleaning, and processing available data, visual analysis alone can illuminate the difference between random Tweets and Tweets that have depressive characteristics.

The potential of linguistic analysis in the arena of mental health cannot be overstated. By analyzing a person’s words, you have a clear and valuable window into his or her mental state. Even the simplest analysis of social media can provide us with unprecedented access to individuals' thoughts and feelings and lead to substantially greater understanding and treatment of mental health.

The final model proves to be far more accurate than the benchmark model. The benchmark model, run on the same data for the same number of epochs, shows an accuracy of approximately 64%, while the final model has an accuracy of approximately 97%. This proves to be a much more robust and effective model for depression prediction and it appears that this solution is significant enough to have solved the difficulty of effectively analyzing Tweets for depression.

GITHUB:

https://github.com/ram574/Detecting-Depression-in-Social-Media-via-Twitter-Usage

--

--