Deep neural network for text classification, particularly NLP(Nature Language Processing), lends itself as a useful tool in capturing ‘encoded’ features from a collection of text (corpus), and learning hidden relationships between words in a certain context. These features are encoded in numerical values that human brain is almost incapable to understand and interpret.
LSTM, a special type of RNN, has its advantage of uncovering temporal recursive patterns in natural language, which as a result can be a good approach for this project. In spite of a well labeled training dataset which is yet under construction, we attempt to use NLP(LSTM) on existed posts with their nature of belonging to separate forums.
By contrast, Traditional text classification algorithms, such as bayesian classifier, decision tree, logistic regression, etc, rely heavily on manual feature engineering and statistical interpretability. Meanwhile, these methods are mostly unable to store temporal linguistic information that’s worth noticing in social media posts.
By assuming posts from ‘eating disorders’ forums are generally merely discussing eating disorders behaviors and mentalities, and have higher risk scores in developing eating disorders, we label all these posts positive (‘label (y)’ values equal to one). Posts from ‘control group’ and ‘general health group’ forums, on the contrary, are labeled negative.
Train the Model
We split 80 percent of original data to train parameters and 20 percent for model testing. Within the 80 percent, we leave 10 percent of them as validation set for model selection and hyper-parameter tuning throughout training.
Here is our LSTM recurrent neural network:
The model performance is given as following accuracy and loss graph:
The continuing decrease of training loss and increase of training accuracy shows that our network is functioning well, the validation accuracy goes down after 2 epochs, and the validation loss goes up after 2 epochs, which suggests an early stop is in need before the model starts to overfit since the third epoch.
We select model parameters at the second epoch, the model evaluation on the left-out test set generates loss value of 0.3681, and accuracy 0.8087, which is not quite satisfying. The reasons of this problem ranges from poor labeling, for example, people might submit irrelevant posts to eating disorders forums such as recovery stories after being cured or complaints about messy daily life, to errors attributing to the similarity between general health topic forums and posts in eating disorders forums concerning physical and mental well being. This type of error, derived from data itself, is hard to reduce simply by improving model complexity or switching algorithms.
Exploratory Text Analysis on Prediction
It’s hard to tell what kind of classification our model have made on posts in eating disorders forums only based on two values above. We are curious what posts are predicted to be ‘negative'(low-risk) in eating disorders forums, which by assumption should only contain ‘positive'(high-risk) posts. We can use exploratory text analytics methods like BOW, tf-idf, etc, to get a general idea from predicted results.
5115 posts are labeled as positive while 86 posts are labeled as negative amongst all the posts in eating disorders forums. Some word cloud visualizations can help us gain better ideas about the classification results. The following word clouds are built on words, words without stop-words, adjectives, verbs, and nouns of posts.
It looks like our network emphasizes more on words like ‘food’ ‘disorders’ ‘weight’ in positive posts, while prefers neutral expressions like ‘people’, ‘story’, ‘day’, etc, in the negative class. Word clouds for nouns seems to convey more useful distinguishing information about the prediction.
Our next steps will be launching zooniverse data collection website for crowd-labeling, which will finally solve the current problem of poor labeling.