Incorporating relationships between tweets for topic derivation in Twitter
thesisposted on 28.03.2022, 15:36 by Robertus Setiawan Aji Nugroho
Twitter has become one of the most popular social media platforms, widely used for discussions and information dissemination on all kinds of topics. As a result, much research has been concerned with deriving topics from Twitter and applying the outcomes in a variety of real-life applications such as emergency management, business advertisements, and corporate/government communication. However, deriving topics in this short text based and highly dynamic environment remains a huge challenge. In Twitter, the frequency of term co-occurrences across messages (tweets) is very low due to the limit on the number of characters allowed for posting. In addition, a tweet often includes informal language expressions, such as emoticons, abbreviations, and misspelled terms. This leads to a very sparse relationship between tweets and the terms used in the tweets. It renders methods that exploit only content features ineffective. Deriving topics from tweets is also problematic due to the highly dynamic environment, where topics change quickly over a short period of time. To address these problems, we propose a novel topic derivation approach that incorporates tweet text similarity and time-sensitive interactions measures. Besides the tweet contents, the approach takes into account several types of interactions amongst tweets: tweets which mention the same user, replies, and retweets. We propose a joint probability model that can effectively integrate the effects of the content similarity, user mentions, and replies-retweets to measure the tweet relationships. Given the dynamic aspect of the environment, we also hypothesize that temporal features could further improve the quality of topic derivation results. We incorporate a time factor, introducing a half-life exponential decay function to deal with this dynamic environment. Topic derivation is done through our proposed Non-negative Matrix inter-joint Factorization (NMijF) method, in which we conduct co-factorization jointly over our tweet-to-tweet relationships matrix and tweet-to-term relationship matrix within a single iterative-update process. NMijF effectively clusters the tweets based on their relationships and meanwhile learns the topic-words by using the tweet clusters and content features of the tweets. We conducted a number of experiments on several Twitter datasets to reveal both the individual and integrated effects of the various features being considered. Experimental results with TREC2014, tweetSanders, and tweetMarch datasets demonstrate that the proposed method is able to consistently outperform other advanced topic derivation methods and results in 10-70% improvements in all evaluation metrics.