Sentence-BERT for Interpretable Topic Modeling in Web Browsing Data

img
Nowadays, much intellectual exploration happens through a web browser. However, the breadcrumbs that trail all this activity are largely unstructured. Common browsers retain lists of browsing history, which is typically timestamped, however, the tabular nature of this data existing as a list of URLs, titles, and timestamps leaves it much neglected and difficult to semantically explore. To overcome this challenge, topic modeling and document clustering are techniques used to manipulate and search collections of text for information retrieval and potential knowledge discovery. In this work, I leverage Sentence-BERT (SBERT) to build expressive embeddings for building an interpretable space for topic modeling within my own browsing history data. After qualitative analysis, topic clusterings made from SBERT web page embeddings outperform those made from Doc2Vec-based document embeddings. This method shows promise as a tool for semantically exploring one's browsing data history, or more broadly, other diverse collections of documents and text.