All events are in Central time unless specified.

M.S. Thesis Defense: Rojina Deuja

Date: Time: 1:00 pm–2:00 pm
“Semantically Meaningful Sentence Embeddings”

Text embedding is an approach used in Natural Language Processing (NLP) to represent words, phrases, sentences, and documents. It is the process of obtaining numeric representations of text, that are fed into machine learning models as vectors (arrays of numbers) for desired processing. One of the biggest challenges for text embedding is representing longer segments of text in a manner that the meaning behind the segment and the semantic relationship between its constituents is captured. Such representations are known as semantically meaningful embeddings.

In this study, we seek to improve upon the quality of semantically meaningful embeddings generated for sentences. The current state-of-the-art models are mostly based on transformer networks that utilize attention mechanisms. Such networks use encoders that generate dense vectors to represent input sentences. While most of these models have been simply combining the dense vectors into fixed-sized embeddings, there is no evidence that such heuristic pooling techniques work best for capturing semantic relationships. In this study, we argue that combining the vectors in this way incorporates a lot of unwanted information into the embeddings. In order to capture the true semantic relationship between words in a sentence and get rid of linguistic noise, we propose a modified version of the DeBERTa model with a novel pooling technique. The model uses an FCNN to reduce the size of the encoder output while enriching the expressiveness of semantic information in the embeddings. Our experiments show that the proposed model achieves significant improvement over existing sentence embedding methods on two different datasets - STS Benchmark (STS-B) and SICK-Relatedness (SICK-R). We also create a semantic search engine that encodes an input sentence and returns the top N sentences that are the most similar to it.

Dr. Stephen Scott (Adviser)
Dr. Mohammad Rashedul Hasan (Co-advisor)
Dr. Vinodchandran N. Variyam

Meeting ID: 998 4120 7914

Download this event to my calendar