All events are in Central time unless specified.

Ph.D. Dissertation Defense: Yuji Mo

3:00 pm–4:00 pm
Zoom Room:
“Cost-Aware Hierarchical Active Learning and Sub-Linear Time Embedding Based Deep Retrieval”

The research in this dissertation consists of two parts: An active learning algorithm for hierarchical labels and an embedding based retrieval algorithm. In the first part, we present a new active learning model that works with training data labeled according to a hierarchical scheme, e.g., data from named entity recognition (NER), document classification, and biological sequence analysis. Our model allows a high-level learning problem (e.g., location versus non-location) to be decomposed into a collection of finer-grained problems (e.g., museum versus non-museum), which can then be learned individually and the results combined as an alternative means to learn the high-level concept. We show that this approach can result in higher levels of precision for the same level of recall. Since such finer-grained labeling data could be more expensive to obtain, we work in the active learning setting and study the trade-off between the increased cost of purchasing finer-grained labels versus the potential increased benefit in learning. We then present a family of parameterized algorithms to work in our new model, and empirically evaluate our approach on NER and document classification problems.

In the second part, we present a Bayesian Deep Structured Semantic Model (BDSSM) that efficiently in retrieval tasks with a large pool of candidates for real time applications, e.g., in search engines, digital ads and recommendation systems. The efficiency is achieved by indexing the items into groups based on their sparse representation of embeddings during offline pre-computation. In the online retrieval phase, the algorithm only retrieves and ranks items from indices that are relevant to the query. We explore optimization strategies in the algorithm to make sparse representation sparser. In evaluation, the algorithm is compared it with other popular clustering-based, hashing-based and tree-based retrieval methods. We measure the differences in multiple dimensions, including retrieval recall, storage of embeddings, and CPU time. We show that this algorithm outperforms other algorithms in the comparison in both recall and CPU time with the same storage limit. Finally, we also show that this algorithm can be used in exploration when the model is recurrently retrained.

Dr. Stephen Scott, Advisor
Dr. Ashok Samal
Dr. Leen-Kiat Soh
Dr. Etsuko Moriyama

Download this event to my calendar