Statistics Weekly Seminar: Andee Kaplan
Record Linkage for Streaming Data
3:00 pm –
4:00 pm
Hardin Hall - North Wing, HARH 49
3310 Holdrege Street
Lincoln NE 68583-0963
Lincoln NE 68583-0963
Virtual Location:
Zoom
Target Audiences:
Contact:
Department of Statistics, statistics@unl.edu
The Department of Statistics would like to invite you to our Weekly Seminar Series. In our next seminar, guest speaker Andee Kaplan, Assistant Professor at Colorado State University in the Department of Statistics will be presenting her talk: Record Linkage for Streaming Data.
Abstract:
With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, fraud detection, and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to address a scientific problem, record linkage is a critical step in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of the linkage structure are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. In this talk, I present the first multi-file Bayesian record linkage model formulated specifically for the streaming data context. This model is fit using recursive updates, incorporating each new batch of data into the model parameters’ posterior distribution. A novel Markov chain Monte Carlo algorithm is presented that performs recursive Bayesian updates while avoiding the issue of degradation, common to many recursive algorithms. This sampler achieves near-equivalent posterior inference to non-streaming algorithms at a small fraction of the compute time.
Bio:
Andee Kaplan is an assistant professor in the Department of Statistics at Colorado State University. Her research interests lie in the intersection of Bayesian methodology and statistical computing, particularly as applied to large social science and ecological problems with complex dependence and messy data structures. Her research program has been supported by multiple grants, including the NSF CAREER award. Prior to joining Colorado State University, Andee spent two years as a Postdoctoral Associate at Duke University after completing her Ph.D. in Statistics from Iowa State University. In her free time, Andee enjoys riding bikes, skiing, and rock climbing.
Abstract:
With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, fraud detection, and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to address a scientific problem, record linkage is a critical step in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of the linkage structure are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. In this talk, I present the first multi-file Bayesian record linkage model formulated specifically for the streaming data context. This model is fit using recursive updates, incorporating each new batch of data into the model parameters’ posterior distribution. A novel Markov chain Monte Carlo algorithm is presented that performs recursive Bayesian updates while avoiding the issue of degradation, common to many recursive algorithms. This sampler achieves near-equivalent posterior inference to non-streaming algorithms at a small fraction of the compute time.
Bio:
Andee Kaplan is an assistant professor in the Department of Statistics at Colorado State University. Her research interests lie in the intersection of Bayesian methodology and statistical computing, particularly as applied to large social science and ecological problems with complex dependence and messy data structures. Her research program has been supported by multiple grants, including the NSF CAREER award. Prior to joining Colorado State University, Andee spent two years as a Postdoctoral Associate at Duke University after completing her Ph.D. in Statistics from Iowa State University. In her free time, Andee enjoys riding bikes, skiing, and rock climbing.
Download this event to my calendar
This event originated in Statistics Seminar.