All events are in Central time unless specified.
Activity

M.S. Thesis Defense: Lei Yu

Date:
Time:
3:30 pm – 4:30 pm
Zoom
“Information Extraction and Classification on Journal Papers”

Digitized documents have become an omnipresent medium of information. A plethora of scholarly documents on the web is excessively being increased. Most of the scientific literature is stored in Portable Document Format (PDF). PDF documents hold a complex structure due to which their comprehension and extraction of useful information from them is a challenging task. In this regard, research community has been proposing different rule based and machine learning based techniques in the past several years. We believe that accurate and efficient information extraction form the PDF files is an important issue as major portion of scholarly literature is stored in PDF.

To help a soil science team from the United States Department of Agriculture (USDA) build a queryable journal paper system, we used web crawler with Python to download journal papers on soil science from the digital library to provide users with papers they are interested in. To extract useful information including authors, journal, publish date, abstract, DOI, journal type, experiment location and key words in papers and highlight the paper characteristics in data system, we applied named entity recognition to extract authors and location of experiments, table analysis to extract tables in the paper. The named entity recognition technique is used to extract authors and experiment location. And the table analysis is used to store the tables from the journal paper in a computer queryable form. Text analysis is applied to figure out the parts of interest, and stored them in the database to save time. We used traditional machine learning techniques including logistic regression, support vector machine, decision tree, naive bayes, k-nearest neighbors, random forest, ensemble modeling, and neural networks in text analysis and compare the advantages of these approaches in the end.

Committee:
Prof. Stephen D. Scott
Prof. Vinodchandran Variyam
Prof. Ashok Samal

Zoom: https://unl.zoom.us/j/93959397461
Meeting ID: 939 5939 7461

Download this event to my calendar