作者: |
Batra, A; Rajkumar, S; Nielsen, D; Clarke, S. S. B; Kalyanam, K. M; Tejasen, K; Ohsfeldt, M; Copp, M. |
摘要: |
Often when working with technical documents, it is helpful to classify them into specific categories. In this paper, we conduct a thorough review of natural language processing techniques to perform this classification task on Letters of Agreement (LOAs), technical aviation documents outlining rules for utilizing US airspace. We evaluate multiple techniques, including Transfer Learning, for representing the text in the documents as embeddings: unigram and bigram Term Frequency Inverse Document Frequency (TFIDF), Word2Vec, Doc2Vec, GloVe and RoBERTa. We investigate a wide range of classification models: K-Nearest Neighbors, Random Forest, Support Vector Machines (SVM), Logistic Regression, Naive Bayes, Feed-Forward Neural Network, Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM). By comparing the different methods, we found the best overall approach for our task was to use unigram TFIDF representations with SVM while also gaining insight into how the other methodologies performed on a small technical datasets. |