Abstract:
This study presents a retrieval-augmented question-answering (QA) system designed
to extract academic policies and resolutions from the University of the
Philippines (UP) Gazette. Utilizing Optical Character Recognition (OCR), historical
printed documents were digitized and preprocessed through natural language
processing techniques. Dense vector representations were generated using
embedding models and stored in Pinecone, a hybrid vector database enabling both
semantic and keyword-based retrieval. Retrieved passages were reranked using a
two-stage approach: a cross-encoder for semantic matching and PageRank-based
graph reranking to promote contextually central chunks. A fine-tuned large language
model (LLM) was then used to generate coherent, context-aware responses
based on the top-ranked passages. The system was evaluated using retrieval
and generation metrics including Precision@k, Recall@k, Mean Reciprocal Rank
(MRR), ROUGE-L, METEOR, and Jaccard Similarity. Results indicate that
while the LLM frequently identifies the correct answers, partial outputs affect
text generation scores, suggesting future improvements in generation grounding.
This research demonstrates how hybrid search and graph-based reranking enhance
retrieval effectiveness in open-domain QA for historical documents.