Bachelor Thesis
Duration
October 2024 - March 2025
Project Overview
LLM-Powered Metadata Extraction GUI for DBpedia Databus
Metadata plays a key role in improving data accessibility and interoperability. The DBpedia Databus aims to establish FAIR (Findable, Accessible, Interoperable, and Reusable) Linked Data on the Web by providing a registry for files using DataID metadata. To support this ecosystem, Databus Mods were introduced in 2021 offering a metadata enrichment service for files registered on the Databus. These are, in essence, code snippets that extract metadata from certain file types. To improve the development of Databus Mods, an interactive Graphical User Interface (GUI) using Large Language Models (LLMs) to support the creation of code snippets is implemented in this thesis.
Technical Details
Research Contributions
- • Custom benchmark for LLM code generation evaluation
- • Performance analysis of LLMs in specialized topics
- • Recommendation of three suitable models
- • Interactive GUI for code snippet generation
Implementation
- • Flask API for universal execution
- • Code snippet packaging system
- • Databus Mods integration framework
- • GitHub repository for open access
Project Goals
Primary Objectives
- •Improve development of Databus Mods through LLM assistance
- •Create interactive GUI for code snippet generation
- •Evaluate LLM performance in specialized code generation
- •Advance FAIR Linked Data principles
Expected Outcomes
- •Enhanced metadata extraction capabilities
- •Streamlined Databus Mods development process
- •Improved data accessibility and interoperability
- •Open-source contribution to the research community
Research Methodology
Custom Benchmark Development
This work presents a custom benchmark to evaluate the performance of LLMs in code generation for the purpose of metadata extraction. The benchmark shows the increasing performance of LLMs in small specialized topics, and concludes in the recommendation of three models suited for the application.
Practical Implementation
The practical contribution of this thesis is the implemented GUI (accessible via GitHub) that uses the models to generate code snippets, which are packaged into a Flask API to enable universal execution. These packaged applications are the baseline for creating Databus Mods using the generated code in the future.
Key Contributions
- •Developed custom benchmark for LLM evaluation in metadata extraction
- •Created interactive GUI for code snippet generation using LLMs
- •Implemented Flask API for universal code execution
- •Established framework for Databus Mods development
- •Contributed to FAIR Linked Data advancement