Bachelor Thesis

Duration

October 2024 - March 2025

Project Overview

LLM-Powered Metadata Extraction GUI for DBpedia Databus

Metadata plays a key role in improving data accessibility and interoperability. The DBpedia Databus aims to establish FAIR (Findable, Accessible, Interoperable, and Reusable) Linked Data on the Web by providing a registry for files using DataID metadata. To support this ecosystem, Databus Mods were introduced in 2021 offering a metadata enrichment service for files registered on the Databus. These are, in essence, code snippets that extract metadata from certain file types. To improve the development of Databus Mods, an interactive Graphical User Interface (GUI) using Large Language Models (LLMs) to support the creation of code snippets is implemented in this thesis.

LLMsFAIR DataMetadata ExtractionFlask APIResearch

Technical Details

Research Contributions

  • • Custom benchmark for LLM code generation evaluation
  • • Performance analysis of LLMs in specialized topics
  • • Recommendation of three suitable models
  • • Interactive GUI for code snippet generation

Implementation

  • • Flask API for universal execution
  • • Code snippet packaging system
  • • Databus Mods integration framework
  • • GitHub repository for open access

Project Goals

Primary Objectives

  • Improve development of Databus Mods through LLM assistance
  • Create interactive GUI for code snippet generation
  • Evaluate LLM performance in specialized code generation
  • Advance FAIR Linked Data principles

Expected Outcomes

  • Enhanced metadata extraction capabilities
  • Streamlined Databus Mods development process
  • Improved data accessibility and interoperability
  • Open-source contribution to the research community

Research Methodology

Custom Benchmark Development

This work presents a custom benchmark to evaluate the performance of LLMs in code generation for the purpose of metadata extraction. The benchmark shows the increasing performance of LLMs in small specialized topics, and concludes in the recommendation of three models suited for the application.

Practical Implementation

The practical contribution of this thesis is the implemented GUI (accessible via GitHub) that uses the models to generate code snippets, which are packaged into a Flask API to enable universal execution. These packaged applications are the baseline for creating Databus Mods using the generated code in the future.

Key Contributions

  • Developed custom benchmark for LLM evaluation in metadata extraction
  • Created interactive GUI for code snippet generation using LLMs
  • Implemented Flask API for universal code execution
  • Established framework for Databus Mods development
  • Contributed to FAIR Linked Data advancement

Technologies & Tools

AI & ML

Large Language ModelsCode GenerationBenchmarking

Development

PythonFlaskGUI Development

Data & Research

FAIR DataDBpedia DatabusGitHub