12 juni 2025
My research focuses on how knowledge can be meaningfully represented, integrated, and accessed across different modalities, languages, and structures. As more knowledge is encoded in unstructured and opaque formats, ranging from natural language text to LLMs, I study how we can restore semantic clarity, ensure consistency, and bridge representational gaps. LLMs play a central role in my work, both as a method and as a data source. I use them to extract structure from text, interface with complex data like knowledge graphs or statistical tables, and support tasks like classification, reasoning, and data alignment. This allows me to build systems that do not just retrieve information, but integrate and make sense of it in a more transparent and robust way.
Since last year, I have been supervising a PhD student in collaboration with CBS, the Dutch national statistics office. CBS provides thousands of open datasets on topics like population, economy, education, and healthcare. These datasets are valuable but hard to use, as they are structured in large and complex tables that often require expert knowledge to interpret.
Our goal is to make this information more accessible through natural language interfaces. We are developing methods that allow people to ask questions in plain language and receive meaningful, data-driven answers. This is typically a task carried out by trained statisticians, so automating it involves both technical and linguistic challenges. The data is rich but highly specific, and understanding the meaning of the columns often requires external background knowledge. For me, this project is exciting because it combines technically challenging problems in semantic parsing and reasoning with a clear societal goal: improving public access to governmental data.
I like seeing how others use data science methods to solve real problems in different fields. It helps me step outside the technical details of my own work and think more broadly. Earlier this year, I gave a workshop on extracting structured data from text, and I really appreciated the practical questions people brought in. These kinds of interdisciplinary exchanges keep me grounded and often lead to new directions in my own research.
One method I find especially powerful is using language models for classification. Many tasks in data integration, such as detecting errors or aligning schema elements, can be reframed as simple classification problems. This allows us to leverage the background knowledge embedded in these models in surprisingly effective ways. It has shifted how I think about knowledge representation and opened up new approaches to building systems that are both flexible and robust.
Mostly Python. In applied data science and NLP, Python has become the standard, with a broad ecosystem of tools and libraries that make experimentation fast and reproducible. But I am not tied to any particular language. I often use whatever tool fits the task best - whether that is SPARQL for structured queries, SQL for data preparation, or even simple shell scripts for automation.