WIKIDATA-ELTEdata-WIKIMEDIA The technical background of source publications

From wikibase-docker

FELLEGI Zsófia

WIKIDATA – ELTEdata – WIKIMEDIA: The technical background of source publications


In January 2019, the preparation of the digital edition of the texts representing the seven fields of knowledge have begun with the professional collaboration of Gábor Palkó, then head of the Digital Humanities Centre of the Faculty of Humanity of Eötvös Loránd University and Zsófia Fellegi, a colleague of the Institute for Literary Studies of the Eötvös Loránd Research Network. Over the last few decades, the Text Encoding Initiative (TEI) recommendation for digital text editions, developed for XML markup language transcriptions, has become a leading recommendation in the field. However, producing an XML-based edition requires a lot of practice and expertise in digital philology. In addition, the secure archiving, delivery, and visualisation of markup language transcripts are quite expensive. Although ready-made visualisation tools exist, their customisation requires further IT development. An additional backend database would have been needed to store the data relevant to the research, as the XML structure does not support overlapping markup, so, for example, the marking of a concept in the markup language transcript is not possible. The development of the structure of backend databases, the uploading and storage of data, and the linking of the data to the published source, and their presentation on a single platform would have also required significant development. Considering the above-listed factors, the colleagues have sought a solution that would both support the publication of sources and provide a suitable backend database.

The digital edition was created within the open-source Wikibase software, and the storage space was provided by the ELTE Digital Humanities Centre (https://elte-dh.hu/). Sustainability was an important factor in the selection of the software: the system is operated and developed by the Wikimedia group, and it is also used by international research groups for publishing the results of prosopographical research. The ELTE Digital Humanities Centre has tested the system in cooperation with the Humanism in East Central Europe (HECE) “Momentum” Research Group of the Hungarian Academy of Sciences and Eötvös Loránd University, and based on the experience gained from this work, the system was chosen here as well.

The Wikibase software can support both text publication and provide the backend database needed for data enrichment, with a visual interface that is user-friendly and similar to Wikipedia and Wikidata. The source publication does not follow the TEI recommendation, but was made in Wikimedia’s own format, the so-called wikitext format. This format can be easily archived and later converted to XML format, if necessary. With this system, it is possible to represent the relationships between the seven fields of knowledge through the sources and to display the annotations in the texts. The database contains the data collected during the research, using the most recent technology, following the practices of the Semantic Web. The system provides an opportunity to create data visualisations that could reveal previously unknown patterns, underpin research findings, and identify new directions in research.

In the first phase, Gábor Palkó and Zsófia Fellegi prepared the automatic loading of personal and geographical names extracted from the most extensive corpus of the digital source anthology, representing the field of “economics and agricultural sciences”. The first step was to tabulate, annotate, and uniformize the data, then, following the logic of the Semantic Web, the creation of links, and, finally, the automatic loading of the datasets was carried out. Based on previous experiences, the development of the methodology has begun, by which the uploading of further texts and data became significantly easier and faster.

During the development of the data structure, we had to take into consideration that the database provides research infrastructure and a platform for publication for various research projects, and therefore some of the research data overlaps (e.g., geographical names). Each research project forms a subcollection of the database. As for claims assigned to entities, it is possible to display which claim is the result of the work of which research team. This method enables research in the whole database and each subcollection simultaneously.

In the second phase of the project, Gábor Palkó, Head of the Department of Digital Humanities of the Faculty of Humanities of Eötvös Loránd University, and Zsófia Fellegi, the colleague of the Institute for Literary Studies involved additional colleagues in the process of organizing the transcribed texts and data extracted from them into a database. The personal and geographical names were uploaded to ELTEdata, a Wikibase-based system operated by ELTE DH. In the source texts, the personal and geographical entities were matched to records in the database, so that the data and their textual occurrences can be searched in one system. In addition, the concept map was also visualised and tagged at the corresponding locations in the texts. The system enables the concept map and the textual sources to be complemented by further texts and data and to add further information to the data in the framework of a future research project.

In 2020, the Department of Digital Humanities was established at Eötvös Loránd University, taking over the infrastructure of the former Digital Humanities Centre. In addition to the HECE and Circulation of Knowledge projects, the Prosopography Research Group of the Faculty of Social Sciences of Eötvös Loránd University also uses the ELTEdata software, therefore, the amount of data in the system is growing dynamically, which will significantly speed up the data enrichment of the sources to be uploaded later. The use of the Wikibase software to organise the results of historical research into a database is a common practice in international scholarship, one of the most prominent examples being FactGrid (https://database.factgrid.de/wiki/Main_Page), provided to historians by the Gotha Research Centre of the University of Erfurt. In Hungary, the Institute for Literary Studies, building on the experiences of the Department of Digital Humanities of Eötvös Loránd University, has developed ITIdata, in which bibliographical data is published, and a database of personal and toponyms is built.

These examples also demonstrate that the flexibility of the data structure makes Wikibase suitable to serve the specific needs of different research projects. Institutional embeddedness ensures sustainability and extensibility for the research carried out and to be carried out in the framework of the Circulation of Knowledge project. In the future, it will be possible to visualize bibliographic data in the system, thus creating a unique database. Within this system, it will be possible to create data visualisations and thus help to discover new relations and new patterns that will support further research.