Kereső
Bejelentkezés
Kapcsolat
A magyar webtér aratásával kapcsolatos kurátori feladatok |
| Tartalom: | http://ocs.mtak.hu/index.php/nws/2025/paper/view/208 |
|---|---|
| Archívum: | NETWORKSHOP |
| Gyűjtemény: | Tanulmányok |
| Cím: |
A magyar webtér aratásával kapcsolatos kurátori feladatok
|
| Létrehozó: |
Gyula Kalcsó; Magyar Nemzeti Múzeum Közgyűjteményi Központ Országos Széchényi Könyvtár, Digitális Bölcsészeti Központ, Digitális Filológiai és Webarchiválási Osztály<br />
|
| Kiadó: |
NETWORKSHOP
|
| Dátum: |
2025-10-25 09:20:54
|
| Tartalmi leírás: |
Curatorial tasks related to the harvesting of the Hungarian web domainAccording to a government decree, the national library’s essential task is to carry out an as complete as possible harvest of the Hungarian web domain twice a year and to keep a register of the sites known. This complex task is carried out by the web archiving team of the Digital Philology and Web Archiving Department of the Digital Humanities Centre of the Hungarian National Széchényi Library. This paper will describe the most important curatorial activities related to this mandated task. It describes how to register, the process of registering websites, and the methodology for collecting seed URLs. Since the launch of the Hungarian Web Archive in 2017, the number of sites registered has grown significantly, with new URLs identified from our own harvests, recommendations received and cooperation with the Internet Archive being the main sources of new URLs. The seed URL lists need to be maintained before the two harvests a year, which is a complex process involving many steps. The first step is to extract the URLs from the previous captures and sort out those that are not yet known. We automatically retrieve the HTTP status code to determine which sites are live, and then retrieve the value of the title tag in the HTML head tag and whether the site has a robots.txt file. Based on the structure of the URLs and the information obtained, we can classify the new URLs into the appropriate list. The status codes and title data as well as the robots.txt are checked for the previously harvested URLs as well, and the inactive sites are removed from the lists and the URLs are classified into the appropriate seed list.Kulcsszavak: webarchiválás, born digital, a magyar web aratása, webkurátori feladatokKeywords: web archiving, born digital, harvesting of the Hungarian web, curatorial tasks in web archiving https://doi.org/10.31915/NWS.2025.21
|
| Nyelv: |
magyar
|
| Típus: |
Peer-reviewed Paper
|
| Azonosító: | |
| Forrás: |
NETWORKSHOP; Networkshop 2025
|
| Létrehozó: |
Authors who submit to this conference agree to the following terms:<br/>
<strong>a)</strong> Authors retain copyright over their work, while allowing the conference to place this unpublished work under a <a href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution License</a>, which allows others to freely access, use, and share the work, with an acknowledgement of the work's authorship and its initial presentation at this conference.<br/>
<strong>b)</strong> Authors are able to waive the terms of the CC license and enter into separate, additional contractual arrangements for the non-exclusive distribution and subsequent publication of this work (e.g., publish a revised version in a journal, post it to an institutional repository or publish it in a book), with an acknowledgement of its initial presentation at this conference.<br/>
<strong>c)</strong> In addition, authors are encouraged to post and share their work online (e.g., in institutional repositories or on their website) at any point before and after the conference.
|