Guide for website owners

The purpose of the guide is to improve the possibilities for a website to be saved successfully in the online archive, and thus live up to the intentions of the Danish Legal Deposit Act.

Royal Danish Library can never guarantee that a site is harvested completely But we will be able to collect more content if you, as the owner of a website, have followed the instructions in this guide.

The most important thing

Have a sitemap with links to ALL pages and data to be archived, including any paginated pages
(../result.php?page=1, ../result.php?page=2 etc.).

Call it "sitemap.xml" and place it in the root of the website. If you can't link to a page, we can't archive it!

You must write the link to the sitemap in robots.txt

Also feel free to add important resources (for example PDF files, json data, images, audio and video files) directly in the sitemap or in another sitemap that is linked to.

  • Have all necessary resources on the same domain
    It includes Javascript files, CSS, media, images and so on. Our crawlers only collect a very limited amount of content that is located on other domains.
     
  • Consider using explicit links to your media files and listing them in the website's sitemap
    The crawler can download certain types of audio/visual files, but what matters is whether the crawler can detect them in the first place. If the path (URL) to the video is hidden, for example in a JavaScript or Flash setup, the crawler will not be able to find it.
     
  • Lost data
    Data that is not immediately available on the page, when it is accessed, is not picked up by our crawlers. This applies, for example, to ajax content, infinite scroll, and pagination without href links. Pages which can only be reached that way should be included in the sitemap.
     
  • Test your page with Javascript disabled
    Then you get an idea of what our crawler can see.
     
  • Search fields and other input forms typically stop our crawler
    The same applies to POSTrequests. If there are pages that can only be accessed in this way, they should be linked to in the website's sitemap.
     
  • Avoid using dynamic URLs
    Whether it's links, calendars, contact forms and so on. In general, avoid using "infinite" options, such as in calendar modules. If possible, limit them to realistic time only.
     
  • Active links
    Make sure all links work on your website; if your website contains broken links, copies of your website will also have broken links.
     
  • Grant access
    In order for us to archive and display your website correctly, our crawler must have access to all the resources that determine how the website is displayed, including images, scripts and stylesheets. We use the Heritrix crawler, and the crawler's user agent identifies itself as:

    Mozilla/5.0 (compatible; heritrix/3.4.0 https://www.kb.dk/netarkivindsamling/ Firefox/57
     
  • Avoid "wrong" http status codes
    If a page cannot be found, respond with 404, not 200. When a 200 code is the response the crawler thinks it is on the right track.
     
  • Date and time
    If the page displays date and time, use the server-generated date instead of the client-side date. A date dynamically generated by the client will forever show the current date, not the date of archiving.
     
  • RSS feed
    Feel free to offer an RSS feed for new content if your website is frequently updated with new pages/articles and so on. In this way, the online archive can collect the latest content without having to crawl through the entire website. Remember the link to the RSS feed in the website's sitemap.
     
  • All browsers
    Always also design for browsers that do not support JavaScript or have disabled JavaScript.
     
  • Alternative access
    Provide alternative access methods for content, such as simple HTML.
     
  • Comply with web standards
    It is generally good practice to adhere to current web standards and validate your code against current web standards: http://validator.w3.org/
     
  • Rejected material
    We cannot accept "dumps" or "backups" of websites from content management systems, databases, on hard drives, CDs or DVDs or other external media in the archive. Only snapshots directly crawled by our system are accepted. Therefore, it often pays to make the website archiveable from the start - remember the sitemap.
     
  • Embedded content
    Embedding content on a page using a third-party service makes it unlikely that the web crawler will be able to read and store it. Examples of embedding services include Youtube, Flickr, Scribd, SlideShare, Storify, and SoundCloud.
     
  • Stick to the domain
    One should retain ownership of the website's domain after the last crawl of the website is done and after it is closed in order to:
    • Avoid cybersquatting
    • If necessary, refer to the fact that the website is archived in the online archive.