In order to enable access to web archives, we use CDX files to act as indexes so that we can look up which ARC or WARC files contain which URLs and responses.
The original CDX files were generated for us by the Internet Archive, with one CDX file for each ARC or WARC file. This makes it easy for us to manage those files, but it is not very convenient for researchers to have to download and deal with over half a million separate small files. Therefore, we have processed those CDX files and aggregated the data into 18 separate CDX files – one per year of crawling activity. Please note that the individual CDX files are not sorted.
There are a few variations on the CDX format, but for this dataset, the CDX lines look like this:
vanguard.ntu.ac.uk/ 19961018104851 http://vanguard.ntu.ac.uk:80/ text/html 200 2TAC6RS2DMTHHFVWCSDHNL6W6RIIOQIV - 34954008 DOTUK-HISTORICAL-1996-2010-GROUP-AA-XABEGS-20110428000000-00000.arc.gz
These space-separated fields can be interpreted as follows:
YYYYMMDDHHMMSS
format)Content-Type
(as returned in the original server response)Location
header, populated only for 3** responses, “-“ for others).This large dataset cannot be hosted on GitHub. It can be downloaded from here instead, as compressed files containing the CDX data for each year of crawler activity.
If you do wish to cite this dataset, please this DOI: 10.5259/ukwa.ds.2/cdx/1
Based on DataCite guidelines, we recommend this full citation:
The Internet Archive (2013). JISC UK Web Domain Dataset (1996-2013) Crawled URL Index. The British Library. https://doi.org/10.5259/ukwa.ds.2/cdx/1
To the extent possible under law, The Project Partners have waived all copyright and related or neighboring rights to the JISC UK Web Domain Dataset (1996-2013) Crawled URL Index (10.5259/ukwa.ds.2/cdx/1).