The entire selective archive is manually curated, including classification of sites into a two-tiered subject hierarchy. We have made this manually-generated classification information available as an open dataset, in tab-separated column format. The structure of the data is as follows:

Primary Category        Secondary Category      Title   URL
Arts & Humanities       Architecture    68 Dean Street  http://www.sixty8.com/
Arts & Humanities       Architecture    Abandoned Communities   http://www.abandonedcommunities.co.uk/
...

Use Cases

We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future. Options include:

  • For each site, make the titles of every page on that site available.
  • For each site, extract a set of keywords that summarise the site, via the full-text index.

Download

Citing this dataset

If you do wish to cite this dataset, please this DOI: 10.5259/ukwa.ds.1/classification/1

Based on DataCite guidelines, we recommend this full citation:

The UK Web Archive & Partners (2013). UK Selective Web Archive Website Classification Dataset. The British Library. https://doi.org/10.5259/ukwa.ds.1/classification/1

License

CC0 To the extent possible under law, The UK Web Archive have waived all copyright and related or neighboring rights to the UK Selective Web Archive Website Classification Dataset (10.5259/ukwa.ds.1/classification/1).