The entire selective archive is manually curated, including classification of sites into a two-tiered subject hierarchy. We have made this manually-generated classification information available as an open dataset, in tab-separated column format. The structure of the data is as follows:
Primary Category Secondary Category Title URL Arts & Humanities Architecture 68 Dean Street http://www.sixty8.com/ Arts & Humanities Architecture Abandoned Communities http://www.abandonedcommunities.co.uk/ ...
We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future. Options include:
If you do wish to cite this dataset, please this DOI: 10.5259/ukwa.ds.1/classification/1
To the extent possible under law, The UK Web Archive have waived all copyright and related or neighboring rights to the UK Selective Web Archive Website Classification Dataset (10.5259/ukwa.ds.1/classification/1).