The ~2.5 billion 200 OK responses in the (1996-2010) tranch of the JISC UK Web Domain Dataset dataset have been scanned for geographic references - specifically postcodes. This set of postcode citations, found at particular URLs, crawled at particular times, forms an historical geoindex of the UK web. For more details about how the data was created, its format, and how to use it, see here.
The geoindex is composed of some 700,641,549 lines of TSV data, each asserting that a given web page, crawled at a given data, contained one or more references to a given postcode. Uncompressed, this is a total of 61 GB of text, and so care should be taken before downloading or attempting to use this data set.
2011-201304 warcs 1.4GB per file, arcs 350 MB per file, 50 files.
The data is not hosted on GitHub, as it is far too large. It can be downloaded from here in a compressed format (total download size, about 8GB).
If you do wish to cite this dataset, please this DOI: 10.5259/ukwa.ds.2/geo/1
Based on DataCite guidelines, we recommend this full citation:
Andrew N. Jackson (2017). JISC UK Web Domain Dataset (1996-2010) Geoindex. The British Library. https://doi.org/10.5259/ukwa.ds.2/geo/1
To the extent possible under law, The Project Partners have waived all copyright and related or neighboring rights to the JISC UK Web Domain Dataset (1996-2010) Geoindex (10.5259/ukwa.ds.2/geo/1).