Ian Gregory is an expert in digital humanities, which has significant potential for historians interested in mapping urban communities. Here he talks about the direction of this fast-moving field – including Geographical Information Systems (GIS), data mining and internet applications (5-10-2011)
David Rosenthal: “There’s a potential in Geographical Information Systems (GIS) and in digital humanities more generally to re-invigorate almost all aspects of historical geography, and actually bring historians who never thought of themselves as geographers into this kind of discourse as well.” That’s a quote from your book of a few years ago where you also said that the full implications of this kind of technology are yet to be established [I.N. Gregory and P.S. Ell, Historical GIS: Technologies, methodologies and scholarship, CUP, 2007]. So my first question is, what is GIS, as it stands today, and to what extent has it been taken up by historians?
Ian Gregory: Although GIS tends to be seen as a mapping system, it’s much more than that. It’s really a database system that allows you to query your data in a spatial way, so asks questions that have answers which say, “Over here this is happening, but over here this is happening, and over there something else is happening.” Traditionally, certainly at the time when I wrote that book, the fact that it was a database technology tended to mean it was best suited to the more quantitative ends of the discipline, but I think that’s starting to go as computing generally becomes much better at handling information in textual form. To what extent it’s being taken up by historians at this point? On the one hand I’d say very, very rapidly, but on the other hand it hasn’t spread nearly far enough yet. I think it’s fair to say a problem at the moment is still, how do you start? The two big barriers at the moment are, first of all – if you’re a PhD student or beyond that – is simply understanding what GIS is and what it has to offer to you, because at the moment most of the people with expertise in GIS are still more likely to be found in geography rather than history departments. So there’s that, “Is this really the thing for me?” barrier. The second barrier is probably the, “Okay I am interested, so how do I go about doing it?” The software is not that difficult, but it’s not that easy either, and it probably requires a certain amount of training. And getting that training is still relatively hard to get focused at historians. I’ve run a few courses, but I’m not sure that any history department is including GIS skills as part of PhD programmes as yet. ArcGIS is the standard software, and most universities have got that on a site licence. As for expenses, if you’re then going to digitise a large amount of material or try and match them to locations manually, that takes quite a lot of time. It’s either the expense of your time or the expense of employing someone to do it for you.
DR: To some degree it seems GIS seems to favour collaborative projects, perhaps because of the kinds of barriers you’ve just outlined. So, for example, you have a couple of historians who are interested in asking questions that have a geographical element. And then they would probably get some kind of funding for that, and would also draw in somebody like you.
IG: It encourages collaboration, yes. And I think one reason why Britain and Europe have done well in this field, is that we are more sympathetic to collaborative scholarship, even the humanities, than the US is, where the tenure system is basically hostile to it. But that said, a lot of people have set off on their own and made a lot of progress on both side of the Atlantic. So it doesn’t have to be collaborative but I think it benefits from it. There are quite a few good examples of projects like that. There’s a really good one at the Open University at the moment. Elton Barker is looking at the Greek historian Herodotus [Hestia project]. Barker is in Classics, but he collaborated with some people from Geography, because what they’re interested in is this ancient historian’s view of the Mediterranean world: where was he talking about, where was he visiting, what was he talking about in various places? And they’ve had money not just from the AHRC but Google as well to try and follow some of their ideas down. It’s very much one of those collaborative projects where Elton had the idea of where he wanted to go, and knew he didn’t have the expertise to do it, so he enrolled some people from the geography department to be involved. Now, that’s all well and good if you can get the money to do it, but it would be better if, particularly, PhD students were able to actually start working on this themselves. GIS projects don’t have to be that labour intensive, or that hard to do, but they do require a certain amount of skills to start with.
DR: Could you give some detail about a couple of successful historical GIS projects that you have been involved in or are aware of – how they were set up, what they produced?
IG: Well, take 19th-century infant mortality in Britain. The Victorians knew they had a problem with infant mortality, or with health more generally because urban areas were growing very rapidly, with very poor sanitation and poor living conditions leading through into high death rates, particularly among babies and children. In the 1870s and 1880s, the government starts to enact public health legislation to try and do something about that, and around that same time infant mortality starts to decline, and it carries on declining all the way through the 20th century. Therefore government intervention in these problem areas, particularly associated with sanitation, was the cause of infant mortality decline – that’s been the orthodoxy since the Victorian age. But we were able to use GIS and various techniques to take the whole of England and Wales from the 1850s up to the Edwardian era, to 1911. We found that the biggest declines actually occurred in rural areas, and they started long before the public health legislation came in. That existing orthodoxy, then, at best, can only apply in urban areas that had high rates of infant mortality where the decline started after the public health legislation. At worst, it may be completely wrong. But what the GIS can’t do is then advance to an explanation, or at least not as we’ve implemented it at the moment, so it’s very hard to know what was driving down infant mortality in those rural areas, particularly in the south and east. Nevertheless it completely re-frames the question. That’s one example from my own work.
A similar example is Geoff Cunfer’s work on the dust storms on the Great Plains in the 1930s – which doesn’t sound similar, but it’s the same kind of thing, in that when the dust storms were happening it very much fitted the politics of the time that this was down to over-intensive agriculture driven by perhaps over-enthusiastic capitalism, which the New Deal was kind of opposed to. This orthodoxy, that over-ploughing driven by the pressures of capitalism caused the top soil to fall apart, became very much established. Cunfer did, in some ways, the same kind of things I did, he actually mapped out where dust storms had occurred over the Great Plains over quite a long period of time, and found that they actually were occurring in areas that had no ploughing at all going on. The explanation that he started to develop was that the dust bowl was because of unprecedented levels of drought across the Great Plains. Agriculture may have been a contributory factor, but it was more a natural disaster than an environmental disaster driven by capitalism and the likes. So by taking a broader view and by bringing more data in and looking at it across larger areas of time and space, you’re able to challenge existing orthodoxies, existing explanations.
DR: I am coming at this from the point of view of thinking about communities in early modern cities. If I said to you, I’ve got lots of census data and I’ve got a lot of text and other kinds of data that you can feed into a machine quite easily, what can GIS do for me?
IG: There are a lot of people that have done exactly that sort of work for cities, although not looking at the early modern period, looking at modern North America particularly, focusing on issues like segregation and race. But what can GIS do for you the typical early modern urban historian, or broaden that out slightly, typical urban historian. Basically what you’d be likely to be able to do would be to take whatever records you’ve got, statistics from the census or taxation or whatever, and I presume link it to individual addresses, if you’re very lucky, perhaps slightly more aggregate to streets. So what you’re able to do is then reallocate the data to the houses that the people were living in. And I assume you may well have data from a variety of different sources. So you can actually start to link up taxation records with census records and perhaps other records, based on where people lived, and you may have them from different dates, so you can look at how things changed over time. But more than that, you can start then to ask questions about how different parts of the city were different to each other. Where were the immigrants living? Were they also the poor areas? How did this relate to things like, perhaps, transport routes, or, almost anything, the red light districts, the sanitation, the city walls?
DR: One early modern project I do know about that’s kicking off, and one that you are involved in, is Nick Terpstra’s at Toronto, which aims in the first instance to look at the relationship between zones of prostitution and holy sites such as convents [‘Sex and the Sacred: Negotiating Boundaries in Renaissance Florence’ aims to produce a digital map of Florence primarily using a census of 1561 in combination with the ‘Buonsignori’ map of 1584. The idea is to layer sensory data together with information on patrician homes, religious sites, and where prostitutes lived and worked. The map is intended to become an interactive tool that can be downloaded by others, who will then be able to layer their own data on to it. Outline provided by Nick Terpstra. Also live now is Locating London’s Past]. Potentially, it seems, that you can layer just about as much as you like into a GIS.
IG: Yes. The big thing for GIS is, can you fix a location for the data that you’ve got? As long as you can do that, you can layer it without any problems at all. If you can’t do that, you’ve got problems. But if you can attach a decent location to it, then you’re away.
DR: The other thing here, and this seems crucial, is it’s active rather than passive, in the sense that you can ask it your own questions once it’s online. Is the natural home of GIS the internet?
IG: Its actual home is increasingly becoming the internet with technologies like Google Earth, which are nice and easy to use. And, yes, it is very much about asking it questions, because it offers you a complicated, layered map, where there is usually too much information there to take in all at once, and you either ask it specific questions, like, “What is here? What is here in relation to here?’ Or you ask it more general questions, basically, “Show me a map of this kind of thing.” It might be locations of convents and the locations of red light districts, because you think that there might be a relationship between the two.
DR: It seems that part of the potential is also to be able to show things in an animated fashion, again perhaps online. And the reason I’m interested in that is because, like many urban historians, part of what interests me is how people move around through urban space. Where, technologically, are we at with that?
IG: Animations are great and they’re very easy to do. But there are a number of problems. They’re fine if you’re publishing them on the internet or doing PowerPoint presentations; not so good if you want to publish them in a conventional journal. Secondly though, they’re easy to produce, but they can be quite difficult to understand. If you think about it, what you’re trying to do, effectively, is represent space, time and theme simultaneously. If you just produce an ordinary map, you usually simplify theme quite a lot to stress space, and if you’re going to add time as well, and still make it understandable, you probably have to simplify the three of them even further. So as long as you’ve got a relatively simple theme, you can do these things quite well, but where you’re trying to represent data which is complex in theme, space and time it’s quite hard to do, just simplifying it down enough to make it understandable without throwing out all the information that should be in there. The technical level is not hard, the challenges are more at the cartographic and conceptual level.
DR: Again, the more the technology moves that way the more it seems that conventional publication is becoming degraded as a forum for this kind of spatial computational history or sociology, or whatever it happens to be.
IG: Yes, that’s a very fair point. It is a problem. Because it’s a problem even with something as basic as colour maps that can’t be reproduced, and you certainly can’t put interactive maps or animations or anything like that in. And it is a limitation because it means you have to simplify your results, or the way you present your results, because of limitations in publishing, in other words greyscale paper. But it’s becoming less of a problem, and it’s likely to continue to do so as electronic publishing allows you to put colour in, even in PDFs and the like. A lot of journals now will take colour versions of the map in the PDF version that they produce, and only go to greyscale in the other version. PDFs won’t let you put images in, but you can start putting them up on websites and things like that. E-books are likely to move this along even further.
DR: Also, say I want to isolate a particular data set within a census, for example, where are the female heads of households in Florence in the census of 1632, the only way I’m going to be able to ask that is through an interactive online model. Unless the researcher that put up and analysed the census in the first place had decided that that was something that I needed to know.
IG: I mean there’s two ends to this aren’t there? There’s the kind of, “I’m a researcher,” end to this where you’re actually dealing with the database and framing your own queries, and then there’s the, “I’m a reader,” I suppose, who is largely following through what another researcher has presented, and following their argument along. It’s pretty fundamental to the way that we do scholarship, because if you’ve authored something, if you’ve just created a database for instance, and stuck it on the internet for other people to use, you haven’t actually generated any argument. You could debate this, but you haven’t really done much in the way of scholarship, you’ve simply said, here’s a big lump of, whatever it is, sort it out yourself. Whereas if you’ve really done the scholarship on it, and produced something like a journal article or a book, the chances are what you’re doing is presenting, “Well I’ve taken this lump of material and this is what I think of it, and this is why I think it.” And obviously as a reader, you may be challenging that argument. You may be thinking, “No, I don’t agree with that.” But nonetheless you are following along someone else’s version of what this is telling you.
DR: I take that point, but what I would like to do – and this is where I’m wondering if things are going to go more and more online – is not just have the argument you’ve made from the data that you’ve collected, and turned into a GIS of some form, but also put my own questions. At the moment you can’t do that?
IG: You can’t. There’s a couple of examples of people that have tried to do it, and they’re quite interesting. One is by a guy called Ben Ray, looking at the Salem witchcraft trials, where on his website he tries to combine having an archive with also having his interpretation of that archive, and linking the two together [Salem Witch Trials, Documentary Archive and Transcription Project]. And there’s another paper in the American Historical Review, by Ed Ayres and Will Thomas on slavery, [“An Overview: The Difference Slavery Made: A Close Study of Two American Communities, AHR, 108, 5, 2003, 1299-1307]. The journal very interestingly, published a paper version which described the electronic version, it did nothing more than say, ‘Have a look at the electronic version, this is what we’re trying to do on it.’ And the electronic version did very much what we’re talking about, which was it tried to talk you through their argument, but it put plenty of links back to the original sources, and it also allowed you to read through their argument in a non-linear way, so you could move around the text from all over the place. Unfortunately AHR didn’t follow up on that kind of thing, because it was a really interesting experiment.
DR: You mentioned before the issue of representing time – and one of the things you say in your book is that the biggest problem at that moment, the biggest critique, of GIS, was that it had trouble dealing with change over time. Is that still true? What kinds of projects have looked at change, and have they been successful, or at least interesting in some way?
IG: I’m not entirely convinced by that statement any more, about GIS not being good at change over time. It depends how you want to represent time. As long as time can be represented as a series of layers, then it works quite well. So time, in terms of census data, it does quite well. Your censuses are every x years or whatever, so each year is a slice through time, and you can bring them together very nicely within the GIS. Where it doesn’t handle time so well, although technology has improved a little bit, is where you’re trying to deal with continuous time. An example might be work that Richard Healey and Anne Knowles did in the Journal of Economic History in 2006 for example, where they were interested in firms in the north-eastern United States in the 19th century, and those firms could open and close, they were near railways which could open and close. Firms could also rename themselves, they could move from one location to another. [R. Healey and A. Knowles, ‘Geography, Timing and Technology: A GIS-Based Analysis of Pennsylvania’s Iron Industry, 1825-1875, Journal of Economic History, 66, 3, 2006, 608-34]. It’s not so good at coping with that much more complex sort of information. It can do it, but it does it slightly clumsily.
But there’s also a kind of conceptual problem with it, in that GIS very much comes from a perspective of the map, of space, and change over space, how things are different over space, whereas historians tend to think of things more as change over time. Going right back to the early Nineties when GIS first came on the scene, a woman called Gail Langran wrote something where she said that we don’t want to be contextualising things either from a spatial point of view or a temporal point of view. We want to do both together. [G. Langran, Time in Geographical Information Systems, Taylor & Francis, 1992]. Which seems like a good idea until you actually try and think, well how do you represent things through time and space simultaneously, in a way that you can understand them? One example being, if you have some data, you might either map it, or you can show time series graphs of it, but actually trying to represent those time series and maps simultaneously is very hard. And I think as human beings, almost, we prefer to have things served up in a way that either emphasises space or time, because we just can’t cope with the complexity of both simultaneously. You can try animating things for instance, but if it’s complicated data you just stare at it, and either see what you want to see, or what you’re expected to see, or just think, ‘This is a mess, help!’ You need to simplify it down. I think there are still problems. I think there are likely to continue to be problems with that, just because I don’t think we conceive of space and time simultaneously very well.
DR: I was also interested in the “data mining” piece on 17th-century English news pamphlets that you did, which matched words or concepts in a large body of text to places. [A. Dunning, I. Gregory, A. Hardie, ‘Freeing up digital content with text mining: New research means new licenses’, Serials, 22, 2009, 166-173]. Could you tell me about that, because that seems to me something that would have been unthinkable a few years ago, simply because it would have been too labour intensive?
IG: What it does is use what we call “corpus linguistic” techniques to identify what may be place names in your large amount of text. And then pull them all out, and you then compare them to what’s called a gazetteer, which is basically just a database file which gives place names and coordinates for those place names. So you’re trying to match – and this is the slightly difficult bit, the bit where some human activity is required – your suspected place names against a list of known place names, and where they don’t match you’re trying to work out whether it’s actually a person’s name, or is it a different spelling and that sort of thing. But once you’ve done that, you’ve then got coordinates, potentially, for every place name in a very large amount of text, because actually all of this can be largely automated. At that point you can start asking questions. The first question you can ask is simply, “Where is this text talking about?” Because you’ll get a map of every place it’s talking about. And then you can ask questions like, “What is it saying about the different places?” It picks up on the idea that if certain types of word, like words related to war, for example, words related to scenic beauty, are occurring near that place, then they’re relevant to that place name, so you can start mapping what place a certain writer is viewing positively, or what places are being mapped in relation to words like war and finance, governance and things like that, which were things we were picking up in that article. I’ve published something else more recently that goes into more detail on that, with Andrew Hardie, and we just got funding to follow that up. [I.N. Gregory and A. Hardie “Visual GISting: Bringing Together Corpus Linguistics and Geographical Information Systems,” Literary and Linguistic Computing, 26, 2011]. Another way of going about it would be simply to take the whole text, and ask the question, “Well what are the major themes this text is talking about?” and then try and map what it’s talking in relation to those. Or the third way of doing it would be to say, “Well I’m interested in the following places,” or, “The following areas, what are the major things that are being talked about in relation to these places?”
DR: And how would a computer identify themes unless you told it to? Unless you gave it the key words to begin with?
IG: One thing that corpus linguistics does is it gives you what’s called semantic tagging, whereby it groups words into major themes associated with those words. So, for instance, a word like “admiral”, as in the navy, would be associated with “war”. So there’d be kind of hierarchical cataloguing systems for all types of words, which you can use in this kind of way.
DR: This seems to me like a powerful tool for analysing online printed text.
IG: Yes, that’s where it’s going. I don’t think we’re 100 per cent there yet. As I say, I’ve just got more funding to develop this, but if you stand back and think about it, and this only really occurred to me recently, IT is not about numbers any more, it’s about text, and we’ve got far more text now than we know what to do with, so we need to develop tools for handling them, and that’s very much where this is coming from. It’s one way of approaching what you can do with very large bodies of texts that you can’t possibly read all of. How can you go about summarising them quickly? Online resources are a bit problematic, because a lot of the time the people that have put them online only allow you to search them in quite primitive ways. And that’s partly deliberate, I think, because they want to protect their copyright. What we’re going to be dealing with is where you’ve actually got access to the raw data, the raw texts themselves, and be working on that. Whether you can then set that kind of technology loose on an internet resource I’m not yet sure. I think the thing to do is to ask them nicely whether they would supply you with the raw data, and take it from there.
DR: Even if they did, that raw data could present a computer with considerable challenges of reading it, simply because an early modern printed pamphlet can actually be quite hard to read until you’re used to them, and because the typefaces can be diverse. That must present some obstacles.
IG: That would present obstacles. Hopefully with things like Eebo [Early English Books Online], there is an electronic text behind it, so somebody’s already dealt with turning weird typefaces into ASCII text or whatever, that the computer can handle.
DR: One of the more recent things you’ve done is the Lake District narratives, which was a case study to try and find out how GIS could deal with qualitative information, How successful was that? [This aimed to explore how subjective spatial experiences could be mapped by followed accounts of tours of the English Lake District by two poets, Thomas Gray in 1769 and Samuel Taylor Coleridge in 1802. D. Cooper and I. N Gregory, ‘Mapping the English Lake District: A literary GIS’, Transactions of the Institute of British Geographers, 36, 2011, 89-108.]
IG: It was really successful, I was surprised just how far we were able to go, because it enables you to think about text in a completely different way, because it presents a summary of the geographies in that text. Where are they talking about? For the Lake District, things like, what heights are those places that they’re talking about? Are they near lakes, are they near towns? Lots of questions like that which were very difficult to answer in the past. And as you move through the text, how do the places talked about change, how were they represented, how does one writer compare with another writer and one source compare with another source? We were only dealing with 20,000 words, which is pretty short, but potentially you could start applying those techniques to millions of words. Once you start doing that, the potential is absolutely amazing, because all of a sudden you’ve moved from a position where you can start summarising these large volumes of text without actually having to read them, while in the past you’ve always been pinned down by simply the fact that your scholarship was limited by your ability to read quickly. That no longer needs to be the case. Not that you want to stop reading entirely, but it points you at where you need to be reading in detail and which parts you can ignore, which in the past we’ve always done, but we’ve almost had to do it through guesswork.
DR: I noticed with the Lake Districts that you had the “mood maps”. If a writer was to express a complex, even paradoxical, emotion, then that can’t be placed in a simple way on a scale. Though you might be led to that passage of text.
IG: I think if this is going to work properly, in a way that historians and others are comfortable with, in the humanities particularly, then what it needs to be able to do is combine these kind of grand, but somewhat crude, summaries with the ability to still add nuance yourself, and I think, to pick up on that mood map example, I think you get the feeling that in general people are responding to this place in a very positive way, however, a number of people seem not to be doing that. Therefore, you then perhaps want to follow up a writer who has not responded in the typical way and ask why. It’s a tool that helps your scholarship, it’s not something that does your scholarship for you.
DR: What other pitfalls are there?
IG: Sources is probably the biggest one, but then as historians you’re used to dealing with sources so you just have to remember that those limitations are still there. Next big problem is probably the time and expense of getting the data into digital form and then GIS form. That can be a big job. It shouldn’t be underestimated. Another pitfall that people can and do fall into – it’s probably a good pitfall – is to get too excited about the potential for then extending the database or doing new bits of technology, and forgetting that at the end of the day it is about the scholarship and not about the technology. Keeping focus on research questions is, then, important. But also allowing your research questions to adapt as the work progresses, as they inevitably will.