Session: Play – This&THATCamp Sussex Humanities Lab

TALK/PLAY – TextLab Project in Practice

Rebecca Russell — Wed, 02 Mar 2016 16:09:50 +0000

TextLab is a Vertically Integrated Project at the University of Strathclyde involving students from the English Literature department and the Computer & Information Science department.

We use tools like:

AntConc (www.laurenceanthony.net/software/antconc/), a freeware text analysis toolkit for concordancing and text analysis.

Ubiqu+ity (vep.cs.wisc.edu/ubiq/) which generates statistics and identifies linguistic patterns and groups.

WordHoard (wordhoard.northwestern.edu/), an application for the close reading and scholarly analysis of texts, largely used on this project for determining the log-likelihoods of specific words and generating word clouds to display this information in a user-friendly manner.

In TextLab we use these programs to analyse the language of Shakespeare and to find patterns and discrepancies that would almost certainly be invisible to the naked eye.

But can we also use them to solve a murder?

To demonstrate the uses of these various tools, we have developed a murder-mystery type scenario in which Romeo (of Romeo and Juliet) has been found murdered while staying in a house with Hamlet, Brutus, and Lady Macbeth. A confession note was found by the body, signed by Brutus, but he claims he is innocent. We will demonstrate how some of these analytical tools could help us identify the killer, simply from the language used in the letter.

TEACH/PLAY – Scaling Up Impact

Julie Weeds — Fri, 26 Feb 2016 09:57:32 +0000

Researchers in every field are being made increasingly aware of the need for their research to have impact. However, often researchers don’t realize where beyond their own field of study their research might have impact. How does one go about finding the documents on the Internet which could be connected conceptually to another document, such as one’s own work or project proposal? Searching for key terms is a good start but often the same kinds of documents come up top again and again and it is difficult to sift through the results to find something different and relevant. Further, documents from other domains or sources (e.g., government policy documents) may use a different vocabulary making them less likely to come up in keyword searches.

In the Text Analytics Group (TagLab) at Sussex, we are developing a system which does four things. First, it automatically identifies key words and phrases for a document or set of documents. Second, it searches the web using queries based on combinations of the key words/phrases and related words. Third, it allows the user to build custom classifiers (using active learning), e.g., for relevance. Finally, it clusters the results with a view to making it easier to identify documents or clusters of documents outside of existing known clusters. The purpose of this session is to teach delegates about the underlying technology and to give delegates the opportunity to play with and evaluate the prototype system, using their own work as input.

Delegates will ideally have a laptop with Google Chrome installed to be able to access the software which is run as a web service. It would also be helpful if delegates brought with them a digital copy of some of their own work (e.g., an academic paper or grant proposal) in raw text format (i.e., ‘.txt’) which can be uploaded and processed by the system. However, we can help with both installation of software and file formatting as required.

TEACH/PLAY – Treating BMD data as a ‘(semi-)closed set’

Richard Light — Thu, 18 Feb 2016 18:13:21 +0000

Starting from 1837, the GRO Civil Registration index provides a nominally complete record of births, marriages and deaths for England and Wales. The FreeBMD project has transcribed the vast majority of this material (up to 1983, when the GRO register went digital. Ironically, the FreeBMD project is still negotiating to make this more recent data available). Free UK Genealogy, of which FreeBMD is one project, is committed to making all its data available under an open data licence, and is working towards this goal, initially by getting all contributors to sign an agreement which allows this. This session is intended to explore what might be possible once these data sources are indeed open.

The high degree of completeness in this data makes it feasible to think in terms of a ‘closed set’ (as against the ‘open world’ assumption that cultural history usually has to adopt). In principle it should be possible to algorithmically match deaths to births that fall within this period, thereby providing an extra impetus to single-name studies.

A companion project – FreeCEN – offers census data, which places individuals within households on a specific date. While the coverage of FreeCEN is less complete than that of FreeBMD, the data it does hold offers much richer information about relationships between individuals, placing them in a social/family context.

Richard Light has scraped all the FreeBMD and FreeCEN data relating to his own name and his mother’s maiden name. The data behind these experiments will be made available as open data. It currently lives in an XML database, and can be published using the Linked Data approach. The plan for this session is for Richard to explain what has been achieved so far with this data, and then for everyone to explore what other techniques might be applied to it.