Session: Teach – This&THATCamp Sussex Humanities Lab

TEACH/PLAY – Scaling Up Impact

Julie Weeds — Fri, 26 Feb 2016 09:57:32 +0000

Researchers in every field are being made increasingly aware of the need for their research to have impact. However, often researchers don’t realize where beyond their own field of study their research might have impact. How does one go about finding the documents on the Internet which could be connected conceptually to another document, such as one’s own work or project proposal? Searching for key terms is a good start but often the same kinds of documents come up top again and again and it is difficult to sift through the results to find something different and relevant. Further, documents from other domains or sources (e.g., government policy documents) may use a different vocabulary making them less likely to come up in keyword searches.

In the Text Analytics Group (TagLab) at Sussex, we are developing a system which does four things. First, it automatically identifies key words and phrases for a document or set of documents. Second, it searches the web using queries based on combinations of the key words/phrases and related words. Third, it allows the user to build custom classifiers (using active learning), e.g., for relevance. Finally, it clusters the results with a view to making it easier to identify documents or clusters of documents outside of existing known clusters. The purpose of this session is to teach delegates about the underlying technology and to give delegates the opportunity to play with and evaluate the prototype system, using their own work as input.

Delegates will ideally have a laptop with Google Chrome installed to be able to access the software which is run as a web service. It would also be helpful if delegates brought with them a digital copy of some of their own work (e.g., an academic paper or grant proposal) in raw text format (i.e., ‘.txt’) which can be uploaded and processed by the system. However, we can help with both installation of software and file formatting as required.

TEACH/PLAY – Treating BMD data as a ‘(semi-)closed set’

Richard Light — Thu, 18 Feb 2016 18:13:21 +0000

Starting from 1837, the GRO Civil Registration index provides a nominally complete record of births, marriages and deaths for England and Wales. The FreeBMD project has transcribed the vast majority of this material (up to 1983, when the GRO register went digital. Ironically, the FreeBMD project is still negotiating to make this more recent data available). Free UK Genealogy, of which FreeBMD is one project, is committed to making all its data available under an open data licence, and is working towards this goal, initially by getting all contributors to sign an agreement which allows this. This session is intended to explore what might be possible once these data sources are indeed open.

The high degree of completeness in this data makes it feasible to think in terms of a ‘closed set’ (as against the ‘open world’ assumption that cultural history usually has to adopt). In principle it should be possible to algorithmically match deaths to births that fall within this period, thereby providing an extra impetus to single-name studies.

A companion project – FreeCEN – offers census data, which places individuals within households on a specific date. While the coverage of FreeCEN is less complete than that of FreeBMD, the data it does hold offers much richer information about relationships between individuals, placing them in a social/family context.

Richard Light has scraped all the FreeBMD and FreeCEN data relating to his own name and his mother’s maiden name. The data behind these experiments will be made available as open data. It currently lives in an XML database, and can be published using the Linked Data approach. The plan for this session is for Richard to explain what has been achieved so far with this data, and then for everyone to explore what other techniques might be applied to it.

TEACH – Open Source Personal Digital Archiving

James Baker — Mon, 15 Feb 2016 09:43:09 +0000

SESSION REQUIREMENTS
If you intend to come to this session and want to do this work on your own laptop, please make sure you do the following in advance of coming to Sussex (it may be possible during but the files are a bit big!):

Download and extract the latest BitCurator Virtual Machine at wiki.bitcurator.net/index.php?title=Main_Page
Download and setup VirtualBox and make sure the BitCurator Virtual Machine works as per wiki.bitcurator.net/index.php?title=BitCurator_Virtual_Machine_Install
If you have an old floppies, flash drives, CD, DVDs, or hard disks you want to try and capture as part of the session, bring them along! (and don’t worry, I’ll bring along some dummy media for us to play with)
See Processing Workflow for Digital Media for the session handout (I’ll bring some copies along!)

The paper archive has been replaced by physical data storage – a new format that requires historians, archivists, and humanists to think and act afresh. In just 35 years most people – in Britain and worldwide – have come to create text and data in a fundamentally new way. The first step towards working with these personal digital archives if to preserve them. You can’t just turn on an old computer and start browsing: the act of booting it up adds new data to the archive with fresh data stamps, thus compromising its authenticity. Thankfully open source digital forensic tools aimed at archivists and scholars have made huge strides in recent years thanks largely to the efforts of the BitCurator project led by University of North Carolina Chapel Hill.

In this session, we’ll work together to capture some dummy media (bring your own if you want to work with the real thing!) and explore that media using BitCurator: a suite of open source digital forensics and data analysis tools design to help collecting institutions process born-digital materials.