Towards a Big Data History of Music

By Thomas Delpeut & Mascha van Nieuwkerk

The ‘A Big Data History of Music’ project is not short in its ambitions. Initiated by Stephen Rose from the Royal Holloway University and Sandra Tuppen from the British Library, the collaborative project brings together several of the world’s biggest musical-bibliographical databases and aims to inspire new musicological research to quantitatively analyse and visualise Western music history with the help of digital tools. During a workshop and symposium on March 10^th and 11^th in the British Library in London, the work in progress project was presented and researchers were invited to experiment with the data and reflect on its use.

The combined dataset contains over a million catalogue records of musical scores from over forty thousand composers. This information is mainly derived from six prominent library databases, themselves being the product of cumulative bibliographical research and focusing on different periods – between 1500 and the present – and different forms of sheet music – from manuscript to published scores and volumes of printed music. The core of this project is constructed from the depositories of the Répertoire International des Sources Musicales (RISM) and the Musical Collections of the British Library. [For a complete overview of these databases, see this article by Stephen Rose and Sandra Tuppen].

By bringing this data together, the project attempts to incite new research concerning patterns in the production and distribution of music scores and aspects such as changing musical taste, compositional genres, canon formations, and the geographical transmission of music. The database does not only encompass the works of the most prominent composers, it also allows to study the enormous amount of unknown and peripheral composers – or ‘the Great Unheard’, a play on Morretti’s ‘Great Unread’. During the second day of the event, several scholars were invited to speak about the database and their digital musicological research. Laurent Pugin (RISM Switzerland) spoke about visualising and working with the RISM data. Marnix van Berchum (Utrecht University, DANS) addressed networks of early music sources and Tim Crawford and Ben Fields (Goldsmiths, London) discussed musical collections and social networks.

A separate database, from the In Concert project, explores new approaches to building a digital archive from varying types of performance datasets based on concert ephemera, such as programmes, bills, and reviews and advertisements published in historical newspapers and periodicals. Cowgills contribution focused mainly on evaluating ways of overcoming the barriers of expertise, volume of data, and the gap between cost and benefit that have hampered such digital musicology projects in the past. Solutions proposed by the In Concert research team include linkage to external sources and a step-by-step curation of sources, based on relevant research questions. From this starting point, the In Concert group is supporting many different small-scale expertise projects that have a very direct relation to performed research. This is an inspiring approach, opposite from the working method of the British Library. Whereas librarians are looking for researchers and digital tools to match their sources, researches such as Cowgill are looking for tools and sources to match their research questions. On such a symposium these matches can be successfully made.

Besides the ambitious promises of the combination of databases, considerable attention was also given to their limitations. Due to their heterogeneous nature, aligning the seven databases causes considerable problems. Differences in information and varying levels of detail limit the project from offering one fully integrated database. Aspects such as spelling variations and the diverse ways to categorize the information show that the data still needs extensive pre-processing in order to be fully functionaland in tune with specific research questions. During the Data Exploration Day we experienced this first-hand while using open access software such as Openrefine, Google Fusion Tables and Palladio – presented by Loukia Drosopoulou (Royal Holloway). The project’s initiators are, furthermore, clear about the fact that even after further processing, the databases will never be complete due to archival limitations and will always contain conjectural decisions, man-made errors and outdated information. It would be interesting to see how these limitations are included when the databases will be made publicly available.