Archive for the ‘Data’ category

Brad Scott

What have I been doing?

New projects, data and taxonomies. No wonder my blogging has been minimal. I thought I’d have a week or two to catch my breath, but it looks like another busy period is about to start.

I’ve been lucky to have an interesting few months with some lovely clients and projects. First off was a London-based corporation who needed some consultancy advice on a taxonomy and metadata strategy. They have a huge amount of data from lots of different sources, and a pressing need to integrate it together for users to be able to draw greater benefit from it, whether through a conventional classification, using an existing thesaurus or even developing the folksonomies they already have. I gave them a roadmap for the next stages, and a pragmatic sense of where they could end up.

Then it was back to the more familiar academic publishing environment for some project planning. The new project needed to be started urgently, so I’ve spent the last couple of months working with the extremely bright publishing team to pull together the outline requirements for an exciting new service to be launched next year. Despite a subject area with which I was relatively unfamiliar, it proved to be a fascinating resource, with an extremely innovative back-end system for linking data. Nice to know I can still throw in a few quirky insights that add something challenging to a project…

And finally, there’s the data. I’ve always enjoyed the data side of digital publishing, ever since I managed the editorial and the technical ends of projects at Routledge in the 90s when it was clear that to manage data well it really helped to have a detailed understanding of what it does, and of the fine attention to picky detail that proofing and copy-editing gives you. Consequently, I’ve done lots of modelling and converting publishers’ data over the years, so it was a pleasure to have to convert some very lightly-HTML-tagged encyclopedia data to rather more robust TEI markup using XSLT. The highlights this time have been the rather extensive use of the for-each-group and inordinate complex regexes in analyze-string to add structure to the bibliographies so that OpenURL linking might work. Needless to say, Michael Kay’s book is always by my side.

Brad Scott

Lots and lots of data

I’ve been involved with the publication of products containing fairly large amounts of data for well over a decade now, and finding some old articles of mine made me think about what has changed for publishers who handle such content.

Certainly, the volume of data for individual projects has increased, which in turn has meant that publishers have got a bit better at managing and archiving their data assets, though I wish that were more generally true; valuable data can still be stored in the equivalent of a shoe box with inadequate documentation. Suppliers are generally better (and cheaper) too, not least since they now have more familiarity with the important data standards. Even so, data testing and QA can still be problematic, and that is equally true internally within publishers.

Compared with a decade ago, the user requirements and expectations tend to inform data design more, and some publishers certainly have well-thought-out and documented data models that have been constructed with usage in mind. But, the technology platform that delivers the content can sometimes be what shapes the data, rather than the user, and that can lead to some ugly and inflexible choices.

Nevertheless, when faced with a new data creation or migration project, there is still an unavoidably large amount of grind and planning required to get it right. That’s what I found so interesting re-reading these ten-year-old articles. Though the delivery technology has changed, the processes and thinking required isn’t very different, and I could have written similar things about many of the projects I’ve worked on since then.

Cover of Asia Official British Documents package

The articles themselves date back to when I was digital publisher at Routledge in the late 90s. One describes the creation of Asia: Official British Documents (1998)1, which was published with the British National Archives, and comprised 40,000 page images of original archive content plus metadata; and the second focuses on the data of the Calendar of State Papers Colonial series (1999).2

The former was mostly an exercise in tracking bits of paper in a database, but the latter was an SGML implementation, drawing on the models of the Text Encoding Initiative (TEI) and the Model Editions Partnership. In the years since then I’ve been extending the TEI for several other projects, such as  the New Palgrave Dictionary of Economics, and the MLA Handbook, which has meant adding in MathML and the CALS table model. Fundamentally though, the process for planning and creating the data for these products hasn’t changed much at all.

  1. Scott, Brad. “Creating an Image Edition of Historical Material: Asia: Official British Documents, 1945-1965″ 1998. http://www.brambletye-publishing.co.uk/consultancy/creating-an-image-edition-of-historical-material/
  2. Scott, Brad. “Retrospective Data Conversion in a Commercial Publishing Environment: The Calendar of State Papers, Colonial” 1999. http://www.brambletye-publishing.co.uk/consultancy/retrospective-data-conversion-in-a-commercial-publishing-environment/
Brad Scott

London Book Fair

The LBF has prompted a number of posts and ruminations. Mike Shatzkin’s paper at the digital seminar has been posted on the Bookseller blog.1 It’s well worth a read for a wide-ranging view of the industry. With my data-preparation hat on I was particularly struck by the comment:

Peter Balis, the ebook wizard at John Wiley in Hoboken, talks about the fact that he has to take IP designed to be optimized on a print page and figure out how to make it work on different sized ebook screens. He openly longs for the day when his outputs become the dog and the printed book the tail. He points out, correctly, that it would be an easier workflow for everybody if it worked that way around.

This is surely the experience of most publishers, and it’s not just in respect of ebooks, but all output formats; platform neutrality is the key. His paper makes clear how much the ebook phenomenon is causing waves through the industry, and the impact of that is addressed by Michael Cairns, who speaks of all-important standards, interoperability and collaborative action between publishers,2 and how essential it is that publisher direct the technology rather than be led by it.3

  1. Shatzkin, Mike. “ The future of trade publishing in the digital marketplace.” 27 April 2009. BookBrunch. http://www.bookbrunch.co.uk/index.php?option=com_content&view=article&id=1682:the-future-oftrade-publishing-in-the-digital-marketplace&catid=918:digital&Itemid=97
  2. Cairns, Michael. “Amazon Stanza: This Changes Nothing.” 27 April 2009. PersonaNonData. http://personanondata.blogspot.com/2009/04/amazon-stanza-this-changes-nothing.html
  3. Cairns, Michael. “London Days of Futures Past.” 27 April 2009. PersonaNonData. http://personanondata.blogspot.com/2009/04/london-days-of-futures-past.html

Digital publishing consulting

With twenty years' experience in the information industry, and a broad range of activities in the digital/new media sector since 1994, Brambletye Publishing offer invaluable expertise for publishers and other information professionals. Read more