Pilhofer shows early working version of DocumentCloud

In: journalism

2 Oct 2009

In a session on data at the Online News Association conference, Aron Pilhofer of the The New York Times, gave a detailed look at DocumentCloud.

The project is intended to help present, share, find and deal with mounds of documents that investigative reporters unearth.

DocumentCloud considers documents as data but attempts to put some structure around these.  The aim is to have an analytical tool to process documents and make the date accessible.

The goals are to improve transparency and help journalists discovered linked data buried in the documents.

The tool, says Pilhofer, is going to be 100% free so long as you are willing to make your documents public.

PDFs are a terrible way of putting documents online, says Pilhofer.

Pilhofer talks about OpenCalais as a tool that takes the text and after the “magic part”, out comes details from those documents, such as companies, people, places and more.

The tool becomes more powerful as DocumentCloud would be a store of documents from diverse news organisations. So it would be able to find connections between documents from different sources, explains Pilhofer.

He calls the process “entity extraction” and, his new favourite word, disambiguation.

Calais works by assigning an individual ID number to an entity, say IBM, and that same reference will apply across all documents.

Pilhofer demonstrated an early alpha of DocumentCloud. Uploaded documents appear on the left-side of the screen. Entering a search term starts to narrow down the documents.

A list of documents appears on one side of the screen, with a list of topics on the other. By selecting a topic, the list of documents is further narrowed, enabling you to extract specific documents.

DocumentCloud has 27 members and is looking for more, so anyone interested can get in touch with them on info@documentcloud.org.

Comment Form

About this blog

This blog is run by Professor Alfred Hermida, an award-winning online news pioneer, digital media scholar and journalism educator.

Twitter updates