In reponse to the COVID-19 pandemic, the ODI Leeds space will be closed for events and workspace hire until further notice.
ODI Leeds

Emergent Alliance - Creating an Open Catalogue of Data

ODI Leeds recently joined the Emergent Alliance, a collaboration set up to share data, work, and resources related to global recovery from the COVID-19 pandemic. Our founder, Paul Connell, wrote an excellent blog post explaining the approach we are taking, in line with our #RadicallyOpen strategy.

The key to the Alliance is collaboration, and a very important part of that is sharing data. This doesn't need to be difficult - we already have an amazing tool at our disposal, so why not use it as intended? I'm talking about the web. We set up an open catalogue of data, where members of the Alliance can share information about datasets they are using and publishing.

The data catalogue site - https://emer2gent-data.netlify.app
Credit: ODI Leeds

We're no strangers to working openly (the clue is in the name!). We've seen the huge benefits it brings in terms of collaboration, efficiency, and time saved - that's why we thought it was important to set up a central place where everyone from the Emergent Alliance can share datasets which they are using, or publishing. By putting everything in one place, we make it much easier for everyone to be able to find the data they need, see what data everyone else is using, and quickly get metadata surrounding it.

This is a catalogue of metadata. It doesn't try to be something its not. It's not a datastore, or a data science platform, nor is it intended to replace any existing infrastructure or platforms. Instead, it is very simply, a tool to allow everyone to share information and metadata around the datasets being used and published. The data itself could be stored anywhere on the web, but by providing the URL in the metadata, it can still be accessed through the catalogue. Perhaps the data isn't publicly available, or isn't on the web - that's fine - the metadata around it can still be added to the site. Even if that's just a short description of the data, people then know that dataset is available, and who has it, whereas before, they might not have even known it existed.

We have taken a decentralised approach to actually storing the metadata: the site is powered by a set of index files, each containing datasets. All of the technical information is available on GitHub, and I will be writing a separate technical blog, but it's essentially just a collection of JSON index files stored on the web.

Each organisation can create their own index, which could be stored anywhere on the web. It could even be an API endpoint rather than a static JSON file (as long as the schema matches). This means whoever created the index maintains full control over it, and can add, update, and remove datasets freely, without needing permission. Once an organisation creates an index, they simply need to add the URL to our master index file, which is stored in our GitHub repository. This is the only part of the process which requires any intervention on our part, and it only needs to be done once.

If that seems complicated, thats OK - we've tried to make things as easy as possible, by also providing a form on the site which allows you to add a dataset to a special 'public' index we have created specifically for this purpose. You can even use that form to generate a JSON object for a dataset to copy and paste into another index.

The form to add data to the public index
Credit: ODI Leeds

It's hard to understate the benefits that something as simple as this brings, but it only works if everyone gets involved, so, if you're part of the Emergent Alliance - start adding data.

If you need help or have feedback, use the Emergent Microsoft Teams channel, oropen an issue on GitHub. I don't know exactly what this should look like - it's not a finished product, and probably never will be. If something can be improved, it should be, so feel free to give us feedback, ideas, and suggestions. As the Alliance expands, it becomes more and more important to have a collaborative, open platform to share data. Everyone has data, and everyone needs data. Why make it harder to find?