Mapping the economic structures of the UK
ODI Leeds & Bloom have been working together over the last 18 months to build The DataCity. Integral to this is to build the tools that let us map our cities economic structures - give us real time insight and let us measure if our interventions are successful.
Part of our journey to build The Data City has been the IoT Nation Database, Pete Laflin - who leads Bloom's data-science team has kindly created a technical blog to explain how we have used open data & the power of the web to create a new way to measure how our economy is structured and how it is developing.
Watch this space for more of our mapping, clustering and innovation work with the web of data to answer questions on how our economy is structured, how it develops and how it is changing. - Paul Connell, Founder of ODI Leeds
Guest post by Pete Laflin
In March, the Digital Catapult launched the IoTUK Nation Database. The database aims to bring together "a snapshot of the current state of the businesses and organisations that make up the Internet of Things sector in the UK", to help understand the economic impact of the sector and measure the growth of an important enabling technology.
The project has started the process of curating an open data asset, with a data visualisation and exploration tool that anyone can use to query the data without having to understand how to query a database. As well as delivering a dynamic data asset, the project has developed a platform for innovation focused on telling the stories about the Internet of Things rather than focusing on a static data collection process.
The production of the database raised a number of interesting challenges for our team, as well as providing us with a great use case for exploring the data in Virtual Reality. This article describes some of those challenges and how we've tackled them to deliver an innovative solution for the Digital Catapult.
How do you go about curating a list of organisations that are actively involved in the Internet of Things ecosystem?
But, first, what is the "Internet of Things"?
Our first challenge is to identify exactly what it means to be in the "Internet of Things ecosystem". Without a useful working definition, it's going to be very hard to decide if an organisation is in, or out, of this ecosystem.
This raises the very basic question of "What is the Internet of Things?" How broad does it go? Is it a technology, a process, a collective term or a mixture of all three? Is every business that uses a remote sensor now an Internet of Things business, or is there a restriction we must impose to maintain some sector focus?
Rolls Royce make aircraft engines that are capable of sharing data with engineers in-flight using a data link via satellite. Is this the Internet of Things, or does the closed nature of the communication process put this use case into another category? Enforcing some boundaries is key to solving our problem.
Can't we just use the "Internet of Things" SIC code?
SIC codes, or Standard Industry Classification codes, were developed in 1937 by the US government as a way of classifying "industries". Over the years, various amendments have been used with the US and UK systems differing in detail as requirements have changed.
In the UK, we've updated our SIC codes seven times to reflect how industries have changed. The latest update happened in 2007 and this version contains 15,599 different industry descriptions, mapped to 728 different code numbers.
In our search for boundaries for the Internet of Things, we should be able to look up all the businesses that are listed as an "Internet of Things" business. A quick search reveals 80 categories that contain "thing". There's lots of clothing categories, categories involving breathing such as "Breathing apparatus for diving (manufacture)" and there is even a category for businesses involved with "Bathing caps of rubber (manufacture)". But, frustratingly, there isn't a classification for "Internet of Things".
So, no. We can't just use the "Internet of Things" SIC code as there isn't one. Maybe it'll make it into the next revision of SIC codes but, for now, we need a different plan.
Can't we just use whatever code is used by IoT businesses?
If there isn't a SIC code for the "internet of things", we won't be able to search companies house data using the business classification alone. Thinking laterally though, what if we find a known Internet of Things business and look at their classification? That might shed some light on the situation.
Vodafone are a key player in the IoT ecosystem. How do they classify themselves? Vodafone PLC has three SIC codes associated with its public record at companies house:
- Installation of industrial machinery and equipment
- Other Telecommunications Activities
- Activities of Head Offices
Three problems are immediate apparent. The first is that, because the company directors decide these codes as part of the confirmation statement for the business and they choose the option from a long list of possibilities without seeing all the possibilities, there will be some inconsistencies between different businesses. Two businesses doing the same thing might classify themselves differently and businesses engaged in the same activity may choose different classifications. Where does "Other Telecommunications Activities" start and stop and how can we be sure that it is a better classification for Vodafone than "Wireless telecommunications activities?"
The second issue is that businesses can have multiple SIC codes. This isn't a problem per se, but does create some complexity in our search.
The third, and the most important, problem is that not every business listed as an "Other Telecommunications Activities" business will be involved in the Internet of Things. So, we cannot simply include all businesses in this category in our list of Internet of Things businesses as this will over inflate our numbers very quickly and lead to a confusing picture.
Does this mean SIC codes are broken?
In our newly-agile world, where businesses frequently pivot to take advantage of opportunity, the concept of a hierarchical classification system is becoming an outdated concept. Organisations are more frequently working across industries, focusing on their skills rather than narrow industrial sub-speciality knowledge. There are ways for a business to choose multiple SIC codes to reflect their cross-sector workings, but will the company secretary really spend the time to make sure the data is accurate and the optimal classification for their business?
Recall our first problem - that we rely on an individual classifying their business consistently with others - and you quickly realise we need a better way.
In this hyper-connected world that makes the Internet of Things a realistic possibility, comes an opportunity to harness the power of the web to collect our data.
We should focus on the individual business and make individual choices.
What if we use the organisation's website as a source of information and we make our own decision about what they do? Looking for terms like "Internet of Things" or "P2P" or "Sensor" might give us a clue that the organisation has an interest in the sector.
This seems like a reasonable approach and one which doesn't suffer from the problems we identified earlier with using SIC codes. If we had a list of businesses, we could make a reasonable attempt at classifying them as either "in" or "out" of the ecosystem, based on their activities. Better still, if we had a list of businesses known to be an IoT business, we could use this as a training set for a machine learning process. We could build a machine learning "black box" to help identify Internet of Things businesses using data science.
The Digital Catapult had such a list, which allowed us to train our model to spot words relevant to the Internet of Things.
Now we can spot an IoT business, how do we find more?
A reasonable approach is to try and push a large number of organisations through our black box to decide whether they are, or are not, involved in the Internet of Things.
Now we have this black box, we simply need to get a list of organisations and a URL for each organisation.
But, wouldn't buying a list of all UK organisations be very expensive?
Yes. So, we need another way. However, by harvesting information from the web and looking at how Internet of Things websites link to each other, we can build up a list of organisations for our black box to think about. By harnessing the power of Open Data we can curate a list for our black box to work on, at a fraction of the cost of approaching a traditional list vendor. The Companies House API and the Open Corporates API has been a fantastic asset in this regard, allowing us to get access to the public data record for UK organisations.
How do we check that the black box is making the right decision?
Once the algorithm has run and made its decisions, we need to manually check the accuracy of the classification for some organisations. By feeding back this validation step into the model, we continue to improve the accuracy of our model and ensure that our process learns as it goes.
Will this get every IoT business in the UK?
No, but the harvesting step we use has been designed to pick up organisations which are discussed and linked to by other IoT organisations. By starting with a good "seed" list of known Internet of Things organisations, we should find a good proportion of Internet of Things organisations. If we've missed your organisation, you can tell us about it here.
Will the data be updated?
Yes, probably on a six monthly refresh cycle. As we've built an automated process to collect and validate the data, the initial investment allows for repeated refreshes of the data. We can focus more time on analysing the data and telling stories about the sector, rather than spending most of our time collecting and organising data.
Can this be used for other sectors?
Yes. We can use this approach to build accurate audits of specific sectors on a global basis.
We can build forward looking Data Assets that track changes over time, measuring the impact of interventions and investments and tracking longitudinal changes across a sector.
Is it scalable for all organisations in a country?
Yes. Our approach allows us to process large amounts of data reliably and our choice of underlying technology enables us to apply this to all sectors in a country, or continent. In part, this is due to our choice of database technology, which has merged the benefits of MySQL with the benefits of a graph database, such as Neo4j.
Why was a graph database used?
Graph databases are fantastic at storing information about how things are linked together. Our ultimate interest in this data set is to understand the linkages within the ecosystem and a graph database has allowed us to do this more effectively than a relational database system.
The video below shows the relationships between organisations, as seen through our Virtual Reality data exploration platform.
What did the data show?
We identified 603 UK based organisations where our black box decided that they were engaged in some IoT activity. These organisations fall into 146 different SIC classifications and you can explore how varied these codes are in the visualisation tool we built to help people explore the data. The raw data itself can be downloaded from Data Mill North.
IoTUK are to publish a number of reports explaining how this data describes the growth of the sector and its importance to Uk Plc. It is fitting that innovation in data science is helping to describe innovation in other sectors.
Innovation delivered by the Data City
Head of Data, Jaywing Intelligence