Glossary

API [Permalink]

Application Programming Interface. For data, this is usually a way provided by the data publisher for programs or apps to read data directly over the web. The app sends the API a query asking for the specific data it needs, e.g. the time of the next bus leaving a particular stop. This allows the app to use the data without downloading the whole dataset, saving bandwidth and ensuring that the data used is the most up-to-date available.

API Documentation [Permalink]

Quality API documentation is the gateway to a successful API. API documentation needs to be complete, yet simple–a very difficult balance to achieve. This balance takes work and will take the work of more than one individual on an API development team to make happen.

API documentation can be written by developers of the API, but additional edits should be made by developers who were not responsible for deploying the API. As a developer, it’s easy to overlook parameters and other details that developers have made assumptions about. —source

Analytics [Permalink]

Rate limiting will be part of any API platform, without some sort of usage log and analytics showing developers where they stand, the rate limits will cause nothing but frustration. Clearly show developers where they are at with daily, weekly or monthly API usage and provide proper relief valves allowing them to scale their usage properly. —source

Anonymisation [Permalink]

Processing data that includes personal information so that individuals can no longer be identified in the resulting data. Anonymisation enables data to be published without breaching data protection principles. The principal techniques are aggregation and de-identification. Care must be taken to avoid data leakage that would result in individuals’ privacy being compromised. UKAN studies best practice in data anonymisation.

Anonymization [Permalink]

See Anonymisation.

App / Application [Permalink]

A piece of software (short for ‘application’), especially one designed to run on the web or on mobile phones and similar platforms. Apps can make network connections to large databases and thus be a powerful way of consuming open data, which may be real-time, personalised, and (using a mobile phone’s GPS) location-specific information. Crowdsourcing apps can also be used to build or improve datasets.

Application Programming Interface [Permalink]

A way computer programs talk to one another. Can be understood in terms of how a programmer sends instructions between programs.

Attribution [Permalink]

Acknowledging the source of data when using or re-publishing it. A data licence permitting the data to be used may include a requirement to attribute the source. Data subject to this restriction may still be considered open data according to the Open Definition.

Authentication [Permalink]

Authentication is a way for an application or system to associate an account with a user. Usually through credentials in the form of a username and password. Because Basic Auth is integrated into HTTP protocol it is the easiest way for users to authenticate with a RESTful API.

Basic Auth is easily integrated, however if SSL is not used, the username and password are passed in plain text and can be easily intercepted on the open Internet. —source

Bandwidth [Permalink]

The rate at which data can be transferred between computers. As bandwidth is limited, apps aim to download only the minimum amount of data needed to fulfil a user’s request.

Big Data [Permalink]

A collection of data so large that it cannot be stored, transmitted or processed by traditional means. The increasing availability of and need to process such datasets (for example, huge collections of weather or other scientific data) has led to the development of specialised computer technologies, architectures and programming languages.

BitTorrent [Permalink]

BitTorrent is a protocol for distributing the bandwith for transferring very large files between the computers which are participating in the transfer. Rather than downloading a file from a specific source, BitTorrent allows peers to download from each other.

Bulk [Permalink]

Data is available in bulk if the entire dataset can be downloaded easily and efficiently to a user’s own system. Conversely it is non-bulk if one is limited to getting small parts of the dataset, for example, are you restricted to a few elements of the data at a time and therefore require thousands or millions of requests to get the entire dataset. The provision of bulk access is a requirement of open data.

CKAN [Permalink]

An open-source software platform for creating data portals, built and maintained by Open Knowledge. CKAN is used as the official data-publishing platform of around 20 national governments and powers many more local, community, scientific and other data portals. Notable features are configurable metadata, user-friendly web interface for publishers and data users, data preview, organisation-based authorisation levels, and APIs giving access to all features as well as data access.

CSV [Permalink]

‘Comma-separated values’, a standard format for spreadsheet data. Data is represented in a plain text file, with each data row on a new line and commas separating the values on each row. As a very simple open format it is easy to consume and is widely used for publishing open data.

Citizen Engagement [Permalink]

Actively involving the public in policy and decision-making. Citizen engagement is a central aim of open government, with the aims of improving decision making and gaining or retaining citizens’ consent and support. Open data is an essential tool for ensuring informed engagement.

Civic Hacking [Permalink]

Building tools and communities, usually online, that address particular civic or social problems. Examples could be tools that help users meet like-minded people locally based on particular interests, report broken infrastructure to their local council, or collaborate to clear litter from their neighbourhood. Local-level open data is particularly useful for civic hacking projects.

Cloud [Permalink]

Data stored ‘in the cloud’ is handled by a hosting company, relieving the data owner of the need to manage its physical storage. Instead of being stored on a single machine, it may be stored across or moved between multiple machines in different locations, but the data owner and users do not need to know the details. The hosting company is responsible for keeping it available and accessible via the internet.

Connectivity [Permalink]

Connectivity relates to the ability for communities to connect to the Internet, especially the World Wide Web.

Conversion [Permalink]

The process of automatically reading data in one file format and emitting the same data in a different format, thus making the data accessible to a wider range of applications.

Copyright [Permalink]

A legal right over intellectual property (e.g. a book) belonging to the creator of the work. While individual data (facts) cannot be copyright, a database will in general be covered by copyright protecting the selection and arrangement of data within it. Within the European Union separate ‘database rights’ protect a database where there was a substantial effort in ‘obtaining’ the data. A copyright holder may use a licence to grant other people rights in the protected material, perhaps subject to specified restrictions.

Corruption [Permalink]

Misuse of public positions or funds, e.g. by embezzling money, extorting bribes, claiming unreasonable expenses, illicitly favouring friends or particular groups in public services or appointments, etc. Open data and, more generally, open government is an important tool in the fight against corruption.

Cost recovery [Permalink]

The principle of setting a price for a resource, e.g. data, aiming to recover the cost of collecting the data, as distinct from marginal cost. Data charged for on a cost-recovery basis is not open data according to the Open Definition. Studies show that charging for PSI on a cost-recovery basis leads to lower growth than free or marginal-cost pricing.

Creative Commons [Permalink]

A non-profit organisation founded in 2001 that promotes re-usable content by publishing a number of standard licences, some of them open (though others include a non-commercial clause), that can be used to release content for re-use, together with clear explanations of their meaning.

Crowdsourcing [Permalink]

Dividing the work of collecting a substantial amount of data into small tasks that can be undertaken by volunteers. Some examples: Wikipedia is a crowd-sourced encyclopedia and Galaxy Zoo was an early example of crowdsourcing scientific data, by asking non-expert volunteers to classify galaxies based on their visual appearance. NOVAM was a service which allowed the public to verify or correct official data on the locations of UK bus stops, crowdsourcing about 18,000 corrections.

DOI [Permalink]

Digital Object Identifier, an identifier for a digital object (such as a document or dataset) that is assigned by a central registry and is therefore guaranteed to be a globally unique identifier: no two digital objects in the world will have the same DOI.

Data [Permalink]

Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured and presented so as to be useful and relevant for a particular purpose, it becomes information available for human apprehension. See also knowledge.

Data Access Protocol [Permalink]

A system that allows outsiders to be granted access to databases without overloading either system.

Data cleaning [Permalink]

Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration), etc. See data quality.

Data collection [Permalink]

Datasets are created by collecting data in different ways: from manual or automatic measurements (e.g. weather data), surveys (census data), records of decisions (budget data) or ongoing transactions (spending data), aggregation of many records (crime data), mathematical modelling (population projections), etc.

Data integration [Permalink]

Almost any interesting use of data will combine data from different sources. To do this it is necessary to ensure that the different datasets are compatible: they must use the same names for the same objects, the same units or co-ordinates, etc. If the data quality is good this process of data integration may be straightforward but if not it is likely to be arduous. A key aim of linked data is to make data integration fully or nearly fully automatic. Non-open data is a barrier to data integration, as obtaining the data and establishing the necessary permission to use it is time-consuming and must be done afresh for each dataset.

Data journalism [Permalink]

The ability to work with data is an increasingly important part of a journalist’s armoury. Skills needed to research and tell a good data-based story include finding relevant data, data cleaning, exploring or mining the data to understand what story it is telling, and creating good visualisations

Data leakage [Permalink]

If personal data has been imperfectly anonymised, it may be possible by piecing it together (perhaps with data available from other sources) to reconstruct the identity of some data subjects together with personal data about them. The personal data, which should not have been published (see data protection ), may be said to have ‘leaked’ from the ‘anonymised’ data. Other kinds of confidential data can also be subject to leakage by, for example, poor data security measures. See de-identification.

Data management [Permalink]

The policies, procedures, and technical choices used to handle data through its entire lifecycle from data collection to storage, preservation and use. A data management policy should take account of the needs of data quality, availability, data protection, data preservation, etc.

Data portal [Permalink]

A web platform for publishing data. The aim of a data portal is to provide a data catalogue, making data not only available but discoverable for data users, while offering a convenient publishing workflow for publishing organisations. Typical features are web interfaces for publishing and for searching and browsing the catalogue, machine interfaces (APIs) to enable automatic publishing from other systems, and data preview and visualisation.

Data preservation [Permalink]

The Domesday Book of 1086 was written with ink on vellum, a technology that is still legible today. Long-term preservation of present day datasets is more difficult to ensure owing to uncertainty about the future of file formats, computer architectures, storage media and network connectivity. Projects that put particular stress on data preservation take a variety of approaches to solving these problems.

Data protection legislation [Permalink]

Data protection legislation is not about protecting the data, but about protecting the right of citizens to live without fear that information about their private lives might become public. The law protects privacy (such as information about a person’s economic status, health and political position) and other rights such as the right to freedom of movement and assembly. For example, in Finland a travel card system was used to record all instances when the card was shown to the reader machine on different public transport lines. This raised a debate from the perspective of freedom of movement and the travel card data collection was abandoned based on the data protection legislation.

Data quality [Permalink]

A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset.

Data wrangler [Permalink]

A person converting data into a usable form so that they can be easily used with automated or semi-automated tools. Data wrangling may include further data cleaning.

Database [Permalink]

(i) Any organised collection of data may be considered a database. In this sense the word is synonymous with dataset.

(ii) A software system for processing and managing data, including features to extend or update, transform and query the data. Examples are the open source PostgreSQL, and the proprietary Microsoft Access.

Database rights [Permalink]

A right to prevent others from extracting and reusing content from a database. Exists mainly in European jurisdictions.

Dataset [Permalink]

Any organised collection of data. ‘Dataset’ is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources.

Discoverable [Permalink]

It is not enough for open data to be published if potential users cannot find it, or even do not know that it exists. Rather than simply publishing data haphazardly on websites, governments and other large data publishers can help make their datasets discoverable by indexing them in catalogues or data portals.

EU PSI Directive [Permalink]

The Directive on the re-use of public sector information, 2003/98/EC. “deals with the way public sector bodies should enhance re-use of their information resources.” Legislative Actions - PSI Directive

File format [Permalink]

The description of how a file is represented on a computer disk. The format usually corresponds to the last part of the file name (‘extension’), e.g. a file in CSV format might be called schools-list.csv. The file format refers to the internal format of the file, not how it is displayed to users. E.g. CSV and XLS files are structured very differently on disk, but may look similar or identical when opened in a spreadsheet program such as Excel.

Five stars of open data [Permalink]

A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.

Freedom of Information [Permalink]

Also known as FOI. A requirement in law (e.g. the Freedom of Information Act 2000 in the UK or the Right to Information Act 2005 in India) for public bodies to provide data held by them to citizens on request, unless a specific exemption applies, e.g. the data is confidential. The fact that information must be supplied under FoI laws does not in general make it open data, as it is not distributed, may not be available under an open licence, etc.

GIS [Permalink]

Geographical Information System, any computer system designed to read, display, analyse and manipulate geodata.

GPS [Permalink]

The Global Positioning System, a satellite-based system which provides exact location information to any equipment with a suitable receiver (including modern smartphones). GPS is invaluable to many location-based apps, providing users with e.g. route-finding information or weather forecasts based on their current location. GPS is also a striking example of successful open data, as it is maintained by the US government and provided free of charge to anyone with a GPS receiver.

GeoJSON [Permalink]

A dialect of JSON with specialised features for describing geodata, and hence a popular interchange format for geodata.

Geodata [Permalink]

Any dataset where data points include a location, e.g. as latitude and longitude or another standard encoding. Maps, transport routes, environmental data, catastral data, and many other kinds of data can be published as geodata.

Government data [Permalink]

The work of government involves collecting huge amounts of data, much of which is not confidential (economic data, demographic data, spending data, crime data, transport data, etc). The value of much of this data can be greatly enhanced by releasing it as open data, freeing it for re-use by business, research, civil society, data journalists, etc.

Hackathon [Permalink]

An event, usually over one or two days, where developers, subject experts and others come together to create apps, visualisations and prototypes that aim to address problems in a particular domain, usually making heavy use of data. Hackathons focusing on a particular collection of data are a possible form of community engagement by data publishers. The hackathon is a popular format in the open source community.

Host [Permalink]

A company that stores a customer’s data on its own (the host’s) computers and makes it available over the internet. A hosted service is one that runs and stores data on the service-provider’s computers and is accessed over the network. See also SaaS.

Human Readable [Permalink]

Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.

IP rights [Permalink]

See Intellectual property rights.

Identifier [Permalink]

The name of an object or concept in a database. An identifier may be the object’s actual name (e.g. ‘London’ or ‘W1 1AA’, a London postcode), or a word describing the concept (‘population’), or an arbitrary identifier such as ‘XY123’ that makes sense only in the context of the particular dataset. Careful choice of identifiers using relevant standards can facilitate data integration. See linked data.

Information [Permalink]

A structured collection of data presented in a form that people can understand and process. Information is converted into knowledge when it is contextualised with the rest of a person’s knowledge and world model.

Information Asset Register [Permalink]

IARs are registers specifically set up to capture and organise meta-data about the vast quantities of information held by government departments and agencies. A comprehensive IAR includes databases, old sets of files, recent electronic files, collections of statistics, research and so forth.

The EU PSI Directive recognises the importance of asset registers for prospective re-users of public information. It requires member states to provide lists, portals, or something similar. It states: ” Tools that help potential re-users to find documents available for re-use and the conditions for re-use can facilitate considerably the cross-border use of public sector documents. Member States should therefore ensure that practical arrangement are in place that help re-users in their search for documents available for reuse. Assets lists, accessible preferably online, of main documents (documents that are extensively re-used or that have the potential to be extensively re-used), and portal sites that are linked to decentralised assets lists are example of such practical arrangements.”

IARs can be developed in different ways. Government departments can develop their own IARs and these can be linked to national IARs. IARs can include information which is held by public bodies but which has not yet been – and maybe will not be – proactively published. Hence they allow members of the public to identify information which exists and which can be requested. “For the public to make use of these IARs, it is important that any registers of information held should be as complete as possible in order to be able to have confidence that documents can be found. The incompleteness of some registers is a significant problem as it creates a degree of unreliability which may discourage some from using the registers to search for information.”

It is essential that the metadata in the IARs should be comprehensive so that search engines can function effectively. In the spirit of open government data, public bodies should make their IARs available to the general public as raw data under an open license so that civic hackers can make use of the data, for example by building search engines and user interfaces.

Intellectual property rights [Permalink]

Monopolies granted to individuals for intellectual creations.

Internet [Permalink]

A worldwide network of interconnected computer networks that use the Internet protocol suite (TCP/IP) to facilitate data transmission and exchange among several billion devices, which are logically linked together by a globally unique address space.

JSON [Permalink]

JavaScript Object Notation, a simple but powerful format for data. It can describe complex data structures, is highly machine-readable as well as reasonably human-readable, and is independent of platform and programming language, and is therefore a popular format for data interchange between programs and systems.

KML [Permalink]

Keyhole Markup Language, an XML-based open format for geodata. KML was devised for Keyhole Earth Viewer, later acquired by Google and renamed Google Earth, but has been an international standard of the Open Geospatial Consortium since 2008.

Knowledge [Permalink]

The sum of a person’s - or mankind’s - information about and ability to understand the world. See also data

Licence [Permalink]

A legal instrument by which a copyright holder may grant rights over the protected work. Data and content is open if it is subject to an explicitly-applied licence that conforms to the Open Definition. A range of standard open licences are available, such as the Creative Commons CC-BY licence, which requires only attribution.

Licence mixing [Permalink]

If Project X publishes content, and wants to include content from Project Y, it is necessary that Y’s licence permits at least the same range of re-uses as X’s licence. For example, content published under a non-commercial licence cannot be included in Wikipedia, since Wikipedia’s open licence includes rights for commercial re-use which cannot be granted for the non-commercial data, an example of a failure of licences to mix well.

Linked data [Permalink]

A form of data representation where every identifier is an http://… URI, using standard lists (see vocabulary) of identifiers where possible, and where datasets include links to reference datasets of the same objects. A key aim is to make data integration automatic, even for large datasets. Linked data is usually represented using RDF. See also five stars of open data; triple store.

Machine readable [Permalink]

Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable.

Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable.

As another example scans (photographs) of text are not machine-readable (but are human readable!) but the equivalent text in a format such as a simple ASCII text file or a text-processing format such as Microsoft Word file is machine readable.

Note: The appropriate machine readable format may vary by type of data - so, for example, machine readable formats for geographic data may differ from those for tabular data.

Many eyes principle [Permalink]

If something is visible to many people then, collectively, they are more likely to find errors in it. Publishing open data can therefore be a way to improve its accuracy and data quality, especially where a good interface for reporting errors is provided. See crowdsourcing.

Marginal cost [Permalink]

The additional cost incurred by supplying a single copy of a resource, e.g. data. For data to be open according to the Open Definition, it must be charged for at no more than marginal cost. Where data is available for download over the internet the marginal cost will usually be zero. There may be a small marginal cost in exceptional cases, e.g. if for reasons of size the data needs to be put on a disk and posted.

Metadata [Permalink]

Information about a dataset such as its title and description, method of collection, author or publisher, area and time period covered, licence, date and frequency of release, etc. It is essential to publish data with adequate metadata to aid both discoverability and usability of the data.

NGO [Permalink]

Non-governmental organisation. NGOs are voluntary, non-profit organisations focussing on charitable work, community-building, campaigning, research, etc, making up a vital part of civil society.

Non commercial [Permalink]

A restriction, as part of a licence, that content cannot be freely re-used for ‘commercial’ purposes. Content or data subject to a non-commercial restriction is not open, according to the Open Definition. Such a restriction reduces economic value and causes problems with licence mixing, as well as often ruling out more than is intended (for example, it is often unclear whether educational uses are ‘commercial’). The intent of a non-commercial clause may be better captured by a share-alike requirement.

ODRA [Permalink]

Open Data Readiness Assessment, a framework created by the World Bank for assessing the opportunities, obstacles and next steps to be taken in a country (especially a developing country) considering publishing government data as open data.

ODbL [Permalink]

Open Database Licence, an attempt to create an open licence for data which covers the ‘database rights’ (see copyright) as well as copyright itself. It does this by imposing contractual obligations on the data re-user. Unfortunately contract law is fundamentally different from copyright law, since copyright is inherent in a work and binds all downstream users of the work, whereas a contract only binds the parties to the contract and has no force on a later re-user of re-published data. The ODbL remains useful nevertheless, and other attempts are being made to create open licences specifically for data.

OGP [Permalink]

The Open Government Partnership, a partnership of national governments launched in 2011 with the aim of promoting open government in the member countries and collaborating on multi-lateral agreements and best practice. At the time of writing (2014) there are 64 participating countries.

Open Access [Permalink]

The principle that access to the published papers and other results of research, especially publicly-funded research, should be freely available to all. This contrasts with the traditional model where research is published in journals which charge subscription fees to readers. Besides benefits similar to the benefits of open data, proponents suggest that it is immoral to withhold potentially life-saving and valuable research from some readers who may be able to use or build on it. Open-access journals now exist and the interest of research funders is giving them some traction, especially in the sciences.

Open Data [Permalink]

Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose - subject only, at most, to requirements to provide attribution and/or share-alike. Specifically, open data is defined by the Open Definition and requires that the data be A. Legally open: that is, available under an open (data) license that permits anyone freely to access, reuse and redistribute B. Technically open: that is, that the data be available for no more than the cost of reproduction and in machine-readable and bulk form.

Open Development [Permalink]

Open development seeks to bring the philosophy of the open movement to international development. It promotes open government, transparency of aid flows, engagement of beneficiaries in the design and implementation of development projects, and availability and use of open development data.

Open Science [Permalink]

The practice of science in accordance with open principles, including open access publishing, publication of and collaboration around research data as open data together with associated source code, and use and development of open source data processing tools.

Open Source [Permalink]

Software for which the source code is available under an open licence. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source.

Open definition [Permalink]

The Open Definition, first released by Open Knowledge in 2005, sets out under what conditions data and content is open. The “standard” provided by the Open Definition is crucial because much of the value of open data lies in the ease with which different sources of open data can be combined. Both legal and technical compatibility is vital, and the Open Definition ensures that openly-licensed data can be combined successfully, avoiding a proliferation of licences and terms of use for open data leading to complexity and incompatibility. As governments and organisations jostle to wear the ‘open’ label, the Open Definition ensures that the term does not lose its meaning amid the hype. Today it is the main international standard for open data and open data licences, with an advisory council of senior open data practitioners and can be found at opendefinition.org. The expert-governed licence conformance process and recommendations for conformance have strengthened licences around the world, for example, in the revision of the UK Government’s internationally influential “Open Government Licence”. The Open Definition has also influenced and steered other communities of practice in the open movement, including open access to publicly-funded research, open hardware, and more. See open data for a summary.

Open format [Permalink]

file format with no restrictions, monetary or otherwise, placed upon its use and can be fully processed with at least one free/libre/open-source software tool. Patents are a common source of restrictions that make a format proprietary. Often, but not necessarily, the structure of an open format is set out in agreed standards, overseen and published by a non-commercial expert body. A file in an open format enjoys the guarantee that it can be correctly read by a range of different software programs or used to pass information between them.

Open government [Permalink]

Open government, in line with the open movement generally, seeks to make the workings of governments transparent, accountable, and responsive to citizens. It includes the ideals of democracy, due process, citizen participation and open government data. A thorough-going approach to open government would also seek to enable citizen participation in, for example, the drafting and revising of legislation and budget-setting. See OGP.

Open movement [Permalink]

The open movement seeks to work towards solutions of many of the world’s most pressing problems in a spirit of transparency, collaboration, re-use and free access. It encompasses open data, open government, open development, open science and much more. Participatory processes, sharing of knowledge and outputs and open source software are among its key tools. The specific definition of “open” as applied to data, knowledge and content, is set out by the Open Definition.

Open standards [Permalink]

Generally understood as technical standards which are free from licencing restrictions. Can also be interpreted to mean standards which are developed in a vendor-neutral manner.

PDF [Permalink]

Portable Document Format, a file format for representing the layout and appearance of documents on a page independent of the layout software, computer operating system, etc. Originally a proprietary format of Adobe Systems, PDF has been an open format since 2008. Data in PDF files is not machine-readable; see structured data.

Privacy [Permalink]

The right of individuals to a private life includes a right not to have personal information about themselves made public. A right to privacy is recognised by the Universal Declaration of Human Rights and the European Convention on Human Rights. See data protection legislation.

Proprietary [Permalink]

(i) Proprietary software is owned by a company which restricts the ways in which it can be used. Users normally need to pay to use the software, cannot read or modify the source code, and cannot copy the software or re-sell it as part of their own product. Common examples include Microsoft Excel and Adobe Acrobat. Non-proprietary software is usually open source.

(ii) A proprietary file format is one that a company owns and controls. Data in this format may need proprietary software to be read reliably. Unlike an open format, the description of the format may be confidential or unpublished, and can be changed by the company at any time. Proprietary software usually reads and saves data in its own proprietary format. For example, different versions of Microsoft Excel use the proprietary XLS and XLSX formats.

Public Sector Information [Permalink]

Information collected or controlled by the public sector.

Public domain [Permalink]

Content to which copyright does not apply, for example because it has expired, is free for any kind of use by anyone and is said to be in the public domain. CC0, one of the licences of Creative Commons, is a ‘public domain dedication’ which attempts so far as possible to renounce all rights in the work and place it in the public domain.

Publisher [Permalink]

Anyone who distributes and makes available data or other content. Data publishers include government departments and agencies, research establishments, NGOs, media organisations, commercial companies, individuals, etc.

Query [Permalink]

A type of question accepted by a database about the data it holds. A complex query may ask the database to select records according to some criteria, aggregate certain quantities across those records, etc. Many databases accept queries in the specialised language SQL or dialects of it. A web API allows an app to send queries to a database over the web. Compared with downloading and processing the data, this reduces both the computation load on the app and the bandwidth needed.

RDF [Permalink]

Resource Description Framework, the native way of describing linked data. RDF is not exactly a data format; rather, there are a few equivalent formats in which RDF can be expressed, including an XML-based format. RDF data takes the form of ‘triples’ (each atomic piece of data has three parts, namely a subject, predicate and object), and can be stored in a specialised database called a triple store.

Raw data [Permalink]

The original data, in machine-readable form, underlying any application, visualisation, published research or interpretation, etc.

Re-use [Permalink]

It is rare that data gathered for a particular purpose does not have other possible uses. Happily, data is an infinite resource (see tragedy of the anti-commons); once gathered, for whatever reason, it can be re-used again and again, in ways that were never envisaged when it was collected, provided only that the data-holder makes it available under an open licence to enable such re-use.

Real time [Permalink]

Data (such as the current location of trains on a network) which is being constantly updated, where a query needs to be against the latest version of the data.

Research data [Permalink]

Experimental research in the sciences and social sciences produces large quantities of data. Research data management (RDM) is an emerging discipline that seeks best practices in handling this. Traditionally the data was kept by researchers and only final research outputs, such as papers analysing the data, would be published. Open science holds that the data should be published, both to increase verifiability of the work and to enable it to be used in other research. The full spirit of open science collaboration demands data publication early in the project, but research culture will need to change appreciably before this becomes widespread.

Resource [Permalink]

CKAN uses this term to denote one of the individual data objects (a file such as a spreadsheet, or an API) in a dataset.

SPARQL [Permalink]

A query language similar to SQL, used for queries to a linked-data triple store.

SQL [Permalink]

Structured Query Language, a standard language used for interrogating many types of database. See query.

SaaS [Permalink]

Software as a Service, i.e. a software program that runs, not on the user’s machine, but on the machines of a hosting company, which the user accesses over the web. The host takes care of associated data storage, and normally charges for the use of the service or monetises its client base in other ways.

Scraping [Permalink]

Extracting data from a non-machine-readable source, such as a website or a PDF document, and creating structured data from the result. Screen-scraping a dataset requires dedicated programming and is expensive in programmer time, so is generally done only after all other attempts to get the data in structured form have failed. Legal questions may arise about whether the scraping breaches the source website’s copyright or terms of service.

Server [Permalink]

A computer on the internet, usually manged by a hosting company, that responds to requests from a user, e.g. for web pages, downloaded files or to access features in a SaaS package being run on the server.

Shapefile [Permalink]

A popular file format for geodata, maintained and published by Esri, a manufacturer of GIS software. A Shapefile actually consists of several related files. Though the format is technically proprietary, Esri publish a full specification standard and Shapefiles can be read by a wide range of software, so function somewhat like an open standard in practice.

Share-alike License [Permalink]

A license that requires users of a work to provide the content under the same or similar conditions as the original.

Source code [Permalink]

The files of computer code written by programmers that are used to produce a piece of software. The source code is usually converted or ‘compiled’ into a form that the user’s computer can execute. The user therefore never sees the original source code, unless it is published as open source.

Spreadsheet [Permalink]

A table of data and calculations that can be processed interactively with a specialised spreadsheet program such as Microsoft Excel or OpenOffice Calc.

Standard [Permalink]

A published specification for, e.g., the structure of a particular file format, recommended nomenclature to use in a particular domain, a common set of metadata fields, etc. Conforming to relevant standards greatly increases the value of published data by improving machine readability and easing data integration.

Structured data [Permalink]

All data has some structure, but ‘structured data’ refers to data where the structural relation between elements is explicit in the way the data is stored on a computer disk. XML and JSON are common formats that allow many types of structure to be represented. The internal representation of, for example, word-processing documents or PDF documents reflects the positioning of entities on the page, not their logical structure, which is correspondingly difficult or impossible to extract automatically.

Tab-separated values [Permalink]

Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly machine-readable.

Tragedy of the anti-commons [Permalink]

The well-known tragedy of the commons occurs when a common resource, such as grazing land, is degraded through over-use. Effectively, users are treating a limited resource as if it were limitless, owing to a poor incentive structure. The economist Michael Heller coined the term ‘tragedy of the anti-commons’ to describe the opposite failure, where poor incentives lead to under-use of an abundant or limitless resource. The case of data which is unpublished or charged for at above marginal cost is a prime example, data being in fact a limitless resource.

Transparency [Permalink]

Governments and other organisations are said to be transparent when their workings and decision-making processes are well-understood, properly documented and open to scrutiny. Transparency is one of the aspects of open government. An increase in transparency is one of the benefits of open data.

Transport data [Permalink]

Public transport routes, timetables and real time data are valuable but difficult candidates for open data. Even when they are published, data from different transit authorities and companies may not be available in compatible formats, making it difficult for third parties to provide integrated transport information. Many transport authorities distribute public transport data using the General Transit Feed Specification (GTFS) which is maintained by Google. Work on standardisation and more open data is ongoing in the sector.

Triple store [Permalink]

The ‘triples’ of RDF data can be stored in a specialised database, called a triple store, against which queries can be made in the query language SPARQL.

URI / URL [Permalink]

Uniform Resource Identifier / Uniform Resource Locator. A URL is the http://… web address of some page or resource. When a URL is used in linked data as the identifier for some object, it is not strictly a locator for the object (e.g. http://dbpedia.org/page/Paris is the location of a document about Paris, but not of Paris itself), so in this context it is referred to as a URI

Unconference [Permalink]

A meeting, similar to a conference, but with no agenda fixed in advance. Using various established techniques, participants jointly agree on the day what sessions will run. Some more traditional conference sessions with invited speakers may also be included. A popular format among the tech community, an unconference can be combined with or run alongside a hackathon based on open data. It is a possible method of community engagement by data publishers.

Unique identifier [Permalink]

(or UID): An identifier for an object which is guaranteed to be different from identifiers of all other objects in a collection. Within a database, every object will have a UID that is unique within the database. A UID assigned by a central registry (such as an ISBN for books, or a DOI for data) will be unique for all objects for which it is assigned. The http://… identifiers of linked data provide a technique for guaranteeing UIDs without a central authority.

Visualisation [Permalink]

A visual representation of data is often the most compelling way of communicating the data, bringing out its key features, correlations and outliers. Though many tools exist, creating a visualisation for a dataset is not an automatic process, but requires careful attention to the meaning of the variables, the relations between them and the stories inherent in the data, to design a visual representation that lets the message of the data shine through.

Vocabulary [Permalink]

A standard specifying the identifiers to be used for a particular collection of objects. Using standard vocabularies where they exist is key to enabling data integration. Linked data is rich in vocabularies in different topic areas.

Web [Permalink]

The World Wide Web, the vast collection of interlinked and linkable documents and services accessible via ‘web browsers’ over the Internet.

Web API [Permalink]

An API that is designed to work over the Internet.

XLS(X) [Permalink]

A proprietary spreadsheet format, the native format of the popular Microsoft Excel spreadsheet package. Older versions use .xls files, while more recent ones use the XML-based .xlsx variant.

XML [Permalink]

Extensible Markup Language, a simple and powerful standard for representing structured data.

de-identification [Permalink]

A form of anonymisation where personal records are kept intact but specific identifying information, such as names, are replaced with anonymous identifiers. Compared to aggregation, de-identification carries a greater risk of data leakage: for example, if prison records include a prisoner’s criminal record and medical history, the prisoner could in many cases be identified even without their name by their criminal record, giving unauthorised access to their medical history. In other cases this risk is absent, or the value of the un-aggregated data is so great that it is worth making de-identified data available subject to carefully designed safeguards.

dimension [Permalink]

An ordinary table or spreadsheet can easily represent two data dimensions: each data point has a row and a column. Plenty of real-world data has more dimensions, however: for example, a dataset of Earth surface temperature varying with position and time (two co-ordinates are required to specify the position on earth, e.g. latitude and longitude, and one to specify the time).