Big Data Dictionary
Data Word Soup — What Does It All Mean?
Rikus Combrinck, October 2017
The explosive growth in data science and machine learning has burdened the buzzword bandwagon with confusing terminology from a number of related disciplines. The mad rush of marketing departments and journalists to capitalise on novel terms has turned an already complex subject into a quagmire of confusion and miscommunication. This is an attempt to bring a measure of clarity to a selection of terminology in the fields of data science, artificial intelligence, and machine learning.
The Venn diagram below depicts approximate relationships among popular terms.
The diagram will change depending on the exact type of relationship one is emphasizing, i.e. historical origin, culture, technical foundations, functional dependency, etc. In addition, these terms are used somewhat differently in various domains and contexts. With these disclaimers in mind, here are some brief descriptions, more than watertight definitions, of a selection of relevant terms.
Business intelligence — Concerned with the collection, analysis, and presentation of business information. It provides historical, current, diagnostic, and predictive views of business operations. It is an older, fairly wide term, which originated in the business environment. It emphasises the business-oriented, rather than technical aspects of extracting strategically relevant information from business data. In practice, most BI tools are fairly simple and labour-intensive in that they depend on, and are limited by, human understanding and insight. Traditional approaches are overwhelmed by the sheer volume and high dimensionality of contemporary data streams. Usage of the term has been in slow, but steady, decline for a long time.
Predictive analytics — Technically, this is the application of mathematical models to predict selected variables from historical data. Using this definition, it would include much of the current machine learning technology. In practice, historically, the term has implied a relatively sophisticated set of techniques typically employed by large organisations using expensive software tools in a small niche market dominated by IBM and other software providers. The primary context in this area has been that of business intelligence.
Machine learning — Subsuming conventional predictive analytics, machine learning functionally expands its scope to include techniques that can handle unstructured data (e.g. free text, images, video, speech). Data-driven techniques are emphasised, i.e. nonparametric models are used that make little or no assumptions about the underlying form of the model. It's been made practical in recent years by (1) an unprecedented flood of data, (2) processing power advances, and (3) better algorithms. Theoretically, the term covers any regression, classification, clustering, or mapping model that is built or adapted using observational data. In practice, the field is currently dominated by deep learning.
Deep learning — A relatively new (2006) set of artificial neural network (ANN) architectures and associated training algorithms that dramatically improve on conventional ANN performance. It is an extremely powerful and popular machine learning technique and has managed to break through the complexity barriers that have stymied conventional neural networks for decades. The most significant drawback is that it is extremely data hungry. Taxonomically it is a small part of machine learning, but represents the bulk of new practical applications, including state-of-the-art face recognition, object recognition, speech recognition, and machine translation.
Pattern recognition — A term from the AI world that hasn't made it onto the buzzword bandwagon. It concerns the initial phase in an intelligent system that interacts with the environment, where raw sensory input is converted into symbolic or conceptual form, i.e. something like image recognition or speech recognition. The extraordinary advances brought about by deep learning are mostly instances of pattern recognition.
Big data — Refers both to enormous datasets and the technology for dealing with those, be it in traditional business environments (like financial transaction data of a multinational retailer) or newer machine learning contexts (doing face recognition on Facebook photographs). Relevant aspects include (1) creating, editing, querying, maintaining integrity of large databases, and (2) creating algorithms that can efficiently and effectively utilise the data.
Data science — A broad term referring to building data-centric models and applications. Data scientists are the primary creators and users of machine learning technology. Data science is wider than machine learning, though, in that it includes support technologies like storage, management, and efficient access of large data stores (i.e. big data), as well as the application development required to deploy machine learning models.
Cognitive computing — Strongly connected to machine learning, the term is reserved for systems that concentrate on processing unstructured data. This includes audio, video, images, and natural language. A significant subset are systems with conversational interfaces, i.e. virtual agents or chatbots. The field includes all sorts of advanced searching and indexing, concept discovery, document classification, rudimentary language understanding, question answering, etc. IBM has appropriated the term and in practice it is mostly encountered in connection with their Watson technology suite.
Question answering — A class of technologies that allows one to query a knowledge base in natural language.
Artificial intelligence — A very general term that has endured several waves of terminology inflation over the last few decades. As a result, people that are active in the field tend to avoid it in favour of more specific terms. It has originated in an academic environment where the aim was to replicate human-level, human-like cognitive abilities in a machine. Research specifically aimed at establishing broad, human-level machine intelligence currently goes under the term artificial general intelligence (AGI) and is limited to a handful of small research groups globally.
The graph depicts relative volume of Google search queries for some of the terms over the last thirteen years.
The term business intelligence, for example, has been in steady decline for many years. Big data follows an S-curve, growing rapidly between 2011 and 2015 and then reaching a plateau as awareness and technology matured.
Artificial intelligence displays a bathtub curve, with the effect of the most recent AI winter clearly visible. Around 2013, interest started picking up again, and usage of the term has grown exponentially since. The same is true for data science and machine learning, but usage of these terms have already far surpassed historical usage. They have become the primary terms used to refer to both a new wave of technology and an associated data-centric culture that is carrying us into the future.
Rikus Combrinck is a machine learning expert and is a Senior Data Scientist at OLSPS Analytics. OLSPS Analytics will be presenting on business solutions using predictive analytics across a diverse range of industries. With a focus on machine learning OLSPS has taken the lead in predictive analytics consulting within sub-Saharan Africa.
For more information please visit our website on: http://www.olspsanalytics.com/