Cette page n'est pas disponible en français. Veuillez-nous en excuser.

ABIS Infor - 2016-11

Big Data and Analytics - ABC (part 2)

Arnout Veugelen (ABIS) – 15 November 2016

Abstract

Lately, Big Data and Analytics have been particularly fashionable words in IT. New fashion comes with new terminology, and before you know it, you are at a loss for words. From now on, this will no longer be the case. Thanks to our Big Data alphabet, you will impress your boss and colleagues during meetings and coffee breaks.

In our previous edition we presented part 1 (A–K), this time we add part 2 (L-Z).

Aggregation – Collecting and summarizing information, in preparation for analysis.

Analytics – Discovering, interpreting and communicating relevant insights in data.

AWS – Amazon Web Services: a collection of cloud services by Amazon, containing relational and NoSQL databases (DynamoDB), a Hadoop implementation, machine learning services etc.

Behavioural Analytics – Using data to gain insights in human behaviour.

BI – Business Intelligence: the theories, techniques and tools used to acquire and process data into valuable business information.f

Big Data – Often defined by 3 'V's: Volume, Variety and Velocity. Big Data means working with great amounts of data of all sorts, which is acquired at great speed and typically has to be analysed quickly (often real-time).

Cassandra – An open source NoSQL database, developed by Facebook.

Cloud Computing – Using internet services for certain tasks (e.g. data storage or processing), rather than a local server.

Cloudera – American company that provides Hadoop-related software.

Cold Data Storage – Storage of 'old data' that rarely needs to be accessed. Such an archive can use a compact file format and relatively cheap servers, in exchange for slower processing.

Column-oriented Database or Columnar Database – A Database Management System based on columns. Traditional relational databases are usually row-based. For example, in a table with company information, every company will have its own row, containing the company's name, address, telephone number ... A columnar database stores the data in columns: all company names, all addresses etc. This can lead to better compression and column operations (calculating averages, sums ... ) will be faster.

Confabulation – Psychiatric term. In a data context: the act of making an already made decision appear to be based on data-analysis.

Data Exhaust – Data that is produced as a byproduct of digital activities: log files, cookies, click streams, temporary files etc. This data can reveal a lot of information about a person, and are used eagerly for marketing purposes.

Data Governance – A set of processes and rules that have to ensure the proper management of data: availability, security, privacy ...

Data Science – A general term spanning the different disciplines used to obtain insight from data. It incorparates statistics, visualisation, data mining, machine learning etc.

Data Scientist – Excellent job title on a business card, also popular in job advertisements.

Data Virtualization – Making data available to an application, without requiring all technical details (such as the data's physical location).

Data Warehouse – A central repository of data, used as a base for analysis and reporting. Typically, the data is extracted from various sources and transformed to adhere to the required structure (ETL).

Database – An organised collection of interrelated data elements, that can be easily processed by one or more applications.

DBMS – Database Management System: a tool for data management which serves as a buffer between user and database.

Document-oriented database of Document Store – A type of NoSQL-database which uses semi-structured data, like XML- or JSON-documents.

ETL – Extract, Transform and Load: the classic method used to make data available for a database or data warehouse. The data is retrieved from various sources, then it is transformed into the appropriate format, and subsequently it is loaded. Many Big Data solutions consider this process to be too cumbersome.

Exabyte (1 followed by 18 zeros - bytes), or 1 million terabytes. In 2013 it was estimated that Google had 15 exabytes of data in its data centres.

Fog Computing – Decentralising computer infrastructure to optimise cloud services. For example, a service provider can use servers close to the customer, to make transport faster and more efficient.

Graph Database – A database in which the connections between the data elements are an essential component of the data model. Some relational databases provide this feature, but it is more common in certain NoSQL databases.

Grid Computing – Combining computer resources from multiple locations to reach a common goal.

Hadoop – An open source framework developed by Apache used to store and process very large amounts of data, distributed over clusters of multiple computers. By using many machines in parallel, there is no need for specialised (expensive) hardware. Hadoop's core features are the file system HDFS and the programming model MapReduce.

HBase – A NoSQL database of the key-value type, part of the Hadoop project.

HDFS – Hadoop Distributed File System: Hadoop's central file system. Very large files (typically in the range of terabytes) are spread across multiple machines. Since HDFS provides redundancy, it is not a problem when certain machines are unavailable.

Hive – Software which can be used on top of Hadoop. It makes it possible to write queries in HQL (Hive Query Language), an SQL-like language. Originally developed by Facebook, now used by Netflix and many other companies.

Impala – A query-engine for Hadoop, developed by Cloudera. Can be considered a competitor of Hive, with a focus on performance.

In-database Analytics – Integrating analytics into the DBMS or data warehouse, instead of using a separate analytics-environment.

In-memory Database – a database management system that primarily uses memory for data storage instead of the hard disk.

IoT – Internet of Things: connecting all sorts of devices (refrigerators, traffic lights, windmills ...) to the internet. Sensors collect data (e.g. about the device's energy consumption), which can be processed elsewhere.

JSON – JavaScript Object Notation: a data format. It uses semi-structured text to build data objects consisting of one or more pairs of attributes and values.

Key Value Store – A type of NoSQL database in which every object can be associated with a number of values, not needing a fixed structure.

Load balancing – Distributing the workload as efficiently as possible over multiple computers.

Log files – In these files data is gathered automatically during operations. Log files are a typical source of big data.

Machine Learning – Algorithms and techniques that allow a computer to learn while operational, without the need to explicitly program this new knowledge or functionality.

Mahout – A framework and development environment that allows you to build machine learning-applications applicable in a big data-context.

MapReduce – Hadoop's processing component. First, the input-data of the operation is split into independent parts, which are processed in parallel (the Map-phase). Then, all these intermediate results are combined and processed together into the end result (the Reduce-phase).

Massively Parallel Processing (MPP) – Complex or large problems can be handled more efficiently by using multiple processors or computers at the same time.

Matlab (Matrix Laboratory) – A software environment and programming language for all sorts of mathematical uses, with extensive visualisation possibilities.

Metadata – Data that holds information about other data: file size, author, time stamps ...

MongoDB – An open source document-oriented database, often used for big data. The data is stored as JSON-like documents.

Multithreading – The ability of one processor (-core) to execute multiple processes concurrently.

NoSQL – Used to mean 'non-SQL', nowadays usually explained as 'Not Only SQL': a term referring to a group of databases that don't adhere strictly to the relational theory (in contrast with relational (SQL-) databases). A more flexible approach towards permanent consistency of all data makes it easier to distribute a NoSQL database over multiple computers. Well-known examples are Cassandra, HBase, Couchbase and MongoDB.

Object (Oriented) Database – A No-SQL-database that represents the data as objects, similar to object-oriented programming languages. This allows for a smooth integration between database and program.

Online Analytical Processing (OLAP) – A term from the world of Data Warehousing, when information is obtained and analysed from different points of view (multidimensional), for instance when you want to analyse sales numbers at the same time per product and per region.

Outlier Detection – Outliers are observations that differ greatly from most other observations. The presence of such outliers can indicate that something unusual is going on, and therefore it is important to detect them.

Petabyte – (1 followed by 15 zeros - bytes), or one million gigabytes. The human brain can store approximately 2.5 petabytes of memories.

Pig – A platform for developing Hadoop-programs, with its own language: Pig Latin.

Predictive Analytics – Using statistics, (big) data analysis, machine learning ... to predict events in the future, based on the past.

Privacy – An important (and often neglected) concern in a world where more and more (personal) data is gathered all the time.

Python – An open source programming language, supporting multiple programming paradigms (Object Oriented, imperative, functional). The focus is on code readability and ease of programming. Thanks to a number of specialised libraries, Python became a very popular language for data-analytics.

Qlik – Data visualisation software. Their QlikView product allows you to build dashboards that can be used throughout the company, QlikSense offers DIY-visualisations for end-users.

Query – A request to obtain information from a database or another information system.

R – An open source programming language and software-tool, very popular for statistical and visual analytics.

Real-time Data – Data that are processed and analysed immediately (even less than a second) after they emerge. This way, a system can intervene immediately.

Relational Database – A database that is organised in accordance to the relational model. The data is represented as tables (relations): each row contains a record, each column a specific attribute. Logical links can be defined between these tables, which are to be maintained at all times. Relational databases are usually managed and queried with the SQL-language. Well-known examples include Oracle, DB2, SQLServer and MySQL.

Radio Frequency Identification (RFID) – A wireless sensor technology, nowadays found in many objects: library books, access badges ... Combined with all sorts of sensor data, they are another typical source of big data.

SAS (Statistical Analysis System) – A software environment and programming language, especially suited for data-analytics, statistics etc..

Sentimental Analysis – Trying to find out someone's mood through algorithms, mostly by analysing texts (such as e-mails) usually for marketing reasons. Also called opinion mining.

Smart Home, Smart Grid, Smart City etc. – It is believed that, by gathering and processing all sorts of data; houses, power grids, cities ... can be managed more efficiently.

Spark – Just like Hadoop, Spark is an open source framework from the Apache Foundation that can be used to process large volumes of data, by using clusters of computers. Spark's emphasis is speed.

SQL – Structured Query Language: the de facto standard language for communication with relational databases.

Statistics – A basic skill for every aspiring Data Scientist!

Tableau – Data visualisation software; the drag-and-drop design allows it to be used by non-IT-specialists as well.

Terabyte – (1 followed by 12 zeros - bytes). The hard disk of a typical personal computer nowadays has a capacity of a couple of terabytes.

Text Analytics (Text Mining) – Using algorithms to retrieve relevant information from text sources.

Unstructured Data – Data that isn't structured in a fixed way can still contain useful information, like the content of e-mails. Besides unstructured and structured data, we can use semi-structured data as well, such as XML- and JSON-documents.

Volume, Variety and Velocity – The 3 Vs traditionally used to define big data. A fourth V, Veracity, is often used to stress the importance of data quality. A fifth V could be Value (that you generated through your amazing data science skills) and creative minds often come up with yet more V-words.

Visualisation – The graphical representation of data and information. Excellent visualisation skills are necessary to present data and information in a convincing way.

Wrangling – Transforming unprocessed ('raw') data into a suitable format for analysis.

XML – Extensible Markup Language: a markup language (like html) that can be used to store data into a readable yet structured format. The universal text format allows us to exchange data in a language- and system-neutral way. There are specific XML-based document-oriented databases, but many traditional databases support XML-objects as well.

Yarn (Yet Another Resource Negotiator) – Hadoop's job scheduler: while MapReduce parallelises algorithms on a logical level, Yarn distributes the workload over the available machines.

Yottabyte – (1 followed by 24 zeros - bytes), or 1000 zettabytes. It is the largest unit commonly used for data size.

Zettabyte – (1 followed by 21 zeros - bytes). In order to store 1zettabyte of data, you would need a billion of 1 TB hard drives.

Want to Know More?

If you are looking for in-depth knowledge, we'd love to welcome you to one of our courses. This fall, we greatly expanded our Big Data and Analytics curriculum, with, among others, new courses in Statistics, Spark, MongoDB, R and Python. Make sure to explore our full course range.