Cette page n'est pas disponible en français. Veuillez-nous en excuser.

ABIS Infor - 2016-06

Big Data and Analytics - ABC (part 1)

Arnout Veugelen (ABIS) – 13 June 2016

Abstract

Lately, Big Data and Analytics have been particularly fashionable words in IT. New fashion comes with new terminology, and before you know it, you are at a loss for words. Thanks to our Big Data alphabet, this will no longer be the case. From now on, you will impress your boss and colleagues during meetings and coffee breaks.

In this edition we present part 1 (A–K).

Aggregation – Collecting and summarizing information, in preparation for analysis.

Analytics – Discovering, interpreting and communicating relevant insights in data.

AWS – Amazon Web Services: a collection of cloud services by Amazon, containing relational and NoSQL databases (DynamoDB), a Hadoop implementation, machine learning services etc.

Behavioural Analytics – Using data to gain insights in human behaviour.

BI – Business Intelligence: the theories, techniques and tools used to acquire and process data into valuable business information.f

Big Data – Often defined by 3 'V's: Volume, Variety and Velocity. Big Data means working with great amounts of data of all sorts, which is acquired at great speed and typically has to be analysed quickly (often real-time).

Cassandra – An open source NoSQL database, developed by Facebook.

Cloud Computing – Using internet services for certain tasks (e.g. data storage or processing), rather than a local server.

Cloudera – American company that provides Hadoop-related software.

Cold Data Storage – Storage of 'old data' that rarely needs to be accessed. Such an archive can use a compact file format and relatively cheap servers, in exchange for slower processing.

Column-oriented Database or Columnar Database – A Database Management System based on columns. Traditional relational databases are usually row-based. For example, in a table with company information, every company will have its own row, containing the company's name, address, telephone number ... A columnar database stores the data in columns: all company names, all adresses etc. This can lead to better compression and column operations (calculating averages, sums ... ) will be faster.

Confabulation – Psychiatric term. In a data context: the act of making an already made decision appear to be based on data-analysis.

Data Exhaust – Data that is produced as a byproduct of digital activities: log files, cookies, click streams, temporary files etc. This data can reveal a lot of information about a person, and are used eagerly for marketing purposes.

Data Governance – A set of processes and rules that have to ensure the proper management of data: availability, security, privacy ...

Data Science – A general term spanning the different disciplines used to obtain insight from data. It incorparates statistics, visualisation, data mining, machine learning etc.

Data Scientist – Excellent job title on a business card, also popular in job advertisements.

Data Virtualization – Making data available to an application, without requiring all technical details (such as the data's physical location).

Data Warehouse – A central repository of data, used as a base for analysis and reporting. Typically, the data is extracted from various sources and transformed to adhere to the required structure (ETL).

Database – An organised collection of interrelated data elements, that can be easily processed by one or more applications.

DBMS – Database Management System: a tool for data management which serves as a buffer between user and database.

Document-oriented database of Document Store – A type of NoSQL-database which uses semi-structured data, like XML- or JSON-documents.

ETL – Extract, Transform and Load: the classic method used to make data available for a database or data warehouse. The data is retrieved from various sources, then it is transformed into the appropriate format, and subsequently it is loaded. Many Big Data solutions consider this process to be too cumbersome.

Exabyte (1 followed by 18 zeros - bytes), or 1 million terabytes. In 2013 it was estimated that Google had 15 exabytes of data in its data centres.

Fog Computing – Decentralising computer infrastructure to optimise cloud services. For example, a service provider can use servers close to the customer, to make transport faster and more efficient.

Graph Database – A database in which the connections between the data elements are an essential component of the data model. Some relational databases provide this feature, but it is more common in certain NoSQL databases.

Grid Computing – Combining computer resources from multiple locations to reach a common goal.

Hadoop – An open source framework developed by Apache used to store and process very large amounts of data, distributed over clusters of multiple computers. By using many machines in parallel, there is no need for specialised (expensive) hardware. Hadoop's core features are the file system HDFS and the programming model MapReduce.

HBase – A NoSQL database of the key-value type, part of the Hadoop project.

HDFS – Hadoop Distributed File System: Hadoop's central file system. Very large files (typically in the range of terabytes) are spread across multiple machines. Since HDFS provides redundancy, it is not a problem when certain machines are unavailable.

Hive – Software which can be used on top of Hadoop. It makes it possible to write queries in HQL (Hive Query Language), an SQL-like language. Originally developed by Facebook, now used by Netflix and many other companies.

Impala – A query-engine for Hadoop, developed by Cloudera. Can be considered a competitor of Hive, with a focus on performance.

In-database Analytics – Integrating analytics into the DBMS or data warehouse, instead of using a seperate analytics-environment.

In-memory Database – a database management system that primarily uses memory for data storage instead of the hard disk.

IoT – Internet of Things: connecting all sorts of devices (refrigerators, trafic lights, windmills ...) to the internet. Sensors collect data (e.g. about the device's energy consumption), which can be processed elsewhere.

JSON – JavaScript Object Notation: a data format. It uses semi-structured text to build data objects consisting of one or more pairs of attributes and values.

Key Value Store – A type of NoSQL database in which every object can be associated with a number of values, not needing a fixed structure.

Want to Know More?

In our next newsletter we present the second half of our alphabet, but if you are looking for in-depth knowledge, we'd love to welcome you to one of our courses. We greatly expanded our Big Data and Analytics programme for fall 2016. Make sure to explore our full course range. In this number, Peter Vanroose's article about Perl Text Analytics offers a peek into the Big Data world.