Big Data – Comments from a Beginner

Pedro Francisco Borges Pereira

Undergraduate Student – Universidade Luterana do Brasil (Lutheran University of Brazil)

Major – Information Systems

International Exchange Student at Kansas State University

Fall/2014 Semester

Foreword

I was very interested in Big Data – is there some IT professional who is not, nowadays? – and decided to learn about. So I´ve enrolled in, and I´m attending, the CIS 798-Programming Techniques for Big Data Analytics, at Kansas State University, with Professor William Hsu.

I started the course aware of some catchwords, like

“Big data is NoSQL”.

“Big data made relational databases obsolete”.

Professor Hsu started the course teaching us that “Big data is about Volume, Velocity, Variety”, and proceeded with a Machine Lab – implementing the prototypical Word Count Big Data programming example, using Hadoop with a Java plug-in.

Well, that makes sense, but I think Professor Hsu underestimated my lack of knowledge. I was not just ignorant regarding the programming techniques; I did not know when, in which use cases, to apply Big Data programming techniques.

So I thought I should take some time to learn about the context, about in what use cases we could and/or should use Big Data techniques and also, if “traditional” relational databases are not dead, when we should stay with them.

As a result of my knowledge about Big Data (or, more properly, lack of) summing up almost just to the catchword “Big Data is NoSQL”, when I speak, in these initial paragraphs, about “Big Data infrastructure”, please understand that I mean “anything different from relational databases”.

I called myself a “beginner”. Indeed I am – regarding Big Data. I´m familiar (proficient, I would dare to say) with relational databases. I think these notes can be useful to anyone aiming to get a general overview of Big Data, but please take in account that I´m writing as somebody used to relational databases and new to Big Data. I did make a reasonable amount of reading about Big Data, but of course I am still a beginner. I may have made mistakes; I very probably did. Your feedback will be very welcomed at my personal e-mail: pedrofbpereira@yahoo.com.br . Thank you very much!

Initial Statements

I´ve read a dozen articles (see Bibliography, in these Notes), as well as Professor Hsu handouts, and reached, in short, the following conclusions:

· Despite the fact that (understandably) some NoSQL database vendors say so, not all applications using relational databases worth being converted to Big Data techniques; relational databases – and data warehouses using non-normalized “star schemas” and cubes - are still, frequently, the best option.

· Despite the fact that some relational databases vendors (understandably) state that their products can do everything that Big Data frameworks do, they can´t. They can indeed deliver some Big Data tools. But amazing new fields – as for example Complex Event Processing - are being pioneered thanks to Big Data techniques.

· “Volume, Velocity and Variety” is a simple way to describe Big Data features to outsiders. Reality is not always that simple. Actually most relational databases can scale to “big data” volume (usually, at a higher cost; usually, not instantly) and can perform at high speed (depending on the use case, faster than Big Data frameworks). “Variety” would be the key feature. I´d like to highlight that, as I see, this “variety” is more relevant in a sense of “changes in database schema” (or even absence of schema) instead of variety of data sources or in data content.

· Economic differences and infrastructure differences:

o A strong point of Big Data frameworks is distributed processing and storage, which leads to great easiness in scalability and failure tolerance. Relational databases can be distributed and grow to sizes that would grant them the right to be called also “big data”. But since Big Data frameworks are based in “standard” cheap computers, and relational databases uses (mostly) expensive powerful servers, distributing “conventional” relational databases is a lot more expensive. Big Data frameworks tend to be cheaper.

o Big Data frameworks can scale instantly. Relational databases, usually, are not able to answer instantly to explosive demand increases.

o Relational databases “can” be hosted in the cloud. Most Big Data frameworks are born in the cloud. Open source frameworks as for example Apache Hadoop are less expensive than “cloud” versions of relational databases. And big providers of Infrastructure as a Service – as for example Amazon – allow using a safe, best-practices-managed, and highly (and instantly!) scalable platform, for an affordable cost.

· Giants as Google, Yahoo, Amazon, the government, and also academic researchers, almost for sure need to use Big Data frameworks. Smaller companies – thanks to IaaS providers – can also use Big Data.

· There is no free lunch! The increased capability of Big Data frameworks to deal with variable schemas implies in not being ACID[1] (Atomic, Consistent, Isolated, Durable). Despite some NoSQL distributors claim to be ACID, I honestly can´t understand how you can be distributed, un-related, un-locked and, at the same time, ACID. They aim to be, instead, BASE[2] (Basically Available, Soft-state and Eventually consistent).

[1] This performance/ACID exchange is clearly stated in “Dynamo: Amazon’s Highly Available Key-value Store” article, at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf .

[2] See “Enterprise NoSQL for Dummies”, MarkLogic Special Edition, by Charlie Brooks.

Contents

Foreword 1

Initial Statements 4

For some Use Cases, Relational databases are better 5

Big Data does things that Relational Databases can´t 5

Un-Structured Data 5

Complex Event Processing 6

Sessionization 6

Volume, Velocity, Variety 6

Volume 6

Velocity 7

Volume, Velocity – Facebook 7

Variety 7

NoSQL 7

Scheme-agnostic, but organized 8

Key-Value Stores 8

Column Family 8

Document Databases 8

Graph Databases 8

Big Table and HBase 9

Infrastructure Characteristics 10

Distributed, Parallel and Redundant – High Performance and Availability 10

Super-Computing / Cloud Computing 10

Hadoop Distributed File System - HDFS 11

MapReduce 12

The MapReduce Pipeline 12

Hadoop – Simple Definition 13

Hadoop – BEOCAT instance 13

AMAZON WEB SERVICES – CLOUD COMPUTING 14

The Canonical Wordcount Example 16

Environment: Linux; Programming Language: Python 16

Word Count Diagram 17

hadoop-streaming.jar 17

Mapper 18

Reducer 18

Our Mapper and Reducer can run just in Linux 19

Running our Hadoop Job 20

Using Parallel Processing 20

BIBLIOGRAPHY 21

Glossary 23

Appendix – a BigData Solution Architeture 24

Felipe Renz MBA Project at Unisinos College, Brazil 24

CLICK HERE TO SEE THE FULL ARTICLE

1 comment:

Stanley LohJanuary 15, 2015 at 3:56 AM
Very good your post, Pedro Francisco.
I think there are some challenges that were not yet well solved in the area of Big Data.
First, Variety also concerns complexity. What about an Electronic Patient Record that includes images, sound records, structured data, textual descriptions and all these data related to each other in different contexts, dates, people, etc.
Second, exceptions must be faced. Processes now are being modeled with many different paths and not ordinary tasks. Processes are defined on the fly and officers may change the course by necessity.
In the same way, exceptions occur with data. How many different genders exist ? Some hospitals work with 12 different values for the attribute “gender”. Name of a person is not as the same in the past. There is the birth name, the marriage name and nowadays it is important to support the social name, especially for transsexuals.
Well, that´s my contribution.
Go on with the nice work there.

Pedro Pereira - Kansas State University

Thursday, January 15, 2015

Big Data – Comments from a Beginner

Foreword

Initial Statements

1 comment: