Big Data – Data Analytics and Analysis


Continuing the Research of Top 25 Digital Startup ideas and technologies for 2017,”  in this section, we evaluate and highlight aspects of “Big Data – Data Analytics and Analysis”

Big Data

We create 2.5 quintillion bytes of data every day; over 90% of the data in the world today has been created in the last two years alone. The data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. Such data from structured and unstructured sources is known as big data.

Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term “big data” often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.Wikipedia

Massive volume of both structured and unstructured data so large that it is difficult to process using traditional database and software techniques. Some of the key attributes of big-data include:

  • VolumeData at rest. Addition of Terabytes to Exabyte’s of existing data to process.
  • Variety – Data in many forms that includes structured, unstructured, text, multimedia.
  • Velocity Data in motion that includes Streaming data, milliseconds to seconds to respond.
  • Veracity – Data in doubt.Uncertainty due to data inconsistency, incompleteness, ambiguities, latency, deception, approximations.

Researchers, analysts and data scientists use the term “data mining” to describe how they refine the raw data into information or knowledge. Besides Data mining, several technologies and techniques have been used to analyze data. In the early stages of data analysis, the statistical methods were used to understand the implication of data from patterns. For example, formulating opinions from analysis of data from public opinion poll or TV program rating.

Process of Data Discovery

Why should corporate executives pay attention to Big Data?

There is a confluence of forces impacting the way consumers interact with information technology, including what some in the industry collectively call SMAC — social, mobile, analytics, and cloud — that present unique opportunities.

Innovative data aggregators, organizations, and scientists are applying different types of analytic techniques like investigative data discovery, descriptive data aggregation, predictive analytics focused on outcomes, and other techniques for data analysis. In Figure below, we see a framework for analytics in corporate environment:

  • Reporting
  • Dashboards
  • Discovery

 

Framework for Visualization in a corporate environment

At the core of the framework are structured and unstructured data sources. Organizations, government agencies, and other research organizations generate reports and transactional data in formats that can be stored and retrieved from relational, structured databases. Such transactional and reference data may exist in databases within software applications running commercially developed databases like IBM’s DB2, Microsoft’ SQL Servers, or Oracle. Such data can be cataloged, indexed, and queried using well-understood tools and techniques.

More about Data visualization can be found here (link).

 

Here are a few examples of Big Data analytics from a cross-section of industries.

  • How Experian Is Using Big Data And Machine Learning To Cut Mortgage Application Times To A Few Days –  Credit reference agency Experian hold around 3.6 petabytes of data from people all over the world. This makes them an authority for banks and other financial institutions who want to know whether we represent a good investment, when we come to them asking for money. “Just a few years ago when we did analytics on a dataset it was based on a smaller, representative set of information. Today we don’t really reduce the size of the dataset, we do analytics across a terabyte, or petabyte, and that’s something we couldn’t do before.” (Forbes.com)
  • Introducing a New Coffee at Starbucks – During a recent product rollout, Starbucks’s executives were concerned about ‘strong taste’ of a new coffee it was introducing. On the day of coffee rollout, its data scientists began monitoring blogs, Twitter, and niche coffee forum discussion groups to review customers’ reactions in real-time. By mid-morning, Starbucks discovered that fewer customers complained about the taste of the coffee, and a lot more thought it was too expensive. They immediately lowered the price, and by the end of the day most of the negative comments had disappeared. Such real-time analysis using big-data techniques helps companies react to customer feedback much faster than traditional techniques like waiting for market surveys
  • Drilling for Oil at Chevron – Oil drilling is an expensive business and each drilling in the Gulf of Mexico costs Chevron upwards of $100 million. Traditionally, the odds of finding oil have been around 1 in 5. To improve its chances of finding oil, Chevron invested in a program to digitally analyze seismic data and its geologists leveraged advances in computing power and storage capacity to refine their already advanced computer models. With advances in analytics,  Chevron has improved the odds of drilling a successful well to nearly 1 in 3, resulting in tremendous cost savings.
  • Formula One – Learning by using data. Formula One cars generates terabytes of data during a typical race. The F1 cars are equipped with hundreds of sensors, and they provide a stream of data which is analyzed in real-time. During a typical race, dozens of engineers at the track comb over the data in near real-time , looking for any adjustment that could help the team win or lose a race.

 

A few popular tools for Big-Data analysis  include

Tool Description
Hadoop Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss, even if a significant number of nodes become inoperative.

Apache Cassandra logo Cassandra from the Apache project is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data center’s, with asynchronous master-less replication allowing low latency operations for all clients.
 Aerospike is a flash-optimized in-memory open source NoSQL database and the name of the company that produces it. Aerospike Database was first known as Citrusleaf 2.0. In August 2012, the company re-branded both the company and software name to Aerospike.
 Plotly, also known by its URL, Plot.ly, is an online data analytics and visualization tool headquartered in Montreal, Quebec. Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Plotly was built using Python and the Django framework, with a front end using JavaScript and the visualization library D3.js, HTML and CSS. Files are hosted on Amazon S3.
Cloudera Inc. is a United States-based software company that provides Apache Hadoop-based software, support and services, and training to business customers. Cloudera’s open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology.
Cloudera claims that more than 50% of its engineering output is donated upstream to the various Apache-licensed open source projects (Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Hadoop platform. Cloudera is also a sponsor of the Apache Software Foundation.
Neo4j Home Neo4j is a graph database management system developed by Neo Technology, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is a popular graph database.
Neo4j is implemented in Java and accessible from software written in other languages using the Cypher Query Language through a transactional HTTP endpoint, or through the binary ‘bolt’ protocol.
 OpenRefine, formerly called Google Refine, is a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as data wrangling. It is similar to spreadsheet applications (and can work with spreadsheet file formats); however, it behaves more like a database. [since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine]