Cloud Computing - Processing and Analysis of Big Data

Introduction

Geographic Information refers to all space-related subjects. One can imagine what a broad discipline it is. Long-term geographic observation produces huge amounts of data. Nowadays, traditional RDBMS (relational database management systems) are not capable of loading such huge chunks of data, let alone performing a high quality analysis research. The distributed processing of cloud computing technology provides a solution to this problem. In brief, this approach relies on remote data storage, analysis, and computing.

Applications

  • Observatories of research institutions and government agencies: NASA、NOAA、NSPO、JAXA
  • Governmental organizations or Private institutions: Water Resources Agency and New York Times
  • Remote sensing image analysis (such as our center and the Japanese National Institute of Advanced Industrial Science and Technology, AIST, Institute of Thailand, Asian Institute of Technology (AIT) the lunar exploration satellite (SELENE) multi-spectral imaging studies)

Application Benefits

  • Assists disaster prevention systems in storing, managing and analysis of large data
  • Enables processing, storing and managing of satellite images and other data, accelerates its analysis.
  • Effective management of governmental and corporate documents.

Contact

+886-4-24156669 ext. 301 | Miss Emily Lu

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key(Dean, 2004). MapReduce is inspired by the map and reduce primitives present in Lisp and many other functional languages(Dean, 2004). The MapReduce library in the user program first splits the input into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

MapReduce Concept Map

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers(Chang, 2006). HBase is the Hadoop database. Use it when one needs random, real time read/write access to one’s Big Data. This project's goal is hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' . Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.(Apache project, 2011)

    • Value = (rowkey, column key, timestamp)

Hbase Database Model

To put it simply, HBase can be reduced to a Map>>>. The first Map maps row keys to their column families. The second maps column families to their column keys. The third one maps column keys to their timestamps. Finally, the last one maps the timestamps to a single value. The keys are typically strings, the timestamp is a long and the value is an uninterpreted array of bytes. The column key is always preceded by its family and is represented as follows: family:key. Since a family maps to another map, this means that a single column family can contain a theoretical infinity of column keys. So, to retrieve a single value, the user has to do a get using three keys: row key+column key+timestamp -> value(Hadoop Wiki)

  • Clients:Institute of Advanced Industrial Technology Research Institute in Japan / Asian Institute of Technology in Thailand
  • Projects:Japan's KAGUYA (SELENE) lunar exploration satellite spectral data processing processing of spectral data collected by Japan's KAGUYA (SELENE) lunar exploration satellite

KAGUYA(SELENE)

The Japan Aerospace Exploration Agency (JAXA) launched "KAGUYA (SELENE)" by the H-IIA Launch Vehicle in 2007. The Project is focused on collecting lunar surface topography data, performing spectral response analysis of lunar geology and searching for evidence of the existence of moisture on the lunar surface. During the nominal and extended operation periods, SP has acquired data from about 7,000 revolutions around the Moon and the total number of obtained lunar surface spectra is close to seventy million.

SELENE lunar exploration platform

The platform shows over 100 billion records of lunar spectral data in GOOGLE EARTH and is available to researchers from all over the world.

Using the Hadoop cloud platform and the HBase distributed database to build SELENE data cloud

  • Clients:Formosa Plastics Transport Corporation
  • Projects: Formosa Plastics Transport Corporation Customer Management System

SkyEyes Smart Transportation Management Platform

The SkyEyes Smart Transportation Management Platform was has been used by Formosa Plastics Transport Corporation since 2001 and having accumulated tens of billions of driving record information, it has been called the largest driving database in the country. The previous solution relied on traditional commercial relational databases, which were unable to handle 10 million new data records being added every day. Driving records dating back 3 months or more must be stored separately on magnetic media, which is incovenient for clients interested in looking up older records and thus reduces the quality of customer service.

Using the Hadoop cloud information platform and the HBase distributed database to store and manage traffic data

Formosa Plastics Transport Corporation has used this platform to import 10 billion records of historical traffic data, getting results with very little hardware investment. Searching for an online driving record takes less than one second, which marks a significant performance improvement in comparison to past approach. Moreover, historical traffic data can be used as a basis for the travel time estimation.