Monday, 21 September 2015

Introducing the Infinispan Hadoop Connector

The version 0.1 of the Infinispan Hadoop connector has just been made available!

The connector will host several integrations with Hadoop related projects, and in this first release it supports converting Infinispan server into a Hadoop compliant data source, by providing an implementation of InputFormat and OutputFormat.

The InfinispanInputFormat and InfinispanOutputFormat


A Hadoop InputFormat is a specification of how a certain data source can be partitioned and how to read data from each of the partitions. Conversely, OutputFormat is used to write.

Looking closely at the Hadoop's InputFormat interface, we can see two methods:

    List<InputSplit> getSplits(JobContext context);
    RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);

The first method defines essentially a data partitioner, calculating one or more InputSplits that contain information about a certain partition of the data. With possession of a InputSplit, one can use it to obtain a RecordReader to iterate over the data. These two operations allow for parallelization of data processing across multiple nodes, and that's how Hadoop map reduce achieves a high throughput over large datasets.

In Infinispan terms, each partition is a set of segments on a certain server, and a record reader is a remote iterator over those segments. The default partitioner shipped with the connector will create as many partitions as servers in the cluster, and each partition will contain the segments that are associated with that specific server.

Not only map reduce

Although the InfinispanInputFormat and InfinispanOutputformat can be used to run traditional Hadoop map reduce jobs over Infinispan data, it is not coupled to the Hadoop map reduce runtime. It is possible to leverage the connector to integrate Infinispan with other tools that, besides supporting Hadoop I/O interfaces, are able to read and write data more efficiently. One of those tools is Apache Flink, that has a dataflow engine capable of doing batch and stream data processing that supersedes the classic two stage map reduce approach. 

Apache Flink example


Apache Flink supports Hadoop's InputFormat as a data source to execute batch jobs, so to integrate with Infinispan it's straightforward:

Please refer to the complete sample that has docker images for both Apache Flink and Infinispan server, and detailed instructions on how to execute and customise job.

Stay tuned

More details about the connector, maven coordinates, configuration options, sources and samples can be found at the project repository

In upcoming versions we expect to have a tighter integration with the Hadoop platform in order to run Infinispan clusters as a YARN application (ISPN-5709), and also support other tools from the ecosystem such as Apache Pig (ISPN-5749)

No comments:

Post a Comment