Monday, 17 August 2015

Infinispan Spark connector 0.1 released!

Dear users,

The Infinispan connector for Apache Spark has just been made available as a Spark Package!

What is it?


The Infinispan Spark connector allows tight integration with Apache Spark, allowing Spark jobs to be run against data stored in the Infinispan Server, exposing any cache as an RDD, and also writing data from any key/value RDD to a cache. It's also possible to create a DStream backed by cache events and to save any key-value DStream to a cache.

The minimum version required is Infinispan 8.0.0.Beta3.

Giving it a spin with Docker


A handy docker image that contains an Infinispan cluster co-located with an Apache Spark standalone cluster is the fastest way to try the connector. Start by launching the container that hosts the Spark Master:


And then run as many worker nodes as you want:


Using the shell


The Apache Spark shell is a convenient way to quickly run jobs in an interactive fashion. Taking advantage of the fact that Spark is already installed in the docker containers (and thus the shell), let's attach to the master:


Once inside, a Spark shell can be launched by:


That's all it's needed. The shell grabs the Infinispan connector and its dependencies from spark-packages.org and exposes them in the classpath.

Generating data and writing to Infinispan


Let's obtain a list of words from the Linux dictionary, and generate 1k random 4-word phrases. Paste the commands in the shell:


From the phrases, we'll create a key value RDD (Long, String):


To save to Infinispan:



Obtaining facts about data


To be able to explore data in the cache, the first step is to create an infinispan RDD:


As an example job, let's calculate a histogram showing the distribution of word lengths in the phrases. This is simply a sequence of transformations expressed by:


This pipeline yields:

2 chars words: 10 occurrences
3 chars words: 37 occurrences
4 chars words: 133 occurrences
5 chars words: 219 occurrences
6 chars words: 373 occurrences
7 chars words: 428 occurrences
8 chars words: 510 occurrences
9 chars words: 508 occurrences
10 chars words: 471 occurrences
11 chars words: 380 occurrences
12 chars words: 309 occurrences
13 chars words: 238 occurrences

...

Now let's find similar words using the Levenshtein distance algorithm. For that we need to define a function that will calculate the edit distance between two strings. As usual, paste in the shell:



Empowered by the Levenshtein distance implementation, we need another function that given a word, will find in the cache similar words according to the provided maximum edit distance:


Sample usage:


Where to go from here


And that concludes this first post on Infinispan-Spark integration. Be sure to check the Twitter demo for non-shell usages of the connector, including Java and Scala API.

And it goes without saying, your feedback is much appreciated! :)

No comments:

Post a Comment