Slides 67 to 73 from this presentation
provide a simple Spark streaming application that uses Spark SQL to process the data.
A little effort is required to make this into a functional program for Spark 1.2
Note that application will not run in Eclipse and must be forked when run from SBT, due to
this issue.
The code was tested by running yes 12 a b c 13 x y z |nc -kl 12345 in another console.
This code added the following parameters so as avoid disk I/O and overflow of both
disk and memory space.
StorageLevel.MEMORY_ONLY
set(“spark.executor.memory”, “1g”)
set(“spark.cleaner.ttl”, “30”)
set(“spark.streaming.receiver.maxRate”, “400000”)
The maxRate parameter is unfortunately needed since Spark does not implement the necessary flow control.
That value was empirically derived so as to achieve ~97% CPU load across all 6 cores of an AMD Phenom II X6 1100T.
The web UI indicates a median rate of 400K records/second but the output indicates 800K record/second.
In other words, the rate is number of lines read from the socket, rather than the number of object written
into the DStream.