Friday, June 2, 2017

Apache Spark Design Patterns Using Scala 0

Apache Spark Design Patterns Using Scala 0


Apache Spark supports both batch and streaming analysis, meaning you can use a single framework for your batch processing as well as your near real time use cases. And Spark introduces a fantastic functional programming model, which is arguably better suited for data analysis than Hadoop�s Map/Reduce API

This blog series attempts to find out if the common set of use cases can be solved using Spark.
The use-cases are based on 

http://oreil.ly/mapreduce-design-patterns
�MapReduce Design Patterns by Donald Miner and Adam Shook (O�Reilly). Copyright 2013 Donald Miner and Adam Shook, 978-1-449-32717-0.�


The Hardware and Software stack used
Read full post »

Wednesday, May 10, 2017

Apache Spark Design Patterns Using Scala apache spark Series 1 The word count

Apache Spark Design Patterns Using Scala apache spark Series 1 The word count


A simple word count using scala in Spark
Simple word count example - Click to see code
There are many limitations in the above code The objective is to count words in the post, however the Posts.xml has lot of meta-data like OwnerUserId,Title,Tags etc..The info we need is in the Body.
The missing logic is
1) Count words in the Body
2) Error handling
3) Data clean up - we don�t count single quotes, special characters This example uses case classes and xml parsing which in in-built Scala.
Enhanced word count example - Click to see code
Read full post »

Sunday, April 30, 2017

Apache Spark Design Patterns Using Scala 1 The Setup

Apache Spark Design Patterns Using Scala 1 The Setup


The Hardware and Software stack used  

Spark version 1.2.0
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)

scala -version
Scala code runner version 2.11.4 -- Copyright 2002-2013, LAMP/EPFL

java -version
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

uname -a
Linux SERVER 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/redhat-release
Fedora release 20 (Heisenbug)

Data files used
8.0G Sep 18 03:06 Comments.xml
29G Sep 18 04:34 Posts.xml
1.8G Sep 23 02:01 stackoverflow.com-Comments.7z
5.8G Sep 27 01:26 stackoverflow.com-Posts.7z
101M Sep 23 21:49 stackoverflow.com-Users.7z
895M Sep 18 04:36 Users.xml
cat /proc/cpuinfo -Click to see details
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
microcode       : 0xa0b
cpu MHz         : 1998.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips        : 5985.62
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
microcode       : 0xa0b
cpu MHz         : 1998.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips        : 5985.62
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Read full post »
 

Copyright © Video game tester Design by Free CSS Templates | Blogger Theme by BTDesigner | Powered by Blogger