Wednesday, 3 April 2019

Spark tuning parameters

Spark parameters

Dynamic Executor Allocation
spark.dynamicAllocation.enabled=True
spark.dynamicAllocation.executorIdleTimeout=2m
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.maxExecutors=2000

Better fetch failure handling
spark.max.fetch.failures.per.stage = 10

Scaling spark Driver
spark.rpc.io.serverThreads = 64

Tuning memory configurations
  1.Enable Off heap memory
  spark.memory.offHeap.enabled = True
  spark.memory.offHeap.size = 3g
  spark.executor.memory = 3g
  spark.yarn.executor.memoryOverhead = 0.1 * (spark.executor.memory + spark.memory.offHeap.size)

  2.Garbage collection Tuning
  spark.executor.extraJavaOptions = -XX:ParallelGCThreads=4 -XX:+UseParallelGC

Eliminate Disk I/O bottleneck
1.spark.shuffle.file.buffer=1Mb
  spark.unsafe.sorter.spill.reader.buffer.size=1Mb
2.spark.file.transferTo=false
  spark.shuffle.unsafe.file.output.buffer=5Mb
3.spark.io.comporession.lz4.blockSize=512KB

Cache index files on Shuffle Server
spark.shuffle.service.index.cache.entries=2048

Scaling External Shuffle Service
Tune shuffle service worker thread and backlog
spark.shuffle.io.serverThreads=128
spark.shuffle.io.backLog=8192

Configurable shuffle registration timeout and entry
spark.shuffle.registration.timeout = 2m
spark.shuffle.registration.maxAttempts = 5

XML parsing in spark using databricks/spark-xml library

Using databricks/spark-xml to read a XML into spark dataframe.

Assume the sample XML

<?xml version="1.0"?>
<catalog>
    <book id="bk101">
    <author>
        Gambardella, Matthew</author>
        <title>
        XML Developer's Guide</title>
        <genre>
        Computer</genre>
        <price>44.95</price>
        <publish_date>2000-10-01</publish_date>
        <description>


            An in-depth look at creating applications
            with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
            and query XML data in the database.


            After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository,
            the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB
            application. It provides examples of how and where you can use Oracle XML DB.


            The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating
            XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of
            the manual discusses how to use Oracle XML DB repository, including versioning and security,
            how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage
            your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and
            Oracle Streams Advanced Queuing XMLType support.
        </description>
        </book>
<book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-12-16</publish_date>
    <description>A former architect battles corporate zombies,
        an evil sorceress, and her own childhood to become queen
        of the world.</description>
</book>
    <book id="bk103">
        <author>Corets, Eva</author>
        <title>Maeve Ascendant</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2000-11-17</publish_date>
        <description>After the collapse of a nanotechnology
            society in England, the young survivors lay the
            foundation for a new society.</description>
    </book>
    <book id="bk104">
        <author>Corets, Eva</author>
        <title>Oberon's Legacy</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2001-03-10</publish_date>
        <description>In post-apocalypse England, the mysterious
            agent known only as Oberon helps to create a new life
            for the inhabitants of London. Sequel to Maeve
            Ascendant.</description>
    </book>
    <book id="bk105">
        <author>Corets, Eva</author>
        <title>The Sundered Grail</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2001-09-10</publish_date>
        <description>The two daughters of Maeve, half-sisters,
            battle one another for control of England. Sequel to
            Oberon's Legacy.</description>
    </book>
    <book id="bk106">
        <author>Randall, Cynthia</author>
        <title>Lover Birds</title>
        <genre>Romance</genre>
        <price>4.95</price>
        <publish_date>2000-09-02</publish_date>
        <description>When Carla meets Paul at an ornithology
            conference, tempers fly as feathers get ruffled.</description>
    </book>
    <book id="bk107">
        <author>Thurman, Paula</author>
        <title>Splish Splash</title>
        <genre>Romance</genre>
        <price>4.95</price>
        <publish_date>2000-11-02</publish_date>
        <description>A deep sea diver finds true love twenty
            thousand leagues beneath the sea.</description>
    </book>
    <book id="bk108">
        <author>Knorr, Stefan</author>
        <title>Creepy Crawlies</title>
        <genre>Horror</genre>
        <price>4.95</price>
        <publish_date>2000-12-06</publish_date>
        <description>An anthology of horror stories about roaches,
            centipedes, scorpions  and other insects.</description>
    </book>
    <book id="bk109">
        <author>Kress, Peter</author>
        <title>Paradox Lost</title>
        <genre>Science Fiction</genre>
        <price>6.95</price>
        <publish_date>2000-11-02</publish_date>
        <description>After an inadvertant trip through a Heisenberg
            Uncertainty Device, James Salway discovers the problems
            of being quantum.</description>
    </book>
    <book id="bk110">
        <author>O'Brien, Tim</author>
        <title>Microsoft .NET: The Programming Bible</title>
        <genre>Computer</genre>
        <price>36.95</price>
        <publish_date>2000-12-09</publish_date>
        <description>Microsoft's .NET initiative is explored in
            detail in this deep programmer's reference.</description>
    </book>
    <book id="bk111">
        <author>O'Brien, Tim</author>
        <title>MSXML3: A Comprehensive Guide</title>
        <genre>Computer</genre>
        <price>36.95</price>
        <publish_date>2000-12-01</publish_date>
        <description>The Microsoft MSXML3 parser is covered in
            detail, with attention to XML DOM interfaces, XSLT processing,
            SAX and more.</description>
    </book>
    <book id="bk112">
        <author>Galos, Mike</author>
        <title>Visual Studio 7: A Comprehensive Guide</title>
        <genre>Computer</genre>
        <price>49.95</price>
        <publish_date>2001-04-16</publish_date>
        <description>Microsoft Visual Studio 7 is explored in depth,
            looking at how Visual Basic, Visual C++, C#, and ASP+ are
            integrated into a comprehensive development
            environment.</description>
    </book>
</catalog>

To parse this into a spark dataframe, create a sbt project with the following structure
src
 - main
   - scala
    - sample
      - Books.scala
 - resources
   - Books.xml
build.sbt

Books.scala

package sample

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

import com.databricks.spark.xml._

val spark = SparkSession.builder.appName("Books xml Parsing").master("local[*]").getOrCreate()

val booksXML: DataFrame = spark.read.option("rowTag", "catalog").xml(getClass.getResource("/Books.xml").getPath)
booksXML.show()
selected_data.write.option("header","true").parquet("C:\\Users\\Jijo\\flatten_xml_spark\\books_flatfile")

spark.stop()

build.sbt
import Dependencies._

lazy val root = (project in file(".")).
  settings(
    inThisBuild(List(
      organization := "sample",
      scalaVersion := "2.12.7",
      version      := "0.1.0-SNAPSHOT"
    )),
    name := "xmlparsing",
    libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0",
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0",
    libraryDependencies.+=(scalaTest % Test),
    libraryDependencies += "com.databricks" %% "spark-xml" % "0.5.0")