Using databricks/spark-xml to read a XML into spark dataframe.
Assume the sample XML
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>
Gambardella, Matthew</author>
<title>
XML Developer's Guide</title>
<genre>
Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
and query XML data in the database.
After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository,
the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB
application. It provides examples of how and where you can use Oracle XML DB.
The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating
XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of
the manual discusses how to use Oracle XML DB repository, including versioning and security,
how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage
your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and
Oracle Streams Advanced Queuing XMLType support.
</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
To parse this into a spark dataframe, create a sbt project with the following structure
src
- main
- scala
- sample
- Books.scala
- resources
- Books.xml
build.sbt
Books.scala
package sample
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import com.databricks.spark.xml._
val spark = SparkSession.builder.appName("Books xml Parsing").master("local[*]").getOrCreate()
val booksXML: DataFrame = spark.read.option("rowTag", "catalog").xml(getClass.getResource("/Books.xml").getPath)
booksXML.show()
selected_data.write.option("header","true").parquet("C:\\Users\\Jijo\\flatten_xml_spark\\books_flatfile")
spark.stop()
build.sbt
import Dependencies._
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "sample",
scalaVersion := "2.12.7",
version := "0.1.0-SNAPSHOT"
)),
name := "xmlparsing",
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0",
libraryDependencies.+=(scalaTest % Test),
libraryDependencies += "com.databricks" %% "spark-xml" % "0.5.0")
Assume the sample XML
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>
Gambardella, Matthew</author>
<title>
XML Developer's Guide</title>
<genre>
Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
and query XML data in the database.
After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository,
the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB
application. It provides examples of how and where you can use Oracle XML DB.
The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating
XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of
the manual discusses how to use Oracle XML DB repository, including versioning and security,
how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage
your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and
Oracle Streams Advanced Queuing XMLType support.
</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
To parse this into a spark dataframe, create a sbt project with the following structure
src
- main
- scala
- sample
- Books.scala
- resources
- Books.xml
build.sbt
Books.scala
package sample
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import com.databricks.spark.xml._
val spark = SparkSession.builder.appName("Books xml Parsing").master("local[*]").getOrCreate()
val booksXML: DataFrame = spark.read.option("rowTag", "catalog").xml(getClass.getResource("/Books.xml").getPath)
booksXML.show()
selected_data.write.option("header","true").parquet("C:\\Users\\Jijo\\flatten_xml_spark\\books_flatfile")
spark.stop()
build.sbt
import Dependencies._
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "sample",
scalaVersion := "2.12.7",
version := "0.1.0-SNAPSHOT"
)),
name := "xmlparsing",
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0",
libraryDependencies.+=(scalaTest % Test),
libraryDependencies += "com.databricks" %% "spark-xml" % "0.5.0")