Evaluate Xquery in pyspark on RDD elements

Stack Overflow Asked by Fahad Rana on September 6, 2020

We are trying to read large number of XML’s and run Xquery on them in pyspark for example books xml. We are using spark-xml-utils library.

  • We want to feed the directory containing xmls to pyspark.
  • Run Xquery on all of them to get our results.

reference answer: Calling scala code in pyspark for XSLT transformations

The definition of xquery processor where xquery is the string of xquery:

proc =

We are reading the files in a directory using:


This gives us an RDD containing all the files as a list of tuples:

[ (Filename1,FileContentAsAString), (Filename2,File2ContentAsAString) ]

The xquery evaluates and gives us results if we run on the string (FileContentAsAString)

whole_files = sc.wholeTextFiles("xmls/test_files").collect()
# Prints proper xquery result for that file


If we try to run proc.evaluate() on the RDD using lambda function, it is failing.

test_file = sc.wholeTextFiles("xmls/test_files") x: proc.evaluate(x[1])).collect()

# Should give us a list of xquery results 


PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

These functions work somehow but not the evaluate above:

Print the content xquery is applied on x: x[1]).collect()

# Outputs the content. if x[0], gives us the list of filenames

Return the len of characters in the contents x: len(x[1])).collect()
# Output: [15274, 13689, 13696]

Books example for reference:

books_xquery = """for $x in /bookstore/book
where $x/price>30
return $x/title/data()"""

proc_books =

books_xml = sc.wholeTextFiles("xmls/books.xml") x: proc_books.evaluate(x[1])).collect()
# Error
# I can share the stacktrace if you guys want

One Answer

Unfortunately it is not possible to call a Java/Scala library directly within a map call from Python code. This answer gives a good explanation why there is no easy way to do this. In short the reason is that the Py4J gateway (which is necessary to "translate" the Python calls into the JVM world) only lives on the driver node while the map calls that you are trying to execute are running on the executor nodes.

One way around that problem would be to wrap the XQuery function in a Scala UDF (explained here), but it still would be necessary to write a few lines of Scala code.

EDIT: If you are able to switch from XQuery to XPath, a probably easier option is to change the (XPath) library. ElementTree is an XML libary written in Python and also XPath.

The code

xmls = spark.sparkContext.wholeTextFiles("xmls/test_files")
import xml.etree.ElementTree as ET
xpathquery = "...your query..."
xmls.flatMap(lambda x: ET.fromstring(x[1]).findall(xpathquery)) 
    .map(lambda x: x.text) 

would print all results of running the xpathquery against all documents loaded from the directory xmls/test_files.

At first a flatMap is used as the findall call returns a list of all matching elements within each document. By using flatMap this list is flattened (the result might contain more than one element per file). In the second map call the elements are mapped to their text in order to get a readable output.

Answered by werner on September 6, 2020

Add your own answers!

Related Questions

Why am i getting the output twice in the Node console?

2  Asked on December 29, 2020 by chetan-batra


button to act as checkbox and change styles?

3  Asked on December 29, 2020 by jayg713


What should be structure of Redux Store?

2  Asked on December 29, 2020 by kiran


Random integer numbers with fixed sum

3  Asked on December 28, 2020 by whitecircle


how to attach a header in axios.all()

2  Asked on December 28, 2020 by haider-abidi


How to reduce dom manipulation while appending css using javaScript

4  Asked on December 28, 2020 by thefrontenddev


Problem with chaining relationships in Laravel

2  Asked on December 28, 2020 by mrmar


Lambda expresions and “? :” operator in Java 14

1  Asked on December 28, 2020 by theprettyface


There is no tracking information for the current branch

13  Asked on December 28, 2020 by valerio0999


flask upload python-base64 image

1  Asked on December 28, 2020 by a-clmnt


Getting italic font everywhere in HTML

1  Asked on December 27, 2020 by user9343456


Binding objects defined in code-behind

11  Asked on December 27, 2020 by xandy


Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir