Spark, the general purpose computing framework which is written in Scala allows java. python and R language clients to interact with it. This actually increased the acceptance of Spark among programmers and as an after effect, the spark is enriched with programming libraries that each language has.
But I was wondering how the architecture is designed to handle this. When we send a spark job in Python language, how the core functionality of Sparks such as SparkContext or RDD creation is taken care.
But I was wondering how the architecture is designed to handle this. When we send a spark job in Python language, how the core functionality of Sparks such as SparkContext or RDD creation is taken care.
I have come across some points to describe how this function.
· Only Scala client code is directly using spark libraries
· Other language clients are not interacting with spark's Scala code directly.
· If the spark libraries are only available in Scala, then whatever programming language we use, the terminal operations can happen only in Scala programming language.
· Ex: If we are using a third party java API from our Java code, the call to the API's methods is happening through the single JVM created with application code and external API code. but if we want to call a c/c++ code from java we need to use JNI or JNA.
If you have used this in your career, you can understand the pain in creating and testing the same.
This is what happening when we try to interact with Scala written Spark using other languages.
· Spark made the multi-language support simpler by using the RPC
interface. There are RPC server implementations for each language spark supports with APIs which abstract the scale implementation in the backend.
· The RPC server used to support python is py4j. The py4j gateway server is acting as the communication pipe between python and scale.
In the example, you can see that when you do a list.append() command in python
[ internal_list.append('Second item') ] , the list in java JVM is added with
one element. The coding is in python terminal and by using python syntax, but the backend operation happened on JVM Objects.
You can see the source code of spark that used to run a python file
· Similarly, for R scripts, spark uses RBackend, a Netty-based backend server that is used to communicate between R and Scala.
You can see the source code of spark that used to run an R file
· In the case of Java, a thin wrapper written in Scala is used as the agent to convert Java objects/collection to Scala. java API internals
· The faster approach is when you use Scala as client language.
· But python is widely used due to the rich machine learning API and ease of use.
Comments