What is Diyotta’s run-time architecture for data processing on Hadoop using Spark?
What is Diyotta’s run-time architecture for data processing on Hadoop using Spark?
Diyotta’s spark run-time architecture for Hadoop platform is outlined below
- Extraction of source data as a flat file in controller or agent file system
- Transfer of file to HDFS through HDFS file system API commands
- Obtain spark session with hive support enabled for the session
- Get spark context using the spark session
- Load source data in HDFS to Spark RDD
- Apply row formatting with the schema supplied from Data objects in Diyotta and create data frame in SQL context
- Register the data frame as spark temp table/view in SQL Context
- Apply transformations, if any, through SQL on the temp table data and store transformed data in another spark temp table/view in SQL Context
- Insert into hive target table selecting from SQL Context temp table
- If target is HDFS, persist the transformed data in SQL context table to HDFS file