Estimated reading time: 3'
Overview
Tableau is a popular, powerful visualization platform that leverages a large variety of data sources from files, databases, applications to frameworks such as Apache Spark. Tableau is particularly suitable to visualize the results of queries to a Spark dataset.Tableau desktop is a charts/query builder that relies on simple drag and drop to render results of query. It support a very large variety (50+) data source connectors from files, databases, CRMs, enterprise and cloud applications to scalable frameworks such as Apache Spark. Tableau's powerful statistical and computational capabilities help data scientists and product managers to spot critical data patterns.
There are few options to query and visualize Apache Spark datasets in Tableau:
- Using an ODBC driver to access Spark data-frame using the Spark SQL connector
- Through Hive2 thrift server using the Amazon EMR Hadoop Hive connector
For this post we select the second option and describe a common use case: installation and configuration of Thrift server, loading data from S3, transformation applied to a Spark dataset and leveraging parquet format.
Setup
Starting configuration
Steps
2- Download Simba ODBC driver for MacOS for Amazon EMR Hadoop Hive connector Driver Download
-
final private val dataDir = "/tmp/records_table" final private val s3Bucket = "...." final private val s3CSVFile = "...." final val createTable = "CREATE EXTERNAL TABLE records_table(" + "id varchar(64), " + "description varchar(512), " + "value float, " + "unit char(8) " + "STORED AS PARQUET LOCATION" def loadRecords(implicit sparkSession: SparkSession): Unit = { import sparkSession.implicits._ try { // Load dataframe from CSV file val recordsDF = s3CSVToDataFrame(s3CSVFile , s3Bucket, true, true) // Convert data frame into a typed data set val recordsDS = recordsDF.map(Record(_)) // Generate and store data in Parquet columnar structure recordDS.write.mode(SaveMode.Overwrite).parquet(dataDir) // Create the HIVE table pre-populated from Parquet structure stored on HDFS sparkSession.sql("DROP TABLE IF EXISTS records_table") sparkSession.sql(s"${createTable} '$dataDir'") logger.info(s"New table for $dataDir was created") // Just a quick validation test sparkSession.sql("SELECT id, value FROM records_table")
.show sparkSession.close } catch { case e: Exception => logger.error(s"Failed to create HIVE table ${e.toString}") } } @throws(clazz = classOf[IllegalStateException]) def s3CSVToDataFrame( s3CSVInputFile: String, header: Boolean, s3Bucket: String, isMultiLine: Boolean )(implicit sparkSession: SparkSession): DataFrame = { import sparkSession.implicits._ // Initialize the access configuration for Hadoop val loadDS = Seq[String]().toDS val accessConfig = loadDS.sparkSession.sparkContext.hadoopConfiguration try { accessConfig.set("fs.s3a.access.key", "xxxxxx") accessConfig.set("fs.s3a.secret.key", "xxxxxxx") val headerStr = if (header) "true" else "false" // Read the content of the CSV file from S3 to generate a data frame sparkSession .read .format("csv") .option("header", headerStr) .option("delimiter", ",") .option("multiLine", isMultiLine) .load(path = s"s3a://${s3Bucket}/${s3CSVInputFile}") } catch { case e: FileNotFoundException =>
throw new IllegalStateException(e.getMessage) case e: SparkException =>
throw new IllegalStateException(e.getMessage) case e: Exception =>
throw new IllegalStateException(e.getMessage) } }
4- Log into the target EMR instance and upload the jar file for execution
5- Add or edit few spark configuration parameters
- spark.sql.warehouse.dir=/hive/spark-warehouse
- spark.hadoop.hive.metastore.warehouse.dir=/hive/spark-warehouse
- spark.sql.hive.server2.thrift.port=10000
- spark.sql.hive.thriftServer.singleSession=true
6- Execute the code to generate the Hive table from Spark dataset
7- Set up/edit HIVE configuration file /usr/lib/spark/conf/hive-site.xml as a super user (sudo).
<configuration>
<property>
<name>hive.metastore.connect.retries</name>
<value>10</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>~/hive/warehouse</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.thrift.client.user</name>
<value>root</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>xxxxxx</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>xxxxx</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>~/tmp/hive</value>
</property>
</configuration>
Note: The configuration variables defined in the spark configuration file overrides some of these entries.
- Remove potential locks in the metastore: rm -r metastore/*.lck (Locked access to the store will generate an error accessing and reading the table)
- Stop the Hive2 thrift server sudo /usr/lib/spark/sbin/stop-thriftserver.sh
- Optionally kill the 2 processes related to thrift server ps -ef | grep RunJar sudo kill -9 {processId}
- Restart the thrift server sudo /usr/lib/spark/sbin/start-thriftserver.sh --master local
- Verify the parquet data is correctly stored on HDFS hdfs dfs -ls /tmp/metrics/
- Verify table is created and populated in the EMR instance hive => show tables
- Launch Tableau desktop
- Select Amazon EMR-Hadoop connector
- Configure the connection through UI (see attached snapshot): 1) Enter the public DNS URL for the EMR master instance, 2) Select authentication = user name 3) Enter user name = hadoop 4) Select SSL required.
Thank you for reading this article. For more information ...
References
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3