Showing posts with label Tableau. Show all posts
Showing posts with label Tableau. Show all posts

Monday, October 9, 2023

Tableau-like Visualization with PyGWalker

Target audience: Beginner
Estimated reading time: 3'
Ever thought about presenting test results in a format similar to Tableau, one that management is acquainted with? A visually appealing, business-centric display can effectively convey messages. 
In this article, we delve into the PyGWalker Python library, which mirrors the interactive visualization style of Tableau, especially when it comes to geospatial graphics.


Table of contents
      Installation

Follow me on LinkedIn
Notes:
  • Environments: Python 3.10, Pandas 2.12, PyGWalker 0.3.9, Pedantic 2.4.2, GeoPy 2.4.0
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

PyGWalker, which stands for Python binding for Graphic Walker, is a visual representation library in Python, designed to work seamlessly with Jupyter-style notebooks for data probing and assessment [ref 1]. 
This library renders a user interface reminiscent of Tableau [ref 2], generated directly from pandas data frames. Its user-friendly interface facilitates pattern visualization and analysis through effortless drag-and-drop actions.

Please refer to an older post, Setup Tableau with Amazon EMR-Spark for more information about Tableau configuration and deployment [ref 3].

Installation

pip:  pip install pygwalker --upgrade
conda: condo install -c condo-force pygwalker 
JupyterLab:  pip install jupyterlab
Notebook: pip install notebook

Integration with Jupyter notebook

PyGWalker utilizes the Jupyter engine to produce an interactive user interface (UI) that resonates with the business community. Unlike Matplotlib, where visualization is code-centric, in PyGWalker, plot customization is managed directly through the UI. Thus, Python's role is mostly restricted to setting up and initiating the UI. 

After importing the necessary libraries and setting up the Pandas data frame, activating the UI is just a single line of code away.

import pandas as pd
import pygwalker as pyg

my_csv_file = 'input/locations.csv'
df = pd.read_csv(my_csv_file)

  # Launch the PyGWalker interactive UI
walker = pyg.walk(df)


Use case

Depicting geospatial data graphically can be quite daunting. In this context, we aim to illustrate the spread of tech and science firms throughout California.

Our scenario leverage GeoPy, a Python tool that interfaces with multiple renowned geocoding online platforms [ref 4].
This tool streamlines the process for Python programmers, allowing them to determine the coordinates of various locations—be it addresses, cities, nations, or significant landmarks—by utilizing independent geocoders and assorted data resources. Notably, while GeoPy supports geocoding from platforms like Google Maps, Bing Maps, and Nominatim, it maintains no direct affiliations with any of them.

installation: pip install geopy

In our case, the data comes in a straightforward 2-column table, detailing the city names and the count of tech/science enterprises, labeled as 'num_companies'. We've established a class, 'TechCity', which incorporates additional attributes – longitude and latitude – facilitating the data's visualization on a geographical map.

from typing import AnyStr, TypeVar, List
from dataclasses import dataclass

@dataclass
class TechCity:
  city: AnyStr
  num_companies: float
  longitude: float
  latitude: float

  @staticmethod
  def header() -> List[AnyStr]:
     return ['city', 'num_companies', 'longitude', 'latitude']


Following that, we establish a generator class named 'TechCitiesGenerator' that transforms the input data (comprising city names and the 'num_companies' for each city) into 'TechCity' instances for display purposes.
We employ the Nominatim geolocation service, set up during the class construction. Nominatim taps into OpenStreetMap data to pinpoint locations globally by either name or address (a process called geocoding) [ref 5].

The procedure __call__,  can be broken down into three steps:
  1. Create a 'TechCity' instance.
  2. Transition these instances into a dictionary format.
  3. Archive this dictionary as a CSV or JSON file.
class TechCitiesGenerator(object):
  """ 
  Generate the input to PyGWalker table with geo-location data
       :param cities List of cities with significant number of tech/science companies
       :param num_companies List of number of companies associated with each city
       :param filename Name of the output file (CSV or JSON)
  """
  def __init__(self, cities: List[AnyStr], num_companies: List[int], filename: AnyStr):
    from geopy.geocoders import Nominatim

    self.filename = filename
    self.cities = cities
    self.num_companies_lst = num_companies
    self.loc = Nominatim(user_agent='Geopy Library')

  def __call__(self) -> bool:
    import csv
    import logging

    # Step 1: Generate the records of type TechCity
    tech_cities = [
       TechCity(city, num_companies, self.loc.geocode(city).longitude, self.loc.geocode(city).latitude)
        for index, (city, num_companies)
        in enumerate(zip(self.cities, self.num_companies_lst))
    ]
    # Step 2: Convert to list into a dictionary
    records = [vars(tech_city) for tech_city in tech_cities]
    
    # Step 3: Store the dictionary in CSV or JSON format, give the file name
    try:
       match self.filename[-4:]:
         case '.csv': 
            with open(self.filename, 'w') as f:
               writer = csv.DictWriter(f, fieldnames=TechCity.header())
               writer.writeheader()
               for record in records:
                   writer.writerow(record)
            return True

         case 'json':
            import json
                    
            json_repr = json.dumps(records, indent=4)
            with open(self.filename, 'w') as f:
                 f.write(json_repr)
            return True

         case _:
            logging.error(f'Extension for {self.filename} is incorrect')
            return False
   '
   except Exception as e:
       logging.error(f'Failed to store object {str(e)}')
       return True


Output

The most basic visualization is a table akin to standard Tableau worksheets, where the columns depict the four attributes of the TechCity class, and each row corresponds to an individual instance.


Tabular representation of TechCity instances


The display below showcases PyGWalker's ability to map the count of companies across the cities listed in the table, superimposed on a geographical layout. Achieving this visualization involves three straightforward steps:
  1. Choose 'Geographic' for the Coordinate System.
  2. Drag the longitude (and subsequently, latitude) column to the respective 'Longitude' (and 'Latitude') fields.
  3. Drag the 'num_companies' column, representing the number of companies, into the size field.

That's it.


Tableau-like geospatial representation of number of 
tech & science companies for California cities

Conclusion

Effective communication of findings between data scientists and stakeholders is pivotal for any project's triumph. PyGWalker equips engineers with the ability to represent model outcomes in a style reminiscent of Tableau, a platform that many executives recognize, right within their notebooks.

Additionally, PyGWalker's visualization approach is both instinctive and interactive, sidestepping the clutter that additional coding can sometimes introduce in notebooks

Thank you for reading this article. For more information ...

References





---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3



Sunday, July 19, 2020

Setup Tableau with Amazon EMR-Spark

Target audience: Intermediate
Estimated reading time: 3'

This article describes the configuration of Amazon EMR service and set up of Tableau desktop to query and visualize Spark SQL datasets and content stored on S3. The installation process can daunting as documentation and useful tips are spread across various sites, chat rooms.


Table of contents
Overview
Follow me on LinkedIn

Overview

Tableau is a popular, powerful visualization platform that leverages a large variety of data sources from files, databases, applications to frameworks such as Apache Spark. Tableau is particularly suitable to visualize the results of queries to a Spark dataset.
Tableau desktop is a charts/query builder that relies on simple drag and drop to render results of query. It support a very large variety (50+) data source connectors from files, databases, CRMs, enterprise and cloud applications to scalable frameworks such as Apache Spark. Tableau's powerful statistical and computational capabilities help data scientists and product managers to spot critical data patterns.
There are few options to query and visualize Apache Spark datasets in Tableau:
  1. Using an ODBC driver to access Spark data-frame using the Spark SQL connector
  2. Through Hive2 thrift server using the Amazon EMR Hadoop Hive connector

For this post we select the second option and describe a common use case: installation and configuration of Thrift server, loading data from S3, transformation applied to a Spark dataset and leveraging parquet format.

Setup

Starting configuration

We assume that the data has been previously stored on Amazon S3. The same procedure would apply to any other storage such as HDFS, database or local file. The other assumptions is that Apache Spark has been deployed through an Amazon EMR.

Steps

1- Download Tableau Desktop Tableau Desktop

2- Download Simba ODBC driver for MacOS for Amazon EMR Hadoop Hive connector Driver Download 

3- Implement processing of the dataset loaded from S3 in CSV format, and Hive table using parquet. The procedure (Scala code snippet below) can be also easily implemented using Python/PySpark.
  1. final private val dataDir = "/tmp/records_table"
    final private val s3Bucket = "...."
    final private val s3CSVFile = "...."
    
    final val createTable = "CREATE EXTERNAL TABLE records_table(" +
     "id varchar(64), " +
     "description varchar(512), " +
     "value float, " +
     "unit char(8) " +
     "STORED AS PARQUET LOCATION"
    
    
    def loadRecords(implicit sparkSession: SparkSession): Unit = {
      import sparkSession.implicits._
      try {
        // Load dataframe from CSV file
        val recordsDF  = s3CSVToDataFrame(s3CSVFile , s3Bucket, true, true)
        // Convert data frame into a typed data set
        val recordsDS = recordsDF.map(Record(_))
    
        // Generate and store data in Parquet columnar structure
        recordDS.write.mode(SaveMode.Overwrite).parquet(dataDir)
    
        // Create the HIVE table pre-populated from Parquet structure stored on HDFS
        sparkSession.sql("DROP TABLE IF EXISTS records_table")
        sparkSession.sql(s"${createTable} '$dataDir'")
        logger.info(s"New table for $dataDir was created")
    
        // Just a quick validation test
        sparkSession.sql("SELECT id, value FROM records_table")
                    .show
        sparkSession.close
      }
      catch {
        case e: Exception => 
          logger.error(s"Failed to create HIVE table ${e.toString}")
      }
    }
    
    
    
    @throws(clazz = classOf[IllegalStateException])
    def s3CSVToDataFrame(
         s3CSVInputFile: String, 
         header: Boolean,
         s3Bucket: String,
         isMultiLine: Boolean
    )(implicit sparkSession: SparkSession): DataFrame = {
     import sparkSession.implicits._
     
     // Initialize the access configuration for Hadoop
     val loadDS = Seq[String]().toDS
     val accessConfig = loadDS.sparkSession.sparkContext.hadoopConfiguration
    
     try {
       accessConfig.set("fs.s3a.access.key", "xxxxxx")
       accessConfig.set("fs.s3a.secret.key", "xxxxxxx")
       
       val headerStr = if (header) "true" else "false"
       // Read the content of the CSV file from S3 to generate a data frame
       sparkSession
             .read
             .format("csv")
             .option("header", headerStr)
             .option("delimiter", ",")
             .option("multiLine", isMultiLine)
             .load(path = s"s3a://${s3Bucket}/${s3CSVInputFile}")
      } catch {
         case e: FileNotFoundException =>  
             throw new IllegalStateException(e.getMessage)
         case e: SparkException =>  
             throw new IllegalStateException(e.getMessage)
         case e: Exception => 
             throw new IllegalStateException(e.getMessage)
      }
    }
    


4- Log into the target EMR instance and upload the jar file for execution

5- Add or edit few spark configuration parameters 

  • spark.sql.warehouse.dir=/hive/spark-warehouse 
  • spark.hadoop.hive.metastore.warehouse.dir=/hive/spark-warehouse 
  • spark.sql.hive.server2.thrift.port=10000 
  • spark.sql.hive.thriftServer.singleSession=true

6- Execute the code to generate the Hive table from Spark dataset  

7- Set up/edit HIVE configuration file /usr/lib/spark/conf/hive-site.xml as a super user (sudo). 

 

Note: The default Derby embedded driver is used for convenience but can be easily replaced by a MySql or PostgreSQL driver by updating the javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName values. 

The default port for the thrift server is 10000 (hive.server2.thrift.port) may have to be changed to avoid conflict with other services.
  • <configuration>
    <property>
    <name>hive.metastore.connect.retries</name>
    <value>10</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    </property>
    <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>~/hive/warehouse</value>
    </property>
    <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
    </property>
    <property>
    <name>hive.server2.thrift.client.user</name>
    <value>root</value>
    </property>
    <property>
    <name>hive.server2.thrift.client.password</name>
    <value>xxxxxx</value>
    </property>
    <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    </property>
    <property>
    <name>hive.security.authorization.enabled</name>
    <value>true</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>xxxxx</value>
    </property>
    <property>
    <name>hive.exec.local.scratchdir</name>
    <value>~/tmp/hive</value>
    </property>
    </configuration>

    Note: The configuration variables defined in the spark configuration file overrides some of these entries. 

    • Remove potential locks in the metastore: rm -r metastore/*.lck (Locked access to the store will generate an error accessing and reading the table)
    • Stop the Hive2 thrift server sudo /usr/lib/spark/sbin/stop-thriftserver.sh 
    • Optionally kill the 2 processes related to thrift server ps -ef | grep RunJar sudo kill -9 {processId} 
    • Restart the thrift server sudo /usr/lib/spark/sbin/start-thriftserver.sh --master local 
    • Verify the parquet data is correctly stored on HDFS hdfs dfs -ls /tmp/metrics/ 
    • Verify table is created and populated in the EMR instance hive => show tables
    • Launch Tableau desktop
    • Select Amazon EMR-Hadoop connector
    • Configure the connection through UI (see attached snapshot): 1) Enter the public DNS URL for the EMR master instance, 2) Select authentication = user name 3) Enter user name = hadoop 4) Select SSL required.


  • Thank you for reading this article. For more information ...


    References


    ---------------------------
    Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
    He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3