Sunday, November 3, 2024
Posts History
Thursday, June 15, 2023
Extend LangChain Sequences with Data Types
This post introduces the concept generative workflow by extending LangChain with typed input and condition on input values.The post assumes the reader is somewhat has some basic knowledge of Python, OpenAI ChatGPT and LangChain.
Table of contents
IntroductionNotes:
- As I publish this post, OpenAI releases a new version of gpt-3.5-turbo that supports functions with typed input and output (ChatGPT functions)[ref 3]
- The code snippets uses Python 3.9 and LangChain 0.0.200
- To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.
- This post is not generated by ChatGPT but assumes the reader is already familiar with Large Language Model
- The source code is available on GitHub https://github.com/patnicolas/chatgpt-patterns
Introduction
The LangChain Python framework built on OpenAI API to build large language models applications. The framework organizes ChatGPT API functionality into functional components, similar to Object-Oriented design). These components are assembled into customizable sequence or chains that can be dynamically configured. It allows developers to sequence of tasks (chains) with message/prompt as input (role=user) and answer (role=assistant) as output [ref 4]. This concept is analog to traditional function call
def func(**kwargs: dict[str, any]) -> output_type:
....
return x
Python | LLM |
---|---|
Function call | LLM message/request |
Function name (func) | Prompt prefix |
Argument (**kwargs) | List of tuple (variable_name, variable_type, condition) |
Returned type (output_type) | LangChain output key |
LangChain does not explicitly support types such as integer, dictionary, float.. in input messages. The next section extends LangChain functionality by adding types in ChatGPT request messages and given the data type, a condition or filter on the variable.
Example:
- prompt prefix "Compute the sum of elements of an array"
- Arguments: (x, list[float], element > 0.5)
generate the following prompt. "Compute the sum of the elements of an array x of type list[float] for which elements > 0.5"
The next section describes the Python implementation of a workflow of typed chains for ChatGPT using LangChain framework.
Generative workflow
The first step is to install LangChain Python module and setup the OpenAI API key as an environment variable of target machine. The LangChain Quickstart guide [ref 5] is very concise and easy to follow so there is no need to duplicate the information in this post.
LLM chains and sequence are important functions of LangChain framework. They allow developers to build sequence of chains. It allows developers to assemble basic function, LLMChain into fully functional workflow or sequence (type SequentialChain)
Let's extend SequentialChain with typed and condition on input values by implemented a workflow is defined by the class ChatGPTTypedChains. The constructor has two arguments:
- Temperature, _temperature, to initialize the ChatGPT request
- Optional task definition, task_builder that define the task implemented as a chain
- task description (prompt)
- List of input variables defined as tuple (name variables, data type, and optional condition of the variable
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import SequentialChain, LLMChain
from collections.abc import Callable
"""
This class extends the langchain sequences by defining explicitly
- Task or goal of the request to ChatGPT
- Typed arguments for the task
The components of the prompt/message
- Definition of the task (i.e. 'compute the exponential value of')
- Input variables defined as tuple (name, type, condition) (i.e. 'x', 'list[float]', 'value < 0.8')
"""
class ChatGPTTypedChains(object):
def __init__(self, _temperature: float, task_builder: Callable[[str, list[(str, str, str)]], str] = None):
"""
Constructor for the typed sequence of LLM chains
@param task_builder Builder or assembler for the prompt with {task definition and
list of arguments {name, type, condition} as input and prompt as output
@param _temperature Temperature for the softmax log probabilities
@type _temperature: floating value >= 0.0
"""
self.chains: list[LLMChain] = []
self.llm = ChatOpenAI(temperature=_temperature)
self.task_builder = task_builder if task_builder else ChatGPTTypedChains.__build_prompt
self.input_keys: list[str] = []
- Task descriptor, task_definition (prompt)
- Task parameters, arguments as tuple (name of input, type of input and optional condition)
- Return type, as _output_key
def append(self, task_definition: str, arguments: list[(str, str, str)], _output_key: str) -> int:
"""
Add a new task (LLM chain) into the current workflow...
@param _output_key: Output key or variable
@param task_definition: Definition or specification of the task
@type arguments: List of tuple (variable_name, variable_type, variable_condition)
"""
# We initialize the input variables for the workflow
if len(self.input_keys) == 0:
self.input_keys = [key for key, _, _ in arguments]
# Build the prompt for this new prompt
this_input_prompt = ChatGPTTypedChains.__build_prompt(task_definition, arguments)
this_prompt = ChatPromptTemplate.from_template(this_input_prompt)
# Create a new LLM chain and add it to the sequence
this_llm_chain = LLMChain(llm=self.llm, prompt=this_prompt, output_key=_output_key)
self.chains.append(this_llm_chain)
return len(self.chains)
@staticmethod
def __build_prompt(task_definition: str, arguments: list[(str, str, str)]) -> str:
def set_prompt(var_name: str, var_type: str, var_condition: str) -> str:
prompt_variable_prefix = "{" + var_name + "} with type " + var_type
return prompt_variable_prefix + " and " + var_condition \
if not bool(var_condition) \
else \
prompt_variable_prefix
embedded_input_vars = ", ".join(
[set_prompt(var_name, var_type, var_condition) \ \
for var_name, var_type, var_condition in arguments]
)
return f'{task_definition} {embedded_input_vars}'
The method __call__ implements the workflow as a LangChain sequence chain. This method takes two arguments: Input to the workflow (input to the first task) _input_values and the name/keys for the output values (output from the last task in the sequence).
def __call__(self, _input_values: dict[str, str], output_keys: list[str]) -> dict[str, any]:
"""
Execute the sequence of typed task (LLM chains)
@param _input_values: Input values to the sequence
@param output_keys: Output keys for the sequence
@return: Dictionary of output variable -> values
"""
chains_sequence = SequentialChain(
chains=self.chains,
input_variables=self.arguments,
output_variables=output_keys,
verbose=True
)
return chains_sequence(_input_values)
Simple use cases
- Two numerical tasks (math functions: sum and exp)
- Term frequency-Inverse document frequency (TF-IDF) scoring and ordering task
Numerical computation chain
- The sum of an array x of type list[float] or which values < 0.8
- Apply the exponential function to the sum
In this particular example, an array of 120 floating point values are generated through a sin function then filter through the condition x < 0.8. The output value is a dictionary with a single key 'u'.
def numeric_tasks() -> dict[str, str]:
import math
chat_gpt_seq = ChatGPTTypedChains(0.0)
# First task: implement lambda x: sin(x*0.001)
input_x = ','.join([str(math.sin(n * 0.001)) for n in range(128)])
chat_gpt_seq.append("Sum these values ", [('x', 'list[float]', 'values < 0.8')], 'res')
,
# Second task: function u: exp(sum(x))
chat_gpt_seq.append("Compute the exponential value of ", [('res', 'float', '')], 'u')
input_values = {'x': input_x}
output: dict[str, str] = chat_gpt_seq(input_values, ["u"])
return output
TF-IDF score
This second use case consists of two tasks (LLM chains)
- Computation of TF-IDF score, tf_idf_score of terms extracted from 3 documents/files (file1.txt, file2.txt, file3.txt). The key for input values, documents, is the content of the 3 documents.
- Ordering the items by their TF-IDF score. The output key, ordered_list is the list of terms ranked by their decreasing TF-IDF score.
def load_content(file_name: str) -> str:
with open(file_name, 'r') as f:
return f.read()
def load_text(file_names: list[str]) -> list[str]:
return [load_content(file_name) for file_name in file_names]
def tf_idf_score() -> str:
chat_gpt_seq = ChatGPTTypedChains(0.0)
# Load documents for which TF-IDF score has to be computed
input_files = ['../input/file1.txt', '../input/file2.txt', '../input/file2.txt']
input_documents = '```'.join(load_text(input_files))
# Create first task: Compute the
chat_gpt_seq.append(
"Compute the TF-IDF score for words from documents delimited by triple backticks with output format term:TF-IDF score ```",
[('documents', 'list[str]', '')], 'terms_tf_idf_score')
# Create a second task
chat_gpt_seq.append("Sort the terms and TF-IDF score by decreasing order of TF-IDF score",
[('terms_tf_idf_score', 'list[float]', '')], 'ordered_list')
output = chat_gpt_seq({'documents': input_documents}, ["ordered_list"])
return output['ordered_list']
References
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3
Wednesday, May 31, 2023
Generate Code with ChatGPT Reusable Prompt Patterns
Reusable patterns
Use case
Persona pattern
Template pattern
Visualization generator pattern
Reflection pattern
Conclusion
References
Overview
Notes:
- See post on creating a client application to ChatGPT ChatGPT API Python client [ref 2] or Secure ChatGPT API client in Scala [ref 3] for experimenting with prompt engineering
- This post uses gpt-3.5-turbo model
Reusable patterns
Pattern | Purpose | Contextual statements |
---|---|---|
Persona | Output customization Assign a persona/role/domain expert to LLM. The persona can be expressed as a role, job description, title, historical or known figure. |
|
Template | Output customization Ensure output follows a precise template (i.e. format, URL, example…). This pattern instructs LLM to use a unique format in specific portion of the answer. |
|
Cognitive verifier | Prompt Improvement Quality of LLM answers improves if the initial questions is broken into additional sub-questions. |
|
Fact check list | Error identification Request LLM to provide/append a list of facts/assumptions to the answer, so the user may perform due-diligence. |
|
Output automater | Output customization Having LLM generate a script or automated task that can be execute any steps the LLM recommends. |
|
Reflection | Error identification Ask LLM to automatically explain the rationale behind a given answers to the user. The pattern clarifies any points of confusion, underlying assumptions, gaps in knowledge…. |
|
Visualization Generator | Output customization Use generated text to create a visualization as complex concepts are easier to grasp with diagrams and images. The LLM output should create a pathway for the tool to produce imagery. |
|
Use case
- The prompt's objective.
- The prompt message itself.
- ChatGPT's response, accompanied by comments and source code.
- Our subsequent analysis.
Initial request
Prompt
Answer
import psycopg2 from psycopg2 import sql import random # Define connection parameters db_name = "EHR" db_user = "username" db_password = "password" db_host = "localhost" db_port = "5432" # Connect to the PostgreSQL database connection = psycopg2.connect( dbname=db_name, user=db_user, password=db_password, host=db_host, port=db_port ) # Create a cursor object cursor = connection.cursor() # Create the 'predictions' table cursor.execute("""CREATE TABLE predictions ( id SERIAL PRIMARY KEY, num_entries INTEGER, variable VARCHAR(255), mean FLOAT, variance FLOAT);""" ) connection.commit() # Insert a large number of rows into the 'predictions' table for i in range(10000): # adjust this number as necessary cursor.execute("""INSERT INTO predictions ( num_entries, variable, mean, variance)VALUES (%s, %s, %s, %s )""", ( random.randint(1, 100), # num_entries f'variable_{i}', # variable random.uniform(0, 100), # mean random.uniform(0, 10) # variance ) ) connection.commit() # Close the cursor and connection cursor.close() connection.close()
# Re-establish connection and cursor
connection = psycopg2.connect(
dbname=db_name,
user=db_user,
password=db_password,
host=db_host,
port=db_port
)
cursor = connection.cursor()
# Execute a SELECT query
cursor.execute("SELECT * FROM predictions;")
# Fetch all rows from the query
rows = cursor.fetchall()
# Print each row
for row in rows:
print(row)
# Close the cursor and connection
cursor.close()
connection.close()
Analysis
- The code functions as intended. ChatGPT accurately infers that PostgreSQL is locally deployed and employs randomly generated values for code evaluation. The only minor issue is that the response assumes the absence of a table creation check ('CREATE TABLE IF NOT EXIST').
- In the second code snippet, the query is implemented with a straightforward print statement for each row, serving as a validation measure. To prevent excessive standard output, these print statements could have been applied to a subset of rows.
- ChatGPT doesn't include an exception handling mechanism, as it wasn't explicitly required. Software developers should anticipate incorporating proper error handling in the generated source code for safety.
Persona Pattern
Prompt
Context: I am a software engineer developing an application with the lowest latency possible."
Answer
ChatGPT comments: "In PostgreSQL, VARCHAR2 is not a valid datatype. We use VARCHAR instead.# Re-establish connection and cursor
connection = psycopg2.connect(
dbname=db_name,
user=db_user,
password=db_password,
host=db_host,
port=db_port
)
# Create a server-side cursor
cursor_name = "my_cursor"
cursor = connection.cursor(cursor_name)
# Execute a SELECT query
cursor.execute("SELECT * FROM predictions;")
# Fetch rows in manageable chunks
while True:
rows = cursor.fetchmany(size=1000) # adjust this number as necessary
if not rows:
break
# Process each row
for row in rows:
print(row)
# Close the cursor and connection
cursor.close()
connection.close()
Analysis
- ChatGPT detected the incorrect type VARCHAR2 for the variable.
- The constraint of low latency query is enforced by fetching a batch of 1000 rows at the time.
Template Pattern
Prompt
The output has to be formatted in HTML. I provide a template for you output with the following placeholders in all caps: <title>MYTITLE</title> and CSS style background-color:BACKGROUNDCOLOR. Please fit the output into the two place holders MYTITLE and BACKGROUNDCOLOR.
Context: I am a software engineer developing an application with the lowest latency possible."
Answer
import pandas as pd
# Execute a SELECT query
cursor.execute("SELECT * FROM predictions;")
# Fetch all rows into pandas DataFrame
df = pd.read_sql_query("SELECT * FROM predictions;", connection)
# Convert DataFrame to HTML
html = df.to_html()
# HTML template
html_output = f"""
<!DOCTYPE html>
<html>
<head>
<title>MYTITLE</title> <!-- replace with your actual title -->
<style>
body {{
background-color: BACKGROUNDCOLOR; <!-- replace with your actual background color -->
}}
table {{
width: 100%;
border-collapse: collapse;
}}
th, td {{
border: 1px solid black;
padding: 8px;
text-align: left;
}}
</style>
</head>
<body>
{html}
</body>
</html>
"""
# Save the HTML output
with open("output.html", "w") as file:
file.write(html_output)
# Close the cursor and connection
cursor.close()
connection.close()
Analysis
- The code executes as intended
- ChatGPT generates the HTML output by dumping the results set into a panda dataset and invoking to_htm() method as we did not specify an approach in our request.
- However, ChatGPT did not parameterize html_output, merely suggesting replacing the place holder as an HTML comment.
Visualization generator
Prompt
Answer
ChatGPT comments: "The resulting HTML output of the query will be displayed in a new tab in your default web browser. If you wish to open it specifically in Chrome, you'll need to provide the path to your Chrome executable and uncomment the relevant lines."ChatGPT code update
import webbrowser
# Open the web page in the web browser
filepath = 'file://' + os.path.realpath(filename)
# To open in your default web browser
webbrowser.open_new_tab(filepath)
# Or to specifically open in Chrome, uncomment the following lines
# and replace the path_to_chrome with the path to your Chrome executable
# path_to_chrome = r'/path/to/chrome'
# update path as per your configuration
webbrowser.get(path_to_chrome).open_new_tab(filepath)
Analysis
- We didn't specify the programming language for the visualization generator/script in our request. Consequently, ChatGPT correctly assumes that Python is the language to be used for rendering the HTML output of the query.
- It's worth noting that ChatGPT provided two statements for launching and opening the browser: one for the default browser and another tailored for Chrome.
- Similar to the previous response, ChatGPT did not include error handling code.
Reflection pattern
Prompt
Answer
ChatGPT comments:
"In the context of your request, here is the reasoning and assumptions:
PostgreSQL Connection: You mentioned that the PostgreSQL database is hosted on a remote service with the URL "mypostgresql". This implies that the Python script will require network access and the necessary credentials (username, password) to access the database.
Data Query: You want to query a large number of rows. This implies that the 'predictions' table will hold a large amount of data. For large result sets, it's good practice to use a server-side cursor, which fetches and processes the rows in manageable chunks. This reduces memory consumption and can potentially speed up the response times of your queries.
Table Creation: You want to create a table named "predictions" with specific fields and data types. This implies that you have the required permissions to create tables in the database.
Assumptions:
import psycopg2 import pandas as pd import webbrowser import os # Define connection parameters db_name = "EHR" db_user = "username" # replace with your actual username db_password = "password" # replace with your actual password db_host = "localhost" # replace with your actual host
db_port = "5432" # default PostgreSQL port try: # Connect to the PostgreSQL database connection = psycopg2.connect( dbname=db_name, user=db_user, password=db_password, host=db_host, port=db_port ) # Create a cursor object cursor = connection.cursor() # Create the 'predictions' table cursor.execute(""" CREATE TABLE predictions ( id SERIAL PRIMARY KEY, num_entries INT, variable VARCHAR(128), mean FLOAT, variance FLOAT );""") connection.commit() # Close the cursor cursor.close() # Re-establish cursor with server-side cursor for fetching large amount of rows cursor = connection.cursor("my_cursor") # Execute a SELECT query cursor.execute("SELECT * FROM predictions;") # Fetch all rows into pandas DataFrame df = pd.read_sql_query("SELECT * FROM predictions;", connection) # Convert DataFrame to HTML html = df.to_html() # HTML template html_output = f""" <!DOCTYPE html> <html> <head> <title>MYTITLE</title> <!-- replace with your actual title --> <style> body {{ background-color: BACKGROUNDCOLOR; <!-- replace with your actual background color --> }} table {{ width: 100%; border-collapse: collapse; }} th, td {{ border: 1px solid black; padding: 8px; text-align: left; }} </style> </head> <body> {html} </body> </html> """ # Save the HTML output filename = "output.html" with open(filename, "w") as file: file.write(html_output) # Open the web page in the web browser filepath = 'file://' + os.path.realpath(filename) # Or to specifically open in Chrome, uncomment the following lines # and replace the path_to_chrome with the path to your Chrome executable # path_to_chrome = r'/path/to/chrome' # update path as per your configuration # webbrowser.get(path_to_chrome).open_new_tab(filepath) except psycopg2.Error as error: print(f'An error occurred while connecting to the PostgreSQL database {error}")
finally:
# Always close the cursor and connection, whether an error occurred or not
if 'cursor' in locals():
cursor.close()
if 'connection' in locals():
connection.close()
- In accordance with our specific request, ChatGPT has incorporated exception handling into the most recent version of the code.
- Furthermore, ChatGPT offers insights into its choice of using Pandas for HTML generation and makes assumptions about the PostgreSQL installation status, table condition, query size, and other aspects.
- It's worth noting that the comment regarding server/database performance could have included additional guidance on multi-threaded execution, creating views, and manipulating cursors within the Python code.
Conclusion
References
[4] Design Patterns: Elements of reusable Object-Oriented Software E. Gamma, R. Helm, R. Johnson, J. Vlissides - Addison-Wesley professional computing series - 1995
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3