Target audience: Beginner

Estimated reading time: 3'

Posts history

It can be challenging for a data engineer or data scientist to produce, update and maintain the documentation for a project. This article presents the idea of "latent technical documentation" which utilizes tags on software development items (or artifacts), combined with a large language model (LLM), to develop, refine, and maintain a project's documentation.

Overview
Challenges
What is latent documentation?
Tagging artifacts
Python code
Scala code
Github commits
Airflow DAG & tasks
Docker compose
Generating documentation
Large Language Models
Retrieval-Augmented Generation (RAG)

References

Notes:

The post describes the simple use of large language models. It is not an in-depth description or even an introduction to LLM.
ChatGPT API is introduced in two of my previous posts: Secure ChatGPT API client in Scala and ChatGPT API Python client.
To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Overview

Challenges

The key challenges for documenting an AI-based project to be deployed in production are

Information spread across multiple platforms and devices.
Uneven documentation from contributors, experts with different background; Data engineers, DevOps, Scientists, Product managers, ...
Sections of the documentation being out-of-date following recent change in product or service requirements.
Missing justification for design or modeling decision.

What is latent documentation?

Latent technical documentation is a two-step process:

Tagger: Insert comments or document fragments related to a project for each artifact, item or step in the development process (coding, architecture design, unit testing, version control commits, deployment scripts, configurations containers definition and orchestration, ...).
Generator: Gather and consolidate various doc fragments into a single pre-formatted document for the entire project. A large language model (LLM) is an adequate tool to generate a clear and concise documentation.

The following diagram illustrates the tagging-generation process.

Illustration of two step latent documentation

Tagging artifacts is accomplished by defining an easy to use format that does not add overhead in the development cycle.

In this post we use the following format to select and tag relevant information in an artifact:

^#tag_key comments^

Example: ^#KafkaPipelineStream Initiate the processing of streams given properties assigned to the Kafka streams, in config/kafka.conf^

The second step, generation of the project document consists of collecting, parsing tags across the various artifacts, then generate a summary or/and formal document for the project.

Let's review some of the artifact tags.

Tagging artifacts

The engineers, data scientists tags, key & comment, for as many as possible artifacts used in the development and in the case of AI, training, and validation of models. A partial list of artifacts:

Source code files
Deployment scripts
Version control comments and logs
Unit tests objectives
Test results
Orchestration libraries such Airflow
Container-based frameworks such as Docker, Kubernetes or TerraFlow
Product requirement documents (PRD, MRD)
Minutes of meetings.

Python code

Documentation can be extracted from Python source code by selecting and tagging section of the comments. This process does not add much overhead to the development cycle as the contentious developers document their code, anyway.

async def post(self) -> list:
    """
       ^#AsyncHTTP Process the list of iterator (1 iterator per client). 
       The steps are:
        1. Create a tasks from co-routine
        2. Aggregate the various tasks
        3. Block on the completion of all the tasks^
        :return: List of results
    """
    tasks = self.__create_tasks()      
    all_tasks = asyncio.gather(*tasks) 
    responses = await all_tasks       

    assert self.num_requests == len(responses), \
            f'Number of responses {len(responses)} != number of requests {self.num_requests}'

    return responses

In this code snippet, the documentation fragment "Process the list .... of all the tasks" will be associated with the key AsyncHTTP.

Scala code

The following code snippet define a tag with key KafkaPipelineStream for the class constructor PipelineStreams and method start.

/**
 * ^#KafkaPipelineStream Parameterized basic pipeline streams that consumes requests.

    using Kafka stream.The topology is created from the request and response topic.

    This class inherits from PipelineStreams.^
 * @param valueDeserializerClass Class or type used in the deserialization for Kafka consumer
 * @tparam T Type of Kafka message consumed
 * @see org.streamingeval.kafka.streams.PipelineStreams
 */
private[kafka] abstract class PipelineStreams[T](valueDeserializerClass: String) {
  protected[this] val properties: Properties = getProperties
  protected[this] val streamBuilder: StreamsBuilder = new StreamsBuilder

  /**
   * ^#KafkaPipelineStream Initiate the processing of streams given properties assigned

      to the Kafka streams, in config/kafka.conf^
   * @param requestTopic Input topic for request (Prediction or Feedback)
   * @param responseTopic Output topic for response
   */
  def start(requestTopic: String, responseTopic: String): Unit =
    for {
      topology <- createTopology(requestTopic, responseTopic)
    }

    yield {
      val streams = new KafkaStreams(topology, properties)
      streams.cleanUp()
      streams.start()


      logger.info(s"Streaming for $requestTopic requests started!")
      val delayMs = 2000L
      delay(delayMs)

      // Shut down the streaming
      sys.ShutdownHookThread {
          streams.close(Duration.ofSeconds(12))
      }
    }
}

GitHub commits

Documentation can be augmented by tagging the comment(s) to a version control commit request. The following command line add comment for the key KafkaPipelineStream for a commit.

git commit -m "^#KafkaPipelineStreams Implementation of streams using RequestSerDe and R
esponseSerDe serialization-deserialization pairs^ for parameterized requests and responses" .

Airflow DAG & tasks

Here is an example of tagging section of comments on a Airflow Direct Acyclic Graph (DAG) of executable tasks, with the same key KafkaPipelineStream.

default_args = {
    'owner': 'herold',
    'retries': 3,
    'retry_delay': timedelta(minutes=10)
}



@dag(dag_id='produce_note_from_s3',
     default_args=default_args,
     start_date=datetime(2023, 4, 12),
     schedule_interval='@hourly')



"""
    ^#KafkaPipelineStream Definition of the DAG to load unstructured medical 
    documents from AWS S3. It relies on the loader function, 
    s3_loader defined in module kafka.util.^
"""
def collect_from_s3_etl():

    @task()
    def load_from_s3():
        return s3_loader()

    produced_notes = ProduceToTopicOperator(
        task_id="loaded_from_s3",
        kafka_config_id="kafka_default",
        topic=KAFKA_TOPIC,
        producer_function=loader_notes,
        producer_function_args=["{{ ti.xcom_pull(task_ids='load_from_s3')}}"],
        poll_timeout=10,
    )
    
    produced_notes()

Docker compose

Comments and tags can be also added to container application development such as Docker or a container orchestrator like Kubernetes.

The following multi-container descriptor, docker-compose.yml uses KafkaPipelineStream tag to add information regarding application deployment configuration to the project documentation.

version: '0.1'
networks:
    datapipeline:
        driver: bridge

services:
    zookeeper:
        # .... image and environment

        # ^#KafkaPipelineStream Kafka docker image loaded from bantam following zookeeper deployment
        # Port 29092
        # Consumer properties
        # KAFKA_CONSUMER_CONFIGURATION_POOL_TIME_INTERVAL: 14800
        # KAFKA_CONSUMER_CONFIGURATION_MAX_POLL_RECORDS: 120
        # KAFKA_CONSUMER_CONFIGURATION_FETCH_MAX_BYTES: 5428800
        # KAFKA_CONSUMER_CONFIGURATION_MAX_PARTITION_FETCH_BYTES: 1048576^
    kafka:
        image: bitnami/kafka:latest
        container_name: "Kafka"
        restart: always
        depends_on:
            - zookeeper
        ports:
            - 29092:29092
        environment:
            KAFKA_BROKER_ID: 1
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
            KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092, PLAINTEXT_HOST://localhost:29092
            KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXST_HOST:PLAINTEXT
            KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
            KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
            KAFKA_CONSUMER_CONFIGURATION_POOL_TIME_INTERVAL: 14800
            KAFKA_CONSUMER_CONFIGURATION_MAX_POLL_RECORDS: 120
            KAFKA_CONSUMER_CONFIGURATION_FETCH_MAX_BYTES: 5428800
            KAFKA_CONSUMER_CONFIGURATION_MAX_PARTITION_FETCH_BYTES: 1048576
        volumes:
            - ./producer:/producer
            - ./consumer:/consumer
        networks:
            - datapipeline

Generating documentation

The next challenge is to collect and generate the documentation. The step for the generation of overall project documents consists of

Collecting artifact using a script
Extracting tags as key value pairs
Grouping the various documentation fragments per key
Formatting, optionally and forwarding the document to a LLM model.

Large language models

Let's look at generative AI to create a formal, final project document. The process aggregates the various tag comment into a single text which is used as the context prompt ('system' role in ChatGPT).

The following document was produced by ChatGPT 4.0 [ref 1] although alternative large language models could be also used.

Please refer to the implementation of client to ChatGPT in Scala Secure ChatGPT API client in Scala and in Python ChatGPT API Python client.

1. Overview of PipelineStreams Class:
- The class inherits from PipelineStreams.
- It sets up parameterized basic pipeline streams that consume requests using Kafka stream.
- The topology is derived from both the request and response topics.


2. Kafka Streams Configuration:
- The processing of streams is initiated based on properties assigned to the Kafka streams.
- These properties are located in config/kafka.conf.


3. Implementation Details:
- Streams are implemented using RequestSerDe and ResponseSerDe serialization-deserialization pairs.


4. Loading Medical Documents from AWS S3:
- A Directed Acyclic Graph (DAG) is defined to load unstructured medical documents.
- The loading relies on the s3_loader function.
- The s3_loader function is defined in the kafka.util module.


5. Kafka Docker Deployment:
- Kafka docker image is sourced from bitnami.

- It follows a zookeeper deployment.
- The default port is 29092.


6. Kafka Consumer Properties:
- KAFKA_CONSUMER_CONFIGURATION_POOL_TIME_INTERVAL: 14800
- KAFKA_CONSUMER_CONFIGURATION_MAX_POLL_RECORDS: 120
- KAFKA_CONSUMER_CONFIGURATION_FETCH_MAX_BYTES: 5428800
- KAFKA_CONSUMER_CONFIGURATION_MAX_PARTITION_FETCH_BYTES: 1048576

The LLM produces a document of quality based on the quality of the prompt you provide. So, you must craft the tags used as context in the LLM request meticulously.

The maximum token limit for an LLM prompt (ChatGPT 4.0: 8192, Llama-2: 4096, Llama-code: 16384) can constrain the quantity of tagged information used to create the project document.

Using Retrieval-Augmented Generation (RAG), you can bypass the token restriction by defining the various tag inputs as embedded vectors and storing them in a vector database.

Retrieval-Augmented Generation (RAG)

Retrieval augmented generation is a more sophisticated leverage of large language models (LLMs). This is a machine learning framework that relies on an external knowledge base to improve the accuracy of LLMs [ref 2]. The knowledge base contains up to date, domain specific information.

In our case, the knowledge base is built by defining questions (tags) and expected output documentation.

Thank you for reading this article. For more information ...

References

[1] OpenAPI Getting started

[2] What is retrieval-augmented generation?

---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Thursday, June 15, 2023

Automate Technical Documentation using LLM

Overview

Challenges

What is latent documentation?

Tagging artifacts

Python code

Scala code

GitHub commits

Airflow DAG & tasks

Docker compose

Generating documentation

Large language models

Retrieval-Augmented Generation (RAG)

References

Contact Form

Equation Editor

Popular Posts