I delve into a diverse range of topics, spanning programming languages, machine learning, data engineering tools, and DevOps. Our articles are enriched with practical code examples, ensuring their applicability in real-world scenarios.
Showing posts with label Kafka. Show all posts
Showing posts with label Kafka. Show all posts
Sunday, November 3, 2024
Posts History
Labels:
Airflow,
BERT,
Big Data,
ChatGPT,
Data pipelines,
Deep learning,
Docker,
Genetic Algorithm,
Java,
Kafka,
Machine learning,
Monads,
NoSQL,
Python,
PyTorch,
Reinforcement Learning,
Scala,
Spark,
Streaming
Monday, June 21, 2021
Open Source Lambda architecture for Deep Learning
Target audience: Beginner
Estimated reading time: 4'
Estimated reading time: 4'
Data scientists familiar with Python's scientific libraries have found their landscape transformed with the rise of 'big data' frameworks like Apache Hadoop, Spark, and Kafka.
This article unveils a modified Lambda architecture and succinctly delves into the smooth amalgamation of diverse open-source elements. Herein, we provide a broad perspective on the pivotal services inherent to a standard architecture.
Table of contents
References
Flask is Python-based web development platform built as a micro-framework to support REST protocol. Its minimalist approach to web interface makes is a very intuitive tool to be build micro-services.
Amazon Simple Storage Server (S3) is a highly available, secure object storage service with a very high durability factor (11 sigma) and scalability and support for versioning. It is versatile enough to accommodate any kind of data format.
Conceptual data flow
The concept and architecture are versatile enough to accommodate a variety of open source, commercial solutions and services beside the frameworks prescribed in this presentation. The open source framework PyTorch is used to illustrate the integration of big data framework such as Apache Kafka and Spark with deep learning library to train, validate and test deep learning models.Alternative libraries such as Keras or Tensor Flow could be also used.
Let's consider the use case of training and validating a deep learning model, using Apache Spark to load, parallelize and pre-process the data. Apache Spark takes advantage of large number of servers and CPU cores.
In this simple design, the workflow is broken down into 6 steps
- Apache Spark load then parallelize training data from AWS S3
- Spark distributed the data pre-processing, cleansing, normalization across multiple worker nodes
- Spark forward the processed data to PyTorch cluster
- Flask converts requests to prediction query to PyTorch model
- PyTorch model generate a prediction
- Run-time metrics are broadcast through Kafka
Key services
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.It extends the functionality of Numpy and Scikit-learn to support the training, evaluation and commercialization of complex machine learning models.
Apache Spark is an open source cluster computing framework for fast real-time processing. It supports Scala, Java, Python and R programming languages and includes streaming, graph and machine learning libraries.
Apache Kafka is an open-source distributed event streaming framework to large scale, real-time data processing and analytics.
It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor.
It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor.
Apache Hive is an open source data warehouse platform that facilitates reading, writing, and managing large datasets residing in distributed storages such as Hadoop and Apache Spark
Flask is Python-based web development platform built as a micro-framework to support REST protocol. Its minimalist approach to web interface makes is a very intuitive tool to be build micro-services.
Thank you for reading this article. For more information ...
References
Note: This informational post introduced the high level components of a Lambda architecture. Such orchestration of services is the foundation of iterative machine learning modeling concept known as MLOps. MLOps will be discussed in a future post.
---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3
Subscribe to:
Posts (Atom)