Author: Om Kamath

Om Kamath

RAG-as-a-Service: Unlock Generative AI for Your Business

With the rise of Large Language Models (LLMs) and generative AI trends, integrating generative AI solutions in your business can supercharge workflow efficiency. If you’re new to generative AI, the plethora of jargon can be intimidating. This blog will demystify the basic terminologies of generative AI and guide you on how to get started with a custom AI solution for your business with RAG-as-a-Service.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is a key concept in implementing LLMs or generative AI in business workflows. RAG leverages pre-trained Transformer models to answer business-related queries by injecting relevant data from your specific knowledge base into the query process. This data, which the LLMs may not have been trained on, is used to generate accurate and relevant responses.

RAG is both cost-effective and efficient, making generative AI more accessible. Let’s explore some key terminologies related to RAG.

Key Terminologies in RAG

Chunking

LLMs are resource-intensive and are trained on manageable data lengths known as the ‘Context Window.’ The Context Window varies based on the LLM used. To address its limitations, business data provided as documents or textual literature is segmented into smaller chunks. These chunks are utilized during the query retrieval process.

Since the chunks are unstructured and the queries may differ syntactically from the knowledge base data, chunks are retrieved using semantic search.

RAG-as-a-Service Process

Vector Databases

Vector databases like Pinecone, Chromadb, and FAISS store the embeddings of business data. Embeddings convert textual data into numerical form based on their meaning and are stored in a high-dimensional vector space where semantically similar data are closer.

When a user query is made, the embeddings of the query are used to find semantically similar chunks in the vector database.

RAG-as-a-Service

Implementing RAG for your business can be daunting if you lack technical expertise. This is where RAG-as-a-Service (RaaS) comes into play.

We at meetcody.ai offer a plug-and-play solution for your business needs. Simply create an account with us and get started for free. We handle the chunking, vector databases, and the entire RAG process, providing you with complete peace of mind.

FAQs

1. What is RAG-as-a-Service (RaaS)?

RAG-as-a-Service (RaaS) is a comprehensive solution that handles the entire Retrieval Augmented Generation process for your business. This includes data chunking, storing embeddings in vector databases, and managing semantic search to retrieve relevant data for queries.

2. How does chunking help in the RAG process?

Chunking segments large business documents into smaller, manageable pieces that fit within the LLM’s Context Window. This segmentation allows the LLM to process and retrieve relevant information more efficiently using semantic search.

3. What are vector databases, and why are they important?

Vector databases store the numerical representations (embeddings) of your business data. These embeddings allow for the efficient retrieval of semantically similar data when a query is made, ensuring accurate and relevant responses from the LLM.

Integrate RAG into your business with ease and efficiency by leveraging the power of RAG-as-a-Service. Get started with meetcody.ai today and transform your workflow with advanced generative AI solutions.

How to Automate Tasks with Anthropic’s Tools and Claude 3?

Getting started with Anthropic’s Tools

The greatest benefit of employing LLMs for tasks is their versatility. LLMs can be prompted in specific ways to serve a myriad of purposes, functioning as APIs for text generation or converting unstructured data into organized formats. Many of us turn to ChatGPT for our daily tasks, whether it’s composing emails or engaging in playful debates with the AI.

The architecture of plugins, also known as ‘GPTs’, revolves around identifying keywords from responses and queries and executing relevant functions. These plugins enable interactions with external applications or trigger custom functions.

While OpenAI led the way in enabling external function calls for task execution, Anthropic has recently introduced an enhanced feature called ‘Tool Use’, replacing their previous function calling mechanism. This updated version simplifies development by utilizing JSON instead of XML tags. Additionally, Claude-3 Opus boasts an advantage over GPT models with its larger context window of 200K tokens, particularly valuable in specific scenarios.

In this blog, we will explore the concept of ‘Tool Use’, discuss its features, and offer guidance on getting started.

What is ‘Tool Use’?

Claude has the capability to interact with external client-side tools and functions, enabling you to equip Claude with your own custom tools for a wider range of tasks.

The workflow for using Tools with Claude is as follows:

  1. Provide Claude with tools and a user prompt (API request)
    • Define a set of tools for Claude to choose from.
    • Include them along with the user query in the text generation prompt.
  2. Claude selects a tool
    • Claude analyzes the user prompt and compares it with all available tools to select the most relevant one.
    • Utilizing the LLM’s ‘thinking’ process, it identifies the keywords required for the relevant tool.
  3. Response Generation (API Response)
    • Upon completing the process, the thinking prompt, along with the selected tool and parameters, is generated as the output.

Following this process, you execute the selected function/tool and utilize its output to generate another response if necessary.

General schema of the tool

Schema
This schema serves as a means of communicating the requirements for the function calling process to the LLM. It does not directly call any function or trigger any action on its own. To ensure accurate identification of tools, a detailed description of each tool must be provided. Properties within the schema are utilized to identify the parameters that will be passed into the function at a later stage.

Demonstration

Let’s go ahead and build tools for scraping the web and finding the price of any stock.

Tools Schema

Code 1

In the scrape_website tool, it will fetch the URL of the website from the user prompt. As for the stock_price tool, it will identify the company name from the user prompt and convert it to a yfinance ticker.

User Prompt

Code 2

Asking the bot two queries, one for each tool, gives us the following outputs:

Code 3

The thinking process lists out all the steps taken by the LLM to accurately select the correct tool for each query and executing the necessary conversions as described in the tool descriptions.

Selecting the relevant tool

We will have to write some additional code that will trigger the relevant functions based on conditions.

Code 4

This function serves to activate the appropriate code based on the tool name retrieved in the LLM response. In the first condition, we scrape the website URL obtained from the Tool input, while in the second condition, we fetch the stock ticker and pass it to the yfinance python library.

Executing the functions

We will pass the entire ToolUseBlock in the select_tool() function to trigger the relevant code.

Outputs

  1. First PromptCode 5
  2. Second PromptCode 4

If you want to view the entire source code of this demonstration, you can view this notebook.

Some Use Cases

The ‘tool use’ feature for Claude elevates the versatility of the LLM to a whole new level. While the example provided is fundamental, it serves as a foundation for expanding functionality. Here is one real-life application of it:

To find more use-cases, you can visit the official repository of Anthropic here.

Top Hugging Face Spaces You Should Check Out in 2024

Hugging Face has quickly become a go-to platform in the machine learning community, boasting an extensive suite of tools and models for NLP, computer vision, and beyond. One of its most popular offerings is Hugging Face Spaces, a collaborative platform where developers can share machine learning applications and demos. These “spaces” allow users to interact with models directly, offering a hands-on experience with cutting-edge AI technology.

In this article, we will highlight five standout Hugging Face Spaces that you should check out in 2024. Each of these spaces provides a unique tool or generator that leverages the immense power of today’s AI models. Let’s delve into the details.

EpicrealismXL

Epicrealismxl is a state-of-the-art text-to-image generator that uses the stablediffusion epicrealism-xl model. This space allows you to provide the application with a prompt, negative prompts, and sampling steps to generate breathtaking images. Whether you are an artist seeking inspiration or a marketer looking for visuals, epicrealismxl offers high-quality image generation that is as realistic as it is epic.

Podcastify

Podcastify revolutionizes the way you consume written content by converting articles into listenable audio podcasts. Simply paste the URL of the article you wish to convert into the textbox, click “Podcastify,” and voila! You have a freshly generated podcast ready for you to listen to or view in the conversation tab. This tool is perfect for multitaskers who prefer auditory learning or individuals on the go.

Dalle-3-xl-lora-v2

Another stellar text-to-image generator, dalle-3-xl-lora-v2, utilizes the infamous DALL-E 3 model. Similar in function to epicrealismxl, this tool allows you to generate images from textual prompts. DALL-E 3 is known for its versatility and creativity, making it an excellent choice for generating complex and unique visuals for various applications.

AI Web Scraper

AI Scraper brings advanced web scraping capabilities to your fingertips without requiring any coding skills. This no-code tool lets you easily scrape and summarize web content using advanced AI models hosted on the Hugging Face Hub. Input your desired prompt and source URL to start extracting useful information in JSON format. This tool is indispensable for journalists, researchers, and content creators.

AI QR Code Generator

AI QR code generator

The AI QR Code Generator takes your QR codes to a whole new artistic level. By using the QR code image as both the initial and control image, this tool allows you to generate QR Codes that blend naturally with your provided prompt. Adjust the strength and conditioning scale parameters to create aesthetically pleasing QR codes that are both functional and beautiful.

Conclusion

Hugging Face Spaces are a testament to the rapid advancements in machine learning and AI. Whether you’re an artist, a content creator, a marketer, or just an AI enthusiast, these top five spaces offer various tools and generators that can enhance your workflow and ignite your creativity. Be sure to explore these spaces to stay ahead of the curve in 2024. If you want to know about the top 5 open source LLMs in 2024, read our blog here.

Gemini 1.5 Flash vs GPT-4o: Google’s Response to GPT-4o?

The AI race has intensified, becoming a catch-up game between the big players in tech. The launch of GPT-4o just before Google I/O is no coincidence. GPT-4o’s incredible capabilities in multimodality, or omnimodality to be precise, have created a significant impact in the Generative AI competition. However, Google is not one to hold back. During Google I/O, they announced new variants of their Gemini and Gemma models. Among all the models announced, the Gemini 1.5 Flash stands out as the most impactful. In this blog, we will explore the top features of the Gemini 1.5 Flash and compare it to the Gemini 1.5 Pro and Gemini 1.5 Flash vs GPT-4o to determine which one is better.

Comparison of Gemini 1.5 Flash vs GPT-4o

Based on the benchmark scores released by Google, the Gemini 1.5 Flash has superior performance on audio compared to all other LLMs by Google and is on par with the outgoing Gemini 1.5 Pro (Feb 2024) model for other benchmarks. Although we would not recommend relying completely on benchmarks to assess the performance of any LLM, they help in quantifying the difference in performance and minor upgrades.

Gemini 1.5 Flash Benchmarks

The elephant in the room is the cost of the Gemini 1.5 Flash. Compared to GPT-4o, the Gemini 1.5 Flash is much more affordable.

Price of Gemini

Price of Gemini

Price of GPT

Context Window

Just like the Gemini 1.5 Pro, the Flash comes with a context window of 1 million tokens, which is more than any of the OpenAI models and is one of the largest context windows for production-grade LLMs. A larger context window allows for more data comprehension and can improve third-party techniques such as RAG (Retrieval-Augmented Generation) for use cases with a large knowledge base by increasing the chunk size. Additionally, a larger context window allows more text generation, which is helpful in scenarios like writing articles, emails, and press releases.

Multimodality

Gemini-1.5 Flash is multimodal. Multimodality allows for inputting context in the form of audio, video, documents, etc. LLMs with multimodality are more versatile and open the doors for more applications of generative AI without any preprocessing required.

“Gemini 1.5 models are built to handle extremely long contexts; they have the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio.” — DeepMind Report

Multimodality

Dabbas = Train coach in Hindi. Demonstrating the Multimodality and Multilingual performance.

Having multimodality also allows us to use LLMs as substitutes for other specialized services. For eg. OCR or Web Scraping.

OCR on gemini

Easily scrape data from web pages and transform it.

Speed

Gemini 1.5 Flash, as the name suggests, is designed to have an edge over other models in terms of response time. For the example of web scraping mentioned above, there is approximately a 2.5-second difference in response time, which is almost 40% quicker, making the Gemini 1.5 Flash a better choice for automation usage or any use case that requires lower latency.

Speed on Gemini 1.5 Pro

Some interesting use-cases of Gemini 1.5 Flash

Summarizing Videos


Writing Code using Video

Automating Gameplay

GPT-4o: OpenAI Unveils Its Latest Language Model, Available for Free to Users

GPT-4o

After a ton of speculation on social media and other forums about what OpenAI has in store for us, yesterday, OpenAI finally revealed their latest and most powerful LLM to date — GPT-4o (‘o’ for omni). In case you missed the launch event of GPT-4o, let’s go over the capabilities of GPT-4o and the features it offers.

Enhanced Audio, Text and Vision Capabilites

GPT-4 Turbo is a powerful model, but it comes with one drawback — latency. When compared to GPT-3.5 Turbo, GPT-4 Turbo is still considerably slower. GPT-4o addresses this drawback and is 2x faster than GPT-4 Turbo. This opens up a broader spectrum of use cases involving the integration of data from speech, text, and vision, taking it one step further from multi-modal to omni-modal. The main difference between multi-modal and omni-modal is that in omni-modal, all three sources can be seamlessly run in parallel.

These enhancements also enable the model to generate speech with improved voice modulation, the capability to understand sarcasm, and enhanced natural conversational abilities.

Reduced pricing and available for free to ChatGPT users

Although GPT-4o is more efficient and faster compared to the outgoing GPT-4 Turbo, it is half the price (API) of GPT-4 Turbo, meaning that GPT-4o will cost US$5.00/1M input tokens and US$15.00/1M output tokens. With the better pricing, the context window is now 128k tokens, and the knowledge cutoff is October 2023.

As a cherry on top, GPT-4o will be available to all ChatGPT users for free (ChatGPT Plus users will have 5x cap for GPT-4o). Alongside this, OpenAI also unveiled the ChatGPT desktop app, which will allow users to make use of the vision capabilities of GPT-4o to read and comprehend the content being displayed on the screen. Users will also be able to talk to ChatGPT using the desktop app.

GPT-4o Demo

 

OpenAI stated that they are rolling out access to GPT-4o in stages over the next few weeks, with ChatGPT Plus users receiving priority and early access to the model. We will understand the true potential of this model only once we get access to it in the coming weeks. Exciting times ahead!

Groq and Llama 3: A Game-Changing Duo

A couple of months ago, a new company named ‘Groq’ emerged seemingly out of nowhere, making a breakthrough in the AI industry. They provided a platform for developers to access LPUs as inferencing engines for LLMs, especially open-source ones like Llama, Mixtral, and Gemma. In this blog, let’s explore what makes Groq so special and delve into the marvel behind LPUs.

What is Groq?

“Groq is on a mission to set the standard for GenAI inference speed, helping real time AI applications come to life today.” — The Groq Website

Groq isn’t a company that develops LLMs like GPT or Gemini. Instead, Groq focuses on enhancing the foundations of these large language models—the hardware they operate on. It serves as an ‘inference engine.’ Currently, most LLMs in the market utilize traditional GPUs deployed on private servers or the cloud. While these GPUs are expensive and powerful, sourced from companies like Nvidia, they still rely on traditional GPU architecture, which may not be optimally suited for LLM inferencing (though they remain powerful and preferred for training models).

The inference engine provided by Groq works on LPUs — Language Processing Units.

What is an LPU?

A Language Processing Unit is a chip specifically designed for LLMs and it is built on a unique architecture combining CPUs and GPUs to transform the pace, predictability, performance and accuracy of AI solutions for LLMs.

LPUs Language Processing Unit of Groq

Key attributes of an LPU system. Credits: Groq

An LPU system has as much or more compute as a Graphics Processor (GPU) and reduces the amount of time per word calculated, allowing faster generation of text sequences.

Features of an LPU inference engine as listed on the Groq website:

  • Exceptional sequential performance
  • Single core architecture
  • Synchronous networking that is maintained even for large scale deployments
  • Ability to auto-compile >50B LLMs
  • Instant memory access
  • High accuracy that is maintained even at lower precision levels

Services provided by Groq:

  1. GroqCloud: LPUs on the cloud
  2. GroqRack: 42U rack with up to 64 interconnected chips
  3. GroqNode: 4U rack-ready scalable compute system featuring eight interconnected GroqCard™ accelerators
  4. GroqCard: A single chip in a standard PCIe Gen 4×16 form factor providing hassle-free server integration

“Unlike the CPU that was designed to do a completely different type of task than AI, or the GPU that was designed based on the CPU to do something kind of like AI by accident, or the TPU that modified the GPU to make it better for AI, Groq is from the ground up, first principles, a computer system for AI”— Daniel Warfield, Towards Data Science

To know more about how LPUs differ from GPUs, TPUs and CPUs, we recommend reading this comprehensive article written by Daniel Warfield for Towards Data Science.

What’s the point of Groq?

LLMs are incredibly powerful, capable of tasks ranging from parsing unstructured data to answering questions about the cuteness of cats. However, their main drawback currently lies in response time. The slower response time leads to significant latency when using LLMs in backend processes. For example, fetching data from a database and displaying it in JSON format is currently much faster when done using traditional logic rather than passing the data through an LLM for transformation. However, the advantage of LLMs lies in their ability to understand and handle data exceptions.

With the incredible inference speed offered by Groq, this drawback of LLMs can be greatly reduced. This opens up better and wider use-cases for LLMs and reduces costs, as with an LPU, you’ll be able to deploy open-source models that are much cheaper to run with really quick response times.

Llama 3 on Groq

A couple of weeks ago, Meta unveiled their latest iteration of the already powerful and highly capable open-source LLM—Llama 3. Alongside the typical enhancements in speed, data comprehension, and token generation, two significant improvements stand out:

  1. Trained on a dataset 7 times larger than Llama 2, with 4 times more code.
  2. Doubled context length to 8,000 tokens.

Llama 2 was already a formidable open-source LLM, but with these two updates, the performance of Llama 3 is expected to rise significantly.

Llama 3 Benchmarks

Llama 3 Benchmarks

To test Llama 3, you have the option to utilize Meta AI or the Groq playground. We’ll showcase the performance of Groq by testing it with Llama 3.

Groq Playground

Currently, the Groq playground offers free access to Gemma 7B, Llama 3 70B and 8B, and Mixtral 8x7b. The playground allows you to adjust parameters such as temperature, maximum tokens, and streaming toggle. Additionally, it features a dedicated JSON mode to generate JSON output only.

Only 402ms for inference at the rate of 901 tokens/s

Only 402ms for inference at the rate of 901 tokens/s

Only 402ms for inference at the rate of 901 tokens/s

Coming to the most impactful domain/application in my opinion, data extraction and transformation:

Asking the model to extract useful information and providing a JSON using the JSON mode.

Asking the model to extract useful information and providing a JSON using the JSON mode.

The extraction and transformation to JSON format was completed in less than half a second.

The extraction and transformation to JSON format was completed in less than half a second.

Conclusion

As demonstrated, Groq has emerged as a game-changer in the LLM landscape with their innovative LPU Inference Engine. The rapid transformation showcased here hints at the immense potential for accelerating AI applications. Looking ahead, one can only speculate about the future innovations from Groq. Perhaps, an Image Processing Unit could revolutionize image generation models, contributing to advancements in AI video generation. Indeed, it’s an exciting future to anticipate.

Looking ahead, as LLM training becomes more efficient, the potential for having a personalized ChatGPT, fine-tuned with your data on your local device, becomes a tantalizing prospect. One platform that offers such capabilities is Cody, an intelligent AI assistant tailored to support businesses in various aspects. Much like ChatGPT, Cody can be trained on your business data, team, processes, and clients, using your unique knowledge base.

With Cody, businesses can harness the power of AI to create a personalized and intelligent assistant that caters specifically to their needs, making it a promising addition to the world of AI-driven business solutions.