Previous Index Next

Step 3: Converting Data to Documents

We'll use the MarkdownTextSplitter from Langchain to chop the text into smaller segments. Each segment will then be transformed into a "document" object, essentially a knowledge nugget for your chatbot.

Text splitters have two main parameters:

chunk_size: Chunk size is the maximum number of characters that a chunk can contain.
chunk_overlap: Chunk overlap is the number of characters that should overlap between two adjacent chunks.

The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.

INSTRUCTIONS

Importing Libraries:

from langchain.text_splitter import MarkdownTextSplitter
from langchain.docstore.document import Document

Define a function text_to_docs that takes text and metadata as input and returns a list of Documents:

def text_to_docs(text, metadata):
    doc_chunks = []
    text_splitter = MarkdownTextSplitter(chunk_size=2048, chunk_overlap=128)
    chunks = text_splitter.split_text(text)
    for i, chunk in enumerate(chunks):
        doc = Document(page_content=chunk, metadata=metadata)
        doc_chunks.append(doc)
    return doc_chunks

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Project - Building a RAG Chatbot from Your Website Data using OpenAI, Langchain and Vector Database

Step 3: Converting Data to Documents

XP

Please login to comment

Be the first one to comment!