Project - Building a RAG Chatbot from Your Website Data using OpenAI and Langchain

12 / 26

Step 3: Converting Data to Documents

We'll use the MarkdownTextSplitter from Langchain to chop the text into smaller segments. Each segment will then be transformed into a "document" object, essentially a knowledge nugget for your chatbot.

Text splitters have two main parameters:

  1. chunk_size: Chunk size is the maximum number of characters that a chunk can contain.
  2. chunk_overlap: Chunk overlap is the number of characters that should overlap between two adjacent chunks.

The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.

INSTRUCTIONS
  1. Importing Libraries:

    from langchain.text_splitter import MarkdownTextSplitter
    from langchain.docstore.document import Document
    
  2. Define a function text_to_docs that takes text and metadata as input and returns a list of Documents:

    def text_to_docs(text, metadata):
        doc_chunks = []
        text_splitter = MarkdownTextSplitter(chunk_size=2048, chunk_overlap=128)
        chunks = text_splitter.split_text(text)
        for i, chunk in enumerate(chunks):
            doc = Document(page_content=chunk, metadata=metadata)
            doc_chunks.append(doc)
        return doc_chunks
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...