We'll use the MarkdownTextSplitter
from Langchain to chop the text into smaller segments. Each segment will then be transformed into a "document" object, essentially a knowledge nugget for your chatbot.
Text splitters have two main parameters:
The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.
Importing Libraries:
from langchain.text_splitter import MarkdownTextSplitter
from langchain.docstore.document import Document
Define a function text_to_docs
that takes text
and metadata
as input and returns a list of Documents:
def text_to_docs(text, metadata):
doc_chunks = []
text_splitter = MarkdownTextSplitter(chunk_size=2048, chunk_overlap=128)
chunks = text_splitter.split_text(text)
for i, chunk in enumerate(chunks):
doc = Document(page_content=chunk, metadata=metadata)
doc_chunks.append(doc)
return doc_chunks
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
Be the first one to comment!