AgentLabs LLM Knowledge Base Best Practice

When uploading a PDF knowledge base into AgentLabs’ LLM bot, it’s essential to follow a systematic approach to ensure that the content is uploaded, processed, and integrated efficiently. Here’s a best practices guide to help you through the process.

Pre-Processing the PDF Document

Before uploading the PDF into the bot, ensure the document is in the best format for consumption by the LLM.

Remove unnecessary elements: Eliminate metadata, footers, watermarks, or background graphics.
Check file size: At AgentLabs LLM, the maximum size limit is 35 Mb. Compress the file if it is too large, but ensure readability is not compromised.
Document Structure Consistency: Use clear headings, bullet points, and consistent formatting to facilitate easier parsing by the LLM.

Types of content that can be extracted by AgentLabs LLM

🗒️ Note

Currently, AgentLabs LLM supports multimodal (image) ! You can get information from the images in your PDF file.

There are several types of content that can be trained in the LLM:

Text: LLM will take all the text contained in the PDF file which will then be trained into the model.
Image: LLM will extract the image and describe it in text form, which will then be trained into LLM.

Make sure the PDF file used is in the form of text or image so that it can be read and extracted

Content Segmentation and Structuring

Break into sections: Organize content into sections, topics, or chapters, if possible. This segmentation helps the LLM understand and retrieve specific information more accurately.
Use meaningful headings: Ensure that each section is well-labeled with informative titles.

Content Filter

Content filtering refers to the process of monitoring, controlling, or restricting certain types of content that the LLM can generate or process. This is done to ensure the outputs are appropriate for a given context, audience, or purpose. Filtering can be applied to prevent:

Offensive language (e.g., hate speech, discriminatory terms)
Violent or harmful content
Misinformation or disinformation
Sensitive or private information leakage
Inappropriate or explicit material

Examples of Best Practice

Here is an example of Best Practice on how to create a PDF knowledge base so that your content can be understood by AgentLabs LLM.

Avoid ambiguity

If it’s confusing or ambiguous for a human to read, it’s also going to be ambiguous for AgentLabs LLM! (And that’s when there’s a chance it’ll give the wrong answer, or make an incorrect inference, or just not answer something that otherwise should be answerable).

Use headers

It’s important to use headers (H1, H2, or H3) to make your content scannable for both LLM and humans. But you should also include some of the header information in the paragraph below it, just in case LLM fails to properly capture all the headings from image or text.

For your information, LLM will pick up relevant topics based on the header you use. Make sure the header is clear so it will be easier for LLM to read.

🗒️ Important

Headings are very important to make the content clearer and more focused, LLM will detect the content from the headings you provide. You must provide headings with a clear format in a context

Consistent Format

Maintaining a consistent format is crucial for a well-structured Knowledge Base, especially for topics related to LLM. Consistency improves readability, user comprehension, and ease of use.

Clear heading and consistent bullet point.

The title for each content is short but clear and on the same page, according to the content.

👍Good Practice	Clear and short title, preferably also on the same page.
👎Poor Practice	Title or heading separate from content page

Things to Know in AgentLabs

After successfully uploading the PDF, wait until all images have been extracted, then you can continue to Train Knowledge Base.

When uploading a PDF, you should wait for it to finish first (successfully train), after which you can continue to upload the next PDF.