AI RAG with LlamaIndex, Local Embedding, and Ollama Llama 3.1 8b

6 min readJan 11, 2025

An AI generated black and white image of a close up of a llama eye

In this post, I cover using LlamaIndex LlamaParse in auto mode to parse a PDF page containing a table, using a Hugging Face local embedding model, and using local Llama 3.1 8b via Ollama to perform naive Retrieval Augmented Generation (RAG). That’s a mouthful. I won’t go into how to setup Ollama and Llama 3.1 8b; this post assumes it is running.

First off, you can find the code for this in my LlamaIndex_Test Github repo under Test1/src folder. At the time of this writing there is a Test0 and a Test1. To see the post about Test0 code see Using LlamaIndex — Part 1 OpenAI.

The code uses a .env and load_dotenv() to populate the needed LLAMA_CLOUD_API_KEY. I recommend that if you have an OPENAI_API_KEY entry in the .env that you comment it out for this experiment to prove to yourself that the embedding and LLM are local and not OpenAI. See the part 1 post for more details on the LLAMA_CLOUD_API_KEY.

#OPENAI_API_KEY=YOUR_API_KEY
LLAMA_CLOUD_API_KEY=YOUR_API_KEY

The pip install dependencies I put as comments at the top of the python file. There is also a requirements.txt for the project as a whole that covers all the “Test” experiments package requirements.

# pip install llama-index-embeddings-huggingface
# pip install llama-index-llms-ollama
# pip install llama-index-core llama-parse llama-index-readers-file

The nice thing about LlamaIndex LlamaParse is that it provides an auto mode that will use premium mode when specified criteria are met. In this experiment, I have set auto mode on with triggers for mode change on in- page images or tables. Also, to save on parsing credit usage in LlamaParse and because, for this example, it is all that is needed, I have set the pages to be parsed to PDF page 9 only (note that PDF page 9 is target page 8 to LlamaParse because it uses a 0 based page index). Like the part 1 post, I am using an output of markdown because it provides greater context to the LLM; though, I did try it with result_type=text and received the proper query response despite the answer to the query being in a table.

# set LlamaParse for markdown output and auto_mode only parsing page 8
parser = LlamaParse(
    result_type="markdown", 
    auto_mode=True,
    auto_mode_trigger_on_image_in_page=True,
    auto_mode_trigger_on_table_in_page=True,
    target_pages="8",
    verbose=True
)

So that you don’t have to open the PDF document that gets parsed to understand the input below is a screenshot of the page.

Image of the “Amazon_nova_technical_report” PDF page 9 that shows the text and table on the page. See also the PDF in the GitHub repo under sample_docs.

As in part 1, I use LlamaParse.load_data to read the page and parse it. Since it has a table in-page and we are in auto mode it will automatically use Premium mode to potentially better handle the page and table. This will cause the page parse to cost 15 credits on LlamaIndex. Note that LlamaIndex will cache your parsed page for 48 hours unless you specify otherwise or change the parse parameters which allows you to run the code more than once and only get the credit cost once. I did try using the default “accurate” mode by removing the auto_mode parameters on the LlamaParse and it still parsed the table properly and returned the proper answer to the query — but this is a sample for showing the use of “auto mode” so just pretend that is not the case.

If you want to see the output of the parser, uncomment the print command after the documents variable is populated. I like to then paste it into a markdown viewer to see it as rendered markdown output. See the below image for that output.

with open(f"../../sample_docs/{file_name}", "rb") as file_to_parse:
    # LlamaParse will cache a parsed document 48 hours if the parse parameters are not changed
    # thus not incuring additional parse cost if you run this multiple times for testing purposes
    # see the history tab in the LlamaParse dashboard for the project to confirm that 
    # credits used = 0 for subsequent runs
    # 
    # must provide extra_info with file_name key when passing file object
    documents = parser.load_data(file_to_parse, extra_info=extra_info)
    # to manually check the output uncomment the below
    #print(documents[0].text)

A screenshot of the LlamaParse output and the rendered markdown side by side.

I like to set the default settings for LLM and embedding model so that I don’t need to pass them around as parameters. Here is where I set the embedding model to a Hugging Face provided model. When you run the python for the first time it will pull down the embedding model automatically — nice!

# set the default embeddings and llm so that it doesn't have to be passed around
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = Ollama(model="llama3.1:latest", request_timeout=120.0)

The next part of the code does the same that it did in Part 1 except that this time the VectoreStoreIndex and the query engine use the models I set in the Settings singleton versus the LlamaIndex default of OpenAI.

# index the parsed documents using the default embedding model
index = VectorStoreIndex.from_documents(documents)

# generate a query engine for the index using the default llm
query_engine = index.as_query_engine()

# provide the query and output the results
query = "What is the latency in seconds for Nova Micro?"
response = query_engine.query(query)
print(response)

If all goes well you should get the response output as 0.5 and if you look back at the table from the page you’ll see that is correct.

(.venv) PS C:\python\LlamaIndex_Test\Test1\src> python parse_ollama.py
Started parsing the file under job_id 37dce328-aaa7-499b-afe9-498c32b63944
.0.5

To validate that the value was coming from the RAG provided PDF page and not the the LLMs inherent “knowledge”, I asked a similar question via the command line to Ollama without providing the RAG context— output below:

PS C:\temp> ollama run llama3.1:latest "what is the latency in seconds for Nova Micro Amazon LLM model?"
I don't have access to specific information about the latency of the Nova Micro Amazon LLM (Large Language Model)
model. The details regarding models like this, especially concerning their performance metrics such as latency,
are typically available from the developers or through official documentation and may be subject to change. If
you're looking for accurate and up-to-date information on this topic, I recommend checking directly with Nova
Micro's resources or contacting them for the most current data.

There you have it. But I am not done quite yet in reporting my results. In LlamaIndex’s examples, they used this PDF but used PDF page 1 which contains an image. See below an image of the page.

Screenshot of page 1 of the PDF sowing text and an image of a diagram.

They use this page to demonstrate how LlamaParse in auto mode moves into premium mode for the page parsing because of the image and then creates a mermaid diagram from the image because it recognizes the image is of a diagram. Below is what they report as the outcome in part.

# The Amazon Nova Family of Models:
# Technical Report and Model Card

Amazon Artificial General Intelligence

```mermaid
graph TD
    A[Text] --> B[Nova Lite]
    C[Image] --> B
    D[Video] --> E[Nova Pro]
    F[Code] --> E
    G[Docs] --> E
    B --> H[Text]
    B --> I[Code]
    E --> H
    E --> I
    J[Text] --> K[Nova Micro]
    L[Code] --> K
    K --> M[Text]
    K --> N[Code]
    O[Text] --> P[Nova Canvas]
    Q[Image] --> P
    P --> R[Image]
    S[Text] --> T[Nova Reel]
    U[Image] --> T
    T --> V[Video]
    
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style K fill:#f9f,stroke:#333,stroke-width:2px
    style P fill:#f9f,stroke:#333,stroke-width:2px
    style T fill:#f9f,stroke:#333,stroke-width:2px
    
    classDef input fill:#lightblue,stroke:#333,stroke-width:1px;
    class A,C,D,F,G,J,L,O,Q,S,U input;
    
    classDef output fill:#lightgreen,stroke:#333,stroke-width:1px;
    class H,I,M,N,R,V output;
```

Figure 1: The Amazon Nova family of models

When I tried this I did not get the same outcome from the parse. It did not even attempt to generate a mermaid diagram. I received the following output for the diagram image section; far from their professed output.

The Amazon Nova Family of Models:
Technical Report and Model Card
Amazon Artificial General Intelligence
Nova
Lite Nova
Nova Micro Ix
Pro <l> <l > </>
A Ix
</>
=
Nova Nova
Canvas Reel
Figure 1: The Amazon Nova family of models

In the experiment, everything is local except LlamaIndex which is nice. I hope that this example is of use to you.

AI RAG with LlamaIndex, Local Embedding, and Ollama Llama 3.1 8b

Written by Michael Ruminer

No responses yet