Local RAG with AnythingLLM and Llama 3.1 8B
I decided to do some retrieval augmented generation(RAG) with Llama 3.1 8B to see how it went. In short, it went pretty well with my limited testing. More interesting is probably how I set it up to perform the RAG.
First, I didn’t want to reinvent the wheel. Not yet. I was looing for a way to provide RAG with Llama 3.1 without coding up a user interface for embedding multiple documents and creating chat bot that would use those embeddings. The combination I created was AnythingLLM and LM Studio backed by Llama 3.1 8B Instruct q_8 model.
I ran LM Studio as a server using Llama 3.1 underneath. Ran AnythingLLM as a front end to the AnythingLLM embedder and the AnythingLLM provided LanceDB vector store with the combined set pointing to LM Studio server as the LLM provider.
For AnythingLLM you can then create workspaces that allow for document embedding for that workspace into the vector store. I tried out a doc I was especially curious to chat across as it was both a large PDF and a topic of political and policy interest — the 900+ page Project 2025 “manifesto”. It took 3 minutes for it to create the embedding. I am not sure how many dimensions it embeds but AnythingLLM will show you how many vectors it creates. In this case 5038 which is smaller that I would have expected for 900 pages. It does provide some basic settings around embedding; you can set the chunk size as up to 1000 characters and the overlap (in characters). I used the default which was the maximum of 1000 character chunk size and an overlap of 20. At first I though the overlap was in percentage but it states it is in characters, not tokens or percentage. I am suspicious that 20 character overlap is really enough for the best context and may play with that. That seems a really small number to me; I know Microsoft on their embedding guidance recommends 10% to start with.
You can also select your embedding model. I have so far only tried the AnythingLLM Embedder but I will likely try other local embedding models and probably Azure OpenAI as I have access to that via my Azure account.
After setting up my environment I took a workspace I had created and uploaded the Project 2025 document and began to query. My first query was about IVF and it reported back that there was no mention of IVF but did provide a very general description of the the document as a whole. So I decided to drill in a little more deeply and asked about rights of embryo’s and fetuses. It came back with what seemed to be a response I would have expected from the document. That’s when I realized one issue with my test. I hadn’t read the document so I didn’t really know if it provided an accurate analysis. Based on the results it seemed to likely be accurate but I couldn’t be certain. In the image below you can’t see my initial prompt but it was “You are a society and government policy analyst. What does the 2025 Mandate For Leadership say about IVF?”
This lead me to a different test.
I created a new workspace and added a document that I was familiar enough with. I could have added it to the existing workspace but wanted to keep a separation for testing purposes. The document uploaded was 43 pages. I didn’t notice how long it took to create the vectors but it did end up with 177 as the vector account. Again, lower than I would have expected.
I asked it what should have been a pretty softball question about the 10 principles of SSI (self-sovereign identity). The embedded document specifically summarizes them. Keep in mind that it could use the Llama model and/or the document provided. I didn’t restrict/suggest it to using just the document that had been embedded in the vector database. It came back again with what seemed to be a valid response. What I now need to do is compare the document to the inferences and see if there was hallucination or how much it may or may not have mixed the RAG with the underlying Llama model values. It does provide a citation and listed the document, but I don’t know if that means all the inference came from the document or if maybe just some of the inference came from the cited work. I would expect it to mean the latter and thus 1–100% of what it responded was from the document. I may never know the answer to that question. I do know when I created yet another workspace with no document and provided the same prompt it did come back with some responses that I knew were not quite accurate. The inference when it had the RAG was definitely more accurate.
The “thinking” and the returned tokens/sec were slow but this is running on a local machine with very limited horsepower. (If you wish to see the machine specs check out a prior post “Started Llama 3.1 8B Locally”). If I had an OpenAI API account I’d likely try it out against that LLM endpoint with the RAG and see how it performed, but I currently don’t. Perhaps I can do that in Azure as well.
Keep in mind I had to upload the documents for RAG individually. I don’t see an option to point it to a folder and let it walk the folder recursively. AnythingLLM does say on the upload screen that it “supports text files, csv’s, spreadsheets, audio files, and more!” I’ll have to test it out on some of those items. You can also add website URLs and it has a data connector tab for some other inputs — see the below image.
All in all, it behaved as I would have hoped for the environment. It is after all only an 8 billion parameter LLM model. AnythingLLM gave me a lot of flexibility with very little effort and turned my LM Studio server into something more than just the underlying Llama 3.1.
Oh I didn’t mention AnythingLLM also has some nature of agent support. I’ll certainly be trying that out in the future.