CrewAI, Simple Enough but It Once Made 100 API Calls Instead of 1
I continued my experimentation with CrewAI this weekend. To see the code, take a look at the following repo and path: Starter CrewAI Series and the day_04 folder specifically. In the day_04 package, I created a simple custom tool, one that uses Tavily. The two agent, two task package queries for news information on CrewAI and then processes that info to create a report (markdown) on the latest news about CrewAI. I learned a few things outside that CrewAI just raised $18 million (US).
How it Works
It uses decorators on the classes and functions. I love decorators. I always have. I filed a patent once that used decorators in C# as a part of a solution for business rule traceability. But back to CrewAI. You’ll see code snippets like:
@CrewBase
class Day04Crew():
@agent
def researcher(self) -> Agent:
@task
def research_task(self) -> Task:
While to define the agent behavior, you use some YAML like the below:
researcher:
role: >
Senior Data Researcher on {topic}
goal: >
Find recent most relevant news on {topic} and limit your response to {limit} results
backstory: >
You're a seasoned researcher with a knack for uncovering the latest
developments on {topic}. Known for your ability to find the most relevant
information and present it in a clear and concise manner.
To define the task you use YAML like the below:
research_task:
description: >
Search news about {topic}
expected_output: >
A list of news articles about {topic} with the title, url, and content
agent: researcher
If I were to experiment deeper I’d try the research task description to be a more sophisticated prompt but this one returned decent results.
And that is largely it for a simple enough example. I was off to the races with the exception of some calls to actually run the agents.
What I Found
First off. I found that I need to do some discovery of the CrewAI capabilities for some increased logging or traceability. Even with agent verbosity turned on, it was too much of a black box for me. Maybe I didn’t look closely enough at the verbose output but it seemed a bit too superficial. I want to know what exactly was passed to the LLM, a timestamp, its response and that timestamp, which endpoints on the LLM, etc. I think some of that can be found using LangTrace or CrewAI AgentOps. I’ll almost certainly try that soon.
I also found that one time it got stuck in what I assume was some sort of loop. I can’t be certain exactly where, as I didn’t have any real logging or traceability (black box). But it was running far too long on just the first agent and task. I had to cancel out and when I did and looked at my usage of Tavily it had bumped up 100 API calls for that run versus the expected of only 1. That was very disconcerting. All other runs with the unmodified code performed only the expected 1 API call to Tavily.
The report output was what I was hoping for, but that has more to do with the LLM and Tavily results than with CrewAI.
I did notice that each task can have only one agent. That makes sense, I think. I would like to try where an agent has multiple tasks and has to choose the appropriate task for its job and also a scenario where an agent might call a task multiple times with slightly different input to get a more nuanced or expanded context for its actions. I don’t currently have an example use case for the latter. Give me some time or recommend one below. In these scenarios, traceability becomes even more important, and limits on task calling or tool usage are probably needed.
Final Thoughts
CrewAI covered the simple use case I wanted to try though it left me desiring more visibility into what it was doing. The implementation in my limited use case was easy. It was slow, but I don’t know where it was slow, because I didn’t have any instrumentation to see where it was spending its time. It might have been in the LLM and/or Tavily. All in all, I plan to experiment more with a hierarchical structure and with some attempts into observability and traceability. I wish I could say more and provide greater depth than what you can probably easily and quickly discern from the documentation but for now this is what I have. The next question is will I try the same in LangChain to compare or will I dig deeper into CrewAI first.