LLMs from the trenches Data is how you create a competitive advantage not models

June 14, 2024
Excerpt from "Let's Build a Startup S2E2 - Anatomy of a Unicorn: Hugging Face with Julien Simon" https://www.twitch.tv/videos/2170990579 #largelanguagemodels #HuggingFace #MachineLearning #DeepLearning #AI #opensource ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️

Transcript

The kind of customers I've been speaking with often need a readiness assessment before they start exploring AI. I believe having a solid data strategy is the first step. Sometimes, I have conversations with C-suite executives looking for use cases where they can apply generative AI or AI as a general tool. However, the work we tend to do is ensuring their data is operable by AI, which is a significant step. That's the long-term asset. I keep saying models are expendable; don't fall in love with models. No one would still be using models from six or nine months ago for LLMs. I remind everyone that LLAMA 2, arguably the first great open-source LLM, came out in July 2023. That's not even a year ago. The gap between LLAMA 2 and the best models today is vast. LLAMA 3 came out about two months ago, and even LLAMA 3 isn't the best anymore. The pace of innovation is insane. If you focus too much on models, you'll be stuck with ones that underperform compared to the state of the art. Models are just a tool. You use the right model now, and next month you'll use another one. You'll upgrade existing solutions over time. But the data lives forever. If you're in healthcare and have datasets from 30 years ago, they are absolute gold because very few people have that. The same goes for chemical engineering data, financial data, or company data. The more you have, the greater your competitive advantage. In 30 years, these datasets will still be valuable. It's an investment that will pay dividends forever. Models, on the other hand, feel outdated in just six months. We need more data engineers and data analysts. I'm sorry to say, but we probably need fewer data scientists. If you're looking for a career in AI, the best job right now, in my opinion, is turning company knowledge—whether it's in people's heads or PDF files—into datasets that can be used for model evaluation and fine-tuning. This is the only way to create a competitive advantage for your organization, not with models. If you invest one hour of your time or $1 in improving your datasets, the return on investment will be much higher than if you invest the same time or money in chasing the latest and greatest model. Just don't overthink model selection. People try too hard. I could say just take LLAMA 3.8 billion and be done with it. That's as much model evaluation as you should do. Then work very hard at plugging great quality, hopefully unique data into it. For LLM applications, the content needed to answer user questions comes from your organization, not from the knowledge the model was trained on. If you're building a banking chatbot or a telco chatbot, and a user asks, "I lost my phone. My phone number is blah, blah, blah. What should I do? Please send me a replacement phone as quickly as possible," any LLM would understand what a cell phone, subscription, SIM card, and address are. However, to answer the question, you need to know where the user lives, where to send the phone and SIM, what kind of plan they have, and the specific procedure. For many applications, the model is just a writing assistant, there to understand the question and answer it with the right tone of voice and facts from an external source of knowledge. This is where you need great data. The ability to turn company knowledge, policies, data, and customer data into datasets or data stores that can be used for AI applications is the critical skill. Every great quality model out there, go look at the Hugging Face LLM leaderboard, starting at 7 billion and below. Any one of them can do this equally well. People will say, "Oh, no, but this is a tiny bit better than that." Seriously, stop thinking too hard about that. Work on the data, user experience, UI, and cost-performance optimization. That's where you make a difference, not because one model is 0.1% or whatever higher than another on a particular benchmark. That's just a waste of time.

Tags

Data StrategyAI Readiness AssessmentData EngineeringModel ObsolescenceCompetitive Advantage Through Data

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.