(Disclaimer: I might be wrong about data-training trade-offs. Also, there is always nuance but I’ve oversimplified a few things for understanding. Please bring any inaccuracies to my notice).
Background
To those who are unfamiliar, sometimes we classify businesses as B2B (selling to businesses), B2C (selling to consumers) and B2G (selling to Government). There are niches like B2D (selling to developers). Fundamentally, I’ve always believed that it is very hard for a startup to have b2b and b2c co-exist. The culture, the team strengths are all very different. Microsoft vs. Apple vs. Google was always a fascinating case in here:
Microsoft started as a B2B company in 1975, and remained firmly rooted as a B2B company culturally, though in 1990 they evolved to selling consumer products. Today, you can see Microsoft excelling at B2B due to their ingrained strength there. ~70% of Microsoft’s revenue comes from B2B (very quick estimate based on their financials). Microsoft has recently made some decent consumer products (like Surface Tablet), but I would be cynical about their commercial success since it doesn’t easily align with their DNA.
Apple, on the other hand, started as a B2C company. B2B sort of came in later - and their inherent nature even today is B2C, as is evidenced ~85% of Apple’s revenue comes from B2C (again a quick crude estimate from their financials). If tomorrow, Apple gets a higher share from B2B, it might still be on the back of its consumer strength (like Advertising or ecosystem taxation).
Google/Meta seem B2B, but I would call them business-sponsored B2C. I always felt that Google was a B2C company at heart - but 90% of their revenue comes from B2B (estimated from their financials). Though, what is important to understand is that the 90% that comes from B2B is almost majority advertising to consumers - And hence, building great consumer products and attracting consumers at scale is at the heart of who Google is. Google Cloud today is smaller than Microsoft, and I expect it to stay that way given their respective strengths.Google was like TV channels from the 2000s - the TV Channels main skill was to have a pulse on what shows would entertain the masses, and advertising revenue (B2B) would follow.
So, what is Open AI - B2B or B2C?
Firstly, Sam Altman (the founder of OpenAI) is hard to classify - His previous startup (pre-YC) was a social networking startup, so Sam’s background has been in B2C. though through YC, he’s guided a large variety of companies including B2B, Fintech etc. OpenAI started as a Non-profit in 2015 and focused on Research till about 2019. They were neither B2B or B2C - they were a lab. In 2018, OpenAI launched GPT-1, followed by GPT-2 and GPT-3 in 2019 and 2020. Booth Dall-E (image generation) and Codex (code generation - later incorporated in Github Copilot) were released in 2021. At the same time, they received USD 1B funding from Microsoft - At this time, they were a B2B company in a limited way. GPT-1, GPT-2 and GPT-3 were primarily APIs used by developers, enterprises and educational Institutes. OpenAI was firmly a B2B company (with developers being a path to larger revenue from Enterprise). But something changed in 2022 - Open AI became a B2C company with today B2C contributing to 75% of its revenues (Source: Bloomberg). Sources (Source: CMSwire) point to OpenAI expecting ChatGPT to drive the lion share of its revenue till 2029. Why?
Source: https://sacra.com/c/openai/
The cost of going B2C
I have heard simple arguments like “They did it for the PR” as a rationale for this. But that didn’t make sense to me given that the call to go B2C is seriously expensive for OpenAI. How much does OpenAI spend/lose on ChatGPT every year?
Today OpenAI loses about USD 5B a year (Source: Forbes), while it makes about USD 3.7B (Source: NewYork Times) with ChatGPT contributing around USD 2.7B of this. So how much does it cost to run ChatGPT? According to a research firm SemiAnalysis (Source),it costs USD 700K per day (and 36 cents per query). At it’s peak in 2023 (and this is probably a poorer estimate), ChatGPT used to do 10M queries a day. So, that estimate results in USD 3.6M per day - resulting in about USD 1.3B a year. That’s a lot but not a lot compared to how much they’ve raised and how much they’ve made from ChatGPT. But what do they get for it? Sure, a tenth of that would’ve bought them a crazy amount of PR.
Is the data generated from ChatGPT valuable?
At the top level, Estimates (Source) show that GPT-3 was trained on 570GB of data. Look at this (Source), each person on the internet Generates 15TB+ data everyday.. Tiktok generates 7.5TB data daily! While YouTube has 4.3Pb of data! So, at the top level, it looks like OpenAI shouldn’t run out of data to train its larger and larger models. But..
In 2023, OpenAI put out a blog (Source) stating “We're particularly looking for data that expresses human intention”.
From here on, it’s pure conjecture on my part, but OpenAI doesn’t OWN any of the data sources pointed above. Without getting into whether Google “owns” all YouTube content, they at least have access to it legally. OpenAI also has used YouTube’s data based on reports despite the potential grey area there.
OpenAI internal talks show that they have gone through “almost every available English-language book, essay, poem and news article on the internet.” and is considering both Synthetic data and/or buying large publishers outright.
All the above shows that “owning” your data sources will be a critical strategy for an LLM company like OpenAI. And that’s where ChatGPT comes in. The USD 1.3B - well worth it for a company that doesn’t own a social network like Meta or a Content platform like YouTube/TikTok.
So, just like how Google was a B2C company for its products (where the consumers might be the product), OpenAI is a B2C company too (wherein the product may not be users but conversations). While brand awareness from the use of ChatGPT can’t hurt, the conversations themselves are an invaluable (and I suspect profitable) source of data and insight. For example, comparing queries between “how to book plane tickets” and “what is the ideal social media image for this message” helps OpenAI prioritize better as to which direction AI applications would go towards.
ChatGPT generates a lot of data too. GPT-3 was trained on 570GB of data. According this this source, (though other sources show that this might be a lower estimate and actual number might be 6x of this), in 2023, ChatGPT averaged 10M queries a day.. At an estimate of 1Kb per query, that is 10GB generated per day, and 3600Gb+ per year (compare that to 570Gb used for GPT3) that OpenAI owns for its own training.
Look at the results too - their B2C focus becomes apparent. According to Similarweb , ChatGPT got 3.1 billion visits in September 2024! That makes them the 11th most visited site in the world, ahead of Bing (maybe not a great benchmark) and Amazon.com . All their competitors (like Gemini) have also been attracting significant traffic, though nothing compared to ChatGPT. Competitors (with the exception of Anthropic) like Perplexity have positioned themselves as B2C too ( as a “Google search alternative”). So, it’s fairly clear that they’re going the Google route - wherein they build their offerings mainly for consumers but then use that to get B2B revenue (including advertising).
The same holds for their App too (source), though they launched the app much later.
Does OpenAI have everything it needs with ChatGPT? Probably not. They still need the below:
Voice conversations (which is why ChatGPT introduced a hear and speak mode). It’ll be curious to see how much voice data they have, and what they’ll do further to get it.
Visual Data (which is why ChatGPThas image and video inputs).
Video Data - OpenAI doesn’t have this yet - and I’m sure solving for this is a part of their focus in the coming years. In the chart above, YouTube gets 28B monthly web visits, compared to 3.1B that ChatGPT gets.
If OpenAI could own some conversational real world data, they would acquire it in a heartbeat. For example, if there was a source of all CCTV conversations etc across the world of people talking in various languages, OpenAI would probably acquire it in a heartbeat. If someone has built an active Video-calling tool and has stored all that conversation (again not going into the legality of things here - this is hypothetical), OpenAI would come knocking on their doors. OpenAI is a B2C company, because their product is very much an AI that is learning to mimic a human (and hence, a consumer). How do AI companies get access to conversational real world data - OpenAI uses ChatGPT to get human-AI conversations going, Meta has conversations on Social Platforms going for it (though WhatsApp would be the world’s largest source of said conversations for Meta), while Google has search + Gemini (Human to AI). In all of the above, Meta seems best placed to get good conversational data for now while both Meta and Google have access to high quality Video data. You can expect to see that in their LLMs in the future.
If OpenAI craves real-world conversational data so much, what does that mean for Indian startups like Sarvam and Krutrim? That’s for the next post..
Beautiful analysis, the fundamental insight is conversational AI needs to mimick human, and hence billions of actual conversations give it an unassailable advantage.