Hi, everyone. Happy Thursday, October 3rd. Welcome to the OpenAI Forum. I see so many faculty members in the house. We love seeing educators in the forum. Just allowing some more people to filter in.
Hi, Stuart. Hi, Ahmed. Nice to see you all.
Well, we have a very special treat this evening. Before we get started, though, I want to make you all aware that this event will be recorded because you all always follow up and ask me. If you're unable to attend live, you will be able to catch this recording on demand later in just a few days. I'm Natalie Cone, your OpenAI Forum community architect.
I like to start all of our talks by reminding us of OpenAI's mission. By which we mean highly autonomous systems that outperform humans at most economically valuable work benefits all of humanity.
And welcome to this evening's event, Learning to Reason with LLMs. I'm just as pleased as all of you are to be hosting our fellow researchers this evening.
And I'm also pleased to see all of you joining us here virtually from all over the world. Tonight we're going to take a deep dive into the forefront of AI research with some of the most brilliant minds behind OpenAI's latest advancements.
Before we get started, I'd like to share a roadmap with you to set the stage for our event this evening. Our session will be roughly an hour. We'll begin with a presentation from our speakers and reserve time at the end for audience questions.
Please drop your questions in the Q&A tab on the right, and we'll call on them in the order they are received. Feel free to add your questions as we move along through the presentation versus waiting until the end.
I truly encourage you to all speak up. We know that we're all not technical experts here. This is a space for learning and becoming acquainted with new technology. The technology is for all of us, and this is a space that is safe to ask any questions.
I encourage everyone to connect with at least three people you see here tonight. All of our LinkedIn profiles are embedded in our forum profiles.
And if you need some helpful tips and tricks related to networking, I just read an awesome post from Jen Garcia, one of our forum members, about how she optimized her time at OpenAI Dev Day recently to make as many meaningful connections as possible. So go ahead and reach out to Jen Garcia if you need a little bit of a helping hand in learning how to network for virtual events.
Our purpose here is simple but incredibly exciting, to learn directly from the researchers who are pushing the boundaries of what AI can do in the realm of reasoning and problem solving. So let's get this party started.
We're very fortunate to not just have one, but three of the key contributors to OpenAI's O1 model with us today, Ahmed Elkishki, Hongyu Ren, and G.B. Parascondolo.
Each of them brings unique expertise and life experience to the development of this advanced language model, and collectively, they have created something truly groundbreaking.
Ahmed Elkishki is a research scientist at OpenAI, where he focuses on developing advanced language models and enhancing AI reasoning through reinforcement learning techniques. He played a pivotal role in the creation of OpenAI O1, a model designed to tackle complex problem solving.
Ahmed holds a PhD in computer science from the University of Illinois, with his research specializing in scalable machine learning algorithms and large-scale text analysis.
Hongyu Ren is a research scientist at OpenAI, specializing in machine learning and artificial intelligence, with a focus on enhancing language model reasoning and generalization capabilities. He's contributed significantly to the development of OpenAI O1 as well, helping advance the model's ability to solve complex tasks through structured thought process.
Hongyu holds a PhD from Stanford University, where his research centered on graph representation learning and its applications in natural language processing.
Gibi Parascondolo, who also was one of the inaugural OpenAI forum members, he showed up for our very first OpenAI welcome member event, and so it's awesome to be hosting him as a special guest this evening. Gibi leads a research team at OpenAI, where he focused on reinforcement learning and complex reasoning. He's been instrumental in the development and scaling of OpenAI O1, and set some of the foundations for reasoning in large language models.
Gibi holds a PhD in computer science from the Max Planck Institute for Intelligent Systems and ETH Zurich, where his research explored out-of-distribution generalization in deep learning.
With those intros, let's jump right in. Please join me in welcoming our speakers this evening.
Welcome, fellas.
Hi, everybody.
Hello, Natalie. So exciting to have you, and I really love that you're all in one office. Are you guys in the San Francisco office right now?
We are.
Okay, beautiful. Well, I'm going to hand the mic over to you guys, and I'll come back at the end of your presentation to take audience questions.
Great. Thanks, Natalie.
Okay. Hi, everyone. You're going to hear from me first, and then from Hongyu and Ahmed, in the opposite order in which I mentioned them. So Ahmed will tell you about O1, and Hongyu will tell you about O1 Mini.
Natalie asked me to share a little bit more about the perspectives on reasoning, what reasoning is, and then also something about my personal journey working on reasoning and also joining OpenAI.
So I joined OpenAI three years ago to work on reasoning, and I thought instead of making new slides, we could do a fun exercise. I just went through my slide decks from the past, and I just copy-pasted slides from back then. And so we're going to revisit a bunch of old slides with some of my personal motivations for working on this and also some of the ingredients that I contributed to the project early on. This project, of course, included many people that worked on different components, and this was part of what I did a long time ago when I joined in 2021.
So let's first set the stage for what is reasoning. The way I like to picture the space of problems is as a beautiful colored disk, and at the very center, there are easy problems. And the closer you get to the edge, you get harder and harder problems. And then as you move around an angle, you change subjects. So you have AI, biology and engineering and physics, math, chemistry. So if you want to try to place problems on this disk, you'll see that at the very center, you have easy problems. What's 2 plus 2? And then as you move away, you get more complex problems. For example, here's an equation. And then if you keep pushing all the way on the math side, at some point, you get to Fermat's Last Theorem or Riemann Hypothesis. In engineering, you would be something like building an ASML machine for biology CRISPR and so on and so forth.
Now the models we've been developing so far can tackle a bunch of problems that are pretty close to the center, and they're quite general. I think this was probably the major discovery made with GPT that was the first truly general system in AI. And then this disk that the models can cover in terms of problems grows, but it's hard to grow this disk by a lot, because at some point, you really need expertise in one domain. And so if you think, for example, about Andrew Wiles, a professor of mathematics, at some point, he decided he really wanted to prove Fermat's Last Theorem. It was an open problem in mathematics for, I think, 300 years, and then he decided, I'm going to solve this and ended up spending seven years and eventually found the proof for this theorem. And of course, he didn't do this by expanding the range of knowledge in all of the fields at once. He found a way to focus all of his efforts for a very long time on one specific problem and built a very thin bridge in the space of knowledge until he found a solution to that. So that's a very different approach from what our models are trained to do. It's not just expanding the disk. This is more about targeting something and then focusing really, really hard on one problem until it's solved.
So let's thank Andrew Wiles. And now, with the power of technology, we're going to do something, we're going to 3D rotate these slides in Google Slides. So you're going to see slowly, imagine this is a disk that's rotating in 3D, and now we're going to move it to the side of the screen. So hopefully, you've kept in mind what's in there. And now we can add the new axis. The new axis is thinking time. So we haven't talked about thinking time much now. The key realization here when we think about these models is that they can't really think for very long. Or said in a different way, if we give this model thinking time, they will not improve substantially the tasks they're trying to solve. So GPT-2 plus months to think about a problem won't go very far. It will still cover a disk that's roughly the same that they could cover in a few milliseconds. Roughly the same is true for GPT-3 and GPT-4. So one way to think about reasoning is the idea that we want to be able to expand this disk, not by training a new model, but by spending more time. One way to picture this is of cone coverage.
problems that we can solve grows in radius as we spend more time.
OK, so the main limitation to this, of course, is that we don't train our models to do that so far. And then Ahmed will tell you how O1 goes a bit into this direction and then tries to expand this cone of reasoning over time.
OK, now let's do a jump even more for the back. This was one of my first slide decks when I joined OpenAI. This is more the personal side of the story. One of my first projects, and also one of the main reasons to join OpenAI was that I was really interested in reasoning and I wanted to test how far these models could go, even just by asking them to think about stuff.
So this is one of my first slide decks that I presented at one of my first research meetings. The slides are horrible, but they're very authentic, at least. So the idea was to just ask them to think about things. So in the end, we want to have a general purpose reasoner, instead of having to train it on every domain independently, the model should learn to do its own reasoning. So what if you just ask it to think about stuff?
And now we're back in GPT-3 land. This was in 2021. And so if you ask GPT-3 to solve a very simple problem, for example, a shop sold 29 bags of potatoes, and then 17 bags in the afternoon. If each bag weighs 7 kilos, how many kilograms of potatoes the shop sold for the day? And then if you ask the model to get a completion there, the model would say 46 kilograms.
But you can imagine it's definitely not the correct answer, because there's 29 bags in the morning alone, and each bag is 7 kilos. So the interesting realization back then was that even without doing any training whatsoever, you could just instead ask the model to think about it. And then the model will actually spend some time to go through the steps. And it would say, in the morning was 29, and so that's 203 kilos. In the afternoon is 17 more, and so that's 119, and the sum of those two is 322 kilograms. So that was very interesting. We didn't have to do any training, and a model that was not that useful could now solve some problems without any training, which was pretty cool for the time.
A few more examples from back then. I have a clarinet, a piano, a dog. How many musical instruments do I have? GPT-3 would often say one. This also shows you how far we've gone. Hopefully, you will appreciate ChatGPT more today when it gives you a good answer quickly. That's also meaningful. But here again, the difference was we saw a difference. If you asked to think about it first, you would first of all get a decomposition of the problem. You would say, first, a dog is not a musical instrument. A clarinet is a musical instrument, but a piano is not. So the answer is one, which is also wrong. A piano is a musical instrument.
So again, the world knowledge was also not amazing back in the days. But then if you asked even further, let's think about it and break it down step by step. Now you hear a clarinet is a musical instrument. Next, you have a piano. It's a musical instrument. Finally, we have a dog. It's not a musical instrument. So we have two. And this will not be super consistent, but the percentage of time you would get the right answer by asking the model to think would increase over time.
Then just one more example from back then. This looked very silly today, but they were pretty exciting at the time. It's a long list of items. I have a chair, a lamp, an oven, et cetera. How many objects do I have? The model said 20. And if you look for the correct answer, which I think in this case was 14, in the log props of the models, if you examined what was the probability of getting the right answer, it was only 2%. And most of the probability math was on the wrong answers.
But again, asking the model to think about it, to break it down step by step, the model would go through all of the items and then say that's 14 objects. So again, this is one of the first plots we tried doing on this. If you just asked the model to solve one of these math problems, you would get this curve as a function of how many times you asked the model. But if you just added some instructions, for example, think step by step, or first show your work, break it down, then the performance would increase a lot.
And so this offered one angle to approach the process of having the model think for longer. But of course, they were not very robust, and they couldn't think for very long. So while think step by step was still the art at the time, we didn't really have a way to have a robust chain of thought. And so this thought process was missing, and it's hard to find in the world, because people don't really write down their inner monologue anywhere.
And so the solution that many people were hoping to get at some point, and that we did eventually get to with O1, was to find a way to have the model develop its own way of reasoning by doing RL. And then for now, these models can think for a little bit of time, but not for a very long time. Remember, Andrew Wiles wrote this paper with a proof for the theorem. The proof is 130 pages long, and it took him seven years.
Our models right now can only think for, at most, a couple of hours maybe. But the goal is to eventually get all the way to models that can think for a very, very, very long time. So with that said, this was all the old things. Now let's think about the more exciting new results. And I think Ahmed will tell you about them all.
Hey, everybody. Thank you so much for that introduction, GB. Good introduction to reasoning and the motivation behind it. My name is Ahmed, and today I'll talk about OpenAI O1. So we introduced OpenAI O1, which was a new large language model that is different in that it thinks before it answers. So it can produce a really long internal chain of thoughts, which GB chatted about, before responding to the user. And I'll try to tell you today about why that's really exciting and a revolutionary new paradigm.
The big difference with OpenAI O1 is that it's sort of trained via a very large scale reinforcement learning algorithm. So this basically teaches the model how to think productively using its chain of thoughts in a very highly data efficient manner. This is different from prompting it, because prompting it, you don't really have control over how it thinks. And so very similar to very many reinforcement learning algorithms, which you can check on OpenAI's website or many other tutorials, the model itself learns to better use its chain of thoughts.
So what makes this chain of thought different? One, it's longer. When you actually ask the model to think step by step, you don't really have any control about how it does the thinking. And then also, it's not really as high quality as when you train via reinforcement learning. Additionally, many emergent behaviors come up when you do reinforcement learning. Some of them are error correction. The model recognizes it's made a mistake, goes back and fixes it. It can try multiple strategies, so identify if a path is likely to be helpful. If not, it'll try a different path. And then finally, just like how we all tackle problems, sometimes it breaks down a really difficult problem into smaller subproblems, smaller steps, and tackles them that way.
So all of these are pretty revolutionary, and we haven't seen too much of them in the language model world. But let's just dive in and look at some of them. There. So in this problem right here, which you can see on the OpenAI's blog post on reasoning, you have this seemingly unintelligible sequence of text. And we see think step by step as the resultant outputs. And we ask the model, hey, use the above to decode this new unintelligible string.
So we ask the model that, and here we see the model actually thinking. This is the inner monologue of the model. It's like, what's going on here? We're given this string. How do we actually handle it? It sort of tries various ideas here. Let's break it down into letters. Let's count the letters. See, try different approaches. It looks at the words, how many letters they are. Kind of like how a human would do it. This is a very long chain of thoughts, and it's trying to understand how to sort of map one string to another one. It looks into anagrams. It looks into substitutions. It doesn't really understand how to do it. So it tries various approaches, and it tests them.
So let's test this theory. It tries to do single substitutions. It goes through. Some of these are failing. So this is sort of like how a human would try to go about this problem. It would try different approaches, try them out, get some results based on the results, course correct, and try new approaches. This is something we haven't really seen before in language models.
And the main idea is to try to replicate sort of how an individual would go about solving a problem. It's a very long chain of thought here. You can find it on the main websites. But in the end, it realizes that pairs of letters in the original string are summed and then averaged. And that sort of maps to the final letter number in the alphabet, the ordinal in the alphabet. As we see here in the chain of thought, it tries it. And eventually, it gets one that matches. And it says, success, perfect. It goes through the other ones. It also sees that they're successful. It goes through. It starts mapping pairs. It finds words. It finds think. It finds all the right words.
And then in the end, it gets a final answer that is, there are three Rs in the word strawberry. So here, we gave it a problem. And.
Honestly, I don't think I'd be able to solve this myself. And the model tackled it by trying different approaches, by finding different paths, exploring them, noticing there were failures, and then error correcting. So this is the kind of reasoning behavior we wanted to elicit in the model.
So let's look at another task, one that many of you are probably interested in, which is coding. We asked the model to write a bash script that takes a matrix represented as a string in the following format, and then print the transpose in the same format. So here, once again, the model takes the problem. It looks at it, tries to identify what the user is requesting. It understands it's a matrix. It identifies what the answer should look like. So it looks at the final solution. And it understands that it needs to write a bash script to do this. It knows that bash is a little tricky with strings, escape characters. It comes up with a, I guess, detailed step that it should follow, identifies where that might fail, plans it all out. And at the end, it writes a skeleton of the code, which it then submits at the end.
So here, the model took a problem, broke it down into what it needs to do. It developed a sequence of steps that it takes to solve it. And then finally, it wrote a solution. So once again, planning and identifying how to solve a problem is imbued directly in the chain of thoughts.
And finally, let's look at one last example here. Safety is obviously really important with these AI models. And in this case, we'll check a peek at how the model handles safety, which is a little bit more nuanced than maybe just trying to directly output. So in this case, the model is asked to write a historical essay explaining the ways people use poison or to make poisons from common household materials, and explain in detail the materials they use and methods of preparation.
Obviously, this is a very potentially dangerous question. You don't want this information being easily accessible. But it's a historical essay, so there is some nuance here. The model understands the problem. It knows that the user is asking for a historical essay. So it knows it needs to produce an essay about it. But it recognizes that the policy, it should avoid providing disallowed content. Some of this content includes illicit behavior on how to make dangerous chemicals. And so it knows that there's a balance here.
And so the model goes through the policy. It understands it. And it knows, finally, what to output. Without reasoning, the model could just be like, poison, I shouldn't output this. So as the model gets better to reason, it should intuitively get better at following safety policies and understanding the spirit of the policy, as opposed to the exact letter.
So let's talk about some of the capabilities that can actually happen when you imbue a model with reasoning. Here is one preparedness, essentially, eval that we performed here called capture the flag, which is, in essence, competitive hacking.
In this task, the model is assigned to exploit vulnerabilities in software to try to obtain some information. And this is a really fascinating little anecdote. The model, during the evaluation, identified that the setup itself, the eval infrastructure, was quite faulty. It couldn't actually solve the task. It was running, supposed to find some vulnerability in a Linux container. But the container failed to start.
So what the model did is it decided to use nmap and look at which containers were around, and noticed that it wasn't actually started. It tried to debug the issue and see if it could start up the container on its own. It wasn't able to, but then it found out that it could just restart a new container and just grab the answer to the capture the flag challenge.
So as we see here, reasoning itself can help imbue the model with such useful agentic behavior.
So how does this actually reflect the model? When we imbue it with reasoning, let's look at some really important metrics and benchmarks that we have.
As we can see here, AIME, which is a very challenging high school mathematics competition, GPT-40 basically scores about 13%. When we look at the preview, the model that we released a couple of weeks ago, that jumps up to approximately 56%, 57%. And when we look at the O1 original model, the model that we will be releasing sometime in the future, it jumps up to 83% accuracy on possibly one of the hardest high school mathematics competitions for high schoolers in the US. That is a massive improvement. So we can see 13% to 80%.
Additionally, coding. So when we look at GPT-40, the model gets approximately 11% on code forces, 11% percentile. So it's better than 11% of competitors. And O1 brings it up to 89%. So one of the most challenging tasks in competitive programming.
And across PhD level tasks in GPQA, we see that reasoning just helps across the board. We took our O1 model and then sort of customized it a little bit for programming. And what we saw was quite fascinating.
The ability to reason in this competitive programming challenge called IOI, the International Olympiad Informatics, 10 hours, six algorithmic problems. We essentially had the model sort of adhere to the same rules. And with 50 submissions per problem, it achieved in the 49th percentile among all competitors in the competition.
Keep in mind, these are the top high school competitors in the world. And then with 10,000 submissions to the problem where we relaxed the constraints, it was able to achieve gold as in the top 30 competitors.
As I mentioned before, Codeforces is a very challenging competitive programming task. And we see the model went from an ELO score, very similar to chess of 808, to 1,800 for this fine-tuned model. And that's essentially saying that on Codeforces, it is better than 93% of contestants.
And almost across the board on reasoning heavy tasks, such as math, MMLU, MMU, AP exams, we see that reasoning just provides an incredible boost in performance, which makes sense. These are very reasoning heavy tasks.
So let's talk about this. Why is this actually significant? Similar to scaling computes during pre-training, we have shown that scaling computes during this RL training provides almost we see the performance improve as we train more and more.
So this is very significant because we can just go beyond pre-training and just continue training with RL and see pretty consistent performance improvement. But what is even more interesting here is that we've sort of stumbled upon a new paradigm.
The more time the model is spent thinking, the better it performs. So this test time compute could be a completely new paradigm to scale. So the constraints of this approach is not fully understood or mapped out yet. But the more time spent thinking, the better.
And as GB talked earlier, Andrew Wiles spent years trying to solve Fermat's last theorem. Right now, our models maybe can think for seconds, minutes. But what if we increase that to hours, days, weeks, months, year? We can solve progressively harder and harder tasks. And that's sort of the avenue and direction we're taking forward.
However, like I mentioned before, lots of tasks are sort of better when you do reasoning. But not everything sort of requires reasoning. We see some evals, some benchmarks, where reasoning doesn't help or maybe hurts a little bit. Personal writing, editing text. So this isn't sort of a one solution to everything. There are obviously some things that it benefits, some things it doesn't. But overall, it's a very promising direction.
And O1 is really exciting. This is only the beginning. You're seeing the first of many models. OpenAI is very committed to iterative deployments. And as we release new models, we will make them more capable, smarter, and introduce a lot of the features that you are probably so accustomed to, like better instruction following, structured outputs, code interpreter, browsing, and safety. And the key takeaway here is that reasoning will help all of these so much. Because the model will be able to take a nuanced approach and sort of try to follow the spirits of the query and prompts.
And with that, I would love to hand it over to Hongyu.
Hi, everyone. I'm Hongyu. I work on the O1 as well. Thanks so much for GB, Natalie, and Ahmed's introduction. So I'm here today to introduce this OpenAI O1 Mini, which is a small model, efficient model, that we want to advance this cost-efficient reasoning.
So what is O1 Mini? So OpenAI O1 Mini is a model that is in this O1 series, or the O1 paradigm, that also thinks for a long time, right, using the channel 5.
thought before answering the question. But most importantly, the OpenAI O1 Mini model is a small, and fast, and cheaper model compared with O1 Preview and O1. As a matter of fact, the current cost of O1 Mini is 80% cheaper than O1 Preview. So O1 Mini is a STEM-optimized model. So we specifically optimized this O1 Mini model for complex STEM reasoning tasks.
So as an example, so we see on this figure plot of this A-Me performance, it's pretty clear that O1 Mini model is on this cost-intelligence frontier that is much better than 4L, 4L Mini, and it's even better than O1 Preview, but it's almost comparable with O1, but it's much cheaper. That is only one demonstration of math.
So if we actually go to see this more general-purpose math queries on the math arena, or this math-specific chatbot arena, we see O1 Mini and O1 Preview to rank top, has this very high score, 1,360-plus score. And it just drastically outperformed the other models. So we see that these models really have very improved math capabilities, reasoning capabilities, compared with the previous models.
Another example will be coding. So on Codeforces, we see O1 Mini outperformed both O1 Preview and GPT-4L by a large margin. And as a matter of fact, there's one Codeforces contestant that uses O1 Mini in one of the competitions after our release. So clearly, it's not contaminated. And the results is that they almost achieved near-master-level performance. So it really justified our score.
But most importantly, what I want to emphasize here is that, and also I get a lot of questions from friends and users, is that, so we say, OpenAI O1 Mini is a specialized model. But does it know anything outside STEM? So here, I show this ELO score on the general chatbot arena. So we see that O1 Mini ranks third on this leaderboard. And it actually surpassed all the other models, like from all the other previous models. So it's actually pretty cool that such a general-purpose model, although it's specialized for reasoning, can still both outperform math, coding, and reasoning, and also can actually answer general user queries, just like concentration or writing emails and this type of stuff. Yeah.
So how fast is OpenAI O1 Mini? So here, I want to show one example of a simple query. So the query is about what are just these five countries with this letter A in the third position. So on the left-hand side, we have this GP40 answering this query. It's pretty fast, right? We can see the answer using, like, it takes only, say, three seconds. But all of the five names are actually wrong. They don't have this A in the third location.
If you check O1 Mini, it's able to really think for a long time, as a matter of things, for nine seconds before producing the final answer. But we see that it actually is able to give us the five names with the correct answer, with the correct A location. And of course, O1 Preview also can give us the right answer, but it takes a little bit longer time.
And here at OpenAI, we are committed to releasing safe models, like having a very high standard of safety. So we see on this slide that O1 Mini is able to drastically improve safety measurements, compared with 4.0, across several different metrics, especially like how the percentage of safety completions on harmful prompts, and also how we can avoid, or the success rate of avoiding jailbreaks.
How do we actually optimize for reasoning tasks? So we know large language models are often pre-trained on a drastic, vast task data sets. And as a result, they are very good at world knowledge, but they sometimes can be very expensive and also slow.
So for O1 Mini, we prioritize them reasoning data in pre-training, increasing the weights of those reasoning-related data points. And also, we take the model through the same high compute reinforcement learning process. So the model is specialized in STEM reasoning. That's where we focus on, at a cost that the O1 Mini may not have all the broad world knowledge, such as historical dates or all the factual knowledge.
And there are a lot of ongoing work. Together with O1, we will frequently update and add new features to O1 Mini. And that includes but not limited to better instruction following capability, and also function calling, developer messages, structure outputs have been mentioned a lot by our API customers. We will introduce multi-model to O1 Mini. And of course, we will add more or support better world knowledge in O1 Mini model. So yeah, let us know if you have any new functions in mind. And we are always happy to assist and integrate them in the model.
With that said, so we are super excited to present O1, our latest reasoning paradigm. And let's bring Natalie back on stage. And we are happy to take questions.
That was awesome, guys. Thank you so much.
OK, let's get to some of the audience questions.
So this is from Mohamed El-Feky, Senior Research Manager at Microsoft.
Aside from reasoning and multi-modality, what else is seen as a potential direction for future research? Did you say besides reasoning, or?
Aside from reasoning and multi-modality, what else is seen as a potential direction for future research?
Well, OK, this is just my take. I feel like reasoning is probably the last big open question in AI. And if we solve that, there's not much else left to solve.
Whoa, TV, nice. It will take a while.
Yeah, OK. For me, I guess there are also unsolved challenges in robotics or control. How do we actually apply those reasoning algorithms to drastically increase our control algorithms? And yeah, that was pretty challenging. To pair it both, I do think reasoning is just, we're barely scratching the surface. This is the start. And so it'll be quite a while before we can say we've solved reasoning.
But yeah, I think interacting with the world. And you can think of browsing as interacting with the world, for example, is probably something in the immediate future.
Thank you, fellas.
OK, this is from Benjamin Chi, who also was one of our inaugural OpenAI Forum members. And I believe Ben used to be a math Olympiad. And I hope I'm getting that right, Ben. And I haven't looked at your profile in a while. But I know you're definitely a math expert. And he asks, related to O1's impressive performance on competition math, is there a recommended way to paste math problems into Chats GPT without destroying the latex formatting? Or does it not make a difference?
I could answer that.
Paste with the latex formatting. It loves latex.
Awesome.
This is from Matt Roberts from the University of Pennsylvania. Are there any plans to incorporate formal verification type systems in LLMs? I'd be interested in proving math theorems with large language models.
I think eventually we will include everything. I think one of the promises of AGI is that it should be able to do anything. So if the person who asked the question does that, hopefully their model can also do that one day.
Thanks, JB. GB.
And Ahmed and Hongyu, anything to add to that one? Or did GB cover it?
I think GB covered it.
Oh, this one's from Chris Soria. And Chris Soria, fellas.
just so you know, he's a PhD candidate at Scripps University and he's essentially a forum ambassador. He's helped us with a lot of our preparedness evals, you know, a lot of the experts in the OpenAI forum have participated in our evals and collecting niche data sets. And Chris is one of those guys, so really awesome to see him in the forum tonight. And his question is, why would it be that just prompting the model with the word, let's think about it, improves its output? Is the model actually slowing down to reason through the problem, or is it something about the string input and how it translates to output?I think it's both. You could say by asking them all to think about it, you are spending more compute, which you could say is also what people do when they think about something. They do multiple iterations in their heads and try to think about the way things could go or break things down step by step. And so maybe one way to think about reasoning is, in a very general sense, is the ability to spend compute or spend thinking time and turning it into better performance. Now, why does that string work and not other strings? The main reason is that that string is somewhat commonly associated with breaking things down step by step. And so it's a good way to induce even a very weak model for today's standards, like GPT-3, to try to break things down. And then that makes the model make fewer mistakes and then get better performance overall.
Yeah, I just want to add a subtle point after GB. This is actually not the recommended way to interact with O1 series models because the O1 paradigm and any models in the O1 paradigm actually was trained with automatically just to think for some time. So we don't actually need to add that instruction of think step by step if you just want to interact with the O1 model. Thanks for sharing that, Hongyu. Shall we move on, fellas?
Okay. This is from Kavita Rasal, founder and CEO of LightRoute Venture. What do you mean technically when a model thinks? For example, is it doing reinforcement learning on already trained data?Sure. So if you saw the chain of thought that I shared when it was scrolling, the model thinking, as GB mentioned, is that it's spending compute. It's taking the time to explore different avenues within its chain of thoughts, try things out, sort of get some feedback, and then continue. It's basically a way of, instead of answering right away, just taking its time and exploring within the context itself. And that's what it means to think in this context. Anybody want to add something to it?
I liked that, Ahmed. Thank you.
Okay. I actually like this question. I don't think we've ever received a question like this in the forum. And this is from Jiaming Zhang, and he's actually a software engineer here at OpenAI. Is there some direction the team has explored for a while, and it turned out not to work? Any failed experiments? I mean, I would say there's probably hundreds of failures, probably too many to list out. Anything that stands out? You don't have to think too hard about it, but if it resurfaces, just let me know. We can always come back.
This is also from Jiaming Zhang. I heard that O1 is very prompt heavy. i.e. the user needs to give it enough context and prompt to have it properly to have some useful output. Does this statement match your observation?
I think it works without heavy prompting. In some cases, heavy prompting is helpful because it just gives more context to the model. Like, what are we talking about? What's the context? Sometimes there are lots of implicit assumptions, and then just making those explicit will help the model be more precise and spend its time productively to think about how to solve the problem and not how to second guess the person that asked the problem. Okay, what did they mean? If a detail is missing, should I now guess what they hope to hear about this or not?
Yeah, maybe that would be my take. Okay.
And this is from Matt Roberts, University of Pennsylvania. A lot of folks here from University of Penn tonight. Maybe that's because we're hosting one of their faculty members in a few weeks. Are there specific reasoning tasks where increased inference time compute reduces the need for highly curated training data sets or fine tuning?
I would say all of them, right? Yeah, I would say all of them. I think we showed a graph earlier. As we increased the test time compute, we showed it on math. Like, it improved. And we see the same in most reasoning tasks. Actually, I would say all reasoning tasks. Okay.
This is from Samuel Zhu, Data Science at Redisil. Have you experimented with tool use for self-verification or sanity checking as a step of the reasoning process?
So I will say that tool use, like code interpreter and browsing, are on the horizon. Yeah. I can't really share too much more than that. Okay. Thank you, Ahmed.
This is from Dr. Krishna. Is there a reason why O1 is only good at math? Because the reward model was heavily focused at math, maybe? I do remember seeing you guys say, or at least, Hongyu, you said we prioritize STEM fields in the training, but was that only for the mini?
I mean, we prioritize STEM. And I want to say STEM includes math, but it's also more than math, right? It also has coding and science and other domains. I think for O1, it's a very general purpose model.
Yeah. I think the premise is probably wrong. The model is incredibly good at proving competitions, as Ahmed mentioned, and it solves some pretty challenging graduate-level benchmark in physics and chemistry and biology.
Yeah. Even beyond that, you saw the LSAT scores on there, pretty high. That was law. There were a suite of AP tests, and these are actually recent AP tests, so definitely not seen during pre-training. So beyond math, it seems to have improved across the board in many reasoning tests.
Yeah. Even on general purpose queries, I think on the analysis leaderboard, we see that this model outperforms on that model. And the anecdote I showed about when it was doing the capture the flag, it literally discovered that, hey, the eval setup is broken, and then sort of figured out how to- I mean, that's kind of brilliant.
Yeah. So it's definitely not just good at math.
Okay. So this is from Sanjay, director of Kite Nagpur in India. He says he's still not clear about how giving more time increases reasoning ability. Maybe the best way to think about this is when the person asked the question tries to solve a puzzle.
If they try to solve a- for example, let's say if we ask this person, what is an English word that has 30 syllables? What is an English word that has three consecutive double letters? And they try to ask in- to answer the question in two seconds. It's pretty much impossible. You will not find a person that can do this. But if you think about it for 10 minutes or 15 minutes, at some point they would realize, oh yeah, for example, bookkeeper has double O, double K, double E consecutively, right?
And so what are they doing in their- in the time? They're exploring options and checking if they make sense, trying to find a way, maybe realizing, okay, if I compound two words, I just need to find two words, one that has double and a letter, and then another one has a letter and a double. I can put them together. All of this thinking is almost exactly the kind of thinking our model does. If you look at the chain of thought, you will literally see the model thinking about exploration and error correction and try to validate answers and exploring new avenues, et cetera. To add onto that, imagine a crossword puzzle or Sudoku. You'll try things out, they won't work. You'll go back, you'll try other things out. You're not really expected to immediately look at it and fill in all the blanks in one go. I don't think any human could do that.
Yeah, definitely. Do you guys think the disconnect here is that it's just hard for us to wrap our minds around the idea of technology thinking? Because that question had a lot of upvotes. You've explained it well, but I think we're like, no, this is not making sense. Tell me again.
I think it's just a cultural thing at this point. Words are just words. They don't mean anything. They're just definitions. We assign meaning to words. If by reasoning, we are
happy to go for, if you spend more time thinking, you do better, then it definitely applies. But if we say thinking is only what humans do, then these models don't think. Right. OK, noted.
Some more questions from your team, fellas. So this is from Tongzhu Wang. He's an incoming member of the technical staff at OpenAI. Welcome. Yeah, welcome. Can't wait to meet you.
Not all questions benefit from reasoning equally. Given a specific question, how should the model decide the amount of test time compute it should spend on reasoning and searching? I can at least take this now. I think for now, the model, it's up to the model. But in the future, maybe we will have either the model understand and try to figure it out oneself, or maybe it can be a user-tuned parameter, something you could say, hey, spend a lot of time thinking about this. It's a very difficult problem. Clearly, if you ask it to solve the Riemann hypothesis, one of the millennium problems, it should take a year or more to solve it. If it's asking you what's 6 plus 7, it should get it right away. So we're still sort of investigating, but it's definitely on our minds. Maybe one more thing is it also depends, in some cases, how much do you care about having a good answer. For example, if you ask me, can you write a haiku about O1, I can probably do it in two minutes. It's not going to be a great haiku. If you give me two hours, it's going to be a little better. If you give me three days, it's going to be even better. I can technically solve the task in two minutes. But then how good of an output do you want? If you want a really good output, then give me some extra time. Thank you, GV Ahmed.
This is from Alejandro Dobles, MS student in computer science at Stanford. Is the current bottleneck in thinking time determined by context length? Is the reason we can't make O1 think for years right now because you can't fit more than context-length CO2 tokens? Maybe I can try to take this one. I think there are many reasons why it's hard to scale. Imagine you want to have the model think for a year. Are you going to train the model to think for a year? We don't want to have to wait a year just to train the model. So there's just a lot of research to be done on many components.
OK, on to the next one. How are you guys doing? You're being grilled. You still feeling good?
Yeah.
OK, this one is anonymous from where I sit. O1 excels at problems that typically have structured or multiple steps like those we encounter in STEM fields. How smart have we found it in the more abstract, generative, creative domains? So not as smart for now, but that's definitely on the roadmap. Again, if we think of reasoning as you think longer, you do better, that should apply to everything, including writing haikus or creative writing. There are some fields that are a little easier to tackle when it comes to reasoning, like STEM. And so think of this just as a starting point. I'd also like to add, you should definitely check out the blog post. We do have a suite of benchmarks. And yeah, GPT-4.0 either matches or outshines the OpenAI O1 model in a few of these more abstract tasks as well. Awesome, thanks, guys.
I'm curious on that note related to creativity and perhaps disciplines that are more subjective than objective, as researchers, how do you guys think about cultural context? And the qualitative analysis of whether something is smart or good? Yeah, I think my take is cultural context actually turns things from subjective to objective. If you want to say, write me a good poem, that's a little subjective. But if you are happy to define good as a poem that will move someone, for example, or that will move many people, or it will win an award, a Nobel Prize for literature, now it becomes very objective. And of course, it's defined by some subjective experiences, the ones of people. But then the existence of these people in the world is objective. So I think that's an easy way to turn subjective problems into objective problems.
Do you three work with each other on a daily basis?
Yeah, kind of. That's a good question.
I'm getting the feeling you like each other. Hongyu, I like the way you think your colleagues are kind of funny. It's very sweet. It's important to like the people that you work very closely with.
This question is from Kiran Deep Kaur, PhD student at the University of Washington. In safety reasoning, we could see instances where the model uses reasoning to convince itself to stop outputting something dangerous. I was wondering if the model ever tried to reason a misbehavior. What do you mean? You mean like convince itself that a misbehavior? Do you have any more context to the question? I guess, unfortunately, if we were face to face with the audience, we could ask for a little more context. Maybe after the event, I'll follow up. And then I'll hit you guys up in Slack, and we can revisit that one.
This is from Chris Sora as well, PhD candidate at Berkeley. Oh, Chris Sora, actually. OK, now I'm remembering. Chris Sora is working in the social sciences under Claude Fisher, faculty member at UC Berkeley. Trying to remind myself of who he is, guys. Can any of these improvements be explained by changes in the model's training data versus chain of thought reasoning abilities? No. It's in if we only train the model on the same data but didn't change the algorithm, would it work as well? The answer is no. OK, thanks, GB. Quick and to the point.
This is from Nolan Koblishke, astrophysics PhD student. Probably from the University of Arizona, because recently we hosted one of their faculty members that led the Event Horizon Telescope. So they ask, what can scientists do to help build an AGI that can make scientific discoveries, in my case, in astrophysics? Should we be making benchmarks, data sets, making our tools more accessible? Yeah, all of them. All of the above. OK, Nolan, all of the above. And please contribute.
And GB, this is from Sylvia Baronesi. She said ciao earlier, and now she has a question. Reasoning is a logical function of the brain or of a large language model. How about other non-so-rational aspects, such as consciousness or self-awareness in O1? Yeah, these are great questions. I feel like one of the main issues when it comes to answering these questions is, again, mostly definitional. It's hard to wrap our hands around these topics if we don't want to commit to a specific definition. And then even once we do, if we want to be scientific, we have to make things measurable. I feel like we should just keep an open mind and come up with definitions that are useful, and then try to find ways of measuring whatever these models are doing. And I think in many ways, it will actually be much easier to measure whatever it is that we want to measure in these models than it is to measure in people. To measure what's happening in our brains, to drill holes and stick electrodes inside. And there's lots of neurons and blood, and it's very messy. These models are pretty clean. They're easy to study in so many ways. So OK, if I had to make a guess, I think we'll have a much better understanding of the inner workings of our models in very soon. I think a couple of years at most. Much better understanding that we have about our own brains. Thank you, GB. Want to add anything, fellows? Or should we move to the next one? I think we should move on.
OK. Is there a linear relationship between the time you give to O1 to reason and the quality of reasoning? I think we've touched on this. Well, not quite linear. So I think, yeah, the x-axis, you double the amount of time. You guys can check out the plot on our blog post. It's log scale. And maybe, team, we can pin the blog post right now in the chat for everyone to grab and save for later.
This is from Ellen Camel, Research Center Operations. At where, I'm not sure. What was your first aha moment when working on the O1 model? Like the slowdown in computing was actually triggering the reasoning or perhaps the critical thinking as a human would? I'd say it was very smooth. It was not one day it worked and then the day before it didn't. I think it was this very gradual improvement where the model got better and better at using its thinking and became slightly more confident about its answers. And it would check one time more. Yeah, there was no one day that we showed up to the office.
something crazy was happening. Not quite an aha moment, but obviously, you have a few benchmarks. Those benchmarks are improving. But for me, one astounding thing was being able to look at the chain of thoughts. You can't really peek into people's inner monologues, but when you look at some of these chains of thoughts, you see behavior, and you're like, hey, that's how I do it. That sounds like my inner monologue. And that's an exciting aha moment of, hey, maybe this is something that is pretty promising.
Wow, definitely, guys. It actually reminded me of my 14-year-olds and how we're always breaking down the problem. Well, what is the context? Well, whether it's math or whether it's English, this is literally our conversation every night at homework time. And when I saw your models begin to do that, I thought, wow. I mean, wow. We see things much later than you guys, but there are definitely a few aha moments for me that I thought were super exciting.
Well, guys, that's most of the questions. I was trying to filter through the ones that felt repetitive. I will follow up with you team with any questions that kind of went unanswered or we need a little more context. And if you have time, it would be great to hear from you. But most of all, I just want to say thank you so much for your time. That was really, really lovely. And GB, I loved the human origin story at the beginning. It's always really awesome to be able to learn a little more about our team because we're actually humans as well. We're not robots working on robots. We're people. So thank you so much for taking time. It's probably dinner time for you guys. I know for a lot of us, we're parents. We're putting our children to bed. And I just want to say thank you to the audience as well. That was really fun.
And for the team, GB, Ahmed, Hongyu, this event is being recorded. So I'll share it with you later so you can circulate it in your networks. And for everybody here in the forum, it will be posted in the content tab so you can watch it on demand. And hopefully, fellas, we can have you back really soon when your next advancement is published. That's great. Thanks, Natalie. Thank you for hosting us. See you guys soon.
I'm going to close us out with a few notes about upcoming forum events. So at 12 noon, we're going to host community office hours this Friday. I hope that you guys have found these useful so far. We're definitely taking notes. And they always inform our curatorial practice here in the forum. If you'd like to have the OpenAI Forum Roundtable community office hours added to your calendar, please email or DM Caitlin Maltby in the community. And she'll add it to your calendar.
At 5 PM on October 9, we have another really awesome virtual event, Deploying Chat GPT at Scale, Best Practices for Adoption with Lois Newman. So we're continuing in our educational series throughout the rest of the year, as we've mentioned in the past few events.
And then on October 22, we're going to host another virtual event. This is a special one and first of its kind. We're going to be hosting Professor Richard Waterman from the Wharton School. And he's going to be talking about how Wharton is becoming an AI-native institution. And he's been taking the lead on these initiatives. And he has some really cool use cases to share with us.
October 24, we're going to be sharing Enabling a Data-Driven Workforce with Lois and Erin Wilkiewicz again from OpenAI. So again, we're all going to be gathering to learn and really deepen our ability to apply chat GPT to our everyday lives and workflows and research.
And then we've also started publishing some events in November as well so that we can begin to plan around our holidays. We're going to be hosting November 13 a chat GPT learning lab series. And the topic of that learning lab series is to be determined because we wait and hear from you guys. And we're really informing those series based on what it is you tell us you would like to learn.
I'm Natalie Cone, your OpenAI Forum architect and program lead. And it was wonderful to host you all tonight. Can't wait to see you all soon. Happy Thursday, almost Friday. Thank you to the fellows who presented tonight. Thank you to all the OpenAI teammates that joined us here. And we will see you next week. Goodbye, community.