- Let's Design a Computer: Transistors, Logic Gates (Security Now 233, 42:00)
- Machine language (Security Now 235, 46:00)
- Listener feedback (Security Now 236, 51:43)
- Indirection: The Power of Pointers (Security Now 237, 27:45)
- Listener feedback (Security Now 238, 40:28)
- Stacks, Registers & Recursion (Security Now 239, 1:01:00)
- Listener feedback (Security Now 240 48:16, 1:01:21)
- Hardware Interrupts (Security Now 241, 42:25)
- Listener feedback (Security Now 242, 59:27)
- The Multiverse: multi-threading, multi-tasking, multi-processing, multi-core (Security Now 247, 47:15)
- Listener feedback (Security Now 249, 1:30:21)
- Operating Systems (Security Now 250, 43:22)
- Listener feedback (Security Now 251, 1:00:22)
- RISC vs CISC (Security Now 252, 53:31)
- Listener feedback (Security Now 253, 50:38)
- What we'll do for speed: pipelining, branch prediction (Security Now 254, 29:30)
- See also
Leo: That's all I can say without giving away a very critical part of that book. What a fun book that is, too. All right, Steve. Let us talk about computers. How far back do we have to go to understand this?
Leo: When you say "first principles," are you talking silicon? What are we talking about?
Steve: Before that, actually. If we wind ourselves back in time, get into our Wayback Machine and want to understand the first successful computers - and I'm talking about, frankly, the PDP DEC machines. There was Honeywell and Burroughs, Digital Equipment Corporation, IBM. These early machines were pre-integrated circuit. So there wasn't this notion of multiple components that could be mass produced. That was an amazing breakthrough which is very easy to take for granted. I mean, lord knows everybody, well, everybody listening to this has integrated circuits surrounding them. And I would say throughout the day we're surrounded by integrated circuits doing different things.
But before that, before it was possible to integrate different types of components onto a single piece of silicon, which allowed then mass production, all of these components were separate. And it was the separateness of them and the need to interconnect them which put a tremendous limiting factor on the feasible complexity of these early machines. So what I want to do is really start at the beginning with, if you have resistors and transistors, how do you create logic from that?
We know that computers use ones and zeroes. There were computers, analog computers, which people tinkered with, which for a while could - they had the benefit of being able to, with a lower degree of accuracy, do things in the analog world where currents and voltages conveyed meaning. They were able to do things far more easily than digital computers of the era, within the same timeframe, except that what ended up happening was we needed more accuracy. Things like temperature could affect the accuracy of an analog computer.
And so it turned out that just treating everything as collections of binary information, ones and zeroes, even though you needed a lot of them in order to get the same resolution that you could get sort of for free with the voltage on a wire, you could represent that voltage with a sufficiently large number of ones and zeroes and get the resolution you needed, and also get absolute accuracy that would not drift over time. You didn't have to worry about calibration and temperature and super closely controlled power supply voltages. There was just all this forgiveness by taking the binary approach. So now analog computers are sort of long gone, and they've been completely supplanted by digital technology.
So one of the simplest, most basic components of digital logic is called an inverter. And I want to explain - here's where we wish we had GoToMeeting. But we're in a podcast, an audio format, so I'm going to need people to sort of...
Leo: To visualize here.
Steve: Yeah. If you're driving...
Leo: You're proving my point.
Steve: Exactly. If you're driving while you're listening to this, do not close your eyes. But anybody else, I'm going to draw a picture for you. We have to do a little bit of schematic work with electricity and early electronics to explain the principles. But when I'm done, I think everyone's going to get a kick out of what they understand. I'm going to simplify things a little bit here and there. But fundamentally this is the way all of this stuff began to work.
Imagine in this visual slate that there's a wire running along the top which carries a voltage, and another wire running along the bottom which is the ground. And this is the way most of these logic diagram schematics are drawn, is you'll have sort of a bus running across the top that has a voltage, which is just a pressure, essentially, created by a power supply. And anchored at the bottom is another wire, sort of a bus running horizontally that is the ground. You then - you interconnect things in between this positive power supply potential at the top and the ground at the bottom.
If we had two resistors - a resistor is a component with two wires coming out of each end which, as the name sounds, resists the flow of current through it. Essentially what it does is you run current through it, and it gets hot. It dissipates current in the form of heat. So imagine in this circuit diagram that we have two resistors connected, the first one at the top, coming down to the second one, which then connects to the ground at the bottom. So that we have a circuit formed just with two resistors in series. And for the sake of simplicity we'll assume that they have the same amount of resistance. Well, this forms something called a "voltage divider" because essentially, when we make this circuit with just two resistors in a series, voltage will flow through this circuit.
And the direction of voltage flow is sort of controversial. I can't remember now, I was trying to remember which direction I learned in high school. Some people think of voltage flowing from the negative to the positive. Some people think of it from the positive to the negative. It really doesn't matter. Technically one direction is current flow, the other is the flow of the electrons, which sort of goes, being negative, goes in the other direction. So either way, all you have to have is a consistent system, since it's really sort of an arbitrary designation which way the current is flowing.
So we have this what's called a "voltage divider." So at the very top is our power supply voltage. What happens is the resistors share this voltage drop, as it's called, between the positive power supply voltage and ground, so that the junction where they're connected in the middle will be at half of that power supply voltage because they evenly divide it. And so that's sort of the first thing to see is you have two resistors connected together. They form what's called a voltage divider. And the voltage in the middle, or the voltage at their junction, where they're connected, is half of the total voltage.
So now we take out the bottom resistor, and we replace it with a switch, just a standard mechanical switch. It's got two wires; and, depending upon whether the switch is open or closed - "open" means they're not connected, "closed" means they are. If we close the switch, then the switch is essentially a short circuit. So now that resistor that's still on the upper half of this little circuit, its lower lead is connected through the closed switch to ground. So its lower lead is now at zero voltage, at ground, when this switch is closed. If we open the switch, then we've disconnected the circuit, and the lower lead now has the same voltage as the power supply because there's no current flowing through this resistor. There's no voltage drop across the resistor.
So now we go to the next step, and we replace the switch with a transistor. A transistor is a three-lead device, a three-terminal electronic device. We've all heard of transistors, of course. The way it works is it's like a - it works like an electronic switch. We put this transistor in the circuit. And so the transistor has an input on what's called the base lead of the transistor such that, when we put a positive voltage on that base lead, on the input of the transistor, the switch closes. That is, the transistor sort of works like the switch that we just took out. But it's controlled with the voltage on its base.
Actually voltage and current get complicated here, and I want to keep this sort of simple so we can stay to what's important. But the idea is that, if we put a positive voltage on the base of the transistor, that is, the input of the transistor, some current will flow through the base, which turns the transistor on. But remember that when the transistor is on, it pulls the lower end of that resistor that's coming down from the supply voltage, it pulls it down to ground, that is, down to zero. So what we have is an inverter because, when we put a positive voltage on the input of the transistor, it turns on, which pulls that junction between the resistor and the transistor down to zero. So a one goes in, and a zero comes out. And if we move the voltage on the base of this transistor, the input of the transistor down to ground, then the transistor turns off. And with the transistor off, then that junction between the resistor and the transistor goes up to the power supply voltage. In other words, a one.
So what we have is this, with just two components, this resistor that goes up to the positive power supply with a transistor hooked to it going down to ground. We have an input into the transistor, and the output is that junction between the resistor and the transistor. And that creates an inverter. So we have with these two components probably the most basic logic system that you can have.
So that's an inverter. It doesn't, I mean, it's certainly useful by itself. But we can do something, make one additional change to it to begin to create some logic gates. And that is, we take another transistor and hook it to the same place. That is, we put these two transistors in parallel with each other. Another transistor hooked to the same place so that either of them are able to be turned on and pull this output down to ground, that is, hook the bottom of the resistor down to ground. So now look what we have. If we turn either transistor on by putting a one, binary one into either of the inputs, then that transistor will turn on and pull the output down to ground. And they can both be turned on. We get the same result. So what we have is a, in logical terms, is called a NOR gate. Which NOR stands for "not or." So if either input is a one, the output is a zero. So we have the beginning of logic.
Now, we know how an inverter works. The inverter was just the transistor and the resistor. So we could take one of those and hook it to the output of this little NOR gate, and that would invert its output, turning it into an OR gate. So now if either of the inputs is high, the output of the first part is low. And then that's inverted so that the output of the final thing is high. So if either input is high, the output of our little OR gate, composed of this NOR followed by an inverter, is high. We have an OR gate. Or, conversely, we go back to this NOR gate, where either input is high, and the output is low. If we put inverters on the inputs, on each input, then look what we have. If we just, with the NOR itself, if either input is high, the output is low.
The other way of saying it is, only if both inputs are low is the output high. So if we invert the inputs, then only if both inputs are high will the output be high. Which is an AND gate. So we could have two inverters that feed into the NOR gate, and we end up with an AND gate. So it's possible just with this, just using simple components of transistors and resistors - and this is actually a family of logic called RTL. RTL stood for Resistor Transistor Logic. And circuits that were exactly like this were very popular. And this is the way digital logic was originally created. So it's clear that, by assembling these little building blocks in different configurations, we're able to create sort of fundamental logical connections.
Now, one of the core things that a computer has is registers. It needs to have, it needs to somehow hold data that it's working on. We need to be able to, for example, load data from memory into something called a register. You know, an accumulator. Well, what is that? How do we go from having AND and OR things, which we know now how to build, how do we have memory? How do we store something?
Well, there's an interesting trick. It would be fun to know who actually was the first person to come up with this because it's clever. And that is, we've seen how we create an inverter, where we just have a resistor coming down from the power supply to a transistor such that, if we put a one in, we get a zero out. Well, if we connect another inverter to the output of the first one, and then connect the output of that second inverter back into the first one, so essentially we have two inverters that are connected in a chain, or in a ring. Look what happens. If we have a one input going into the first inverter, we know that it gives us a zero out. Well, that zero goes into the second inverter, giving us a one out, which we feed back into the first inverter. So it's stable. That is, it just sits there like that. And it'll sit there forever like that.
But if something were to briefly, for example, pull the input of that first inverter, which is a one, pull it down to ground, to zero, well, then, its output would go to one. So the input to the second inverter, which would now be one, it turns into a zero, which it feeds back into the beginning, and that will be stable. So whatever state these two inverters are in, they stay in. And that's the basis for another fundamental logical thing called a flip-flop because it flips or flops in either direction, and it stays that way.
Now, when I talked about like if something pulled that input down, the way this is actually implemented is with something like a NOR gate. So if - and this circuit gets a little bit more complicated, and I'm about to sort of start waving my arms around and not going into the same level of detail as we pull back from this. But if we, instead of hooking these inverters to each other, we hooked our NOR gates to each other, then imagine that both - so the circuit is we have a NOR gate - we have two NOR gates. The output of the first one goes to one of the inputs of the second one. The output of that second NOR gate goes to one of the two inputs of the first one.
So we still have the notion of these things being connected to each other in a ring. But now we have each of those NOR gates has the other input, which is not participating in this circular interconnection. And that's actually where we're able to briefly change one, briefly, like, put a pulse on one to flip this flip-flop from one state to the other. And that's the basis for a register bit.
Now, we want to do other things like add bits together, add numbers. It turns out that addition of a binary number is essentially synthesizable just from these fundamental logic blocks. And we've sort of talked about this a few weeks ago where, if you look at adding two bits together, if both are zero, the result is zero. If either one is a one, then the result is one. If they're both one, then the result is zero with a carry. And so binary math works exactly the same as, for example, the decimal base 10 math that we're used to where, for example, if you had five, you were adding five and zero, you'd get five. If you add five and five, you get 10, meaning that the units is zero, and we've carried a one into the tens position. Well, binary works the same way, where if we have two ones, the result is zero, and we carry the one into the next position. So it's possible to express that logic with just the gates that we've seen designed here.
What I just described is known in logical terms as a "half adder" because it's able to take two binary bits and produce a result with a carry. Now, the tricky part is, the next bit over, the next highest bit, it needs to be able to add its two bits, but also accept the carry from the position to the right, the one less significant bit. That's a little more complex nest of logic which is called a "full adder" because it's able to take two inputs and the possibility of the carry from the prior result and incorporate that into its output and also send a carry to the next one.
So essentially, by stacking enough of these full adders together and interconnecting them so the carry-out of one goes into the carry-in of the next, and the carry-out from that goes into the carry-in of the next, you're able to take two binary numbers; and, after this thing settles down, there's like a ripple effect. If you imagine that you put the two numbers in, well, the result of the first two will produce a carry that goes into the second two. And that may produce a carry that goes into the third two and so forth. So this carry ripples down through these full adders to produce a result.
And then the final stage of this full adder, it produces a carry which in computers is often called the "overflow bit." Because the idea is, if we were adding 16-bit numbers together, the result of adding 16-bit numbers could be a 17-bit result, that is, a number that would not fit in 16 bits. And that's exactly the same as in decimal. If we're adding, like, single decimal digits, well, we know that a single digit can hold up to nine. So we could be adding any two decimal digits that sum up to nine. But if we try to add seven and seven, well, we know that that's 14. Well, that 14 won't fit in a single digit. It needs two.
Similarly, it's possible to add two binary numbers where the result won't fit in the same size as the binary numbers, so you have that final - that's what happens with that final carry bit that overflows outwards. So that's sort of the fundamental architecture for the way bits are managed in a computer.
Leo: Do you think that people figured that out a priori? I guess, you know, Alan Turing did all this, long before hardware was available to do it, in his head. And maybe even going back to Ada Lovelace; right? I mean...
Steve: Well, I mean...
Leo: But we didn't have this binary thing until we knew it was going to be a switch.
Steve: Right. That's a very good point. And all of this can be expressed mathematically rather than electrically.
Leo: It's Boolean logic; right?
Steve: Exactly. Boolean algebra, Boolean logic, allows you to do all of this kind of stuff and work out these problems. I mean...
Leo: And that's well known. I remember studying that in college, before there were personal computers. And it's fun. You do it all on paper. And you have AND, OR, NOT. I can't remember if you have things like NAND and NOR. But you learn all those. And there's even symbols for all of that.
Leo: So it makes sense that then, if you give somebody some hardware, and you say, well, okay, you have a switch and you have inverters and all that stuff, now, how do you duplicate those functions in this hardware? And that's really what you're talking about.
Steve: Exactly. And at the time, now, if we think about the cross-connected inverters with some additional logic around them, one of the things which basically forms a storage register. And then you want the ability to load them with a value that's coming in on a set of signal lines for however many bits. Well, that's going to take, oh, maybe, call it 20 transistors and some resistors. So that's for, like, one bit of NOR gates cross-connected with some other gates to gate their inputs. So that's maybe 20 transistors.
Well, back in 1960 a transistor cost a couple dollars. I mean, like, one transistor was a couple dollars. And it was a little, sort of a little silver can, smaller than a dime, sort of like a pencil eraser. Think of it like the end of a regular pencil eraser, sort of like that, with three leads. So it's a couple dollars. Well, say that the resistors that are also going to be scattered all over the place are free. We'll just toss them in because they were always pretty inexpensive. But 20 transistors at $2 each is $40 for the logic to implement one bit of a register. So there's - and that's not - that's just, like, raw cost. That's not even burdened with all the other costs.
The other thing then you have is the need to interconnect all of these because you've got 20 of these little eraser head things with three wires and a bunch of resistors. Now they have to physically live somewhere, on a circuit board. And that's got to have interconnections, which are traces on a circuit board. But now you've got to interconnect them into a family of these. So you need connectors to allow this little circuit board that represents, remember, just one bit of storage forming one bit of a register. It's got to be - you've got to be able to plug that into some sort of a back plane where wires then go from one to the other to connect them into a multi-bit conglomeration.
So maybe this is $50 by the time you're done for this one-bit register. And you want to do a, what, a 20 - you want 20 bits of binary resolution. So now you've got $1,000 that you've spent for 20 binary bits of storage. That's all this is, is just, you know, it can store a 20-bit value. But you haven't been able to do anything else with it yet, and you've spent $1,000. So, and that's not profit. I mean, that's not $1,000 of sale price, that's $1,000 of cost, including something for the interconnection.
So from the beginning the engineers who were trying to design a computer that could do something useful, they were designing them with these individual switches called transistors, and resistors that sort of go along with them, at a cost that meant that literally every single bit that they used was expensive. And they were trying to bring the costs down as quickly as they could, as far as they could, to make these computers much more accessible to people.
What I want to do next week, since we sort of understand this, is take a look then at the next stage, which is what do you do when you've got memory, and you've got a register like this, how do you turn this thing into something that can get some work done? And that's what we'll do in two weeks.
Leo: I love it. I love it. You know, it's so interesting to think, what, you said a thousand bucks for 20 transistors, something like that...
Steve: 20 bits.
Leo: 20 bits.
Steve: One 20-bit register.
Leo: And now we've got, somebody pointed out in the chatroom, we've got NVIDIA GT200 cards which cost about $100, $200 for, get ready, 1.3, what is it, 1.4 billion transistors. Billion. Billion. It's amazing, isn't it. But it was a huge insight to say we can do this. And then that began, with Moore's Law behind it, that began the amazing revolution that we've seen today.
Steve: I read a really interesting book last summer about the invention of the integrated circuit. And the breakthrough, it was actually there were some people in Texas, at Texas Instruments, and also at Fairchild in California. And there was some argument about who actually got it first. But at about the same time there was parallel invention because what happened was everybody was running up against what they called the "interconnection problem." It's like, we are dreaming big. We're ready to, like, do more.
But what happened was, just the physical need to interconnect the pieces, that became the limiting factor. They just, you know, the individual components were so big, and that they physically took up so much room, that you just - you needed to lay out the space. And there was - before long it got too big to run wires all around it. And so the interconnection problem was what the integrated circuit solved when it was able to say, hey, you know, I mean, they even knew they could make resistors out of silicon, they could make diodes, they could make transistors, the fundamental pieces they knew how to synthesize. But they didn't know how to, even on silicon, how to interconnect them. That breakthrough then opened the floodgates.
Leo: It's amazing. Eden in our chatroom sent me a link to the Wikipedia article on transistor count, which has a remarkable graph that shows how rock-solid Moore's Law is. It's a logarithmic chart that starts in 1971 with the 4004 which had 2,300 transistors. Essentially a transistor's a switch; right, Steve?
Steve: Yes. Exactly as we just covered, it is just like - it replaces a switch.
Leo: So it's 2,000 switches. Going up to 2008, where a 2 billion-switch quad-core Itanium - and but look how straight that line is, if you go to this curve, and because it's a logarithmic scale that means doubling. Incredible. I mean, it's such a - it's one of the most remarkable predictions of all time because it's held true for almost 40 years. And, I mean, true. I mean, right-on true.
Leo: Almost self, maybe self, I don't know, self-inflicting because in fact Gordon Moore was the chairman of Intel, so maybe they said, well, we have to do this. I don't know. But amazing. Just amazing.
Leo: Really remarkable. What an age we live in. And it all starts where we just started.
Steve: Well, and what is so remarkable, and this is what we'll look at in two weeks, is I'm going to - I hope to be able to demonstrate with such crude tools, with something so simple as a really basic computer, it is possible to do an amazing amount. Again, as we said at the beginning, like a dumb box of rocks. But really fast rocks.
Leo: Where do we start?
Steve: Well, two weeks ago we sort of laid down the foundation by demystifying AND and OR and NAND gates and how you could cross-connect two inverters that would create a little memory cell which could remember if it was set to one or set to zero, a so-called "flip-flop." And I wanted to convey to people the sense from back in 1960, essentially, for the number of components that were required to do even the simplest things. So now we have gates and the ability to create a register of individual bits which can be read and written.
So how do we, literally, how do we make a computer? And I know from talking to people so much, there's sort of this mystique about assembly language and machine language, as if, like, you have to be some galactic guru in order to understand that. It's like, oh, that's like really deep voodoo. And I'm going to use C or Perl or PHP or Python or something. The truth is that what's actually happening down at the hardware level is really simple. And, I mean, I'm going to demonstrate that now by looking at and sort of designing, developing right now with our listeners a completely workable, usable computer, using only what we understand, no magic. And I believe that, once we've gone through this exercise, sort of wiping the slate clean and just saying, okay, let me just think about this, people are going to end up thinking, well, okay, that's it? And the answer is yes. I mean, it's not that big a deal.
So we have memory for any machine. Back in the early '60s we had gone from drum memory to core memory. Drum memory was sort of the predecessor to core, the idea being that you'd have a magnetized drum that was spinning and literally use the impulses coming off of the drum as the contents of the computer's memory. Thank goodness that was replaced when this concept of cores, little tiny doughnuts, essentially, that are magnetizable in either a clockwise or counterclockwise direction. We've talked about it once before, the idea being that this memory could store a one or a zero based on whether the individual little doughnut was magnetized in one direction or the other. And you could tell which direction it was magnetized in by forcing them, like a set of them, all to zero. If they were already at zero, nothing would happen. If they had been at one, then the act of switching their direction would induce a pulse in a so-called "sense wire," and that would tell you that, ah, we just moved that one from one to zero.
Well, that was called a "destructive read" because the act of reading the contents destroyed the contents. We wrote zeros to everything, getting pulses out of those ones that switched from one to zero. Which meant that, unless we wanted to leave the zero there, we needed to rewrite the original contents in order to put it back, which is what these memories typically did.
So let's imagine that we have memory, core memory, which is nonvolatile, meaning that it just - we can magnetize these little cores, and they'll stay set that way. And we have a way of reading out the contents of a location and getting the ones and zero bits that are there. So the first thing we need to have is what's called the PC, the Program Counter, which is a - it's a counter which increments one at a time, reading out the contents of successive words of this memory. Now, the word length can be pretty much whatever we want. There were word lengths back in the beginning of as much as, like, 36 bits, sometimes even more. The early DEC machines were 18-bit word length. And people are used to thinking these days in terms of 16 bits or 8-bit bytes. We know, for example, that the Pentium machines were 32-bit machines, and of course we now have 64-bit systems. So these are the - currently there's been complexity added to what so-called "word length" means, which we're going to actually talk about in two weeks. In two weeks we're going to talk about all of the stuff that's happened since. But back in the beginning the word length was whatever the designers wanted.
Now, there was pressure on keeping it short because everything cost so much. Remember that, back then, this was before integrated circuits. So a bit that you had was a bunch of circuitry that you had to pay for every time you made one of these machines. So, sure, the programmers would like more bits because that allowed them to store more stuff. But the management was saying, wait a minute, we can't afford this many bits. So there was sort of a compromise. So if we look, for example, we don't really have to worry about specifically, but just sort of imagine you had 18 bits because that's where the first machines of this era sort of landed, 18 sort of being a compromise of different pressures, cost and capability.
So we have this program counter which will address the memory sequentially, basically stepping through it. So say we start at location zero. So out comes 18 bits into a register which we call the "instruction register." And it's just, it's one of these registers made out of individual bit memories, which we talked about last week. And they're all expensive, but we can afford 18 of them. So this instruction register holds the data that we just read out of a given location in memory. So what do we do with that?
Well, there's essentially a subdivision of the bits into different purposes. And a term that probably everybody has heard is opcode, the operation code. And sort of traditionally, the opcode has been on the left of one of these long words. So, for example, in the case of this computer we're making, we'll dedicate some number of bits to the opcode. So, okay, what does that mean? What things do we want to be able to do? We want to be able to load and store memory. We want to be able to add and subtract and maybe perform some logical operations.
Now, we're performing these against something called the "accumulator," which is another register. We had the instructor register; now we have an accumulator, which is sort of our scratch pad, so that that's the main working register where the data moves through where we perform these operations. So, for example, if an instruction said load a certain location into the accumulator, then the computer would transfer the data in a given location in its memory into the accumulator. And if another instruction said store that somewhere else, the computer would store whatever happened to be in the accumulator now into the location specified.
So we need to be able to perform some operations on the data in this accumulator. And sort of - so this is everything is centered around the accumulator, with the rest of the hardware sort of all existing to serve the purposes and needs of this accumulator. So if we had an opcode of, say, 5 bits, well, we know how binary works. We know that each bit gives us twice as many as we had before; 5 bits means that there's 32 different combinations of 5 bits. So if we think of those as sort of as the verb of this instruction, we could have 32 different things. And in fact the PDP-1 was an 18-bit computer that did have a 5-bit opcode. But back then 32 verbs, 32 actions that you could specify turned out to be more than they ended up being able to use.
So as the DEC minicomputers evolved, in fact with the very next one, which was the PDP-4 - there was no 2 or 3; the 4 and the 7 and the 9 and finally the 15 were the 18-bit lineage - they dropped the opcode to 4 bits, which is where they stayed for quite a while, for many years. So 4 bits gives us 16 different verbs, 16 different things we could do. So, for example, the opcode, meaning the first four bits of this word, might be 0000 or 0001, 0010, and so forth. Each combination of those 4 bits would specify a different action. And just one simple action. So absolutely one of them would be load the accumulator with something in memory. Now, where in memory? Well, that's where the rest of the bits come in.
Leo: All right, Steve. So we're building a machine language. And it really is based on kind of the architecture of the CPU; isn't it?
Steve: Well, I think what's significant, the point that's worth making is that even though I'm talking about an architecture that is 50 years old, this is still today exactly the way computers work. What I'm talking about is a simple CPU, a simple Central Processing Unit. But the fundamentals haven't changed at all.
Leo: Probably not even since Alan Turing imagined how a computer would work in the '40s.
Leo: This is the fundamental way a computer works.
Steve: So we've got a 16-bit word. And the left-hand 4 bits are allocated to the opcode, which leaves us 14 bits for the address. Meaning that the word is two parts. There's "what to do," and then the second part is "and what to do it with." So a 14-bit address gives us 16K words. If we think of, like, 10 bits is 1K, 11 is 2K, 12 bits is 4K, 13 bits is 8K, 14 bits is 16K. So the right-hand 14 bits provides the address, sort of the address argument for the opcode verb.
So say that the opcode 0000 stood for "load the accumulator." So when we fetch this 18-bits instruction into the instruction register, there's some logic which looks at the combination of bits in the opcode and essentially does this one simple thing that the opcode specifies, like load accumulator, if all four of those bits are zero. And so what that means is that that 14-bit argument is used as the address to fetch another piece of data from memory, different from the instruction. We fetch the instruction from where the program counter is pointing. Then we fetch the data from where the 14-bit argument of that instruction is pointing and load that into the accumulator.
So the opcode 0001 might be "store accumulator." And then the 14 bits following it would specify where to store the accumulator. So with those two instructions we have the ability of picking up data from somewhere and storing it somewhere else, moving the data from one place to another in memory. We might - we would certainly have an instruction called ADD. That might be 0011. And what that would do is - and then the 14 bits that follow would specify where to go to get the data to add to what's in memory. Again, it would - and this class of instructions are collectively called "memory reference instructions" because each of those opcodes references memory. It loads it; it stores it; it adds it to the accumulator; it might subtract it from the accumulator; it might AND it against the accumulator or OR it with the accumulator. Basically very simple, simple bit manipulations against the accumulator.
Now, the computer is useless to us unless it's able to have some sort of I/O, some sort of input/output. So one of those instructions, which would not be a memory reference instruction, would be an I/O instruction. Maybe that's, like, 1111, all the way at the other end, the 16th instruction, 1111. That would - it would be formatted differently. That is, the memory reference instructions were all an opcode followed by 14 bits that specified where in memory to do its thing. Whereas the last instruction, 1111, that's an I/O instruction.
So the rest of the 14 bits might, for example, specify an I/O device. Many of the early computers had, like, you could attach up to 64 devices. Well, 64 is another power of 2 which you require 6 bits to specify. So there might be a field in those remaining 14 bits that is a 6-bit I/O device number, meaning the teletype, the mag tape, the card reader, the card punch, whatever device it was. And then some of the other bits might be start the device, stop the device, read the device, write the device, different bits that are about input/output rather than, well, because those apply to that specific instruction. So what we see is we see that there's always a field in the instruction word for specifying the operation. And then depending upon that operation, the remaining bits provide arguments of one form or another to it.
Now, at this point we've got a computer which is able to move through memory, incrementing its program counter once for every instruction, and reading what's there and causing something to happen. Read, load and store, input something, output something. The problem is, it just goes in a straight line. And while that's certainly what you want some of the time, one of the things that computers do is make decisions. And that requires altering the normal linear incrementation to jump somewhere else.
The way this was done then, and even now, was to have a skip instruction, the ability to skip over a word in memory. Even though that wasn't very powerful, it was powerful enough because what you might have had, and certainly would have, one of our instructions. We talked about load and store and add and so forth, well, one of those, like instruction eight - 1000 - that instruction could be the jump instruction. And so when we load the instruction in the instruction register, and the opcode is 1000, that is, the first, the left-hand 4 bits is that pattern, well, the argument to that instruction, the other 14 bits, is the address we want to jump to.
So all the computer does is it loads that 14 bits into the program counter. So that instead of the program counter incrementing one at a time, we've just replaced the contents of the program counter with the 14 bits in the jump instruction. Which means that the next instruction we fetch is at that location. We've just jumped our program execution to a different place. That's all there is to it.
And so the way the skip comes into play is that, if we tested something, like say that one of our instructions was skip if the accumulator is zero, or skip if the accumulator is not zero, that kind of thing, well, if we were to subtract two items, and they were the same, that is, if they were equal, then the result would be zero. So that allows us to determine if two things are equal or not. And if we had an instruction that said skip if the accumulator is zero, then the instruction it's skipping over would be a jump instruction, which is - this is all a very simple way of implementing the control of the program's flow, so that if the two things we were comparing were not the same, the accumulator would not be zero, so we would not skip the instruction that follows. That instruction that follows would be jump completely somewhere else, so that if we don't skip, then we land on that jump instruction and go completely somewhere else. If the accumulator was zero, we skip over that jump instruction.
And all skipping means is, instead of adding one to the program counter, we add two, or we add one twice, which is actually how these machines worked back then. And that just causes us to skip over a jump. So essentially that means we can branch to anywhere we want to in memory or continue on our way, which gives us, even though that's very simple, that gives us enough power to allow machines to make decisions. And we've got input/output; we've got math; we've got the ability to transfer data from one location in memory to another. Those are all the essentials of the way a machine functions. That is machine language.
Now, the one layer of humanity that's put on top of that is what's called "assembly language," which is nothing but naming things. For example, you create sort of a so-called mnemonic for the different instructions. So, for example, load the accumulator would be LDA. Store the accumulator, STA. You want them to be short because you're going to be typing them a lot. Remember that you end up using lots of little instructions in order to get something done. And then the only other thing really that assembly language does, it allows you to name locations in memory.
So, for example, you might say LDA, for load accumulator, current score. And current score would simply refer to a, like a variable essentially, a location in memory that you had labeled "current score." And then if you did STA, store accumulator, new score, well, it would first load the current score into the accumulator, and then store that into a different location called new score. So really that's all we're talking about is some simple abbreviations for helping sort of remember and use these individual instructions and convenient labels for locations in memory so that you're not having to remember, oh, that's in location 329627. I mean, who can do that? So instead you just, you label that location with an English, an alphanumeric phrase of some sort, and then you refer to that location by the phrase rather than by its actual number.
And in fact you don't care what the number is. That's one of the things that the assembler will do for you is you just say I need memory called these things. And it worries about where they go because it doesn't really matter to you as long as they're consistently referred to. And that's the whole process. That's machine language and assembly language. And that's the way it was 50 years ago, and more or less that's the way it is now.
Leo: Very cool. It's amazing, really.
Steve: It is. We referred to it the other day as a dumb box of rocks that was just very fast.
Leo: Exactly. And this is - I think that was the most valuable thing about me learning how assembler works is you see every individual thing it does. And so you see exactly that. That's the lesson, is it's not doing very much. It's doing it fast.
Steve: It's why I like it, because nothing is hidden.
Steve: That is, there's nothing going on underneath that. One of the problems that I see programmers having is they assume that the compiler, like a C programmer is expressing much more abstract things. For example, when you're dealing at the machine level, you are truly dealing with fixed numbers of bits that you're moving around under your command. When you abstract that a lot, you're now talking about sort of like double-precision something. But the details matter. And it's where the programmer assumes that something is going to be done for him or her by the compiler that the compiler doesn't agree with. The compiler says, no, that's not what you told me to do. I'm going to go off and do this. So that kind of miscommunication in assumptions is where a lot of problems crop up. And for me, by dealing with it, by insisting on actually doing the individual small little bite-size pieces, there's no room for argument.
Steve: I mean, when I make a mistake, it's mine. It's because I told the computer, move this chunk of bits over here, and that was the wrong place to go. It's not that I told it something, and it did something different.
Leo: Yeah. Well, doesn't mean there are no bugs or surprises. I mean, because humans may think they're saying one thing and the computer think another. But it's much less ambiguous.
Leo: I mean, it's pretty clear. And I would guess there's kind of fewer interactions. Although, I don't know about you, but as I used assembler, I built larger and larger macros that, in effect, represented higher level commands. You must do that; right? You're not going to write out each little thing every single time.
Steve: Well, we're going to talk - one of the things we're going to talk about in two weeks is the nature of indirection and pointers.
Leo: Oh, boy. That's fun.
Leo: Oh, boy. If you - that was - there are two things I found very difficult to learn in programming. Indirection was one, and recursion was the other.
Steve: It's hard. It requires you being very clear about whether you mean something or the thing that that thing points to.
Leo: Right. I remember it very well. Now it's obvious to me. But I do remember very well when I first started writing in C, learning where to put that little caret and where not to. Oh, this'll be fun.
Leo: This'll be fun. Oh, I'm really enjoying this, Steve. And it's bringing back memories, and it makes me want to drag out my copy of MSM.
Steve: Well, and, I mean, what we just described, I mean, that is - what I described is a working computer that has everything it needs to get something done. And I think the mystery or the surprise is that just that, I mean, that's all our computers do. They load and store and perform simple operations on little bits of data. And, I mean, look what we get as a result. Because they're able - because there's enough of these little bits of data, and it's able to do stuff so fast, that they perform magic, really.
Leo: Let's see, here. Dan White in Winchester, VA wonders about incrementing the program counter. Oh, good. We got some programming stuff here. Just listened to the last episode on machine language, thoroughly enjoyed it, he says. It brings back memories of when I programmed in Z80 machine language for a computer my dad and I built based on the S-100 bus, if you remember that.
Steve: Oh, yeah.
Leo: Oh, yeah. I'm looking forward to your discussion of indirection next week and wherever you go after that. My question relates to - I want you to do recursion, too.
Steve: Yeah, we're going to.
Leo: Oh, good, okay. That's the one I'm still wrapping my head around.
Steve: Yup, because we need to talk about stacks.
Steve: And so the plan is to talk about adding various types of complexity onto the very simple model that we built last week.
Leo: Yeah. My question relates to something you just glossed over in your jump from the previous discussion of simple gates and flip-flops - which was excellent, by the way - to this discussion of machine language. You spoke of the program counter to allow the program to step through instructions. But doesn't that require more than just simple gates? Seems like it would involve an adding function, a timer, and a looping mechanism to continually add one to the counter. But that seems to require more complex functions of a program which depend on the program counter. So would you then need a program to create a program? How do you get this chicken-and-egg thing started? Is the program counter all done in hardware? Did I miss something in your previous discussion, or is this something you plan to address in future episodes? Thanks for Security Now! and for SpinRite. No great stories, just the normal maintenance of my drives. Dan.
Steve: Well, I thought that was a neat question. I did sort of just talk about the program counter incrementing and never talked about how that's done. It's not worth a whole episode, so I thought this made a great question for a Q&A. Counting in binary is a process that is interesting and sort of fun to work out on the back of a napkin, or on the front of a napkin, for that matter.
It turns out that a binary counter has a very simple logic to it. If you have a string of bits, say individual bit cells, and say that it's initially all set to zero, well, to turn it to a one we invert the lowest order bit. And so now we've got all zeroes and then a one. To increment again, we invert that first, the rightmost bit again. But when we invert the bit, and the bit goes from a one to a zero, we invert the next bit to the left. And what's really cool is that simple logic. You just - you invert a bit. And when you invert it, and it goes from one to zero, you invert the bit to the left that is the most, the next most significant bit. If you apply that, that counts. So you start off with all zeroes.
Leo: "Counts" in the sense of counts one, two, three, four; not "counts" as in is significant. It's counting.
Leo: It's adding, yeah.
Steve: It's essentially, it's adding one every time. So we start off with all zeroes. We invert the least significant bit. That goes to a one. And then we invert it again. Well, that bit goes to zero, which kicks the next one, causes it to invert. So that goes - now you have a one zero, which is the number two in binary. Now we invert the least significant bit again, so now we have a one one. Now, when we do it again - and a one one is a three - now we invert the least significant bit. So that one goes to zero, which kicks the next one over. It's a one. It goes to zero. Which kicks the next one over, forming a one. So now you have one zero zero, which is binary four.
So the logic in our machine, this program counter is a register of flip-flops that I talked about before. And there's some logic you can put on a flip-flop such that you're able to cause it to toggle. It flips. If it's on, it flips off. If it's off, it flips on. And so just by wiring those sort of in series, you get a counter. And that allows our machine to address successive incrementing memory locations in sequence.
And we also talked last week about altering the instruction flow, that is, this notion of skipping an instruction if the result of an operation was zero or had some particular characteristics. Well, skipping an instruction merely requires incrementing twice, rather than once. So you just send two pulses in, in the event that you want to skip over an instruction, and it adds two to the program counter instead of adding one. So it's a very elegant, very simple solution. And it works.
Leo: It's amazing. I just love the - elegant's the word. In fact, that's one of the - I mentioned art. That's why programming is an art. Because it's not, if it's done right, it's not a struggle. It falls into place in a way that's elegant. And you know immediately that's the correct solution because of the elegance of the solution.
Steve: Yes. I think that's - that really says it well. I've seen that when I don't really grasp the problem that I'm trying to solve, but I just sort of start writing code because I'm in a hurry or I'm anxious or impatient, I can sometimes code myself into a corner. I get myself tangled up where I think, oh, wait, I forgot about this. And so I add some code to fix that. It's like, oh, wait, I forgot about that, and add some code over here. And then before long you end up with just this big mess.
And in fact one of my very best friends once said something to me. This is, oh, I am an old man. This is about 30 years ago maybe. More than that, actually. He said that sufficiently complex problems need to be coded three times because the act - and you have to solve it all the way three times. Because his observation had been that when he - and this was, like, a big system, like a CAD/CAM PC board routing problem. It's like, you know, you start off, and you think you know how to solve it. So you start programming. And the act of reducing this abstract problem to individual instructions to reach a solution, that act teaches you so much about the - more than you knew in the beginning about the problem you're trying to solve, that when you're done you really need to scrap it. You just need to throw it away and start again.
And when you're starting the second time you know, you understand the problem you're trying to solve so much more clearly than you did the first time, even though you thought you understood it then. Now you do it the second time, and again you still haven't got it. You got it better. But now you're solving it sort of almost at a meta level because you really do have a grasp of it, having solved it once before already.
And then his point was the third time's the charm. I mean, and here's the problem. The real world never lets us do this. The real world has managers and marketing schedules and timelines and commitments and all this. And so it's so difficult to ever have an environment where - except as you said, Leo, as an artist, where fundamentally you don't have a commercial goal. You have an artistic sort of aim. And there you can say, okay, wait a minute, I'm just going to throw this away because I don't like it, and I'm going to start again.
Leo: You have to do that.
Leo: It's part of the process.
Leo: I've got to do, I keep wanting to do - and we've got people like Randal Schwartz and you who are just topnotch, best in breed programmers. And I just would love to do a programming show. It's such an abstract subject that I don't know how we would do it. I mean, I guess it's no more abstract than what you and I talk about every week. But I would like to do a show about programming as art. And there are people like Paul Graham. Paul Graham's fascinating on this. He wrote a book, in fact, called I think "Hackers and Painters," that compares fine artists and great programmers. It's just a fascinating subject. Anyway, maybe someday.
Leo: So there's a word in programming, "indirection," that is one of the fundamental programming concepts. And I remember very well when I first started studying C. I mean, I remember PEEK and POKE from Basic, which is essentially the same idea. But when I first started studying C, that was the hardest thing to get was indirection.
Leo: And that's what you're going to teach us.
Steve: We're going to talk about it. It is - it's something I didn't want to get into week before last when we laid out the fundamental architecture for how simple a computer is because it begins to add a little more complexity. But it existed back then on a PDP-8, on the PDP-11, the early Data General Nova machines. It's always been there. And it's a very powerful concept, but the power is the problem. So we're going to cover the whole issue of indirection and pointers.
Okay, so if we turn back two weeks to where we talked about machine language, we'll remember and recall from that that we have a bunch of memory, which is organized in words, each word containing a set of bits; and that a program counter is used to address a particular word in main memory, which it reads from memory. The bits then determine what the machine does in that step. And everything is done sort of step by step.
So, for example, in the machine we sort of - the virtual machine we designed two weeks ago, the upper four bits are opcode, and that would give us one of 16 different possible operations. And so, for example, if it was 0000, if those first, the left-most four bits were all zeroes, that might be the add instruction. And the balance of the bits in the word would be sort of, where the opcode is the verb, the balance of the bits would be the noun, that is, add what? That is to say, add the contents of a certain location, where those bits in the word would specify the address. Or we might load from a location, or store to a location, or AND the contents of the accumulator, which is sort of our scratch pad storage, with a certain location.
So once doing that, the program counter would increment to the next word in memory and do whatever that said, whatever the opcode in that word specified. And so, if you really think about it, it's a script. This program is a script of step-by-step instructions which the computer executes. And it gets a little more complicated because it's able to step out of sequence using skip and jump instructions to go somewhere else. So there's our computer.
Now, imagine a problem, as the designers of this early computer and all early computers did, where for example we have a document that's living in the computer's memory, and we want to search it for a certain word, which, you know, and we use FIND in our word processors all the time, the idea being that the computer needs to scan down through memory, through this document, to find what we've told it we want it to locate. So with this simple computer that we've got, how do we specify a succession of different addresses in memory? That is, the word contains the address we want to load, but it just contains that one address. It doesn't, like, how do we scan through memory?
Well, if we only had what we've described so far, there would be two ways to do this. You could have individual instructions, one after the other, that loaded successive locations in memory into the accumulator to see whether it had what we were looking for, that is, an instruction would be "load location 100," and then it would check to see, then it would be "load location 101" and would check to see, and "load location 102." Well, obviously that's a hugely long program because you're needing several instructions in order to check each location in memory. So that's arduous.
Now, another approach, the other approach would be something that is generally frowned on, and that is self-modifying code. That is to say, since the instruction itself is in memory, and for example it said "load location 100," then the program could change the actual data for that instruction from 100 to 101 and then load it, see if we found it. If not, then increment that location, the actual specified in the program, to 102. So the problem is that it requires that the program is modifying itself, which becomes messy pretty quickly. So what the original architects of these early machines decided is instead of the instruction, like said load 100, instead of that instruction specifying what to load, the instruction would have an option of specifying the location that contains the address of what to load.
Okay, so now - and we have to be careful, even like the way we talk about this, because it's amazing how easy it is to get tangled up. But in these early instruction sets, as I talked about it so far, we had for example a four-bit opcode, and the rest of the bits specified what the opcode acted on, what address in memory was loaded or stored or added or whatever. These early computers used one more bit, that is, so there was an opcode of four bits, and then for example another bit right next to it called the "indirection bit." And then the rest of the bits that were remaining specified the location. That is to say that the designers of these machines took one more bit for this purpose from the instruction.
So what this meant was, if it was a so-called, for example, an indirect load, if it said "load indirect 100," what that meant was the computer would get the contents of location 100 and treat the contents as the address to load the data. In other words, that the location, the contents of location 100 was a pointer to the data that should be loaded. And that is an incredibly powerful concept. That is...
Leo: It seems so simple.
Steve: Well, yes. And the reason it, I mean, it is simple, and it was even simple to do in hardware. I mean, all they had to do was they were going to load the contents of 100 anyway, so they did. They loaded the contents of location 100, for example. So the question is, do you use what you just loaded, or do you treat it as the pointer to what you want to load? And that's - so the logic in the computer was, I mean, it was inexpensive for them to implement this. And they got something so powerful as a result.
So if we return to our simple example of searching memory, all we need to do now is the program refers to location 100, but we're using the value of that as the address of the data that we're going to load. So we simply increment that location. And in fact the early machines, like the PDP-8 and the PDP-11 and even the Data General Novas that was another machine of that time, they had what was called "auto-incrementing memory," that is, auto-incrementing locations, literally a reserved block of memory, typically down at the beginning of memory. In the case of the Data General Nova it was location, I think it was 78 - I'm trying to think of which location. I don't remember now. I think it might have been octal 10 through octal 17.
Leo: It's so funny that you remember it.
Steve: So it was the eighth through the 15th locations. And the way that worked was, if ever you referred to those locations indirectly, the computer itself would increment the value for you. And what was really neat - remember we talked two weeks ago about this notion of memory which destroys its contents when you read it, like core memory, which is what was used back then - in order to read the contents of the memory, you needed to destroy what was there. Essentially you wrote all zeroes into the memory word. And the inductive pulse caused by the core's switching from one to zero is what let the computer know what had been stored there.
But in the process you wrote zeroes. So it was necessary to have a second memory cycle to write back what you had just destroyed. Ah, but in the case of auto-incrementing, you wanted to write back one greater. So what was so clever is that you sort of got this auto increment, or auto decrement, for free. That is, it sort of folded it right into the recovery of the prior contents of the core memory so that, again, very simple logic to just increment the value by one. We talked about that last week in one of our Q&A questions about how do you increment something. And it's very simple logic to do.
So now, with this very bare-bones instruction set, we're able to easily scan through as much memory as we want to. We simply say, instead of using location 100, for example, on a PDP-8 or even in the early Data General Nova, the Nova also had auto-decrement, a block of memory. And when you referred to it indirectly, the computer would decrement so that you're able to sort of scan from high memory down, as opposed to low memory up.
And so in the case of our little project here to locate something in memory, we would establish the beginning of the buffer that we want to scan. We would put its address into, say, location octal 10. Then we would say "load indirect 10." So we're not loading the contents of 10. The computer reads the contents of location 10 and increments it and puts one more than that back in 10. Then it uses the value that it read from location 10 as the address containing the data to be loaded. And so our program can be very small, very efficient. And every time it does this load indirect octal 10, it gets - what actually is loaded is a word somewhere else in memory, and it's the successively next word every time we do this. So again, very simple, tiny steps, but very powerful.
Now, what we've done is to empower the programmer with a concept which is needed for the sake of programming efficiency. But it's tricky because even today we're talking about security problems all the time that contemporary software has. And Leo, you were talking about you've done programming.
Steve: And, for example, in C, pointers are used...
Leo: It's built into the language. It's...
Steve: Well, exactly, it's intrinsic property of the language. And in fact pointers have historically caused so much problem that there are languages that boast they don't have them. Because it's like, oh, if you don't have this, you can't get yourself in trouble. And what happens is, programmers can easily make the mistake of whether they are referring to something or they're referring to where that something points to. And it's funny, I think the problem is there isn't a good analogy in life, that is, we're used to seeing something, and you reach out and grab it. And there's, you know, there's no indirection most of the time.
And so I don't think mentally we humans model something as abstract as a pointer. I mean, we understand intellectually what it is. But in the years I've been programming, I'm always having to be very careful. And programmers who have used pointers extensively know they have to be very careful to make sure that there isn't a gap between what they mean and what they tell the computer. Because the computer, as we know, is very literal. It'll do exactly what you tell it. So one of the, for example, in C or any of these pointer-based languages, you need to be able to get the address of an object as opposed to the contents of the object. And if you think about it, if you had a language, say like Basic, the Basic language, until you had, for example, PEEK and POKE, as you were referring to, Leo...
Leo: Yeah, yeah. Which is - that's indirection in a way, right, because you can...
Steve: Oh, it absolutely is. Yeah. If you just have the Basic language, where you say A equals 1, B equals 2, C equals A plus B, you cannot get yourself in trouble. I mean, there's no notion of pointing to something else. You know, A and B are variables. The language takes care of that for you. If you say C equals A plus B, then, again, the compiler is completely hiding that.
Steve: But as soon as you say A equals where B is pointing to, now, I mean, you have let the genie out of the bottle because, as a pointer, that B, where B is pointing to, it can point to anything. I mean, it could point to outer space. It can point to the operating system. It can point, I mean, to data structures inside the program. I mean, suddenly there is an awesome amount of responsibility that comes with that power. And frankly, it's one of the things that makes C, that allows people to regard C as a relatively low-level language. It was designed from the beginning to be a language close to the machine in which you could implement system-level software. You know, UNIX was written in C.
Leo: It's huge, yeah. Huge.
Steve: Yeah. And so it is a - it's an intrinsic of machine language. It's always been there. One of the variations as we evolved is the notion of what was called "index registers." You could - or "indexing," which is just another way of saying the same thing, where you could, in some of the early machines that had, for example, like the Data General Nova had four accumulators, AC0, 1, 2, and 3. And the last two accumulators, AC2 and 3, could be treated as so-called index registers, which is exactly what we're talking about. We're saying that they contain the address of the target location rather than their contents being used directly. And index registers are a component, and indirection is a component of all contemporary machines today. They come in different shapes and sizes and additional complexity. But this basic structure has been around from the beginning and is really powerful.
Leo: Indirection. Next, recursion. I mean, I tell you, pointers was hard for me, even though I'd had, as I said, had experience with PEEK and POKE. The little caret and little ampersand in C, it was like, I use what, when? But once you get it, it is so powerful. And it's really not that hard. You just did a great job in 15 minutes of explaining it.
Steve: Well, it's not hard. What I think, though, is it is mistake prone.
Leo: Ah, well, okay. Now the security issues arise, yeah.
Steve: Exactly. Because this is what we see. And in fact if you - remember, one of the things that the bad guys do is they are able to confuse data and instructions. The bad guys, when we talk about remote code execution exploits, the idea is that data in a GIF image or in a buffer that is moved across the Internet, it is not supposed to be executable. But because of the power of pointers, literally for this reason, because of the power of pointers, it is possible for data to get confused with code and for the bad guys to leverage this power to get the data that they provided to be executed as code. And at that point all bets are off.
Leo: Yeah. That's why you have this feature in Windows where you can't execute code out of the data stack.
Steve: Right, DEP, Data Execution Protection.
Steve: The idea is that there are regions of memory which a programmer can, when they're setting things up, they're able to say, okay, I intend for this buffer to be data, not executable. And that was a feature added relatively recently to, in the case of Intel, to the Intel architecture so that blocks of memory that were being allocated by the operating system could be marked as read-only, writeable, or executable. Or not, in the case of leaving this bit off. So that literally, if your program attempted to jump, that is, again, we have a program counter in today's processors, just like we did back then.
So if your program counter attempted to be set to the address of an address inside this block of memory, there are gates in the chip which check the privilege bits associated with this allocation of memory and say, oh, wait a minute, the execute bit is off on this block of memory. We cannot execute from this memory. Therefore the program counter is not allowed to fetch from that. And what that does is it pulls an exception, essentially, a violation deliberately that returns control to the operating system, saying whoops, this program just tried to do something it should not try to do. We don't know why. We don't know that it's a bad guy. It could just be a mistake in the code. But...
Leo: Or it could be intentional. Programmers do this all the time. They stick program code on the stack which is, as we now know, bad.
Steve: Yes. And in fact Windows depended upon that in the old days. Back before hardware graphics acceleration, where you were wanting to move rectangles of data around from one place to another on the screen, it was too slow if you had a general purpose chunk of code that would say move this many bits, and then is counting down the bits as you move them, and then goes back over and does another line, and so it's like that does line by line in a raster scan order. The problem was that was just too slow.
So what Windows did was, when you said I want to move a rectangle of data from one place on the screen somewhere else, it actually wrote custom code on the stack in order to do just that one operation one time, much faster than you could execute a much more general piece of code to do that. And then it would just discard it. And in fact we're going to be talking about what is a stack week after next because it's one of the next...
Leo: Oh, good. Oh, yay, yay, yay.
Steve: ...the next evolution of fundamental technology that - and actually the early machines did not have a stack. The machine we've been talking about, our little hypothetical machine, there was no stack. But the introduction of that concept was another sort of crucial, critical addition to the way computers work that it was so good, no one would do one without them these days.
Leo: Yeah, yeah. It's these little abstractions that advance computer science in big leaps. And it's wonderful when you get it because your brain and your understanding of how this stuff works advances in a big leap, too. You really feel it. You go, I get it, pointers. Or, I get it, stacks.
Steve: And the abstraction is fun.
Leo: Oh, I love it.
Steve: I mean, it's fun to, well, in fact it is, I think that's one of the hooks for people who love to program is that they just - they get off on this kind of true abstract thinking. It's just great.
Leo: So forget the hash. Steve's promising nothing. You're asking too much. Paul Scribbans in Cumbria, UK wonders how to play with machine code. He says: Steve and Leo, I've been a listener since Episode 1. I think the work you do is excellent. SpinRite owner, as well. My father tried to get me into machine code when I was about 12 years old in 1982 after we purchased our first computer, a ZX81.
Remember the Sinclair? They also bought a book, "How to Program in Machine Code." It never really took off for me, but through the proceeding years I've always wanted to give it another go, so your episode on the subject really interested me.
I do have one request, though. Would you do a bit on which assembler package or software to use? I'm currently using Windows 7 and plan on buying a MacBook Pro soon, as well. But I have no idea what software to download to enable me to play with assembler based on your teachings. Keep up the good work. Scrib.
Steve: There have been a number of people, listeners who have said, hey, you've got me interested. Where do I start? And the problem is there really isn't a, that I'm aware of, a starting-from-ground-zero tutorial.
Leo: There's a good book. Wait a minute, let me see if I can find it because I have it, I think, on my shelf. There's a number of very good books on this.
Steve: Well, there's a site.
Leo: Oh, good, all right.
Steve: There's a site for Windows called MASM32.com. And this is relatively high-level stuff. I mean, it's, again, if you want to play with assembly code in Windows, I know of no better resource. There's lots of links, lots of samples. They have a complete SDK, as it's called, a Software Development Kit, that provides all the pieces that you need, linkers and help files and header files and include files that provide the definitions and all sorts of stuff. None of this existed back when I was doing assembler in 16-bit code, but it has come together now for the 32-bit world. So MASM32.com for people who are interested in beginning to poke around. I mean, again, it's not what I would create if I were going to create a from-the-beginning tutorial. But I haven't had a chance to do that yet.
Leo: I'm going to try to find this book. It's a thick book, I think. It comes with MASM in the back. And it's all about assembly language. And it - yeah. And there used to be a wonderful assembler on the Macintosh that Apple made. I don't know if they still do it anymore. But I do believe they offer free, you know, they have all the free developer tools, and I believe they offer...
Steve: The whole Xcode system...
Leo: Xcode; right.
Steve: ...is on the disk.
Leo: And it's probably got an assembler. I'm sure it has an assembler.
Steve: Oh, absolutely it does.
Leo: Yeah. So you're set on the Mac, as well. It's the education that's hard.
Steve: Well, I haven't ever talked yet about my long-term plan legacy project. I will tell everyone about that when we're through talking about computers.
Leo: Now you have intrigued me, Mr. Gibson. Question 3 from Eric Stearns in Denver, Colorado. He wants to know how a computer starts up from the very beginning: Steve and Leo, I've enjoyed the discussion so far on the basics of how a computer works. After the first two podcasts I feel I largely understand the basics of how a computer works once it is operating. But one point that's never been clear to me is how a computer starts to work. So we have this dumb box of rocks that's powered off. Now we provide it with a flow of electrons. Well, maybe not a flow, but at least a voltage that can provide a flow. But how exactly does this lead to the ability to do something useful like executing some very basic instructions to eventually be able to run a program? How does it start?
Steve: Yeah, it was a neat idea, I mean, a neat question, which we never talked about.
Steve: And this is something which has evolved over time. The very early machines, the mini computers, my favorite the PDP-8 or the 11, those early machines which used core memory, because the memory itself was magnetic and could and would persist across a power-off and power-on, it was often the case that the computer still had the program in it when you turned it back on. I mean, it's magnetic in the same sense that a hard drive is magnetic. And of course we know that the intention of hard drives, they don't always succeed, but the intention is that they will retain their data, even when they've been powered off.
So mini computers would typically - you'd turn it on, and it would come up just sort of sitting there saying, okay, I'm beginning to burn up power and making whirring sounds. What do you want me to do? And so the operator would put the starting address of the program, like timesharing basic or something, if you had a bunch of teletypes in a classroom of kids, you'd put the starting address in, load that into the program counter and press Run, and it would just pick up sort of where it left off.
Then some machines of that era, specifically, like, doing process control work or real-time stuff, there was something called Power Fail Restart, where there was typically an option that you could get such that, if Power Fail Restart was installed, then once the computer's power levels came up and stabilized, this option would simply jam a starting address into the program counter and sort of electronically do the equivalent of pressing the start button. So it would just - it would just start itself up.
And in fact you might have a power fail occur while it was running, in which case this power fail sensing auto restart, when it would see the power beginning to fail, it would interrupt what the computer was doing. And we're going to be talking about interrupts here before long because that's a huge issue, and it's the problem that Toyota is having at the moment. And it would interrupt what the computer was doing and save the state of the computer, that is, where it was, what was in the accumulator, and the carry bit, so that later when power was restored the state of the computer could be restored to exactly where it was. So these things would be stored in core, and then the computer would stop itself before the power levels had dropped to a level that it was no longer reliable.
And then similarly, when the power came back on and what was stored in memory would tell this power fail option that the computer had been running when power was interrupted, and so read from core memory these things like the original content of the accumulator, the original content of the carry flag at the time of this interruption, put those back, and then resume execution from literally the instruction immediately following the last one that had been executed before the power failed. And so it would be just as if nothing happened.
So now we move forward decades to our current machines. And
our current machines all have some code in them separate from RAM, that is, they have ROM, and that's traditionally called a BIOS, which is - BIOS is an acronym, Basic I/O System. And the startup code for the machine is in the BIOS. So to directly answer Eric's question about contemporary machines, how our computers that we're using now start up, it's as simple as the chip, the actual processor chip is designed so that when it comes out of reset, when reset is over - and, now, reset is the state the machine is in when it's powered up, or if your computer still has a reset button on the front, that's a physical hardware reset button that typically pulls a wire down, pulls the voltage on a wire down that's going into the chip that just stops everything.
And essentially, when you remove your finger from the button or when power levels have come up and stabilized, the processor is hardwired in its hardware to start executing code at a certain location. We've talked about how the program counter steps through locations in memory, reading instructions one at a time. So it's simply a matter of the chip always starting from, it might be zero, it might be FFFF, you know, like all ones. It'll be some predefined hardware location. And at that location is a little bit of ROM, a little bit of Read-Only Memory. That is, the rest of the machine's memory, RAM, will just be random. It'll be not necessarily zeroes.
It's interesting, too, when you power up RAM, it generally comes up in some noisy state, but not all necessarily zero. One of the first things that the BIOS normally does is go and clear out the RAM in order to start it. And it also begins refreshing the RAM in order to keep its contents alive. So there's housekeeping being done. But literally it's just a matter of the processor always going to a predefined hardware, defined in its hardware, starting location, where there'll be a little bit of memory, a little bit of ROM that will be the first instruction that executes, and the second, and the third, and the fourth, and the rest is Windows or Mac or whatever.
Leo: (Laughing) And the rest is Windows.
Steve: And the rest is Windows.
Leo: Question 4 from PDP-10 Programmer - hmm, I like that - regarding - is the PDP-8 different than the PDP-10? Or is it pretty much...
Steve: Oh, yeah. They had different - the PDP-10 was a 36-bit, really long word machine, a beautiful, beautiful computer. But much bigger. I mean, it was like one of those raised-floor, forced-air cooling sort of machines, a beautiful machine.
Leo: You wouldn't have it blinking behind you, with the blinking lights.
Steve: No. Or if you did, you wouldn't be able to hear me talking.
Leo: PDP-10 programmer writes: Steve and Leo, with respect to the Lower Marion School District (LMSD) spying case, I heard you mention on the podcast that kids in the district had become so suspicious that they started putting post-its over the camera. I also read on some Mac-related web forums posts by the IT guy at LMSD regarding this. He said if Macs came to the IT with post-it notes over the camera, the IT guy should just put a pinhole in the post-it, but leave it in place so the student thinks the camera is still covered. With info coming out like this, seems they've got these guys dead to rights. They were spying on the students. Boy, that is shocking.
Steve: Yeah. We have another interesting question a little bit later on because this news really stirred up our listeners.
Leo: Rightly so.
Steve: And lots of people had some questions, but more about the technology. This is more about the ethics of it. I just - I wanted to put this in the podcast because, I mean, assuming that it's true, it really does seem wrong. You can argue that, well, that the camera technology was there to be used by the district to recover stolen and lost laptops. The problem of course was that there's evidence that is conclusive that non-stolen, non-lost laptops were having their cameras turned on. And whoever was doing the looking was feeding information back to the assistant principal, who was then confronting the kids who were registered to this laptop about their behavior at home.
So given that that's the case, the notion that the IT people were being told to poke a little hole in the post-it note so that the camera would be reenabled - I'm not really sure how well that would work, by the way. I mean, it sounds a little apocryphal to me. But given that that was what they were doing, that seems extra annoying and slimy. If you were to defend the whole practice, that is, if notification had been provided, if students and family knew that this was a feature of the laptop, certainly they would have reason to believe that no one was going to be spying on them unless the laptop was lost or stolen. But at the same time, then I guess you could argue, and you could imagine the school district's attorney arguing, that, well, this is a security feature which we need as part of our laptop loaning program. So cutting a hole in a post-it note is necessary and justified in order to keep the security feature functioning.
Leo: Oh, please.
Steve: I know. I'm not defending it, I'm just saying this is, you know, you can imagine what the LMSD attorney would be doing. But still...
Leo: Uh, Your Honor - it's just pretty hard to explain.
Steve: It's bad.
Leo: Yeah, it's not good. All right, get ready for a long one about assembly language, once again. Jeff in Washington, DC wonders about assembly language, CryptoLink, and Intel Macintoshes; and, as long as we're throwing everything in, how the computer works series. Steve, first I'd like to thank you and Leo for a terrific podcast. I thoroughly enjoy listening each and every week. The information covered and so elegantly presented is truly a gift to the world of computing.
He says he's found the "How a Computer Works" series very informative. While reviewing the "Machine Language" module - oh, they're modules now, not episodes - I discovered I had a few questions. One, a general question. I'm wondering how well assembly language lends itself to multiple platforms and OSes - Linux, Mac, OS X, and Windows - as well as various architectures, 32-bit versus 64-bit. Does assembler compare with other languages like C, C++, et cetera, that have odd nuances when deploying on different OSes? That is, does it require heavy modification, depending on the OS in which it's used? C code on a Unix system doesn't really mirror a Windows implementation of the same C code. Does a change from a 32-bit to 64-bit architecture generally require major changes to the assembly source code? Why don't you answer - you want to answer this one first, and then I'll give you the next part?
Steve: Yeah, I think I'll probably end up covering his specific example as part of this.
Leo: Okay, well, I'll give you his specific example, then. Knowing that you write the majority of your programs, including the upcoming CryptoLink, in assembler, and most of your applications are ultimately delivered in an EXE file format, I'm wondering if your ASM-developed applications easily lend themselves to other non-Windows platforms. For example, will CryptoLink be available on the aforementioned various platforms - Mac, Linux, 64- and 32-bit Windows? I really hope that will be the case. If not, will it run well under an emulator like Wine, like your DNS Benchmark does? Keep up the phenomenal work. Jeff.
Steve: So we could shorten this by sort of asking, what's the relative portability of assembly code compared to higher level code? And the answer is they're pretty much equally portable and not. That is, C code is portable from one processor architecture to another. That is to say, it was designed so that the same code could be compiled to different machine language where the compiler would be doing the translation between the C code and the specific hardware architecture, how many registers the chip has and so forth.
But C code is not portable across operating system families, that is, for example, Linux, Mac, Windows, because the operating systems provide radically different sets of services. That is, the operating systems - and we're going to talk about in detail in the future what is an operating system as we continue moving up the abstraction hierarchy from diodes and resistors where we began all the way to a working system. But essentially operating systems provide an abstraction to the software running in or on the operating system of the outside world. It's the operating system provides an abstraction for the I/O and for memory and for storage and for the passage of time.
Sort of the application itself in a modern operating system does not actually connect or contact the hardware. Rather, the operating system publishes services which the program takes advantage of. For example, the program says, hi there, operating system. I need a block of memory for my own purposes. Can I have one, please? And the OS looks at its pool of available memory, which is shared among all the applications, and has a chunk that's free, and provides a descriptor back to the program saying, oh, yeah, here's a chunk of memory. Let me know when you're done with it because we want to put it back into the common pool once you're through.
And so that's the case regardless of what language you're programming in. That is, for example, I'm talking to Windows, "I" meaning Steve who writes in assembly language. My assembly language code is using the same services that someone writing in C uses. So I'm using the same operating system services as someone in C. But if we took either me and my assembly language or a program in C over to a Mac, while you could recompile the C program under Mac's assembler to create machine language that would technically run on the Mac, that program would be lost. It would be trying to - it would be expecting Windows-oriented services when none are available because the Mac has an entirely different view, it presents a completely different abstraction of the outside world to the programs that run on it.
So relative to CryptoLink, for example, I'm already aware that I will be under immediate pressure to make this thing run on platforms other than Windows. And this has been discussed already in the GRC newsgroups. Because I really do want to serve as large a market and offer this solution to as many people as possible, my plans are already to carefully modularize my code so that the bulk of CryptoLink will be portable to different platforms, which they will all have to be Intel chipsets because this is largely going to be assembly language.
But I could still deliberately, for example, keep the code that deals with the LAN adapter, that deals with the kernel driver. I'll have a very different one on a Mac and a very different on a Unix or Linux system. But if I'm careful, rather than just mixing all of that code together in a big stew, if I'm knowing that I'm going to be moving this thing other places, then it behooves me to deliberately keep some modules to this so that, for example, when I do want to port it to the Mac, most of my code was carefully written so it doesn't care where it is. But then it then asks a network module to do the network magic on that particular platform. So there would be some things that are OS-specific. But they would be well encapsulated from the beginning.
So it's possible even for assembly language to be as portable as C, as long as you're not changing processor platforms. C allows you to jump from one processor, you know, PowerPC to Intel to any other type of microprocessor relatively easily because you're abstracting the architecture of the hardware in C, whereas you're absolutely not doing that in assembly language. The language is the language of the hardware, exactly as we've been discussing.
Leo: Yeah, that was the point of C was to create a non-processor-specific portable language. Back in the days when you didn't have GUIs, you didn't have really a lot of interfacing with the operating system.
Steve: Yeah, and Unix was written on a PDP-11.
Steve: The PDP-11 was the first Unix machine.
Leo: So they, I mean, they provided libraries for low-level file access, things like that, so you didn't have to see anything that - but you can do that with assembler, too. As you said, encapsulate it.
Leo: Just separate it out.
Steve: Just plan ahead.
Leo: Yeah. So that's a revelation to me. I had no idea that you were planning portability because you've never done that before. You've only written for Windows.
Steve: And I'm already getting heat...
Leo: Or DOS.
Steve: ...for it about SpinRite. So I'm not going to make that same mistake.
Leo: Yeah, wow, that's very interesting. Well, SpinRite relies on, what, N13? A BIOS call; right? So you can't really do that on a Mac.
Steve: Yeah. The Mac has the EFI BIOS. And so it just, you know, it would be neat at some point for me to move SpinRite over. I know that there are a lot of people who would like it. It's just right now it's down on the queue a little ways, yeah.
Leo: But CryptoLink, because it is not relying on the OS so much, or the BIOS so much, you could do - that's portable, you can make that portable. It's the UI that wouldn't be portable.
Steve: Well, I mean, frankly, SpinRite could talk to the Intel hardware just as easily. It's just because of its history it doesn't. But were I to do it today, I would be talking to the hardware directly.
Steve: And get a lot more power and flexibility.
Leo: Not getting better. So, Steve, we're ready for more in our Computer 101 series.
Steve: Yeah, we've done an hour of security news and updates and stuff, so let's talk about how computers work. This is the fourth installment. And what people will eventually recognize is that I've been very carefully and deliberately laying a foundation which I believe will end up in everyone really getting how computers function in a fundamental way, that is, demystifying this really by showing that, in my opinion, there's not that much to it. I mean, the complexity comes from refinement over time, not from the fundamentals.
So in the first installment we talked - we went to the very beginning and talked about resistors and transistors and how they could be connected to create simple logic gates and inverters, and how, for example, when you connect two inverters to each other in sort of a ring, that's an inherently stable device called a flip-flop which forms a bit of a register which is able to be loaded and remember a binary value, one or zero. And you concatenate a bunch of those in order to be able to store more complex, larger values.
We then, in the second installment, looked at machine language, the idea that machine language is nothing other than some form of storage which is addressable, where the address is stored in a program counter, which is just the address of the current instruction we're executing, and that the instruction itself is a collection of bits. We read a word out of memory, and the individual bits have specific assignments. Like, for example, in the instances I've been using, the top four might be the opcode, the operation code that says what this instruction instructs the computer to do. For example, add the contents of another location in memory, specified by the rest of the bits in the instruction, to an accumulator, called the "accumulator" because it accumulates things. Or store the value of the accumulator into a location in memory, or add or subtract, you know, whatever. So it's just that simple.
And then in our third installment we looked at indirection, which is a very powerful concept. We looked at the power of pointers because in the machine language, at the machine language level, we might sacrifice one of those bits, one of the precious bits in our machine language word, to be an indirection flag. That is to say, if this bit is off, then the address specified by the rest of the bits is the address of the data that we want to load or store or add. But if that indirection bit is a one, then that says that that location contains the address of the location to work from. That is, it's indirect. It's not a direct access, an indirect access. And the other way of saying that is that that location is a pointer to the actual data. So two weeks ago we looked at what that means to have pointers.
So this week I want to introduce the next major advance in the notion of how sort of the fundamentals of how computers work at the hardware level with the introduction of the concept of a stack which uses this notion of indirection in order for it to function. We need to back up a little bit, though, and talk about multiple register machines because that sort of factors into the value of the stack. So far we've talked about a simple machine which had a single accumulator. And many of the early machines did because the point I was early making was that all of this hardware was so expensive in the beginning. Transistors and resistors and the space they required, the power they consumed, the problem of interconnecting them all meant that the designers had to be extremely lean with the design of their systems. So a register was a luxury to have.
At the same time, it turns out that it's really convenient for a programmer to have more registers, that is, more than one accumulator. Now, remember, though, that we don't get these things for free. That is to say that bits in the instruction word, if we're going to have more than one register, bits have to be consumed in order to specify which register. So the advantage of having one was that you didn't need to specify which one because there was only one. So it was implied, there was this implication that if you're loading, well, there's only one place to load it into, and that's the accumulator. And if you're storing, there's only one place to store it from, which is the accumulator.
So when machines began to have multiple accumulators, which was extremely handy to programmers, some bits of that instruction word needed to be dedicated to specifying which accumulator. So, for example, one of them, one of the early machines at the time was the family of machines from a company called Data General that was an early competitor to DEC, Digital Equipment Corporation, called the Nova minicomputers. The Nova was also one that I programmed back in the day when computers had control panels, you know, front panels with lights and switches. And it had four accumulators, four accumulator registers, which were designated AC0 through AC3.
And so the instructions in the Nova instruction set had two bits to specify which one of the four accumulators the instruction would act on. So there was load accumulator and store accumulator. But now you had four accumulators, not just one. So two bits in the instruction needed to be sort of taken away in order, well, needed to be dedicated to that purpose for specifying which accumulator you meant. Typically that meant that you were taking bits from the rest of the word, like from the addressing range. So you were able then to address less memory with the instruction, but that was made up for by the fact that having multiple registers was so convenient.
So the way the early machines handled subroutines was originally very awkward. The notion of a subroutine, probably our listeners are familiar with the concept, is that you would have some piece of code located somewhere in the computer which performs some useful function. For example, the original PDP-8 didn't have a multiply instruction. You could "add," and you could "and," and you could "shift," but there was no multiply.
So the multiplication process, which is very useful to a computer, you had to simulate a multiplication out of the much smaller instructions. So typically what a programmer would do, and in fact I did this on the code that I wrote for my little PDP-8 front panel toys, I wrote a multiply because the PDP-8 chip that I was using had no multiply built in, but I needed one. So I created a multiply subroutine where it was a series of the available instructions, which had the effect of performing multiplication. Well, if I was only doing that one place in my program, then the program would just - it would execute into this multiplication operation and then continue.
But say that I wanted to be able to multiply from various locations in the program. Well, were it not for having this concept of a subroutine, I would have to duplicate that multiplication code everywhere I wanted to perform a multiplication. But the cool thing about this notion of a subroutine is you could have one instance of it in memory, and you could jump to it from all the places in your program where you want to perform a multiply.
The problem is getting back because, if you're going to jump to this routine from various locations in memory, different locations in your program, then once the multiply routine is done executing, it needs to know where to go back to. You can't hardwire that back to one location because you're jumping to it from different locations, and you want to return to the instruction right after you have jumped to the subroutine so that you can continue executing where you were.
Leo: Too bad there's no way to kind of store that place that we came from, just temporarily keep it somewhere.
Steve: Well, what the original designers of the PDP-8 did was they kind of came up with what we would now regard as a real kludge. They stored the location of the instruction after the subroutine call in the first word of the subroutine and then started executing from the second word of the subroutine. That meant that the subroutine always sort of knew who had jumped to it, where it had been jumped to from, because that address was in its first word. So there's a perfect example of where indirection comes in. It would, when it was all through doing its work, or in this case this multiply, it would jump indirectly through the first word of the subroutine. Since that first word contained the address that it had come from, that took it back to the place, the instruction underneath the one that had called it. So it was very, I mean, it was elegant from that standpoint.
But it turns out that this approach has a problem. And that is, if the subroutine were doing something more complicated, like calling other code somewhere else, what if that code needed a multiply? Well, then the other code would call the multiply, and its return address, the location from which it called, would overwrite the first call of the multiply. In other words, well, the way to say this is that this was not recursive. You could not nest these calls.
Leo: Yeah, because you'd overwrite the first call when you did the second call.
Leo: You could never get back to the beginning.
Steve: Exactly. So programmers, if they knew that their code needed to be recursive, that is, if there was a chance that the work they were doing might cause, through any means, might cause them to be executed again, before they returned they would have to go to all kinds of extra work, for example, save that return address at the beginning of the subroutine somewhere else so that, if they got called again, it would be safe against replacement.
Well, what eventually was developed is a brilliant concept called a "stack." A stack is a complete abstraction, that is, we can think of it, and people have, in describing how it worked, have talked about, for example, a stack of plates in a cafeteria. There's a buffering acronym that people may have heard, "last in, first out," or "first in, first out," or "first in, last out" or various combinations of that. What it really is, the way to think of it in concrete terms is there's a register called the "stack pointer."
So remember we have the program counter, which is a register that points to the current instruction that we're executing. Imagine now that this computer has another register called a stack pointer which is, at the beginning of the program, say that it just points at the very end of memory, like way, way up high at the end of the computer's memory, where nothing is going on. And so essentially, we'll call that all ones, you know, 111111111 is the highest value of the computer's storage. So that value is put into this stack pointer register.
Now, a few instructions are used. For example, in common terminology they would be called "push" and "pop," where the conceptual notion is you push something onto the stack, or you pop something from the stack. What pushing on the stack does is it stores the value of what you're pushing in the location pointed to by the stack pointer, and then subtracts one from the stack pointer. So again, we're using this powerful concept of indirection so that the stack pointer - it's called a stack pointer because it's pointing at a location in memory.
So say that we had something in an accumulator that we just needed to put somewhere safe for a while. We do what's called "pushing it on the stack," which merely says we ask the computer's hardware, the processor hardware, to sort of deal with it. Here, take this. And...
Leo: I like that. Deal with it.
Steve: Deal with it, exactly. So the idea is, the stack pointer, which is pointing at the end of memory in the beginning, we say, here, store this on the stack. Push this on the stack. So that value is copied to where the stack pointer is pointing, at the very last word of memory. And then the stack pointer is decremented by one. Well, that decrementing by one is this concept, the concept of pushing it on the stack because, notice if I say, oh, now deal with this, something new has come along, deal with this. So, again, we push this new thing on the stack. Same stack, but this time the stack pointer that was decremented by one last time, well, it's now pointing to the second-to-the-last word of memory. So we store this second thing that came along in the second-to-the-last word of memory, and so on.
So we're able to say, deal with it, deal with it, deal with it. And we don't really care where these things are being stored because that housekeeping is being done by the hardware. Eventually, though, we say, hey, I need that back. And so that's called "popping the value" from the stack. Popping the value does the reverse. It copies the contents of where the stack pointer is pointing back to wherever we're telling it to go, like we want to load that, we want to - if we're popping the stack into a register, it copies where the stack pointer is pointing, back to our register.
Well, I stepped a little bit on this. I'm sorry. Since we store something, in order to push the stack, we store something and then increment the pointer. If we pop the stack, we first decrement the pointer. So it's now pointing at the top of the stack, and then we copy where that is pointing back to our register. That sequence is important because it's got to be at the exact reverse to pop as what we do when we push. But the advantage is that, again, we told the computer to accept some data in a certain sequence of pushes. When we tell it we want that data back through a series of pops, it returns them in reverse order because the stack is being popped, and the pointer is incrementing back towards what's called the bottom of the - back toward the bottom of the stack, back to the beginning. So we get this data in reverse sequence.
So, okay. There's sort of the fundamentals. Well, now, look at this in terms of subroutines. Because if the computer is set up, as all computers today are, to store that return address on the stack, then this problem of recursion completely goes away. If our jump-to subroutine instruction, instead of storing the address of the instruction which follows it, and as the PDP-8 did, at the beginning of the subroutine, it simply pushes it on the stack. That's the phrase, "pushes it on the stack." Then we don't have to worry about it. The return address has been "stacked," as again is the way it's referred to. We execute the subroutine.
And the last instruction of the subroutine is called a "return," a "return from subroutine" instruction. And it does the reverse, essentially, of this jump-to subroutine. It removes the last thing that was put on the stack and uses that as the address to return to. So the value of this, the reason this is recursive, is that if this subroutine called some other subroutine and executed some other code, which came back and called the original subroutine, that is, it would stack that address also, then now two return addresses are on the stack, one for each time the subroutine was called. And in fact other subroutines could be called in any order, the only requirement being that they are returned from in the reverse order that they're called, which is normally the way a programmer would design things. You always - you finish up what you're doing, and then you return from the subroutine.
So the beauty of this is that the series of return instructions properly remove the addresses from the stack in the reverse order that they were stored, which is exactly what you want in order to sort of unwind yourself and get back to where you came from. So that technology is now intrinsic to the design of our computers. One of the other things that the stack is useful for is saving the state of the computer. Say that this multiply subroutine needs to use the registers, which were probably being used by the person that called the subroutine. Well, it can't mess them up. It needs to leave them alone.
So one of the things that this multiply subroutine could do is save those registers which it will be altering, but which it wants to be able to restore, on the stack. So at the beginning of the subroutine it'll say, push register zero, push register one, push register two, which stores their contents on the stack. The subroutine doesn't care where the stack is, what the stack pointer is currently pointing to. It just sort of said "deal with it" to the computer. Then it's able to use those registers any way it wants to as scratchpads for its own work. And once it's done, it needs to put things back the way they were so that the routine which called the subroutine doesn't have things messed up.
So it says pop register two, pop register one, pop register zero, remember, in the reverse order that it pushed them. And so, successively, the values that were stored are restored into these registers. Then it says return from the subroutine, and the stack will be pointing at that point at the proper return address, so long as the same number of registers were popped as were pushed because, again, you need to keep these things balanced, which is some of the problems programmers get into if they're working at a low level and are not careful. You do have to be careful because with this power comes, not surprisingly, responsibility.
Leo: A compiler will do it automatically. It's only if you're managing the stack yourself.
Steve: Well, yes. Except that many of the security flaws that we have talked about are a consequence of stack overflow.
Steve: We've talked about that often. So it is possible, in a powerful language like C, to allocate buffers. One of the other uses of the stack, and this is where this directly impinges on security, notice that we've been talking about pushing and popping things. Well, say that a subroutine needed some, like a buffer, it needed 256 bytes of buffer space for its own purposes, just sort of like to use for a while, while it's doing stuff. One of the ways a programmer could get buffer space is by manually subtracting, like, 256 bytes from the stack pointer.
So look what that does. Essentially, it's almost like - it's like if you had said push push push push push push push, 256 times, you would have moved the stack pointer down in memory, coming down from the top. And you end up with this block of memory. So a programmer is able to say - is able to subtract a value from the stack pointer and use that area which is above the stack pointer as their own scratchpad because, if other things, for example, if this subroutine then pushed other values or called other subroutines, those would all be down lower in the stack. So the act of moving the stack pointer down creates a buffer. Sort of it's a nice, very fast and efficient way of allocating some memory that could be used.
The problem is, look what happens if you were to ever write too much data in that buffer. Well, that would overwrite values on the stack which were above the buffer that you allocated. So that if you then popped values, like the register values off the stack, and even more significantly, if you did a subroutine return - remember the subroutine return's address is stored on the stack. So if a bad guy were able to cause a buffer overrun and write over any of the stack, they could put their own return address of the malicious code into the computer, and it would jump there. So that instead of returning where it was supposed to go from the subroutine, it would go into the malware. And that's exactly how much of this malware gains control over the system, just like that, by taking advantage of the stack. Powerful as it is, it does need to be treated with extreme care because it controls the execution path of the processor.
Leo: If we turn - that's one of the reasons to turn on Data Execution Prevention; right? Because if you can't execute code in the data area, doesn't that prevent that kind of attack?
Steve: Yes, exactly. One of the...
Leo: Is the stack considered - the stack's considered data area.
Steve: Yeah. Now, it is the case that some programs will deliberately put executable code on the stack, or they will allocate other data areas and, for example, load some of themselves, like they may be a modular program where only part of the program is loaded in memory at a time. And so at some point it says, oh, I need to go get another chunk of myself. So they'll load themselves another piece of themselves off the hard drive into memory, which technically is data memory, and then execute that. There's nothing preventing that, except it's regarded as poor programming to do that. And so it's the kind of thing that Data Execution Prevention would flag as an exception, which is why some benign software will not run with DEP enabled because it'll cause that problem.
Leo: Good to know. Good to know. And some benign software wants to use the stack as a code area because it gives them some flexibility.
Leo: It's a kind of programmer's trick.
Leo: Push code on the stack.
Steve: It's very, very handy and very efficient.
Leo: And very bad. And stop doing it. Well, now, that's stacks and registers. Do you want to get into recursion today, too? So I guess in a way we talked about recursion because recursion required the stack.
Steve: Yes. I'm a little self-conscious about having made everyone's eyes cross here with this stuff.
Leo: No, that was great.
Steve: But it's a crucial concept because what we're going to do in two weeks is discuss how computers are able to do everything at once. That is, one thing we've never talked about, and it's...
Leo: Selecting, yeah.
Steve: ...crucial, is interrupts, this notion of interrupts and the way that sounds can play, the mouse can move, the pointer can move, the hard drive is reading and writing, I mean, there's so much happening at once. And this concept of being able to interrupt code, which we're going to cover in two weeks, absolutely requires the stack in order for no matter what's going on to have its state saved. That is, we talked about one instance, for example, where the subroutine, in my example of multiply, needed to use some of the registers, that there aren't an infinite number of registers. So if it was going to be modifying them, it needed to be able to save and restore them. Well, so does a system where anything could happen at any time.
So I wanted to explain, both from a security standpoint, but also sort of from this fundamental architecture, how vital this concept of a stack is, how cool it is that you're able just to say stack this, stack this, stack this, stack this, and not have any specific concern with where exactly it's been stored.
One of the reasons this is so efficient, remember we talked about the notion of a register being implicit in an instruction, when we're not having to specify the stack pointer because that's implied by the push and pop instructions. They know how to do the work. We just sort of say, we say to the computer, take the data, I don't care where you put it, just as long as you can give it back to me when I tell you that I want it. And the key is that it's this notion of it being in the - we get it in the reverse direction than we gave it allows all of this power and essentially allows this recursion because, as long as we sort of unwind our self in the reverse direction that we wound up, we end up right back where we're supposed to be. So it works.
Leo: That's very cool. Love it. I love talking about the deep roots of computer science. From a guy who was there at the beginning, frankly. Pretty close.
Steve: Well, there are just a couple more key concepts. And we will have really pretty much everything we need in order to understand where we are today. Because I want to quickly, probably the episode after we talk about interrupts, talk about today's microprocessors, the whole CISC versus RISC controversy, the issue of how many registers is enough, how Intel got itself tangled up as power continued to be consumed with the design of its processors, and the insane things that designers have done in order to squeeze every last bit of processing power out of these chips. Now that people, now that our listeners have a sense for exactly how these things run, there's enough understanding to understand very cool technology like out-of-order execution and superscalar and pipelining and the stuff that we've all got in today's processors, not just the way PDP-8s used to work.
Leo: All right, moving on to Question 4. G. Wade Johnson in Houston, Texas comments about write-only programming languages: Steve, I've been enjoying Security Now! for a while. Particularly I like the way you carefully explain concepts that may be new to your listeners. I've been a professional programmer for a number of years in several languages and find that many of your programming comments really match my experience.
However, I recently listened to Episode 236, where you made a comment that I had to disagree with. After talking about programming in assembler, C, and Perl, you stated that Forth really was a write-only language, meaning that it was easy to write, but difficult or impossible later to read. Many moons ago I was actually a professional Forth programmer. For eight years I helped develop and maintain a very large codebase in Forth. It was no more unreadable than code I've maintained in C, C++, Java, or Perl, before or since that time. While it is possible to write unreadable code in Forth, that's true about any programming language. Given some good design skills, you can write truly elegant code in Forth - that's kind of been my experience, too - much like it's possible to write elegant assembler, I would bet.
Throughout my career I have regularly heard people claim languages that I've used were write-only - Perl, C, C++, Lisp, et cetera. In every case I've also seen really clean and readable code in each of those languages. I suspect most of the problem has to do with the commenter not being familiar with the idioms of the language, rather than a failing of the language itself. Sorry for the rant. Keep up the good work. I really am enjoying the "computer from the ground-up" series. It takes me back.
Now, Forth, the problem with Forth is it doesn't have much of an idiom. The idiom is created as you create the dictionary. But if you do it sensibly, it looks like English.
Steve: I guess I'll temper my comment. And I thought a lot about why I'm feeling that where he's not. And I guess so what I'll say is that I do think it's fair to say that different languages encourage and facilitate different degrees of, I'll call it transparency. For example, my experience with Pascal has been that I can come back long after I wrote some Pascal, and it just seems so clear to me what I did.
Leo: Well, yeah, but Niklaus Wirth designed that as a teaching language. So it was intended that way.
Steve: Which is my point.
Steve: Is that it succeeds in just - in having - in being transparent. And when I'm thinking about trying to understand some Forth code that someone else wrote, the problem for me is that there is, well, the problem is the elegance and the power of Forth is its use of the stack. So we ought to talk a little bit about Forth and the stack because that was the context in which I mentioned Forth last week also was when we were talking about the whole concept of a stack. In Forth, you don't have arithmetic expressions that are algebraic the way you often have in other languages, where you say A equals five plus four. Instead you push five on the stack, and then you push four on the stack, and then you do a plus sign, which adds the top two things on the stack, leaving the sum of them on the stack. And it's very powerful in the way that, like, the early HP calculators, which used RPN, Reverse Polish Notation, which were also stack-based calculators. They were very powerful.
But for me, you can't look at Forth code and see what's going on. You have to follow along with Forth code because there's this hidden state, that is, the current state, the current contents of Forth's execution stack completely affects the result of the verbs which you are using to apply against the stack. So, I mean, I respect a professional Forth programmer, and I will take our listener's word for the fact that, if you embed yourself in Forth long enough, you can read it. I find it difficult to read, but I have not spent any great length of time programming in Forth. I think it's interesting, and I learned it and used it for a while.
And as you say, Leo, the nature of, at a much higher level, viewing Forth at a much higher level, the way you create - you essentially sort of build your own little environment by creating a meta language with your own verbs out of the lower level intrinsic language and verbs in Forth. And it's really neat. I mean, it's an interesting development environment, unlike anything else. And by the way, if any listeners are curious, there's all kinds of free Forths around for all the different platforms. I mean, it's an interesting enough language that it's been implemented many times on all kinds of processors. So it's...
Leo: It's also very spare. It's small.
Steve: Yes. And in fact that's one of the reasons - in fact, is it Sun? Somebody uses Forth to boot their machines.
Leo: Oh, really.
Steve: Yeah. Forth is used as, like, the BIOS language.
Leo: Yeah. It's a great embedded language. You can have a Forth interpreter in a few K, maybe even less. And then of course everything that Forth does is in a dictionary. So it can be very, very small. It was written for telescopes. It was written to control telescopes by Charles Moore.
Leo: I interviewed him when we were at Tech TV. And he was stunned that I even knew who he was or what Forth was. It was - poor guy, I mean, he really created an amazing thing, which is still I think used in robotics a little bit. I'm sure it must be. Wonderful language.
Steve: Yeah. It has not died. And as you say, if someone were to create a new chip, and with a random instruction set, and needed to quickly get something up and going, one of the quickest ways to bootstrap a system is to write a small Forth interpreter and then start writing Forth code. Especially if you've got a body of Forth code you want to immediately port over. You can get it up and running on an arbitrary architecture very quickly. But still, I would - my feeling is it's the fact that you have to follow along with the code to track in your mind what's currently on the stack, and that that's a completely opaque thing. You can't see that in the language. You have to execute the language in your mind in order to see what's on the stack, to know what's going on.
To me that's very different than looking at a language like Pascal or even like C, where there isn't anything hidden based on, I mean, I guess the contents of variables would be in a sense hidden. But to me you're seeing it - I guess it's just that I'm not used to Forth. But you see what I mean, that there is this state of the Forth machine which the language affects. And you have to know what that is in order to be able to read the code. So I think it's a little different.
Leo: Question 6, as we move along through the list. Luke in Boston, Massachusetts asks: Why a language virtual machine? Steve, I've really enjoyed your current "how to build a computer" episodes. Your detailed discussion of how computers work in hardware has made me wonder what the story is with virtual machines, in particular the virtual machines that show up in various language implementations. I understand how a full system VM like VMware works, and what the value is. But I'm not sure what a language virtual machine, like a JVM, the Java Virtual Machine, or Google's V8, or Parrot, which is the new platform for Perl 3, is, or why - I'm sorry, Perl 6 - or why a language designer would want one.
It's all about cross-platform portability. Is it all about cross-platform portability, or are there other benefits? Are language VMs just software simulations of a particular existing chip, or at least simulations of a chip that could exist? This is - I love this question. Or if not, and the instruction set has surpassed what could be done with the transistor logic you described in Episode 233, why are they virtual machines instead of just programs that read other programs? Thanks for making such a great show. Luke. Well, you mentioned Pascal. Pascal was originally a P machine.
Steve: Yes, p-code was what the compiler produced. It is the case that a language virtual machine is different, as he says, like for example from a VMware system where we're virtualizing an operating system, so to sort of create a containment or a duplicate of the operating system. In a language virtual machine the virtual machine is designed for any of a number of different purposes. It is normally a sort of an idealized processor, that is, you would, if you were implementing a language like, well, we'll take the example of Pascal, but Java is a good one, too. You're implementing a language.
So as a compiler designer you can say, gee, wouldn't it be great if the processor that we were running on, we were able to just, like, design our own, that it had the following characteristics. And so the virtual machine is created to emulate that environment. I don't want to say a processor because it could sort of be more than a processor. So the virtual machine is, exactly as the name says, a virtual machine. It is an emulated pseudo computer which doesn't necessarily exist in hardware. But, for example, there actually was a Forth chip created which implemented the Forth language virtual machine in hardware. So it is possible, in fact I think there were some p-machines that were implemented in hardware, too, so...
Leo: Yeah, I think so, yeah.
Steve: Yeah. So you can, if you wanted to, devirtualize the virtual machine, make it a real machine, and you get all kinds of speed advantages if you were to do that.
Leo: This might go back to Donald Knuth, who wrote his classic books on programming in a pseudo-language called...
Steve: MIX was what Don created, yes.
Leo: Because he didn't want to make it machine dependent, I guess.
Steve: Well, he wanted his books to be able to survive through the years.
Leo: Smart man.
Steve: And he also wanted to illustrate his concepts in actual instructions, not just sort of an algebraic abstraction. So he had to create something that the students could read. In the inside front covers of his books is the MIX machine language, sort of as a handy reference as you're reading his texts, so that it's always there.
So anyway, to answer Luke's question, or to continue, probably the reason this is done, it's done because the boundary between the virtual machine and the compiler can be set by the language designers so that the virtual machine is very powerful or not very powerful. But as he suggests about cross-platform portability, and as we were talking about in the case of Forth, if you had something like a Java technology - or Perl, Perl's another great example because it's wildly cross-platform - if you implement the language against a custom-designed virtual machine, you can first of all design the virtual machine so that it's really good at doing the things the language needs, that is, it provides the facilities to, like, real-time garbage collection and memory management and very high-level constructs, much more than you would get from actual hardware. So that makes implementing the language on top of that virtual machine much easier and more convenient.
And all you have to do to move that whole blob that you've created, the whole implementation, to a different platform is rewrite a much smaller portion of it, that is, just the virtual machine for a different platform, and everything else that you've written runs. Because you've virtualized that the actual hardware into something which is not only non-specific for the platform, but also a close match to the things the language wants to do. So you can design a computer yourself that your language will run well on. And then you only have to implement that virtual computer that you've designed on the actual hardware.
Leo: So cool.
Steve: Yeah, it's just - look at all the cool technology we've come up with in computers over the years.
Leo: All right. Are you ready to take a look now at hardware interrupts?
Steve: We're going to move forward, yes. I'm going to do a quick little review of where we've come. And then, building on everything we've done so far, talk about how computers have become very deft at doing many things at once.
Leo: Well, in fact, if you think about it, if you're a computer and you need to communicate to the outside world, you've got to find a way to do it while you're busy doing other stuff. We see, it's funny, we see it all the time where a computer gets tied up. On the Mac you call it "beachballing," where it's just - it's busy doing something, and it won't pay any attention to you.
Steve: Actually I'm going to talk about that, too. I've got it here in my notes. There's a simple thing that some Windows programmers don't quite get right involving something called the "Windows message loop" that is part of what we're going to cover today.
Leo: Great. All right, we're talking with Steve Gibson, the security guru, the head honcho at GRC.com. And we're going to continue on in Steve's series on building, kind of building a computer up from scratch by solving kind of the fundamental problems that you have to solve to make the computer work.
Steve: Yeah. I don't - you know that famous cartoon with a professor on the blackboard who's trying to balance his equation, and gets to a certain point, and he can't figure out how to make it go, and he says, "And then a miracle happens." And...
Leo: (Laughing) We're not going to have any miracles here.
Steve: Yes. I don't want at any point to just sort of wave my hand and say, oh, and just take my word for it; or, oh, that just kind of, you know, happens. There is nothing that requires we do that. We've got intelligent listeners who have been paying attention. The fact is, mysterious as the inner guts of computers are, they're just not that complicated. By choice it's where I live, programming in assembly language, because I like dealing with the truth. I mean, with the actual raw capability of the machine. And it's easy to sort of say, oh, that's just sort of beyond me. But the fact is it's not beyond any of us.
So this is the fourth in our series of sort of laying down a foundation of understanding what the exact operation, the true functioning of the computers that we're all using now in our daily lives. In the first installment we looked at the whole concept of having a programmable machine where we had just a block of memory that was addressable, word by word, and something called a program counter, which counted through the steps of the program, advancing word by word; and that these words that were read were broken down into fields of bits where one group of bits was the so-called "opcode," the operation code, which instructed how the computer would interpret the rest of the bits in the word, which for example might have been another address in memory telling, for example, the computer to add the contents of that memory address into an accumulator, or subtract them, or invert them, or rotate them, or a number of different things, maybe send them out to a peripheral device or read something from a peripheral device into the accumulator, so you had I/O, input/output instructions. And that's really, at that level, that's all that's going on.
Then in the second installment we introduced the notion of indirection, which is a very powerful concept, the concept of a pointer where, instead of the instruction directly specifying an address from which you did some transaction, the instruction specified an address which contained the address with which you did the transaction. In other words, it was a pointer to the actual target of the operation, a very powerful concept, which we then advance to the next stage with our third installment, which was our discussion of having multiple registers and a stack.
The multiple registers gave the programmer more flexibility. He wasn't spending all of his time loading and storing things to and from memory using a single accumulator. If he had multiple accumulators, multiple registers, then it was possible to sort of have little temporary scratchpads. He could keep things in various registers as part of the algorithm he was doing.
And then we introduce the notion of a stack, which was the focus of the last episode in this series, as a hugely important and hugely powerful concept, the idea being that this was sort of a sequence-oriented storage concept, storage facility, where you could say, place a value on the stack, and then place another value on the stack, and then recover a value from the stack, and recover the prior value from the stack. The idea being that, as long as you put things on a stack and removed them in the reverse order, you're able to not worry about the immediate history of the stack as long as you have some discipline about always "popping," as is the jargon, popping something from the stack in the reverse order that you pushed some things on the stack. You were able to restore the contents that you had used the stack as a temporary storage to contain.
Well, that's an incredibly powerful idea because it allows us then, for example, to call to a subroutine, to jump to a subroutine to perform some work for us. And the subroutine knows that it's going to use, for example, a certain number, a certain subset, maybe, of the registers that's in this computer hardware. But it knows that, when we call the subroutine, we don't expect it to mess up the work we're doing. We just want some service from it. So it is able to use the stack to save the contents of the registers that it might be changing. And then, just before it returns to us, it restores them, it pops them, pops these values off the stack back into the registers so that, when it comes back to us, it's very much like nothing happened. I mean, we got a little work done. But if we were in the middle of doing things, then we were able to continue, trusting that subroutine to clean up after itself. And that's sort of the key concept.
So, moving forward, we have a computer which is able to do work. If we wrote some software, it could compute pi for as long as we allowed it to work on computing pi, or do polynomials, or do roots and cubes and sort of pure math computational stuff. Well, that's useful, except that computers really came into their own when they began interacting with the physical world, with so-called peripherals. We wanted them to interact with a teletype where we can type things in. We want them to be able to interact with storage facilities, where they're able to not only store things in their main core memory, but sort of have more like an archival storage, to magnetic tape or to disk drives. So that comes back to this notion of input/output devices, or input/output instructions, where we're able to say, take the contents of an accumulator and send that out to a device.
Well, in order to do that, since the computer can typically run vastly faster than the physical devices that it's using for input and output, it needs some way of pacing itself so that it's able to provide data to a device as the device is able to accept it. And it's similarly, going in the other direction, it's able to wait around for new data to arrive from a device. If we take the case, for example, of a keyboard, somehow the computer needs to know when we press a key and then to read the value of the key we pressed and accept it, store it somewhere, and then somehow be notified when we press another key.
Well, the original computers did this in a very awkward way, but it was all that they really had. There was an instruction, part of the input/output instructions, which would allow them to sense the readiness of a device to receive data or the availability of data from a device. And so the computer would essentially, it would execute an instruction to read the - say that it was trying to read data from the keyboard. It would execute an instruction which would allow it to sense whether new data was available. And, if so, it would branch in the program to a series of instructions that would read the data and then reset that little sensor so that it could then start waiting for the sensor to again say, oh, we've got new data available, in which case it would read that and reset the sensor and so forth.
So essentially, the computer would loop, like in a tight little loop, just reading the status, checking to see if that status said there was new data available, and if not it would go back and read it again. Now, the problem is, the computer is completely occupied while that's happening. That is, while it's reading data, waiting for us to press keys one after the other, it can't get anything else done because it's spending all of its time sitting there just waiting for data to become available.
So some early programmer said, well, we could get some work done if we sort of, like, went away and did some work, and then checked to see if data was available and, if not, go back to kind of to what we were doing again. And so the idea would be, they would take responsibility for polling the status of the keyboard every so often. Now, that was clever, and it would work, in theory. But it required a great deal of discipline from the programmers who were wanting to get sort of work done on the side while being ready to accept data from the outside because, if they got too busy, for example, just doing computational work, then the computer would seem unresponsive to the user. Or maybe two keys would get pressed one after the other, and the computer would have missed the first one. It would only see the second one when it came back, if it didn't come back often enough. So people writing code like that had to make sure that they didn't spend - they never, didn't ever spend two months' time not checking to see if something new was available.
So the programmers complained to the hardware guys and said, okay, look, this is crazy. We can write code to keep things moving. But it's really hard to do that. There's got to be a better way. Well, what was invented was the concept of a hardware interrupt, which is an incredibly powerful and, as always when I start talking about "incredibly powerful," you know that what's going to follow is "and dangerous," I mean, with power comes responsibility. Like we were talking four weeks ago when we talked about pointers, and I said pointers are incredibly powerful; but, oh, boy, can they be trouble if you're not very careful with them.
Similarly, hardware interrupts are very powerful, but I would be - I would not be surprised to learn that, for example, Toyota has had a problem with that kind of power, because it's the way computers become asynchronous. If you didn't have any interruptions, if all you were doing was running code like computing pi, then the computer is entirely - what the computer does is entirely deterministic. You start it in the morning, and it starts working on pi. And eight hours later, if you were to stop it, then it's done a certain amount of work.
If the next morning you started it at the same time, and if you checked on it in exactly the same length of time later in the second day, it is doing exactly the same thing. It's in exactly the same place. It's been - every single thing would have been predetermined based on its starting state. The entire future was determined by the way it started because, with simple software, it's entirely deterministic.
Well, that immediately goes out the window as soon as you start interacting with the real world. What hardware interrupts allow is they allow an electrical signal of some sort, an electrical signal representing some sort of physical event, like a key being pressed on the keyboard, to interrupt at any time the normal flow of instructions. So now, with a hardware interrupt system in this computer that we were just talking about, the main code, the main work that's being done, never has to check to see if a character is available on the keyboard. It doesn't have to explicitly check.
Instead, before it gets busy doing whatever it's going to do, it sets things up so that hardware, a hardware event that occurs can interrupt it anywhere it is, literally in between instructions. So what the hardware in the computer does is, as it's stepping its program counter, word by word, through memory, reading each instruction in turn and doing that, some additional logic is added to the hardware which, just as it finished executing an instruction, there's a hardware check to see if the interrupt signal is active. If not, it just proceeds as it normally would have to read the next instruction in turn and execute that. And again, after that instruction it - it takes no time to do this in hardware. So there's no overhead in testing this hardware interrupt line between every instruction. It happens at light-speed in the logic. The computer either moves forward; or, if the hardware interrupt is active, it suspends, essentially doesn't execute the next instruction it would have in sequence. Instead it does something different.
Now, what it does in detail is a function of sort of the generation of the hardware. For example, the PDP-8, my favorite machine to use as sort of an example of in the beginning, did have hardware interrupts. And in fact I used them in the programs I wrote for blinking the lights and playing games and things. What the PDP-8 did was, when a hardware interrupt occurred, wherever it was executing in main memory, it forced that next - the location of that next instruction it would have executed to instead be stored in location zero in main memory. And it changed the program counter to a one. So what that did was, suddenly the instruction at program location one got executed, and the computer followed the trail from there with the exact location where it had been interrupted stored in memory location zero. Of course, that allowed - storing it in zero allowed the computer to get back to, to return eventually to exactly where it was.
So this was a huge innovation. Suddenly you could set things up so that, for example, when a key is pressed, no matter what you're doing at the time, suddenly you are where you were, it's stored in location zero, and you start executing at location one. And the formal name for that location one is an interrupt service routine, or an ISR, as it's abbreviated. Interrupt service routine, meaning that that routine, that code is going to service this interruption that the computer has just experienced.
So what does it have to do? Well, we have no idea now where we were. We don't know what we were doing when we got interrupted. So now what we've introduced is nondeterministic computing, where real-time events occurring in any time change the flow of code through the computer. Well, if we want to, for example, read the data from a keyboard and store it in a buffer somewhere, we have to make sure that, almost like a physician that promises to do no harm, we have to make sure we don't change anything. That is, the computer was in the middle of work of some kind. But now we've got to use the computer's own resources to read the keyboard and figure out where we were in the buffer and store what we read in the next byte of the buffer. And so we - but then we need to return to what we were doing when we were so rudely interrupted with - but leave the machine in exactly the same condition as we found it.
So this is where the stack, brilliantly, comes in because, remember, it's this beautiful sort of flexible scratch pad, which as long as we pull things back from it in the reverse order we stored them, we get them all back. And what's cool is that the program we interrupt can be using the stack, too. We just need to make sure we put everything back the way it was. So our interrupt service routine knows that it's going to use some of the registers in the computer which were probably in use when it got - when this interrupt service routine got invoked. So it pushes the values stored in those registers onto the stack in a certain careful sequence and says, okay, I have saved things. Then it reads the data from the keyboard, figures out where it needs to store it in the buffer, does so. And then it needs to clean up after itself. So what is to say is, it needs to restore the state of the machine to exactly what it was when it got interrupted. When it does...
Leo: Now, Jimmy has an interesting question in the chatroom.
Leo: Is this recursive? In other words, what happens when you interrupt an interrupt?
Steve: Ah, well, that's a very good point. Now, in this PDP-8 model, there is a problem, which is that, when the interrupt occurred, remember where we were got stored in location zero. So imagine that, while we were doing this work in this interrupt service routine, another interrupt were to occur. Well, what would happen is where we were would get stored in location zero, and we'd start executing at location one, like we said. Well, that's horrifying because we had already stored in location zero where we were when the first interrupt occurred. Now another one has happened, while we were still in our interrupt service routine, which wiped out the record of the first one.
Well, naturally there was a way, there are several mechanisms that come into play here. The first is that the hardware disables interrupts as part of its servicing of the interrupt. So even in the very earliest machine, like the PDP-8, they understood that there was a danger with the architecture that they had. So the act of the computer doing this interruption, storing where it was at location zero and then executing from location one, the act of it doing that disables further interrupts. So what that does is it prevents exactly the problem that I just stated of an interrupt occurring while we're still in the process of servicing an existing interrupt. So that prevents there being a problem with recursion.
Now, what happened as we moved forward in architectures is things naturally got more complicated. It was recognized, for example, that there were some peripherals that were high-speed, like a tape drive or a disk drive, which were generating data or requiring data at a much greater rate than, for example, a teletype or a keyboard. And so the notion of priorities of interrupt, interrupt priority came into being, where an interrupt would be serviced only if the interrupt - if any interrupt was in the process of being serviced, if the new interrupt coming in was a higher priority than the interrupt that we were in the middle of working on. So that's confusing unless you listen to it a few times.
But what that meant was that you could have a low priority interrupt assigned to the keyboard, and a higher priority interrupt assigned to the disk drive or the tape. And by agreement of the architecture and the programmers, a higher priority interrupt could interrupt the work being done by a lower priority interrupt. And that required a fancier architecture. For example, rather than having just a single location, like location zero, you might have a block of locations, call it 100, 101, 102, 103, 104, 105, that is, a location for each priority of interrupt. So that would allow - that meant that different priorities of interrupts stored their interrupted data in different locations to prevent a collision.
So again, mechanisms were created that allowed that. But the concept here, the main concept is that we've gone from a system where the starting state determines the entire future of the computer - which would be the case, for example, of us computing pi, where we're just following instructions, one after the other, ad infinitum - to a very different system where we're now able to respond in, essentially, in real-time to physical events happening in the real world.
And thanks to this ability for the flow of our instructions to be interrupted at any time, suddenly other code - completely unrelated, perhaps, to what we were doing - is being executed. It, that other code, it promises that when it's done it will put the machine exactly back to the way it was. Then it returns - because the location where we were has been stored in memory, it treats that like an indirect pointer, remember, so it's not the location but the contents of the location. So it does, for example, an indirect jump through location zero, where the location that we were executing has been stored, which actually takes us back to where we were. The program that was running has no idea that that all just happened.
If it were really fancy, it could sense that time had been lost. And in fact that's the way some of the anti-hacking technology that has existed, like anti-copy protection or copy defeating or anti-debugging. Because things like debuggers will be breaking the normal flow of execution, time is lost. And so it's technically possible for that software to sense that, wait a minute, some time has been lost between this instruction and the next one. But in terms of sort of its own sense of the machine and the contents of the registers and what's going on, it would have no knowledge that what had just gone on had happened. It was completely separate from it.
It could look around the computer and see evidence of things. For example, it might be keeping an eye on the buffer which is being filled. The fact that it's being filled is sort of magical to it. Every so often it looks at the input buffer, and it sees characters, new characters are appearing. And there's, like, a buffer pointer that says, here's how full the buffer is. And that's changing magically, sort of by itself. It's actually being done by an interrupt service routine reading those characters from the keyboard. But that activity no longer needs to be monitored that closely. The actual main program can just kind of keep an eye on it lazily and decide if there's enough characters yet for it to, like, get them all at once and go maybe move them somewhere else or do whatever it's doing.
So it's an incredibly powerful advance in the technology of the way computers work, which has been responsible for taking us sort of to the next level of allowing computers to interact with us in a very rich way.
Leo: It's amazing how this all pieces together.
Steve: Yeah, and not that tough.
Leo: No, it's not at all.
Steve: I mean, if our listeners listened carefully, they now understand what hardware interrupts are. And it's sort of, like, duh, I mean, that's how they ought to work, and that's how they do. That's all there is to it.
Leo: Well, I think the thing I like about this series is that it just - it's almost that it has to be this way. It's like, well, these are the problems you have to solve. And one by one you solve the problems. And it just kind of inevitably almost is this way. I guess there are other ways you could solve it. But this is such an easy, straightforward, logical way to do it. It's like there's a certain inevitability to it, is I guess what I'm saying.
Steve: Right, right. I think it follows logically. One concept follows logically from the next. I mean, some of these innovations were brilliant. The notion of a stack, the idea that you could have a stack pointer where you'd sort of give it a region of memory for it to worry about, and you could just kind of give it things. And then as long as you took them back in the reverse order, it would remove them from the stack in the reverse order that it had pushed them on the stack, creating this incredibly flexible facility for having, like, temporary storage in the computer. I mean, that was a brilliant addition.
Leo: I love the stack.
Leo: I just love it.
Steve: I did mention that I would talk a little bit about Windows when we were talking about applications freezing.
Leo: Right. The message event, or the event...
Steve: Yeah, the message loop, it's called.
Leo: Loop, right.
Steve: One of the first things a Windows programmer learns, and this may be less true now than it was, things like Visual Basic and some of the more recent languages sort of obscure the reality of what's going on inside of Windows. But the original concept with Windows was that programs would - that Windows itself, the Windows environment, before it was even an operating system, when it ran on top of DOS, it would hand programs tokens that were called "messages." It would hand them messages, which was just a value from zero to something which represented an event. And so it was also called the event loop sometimes, or the message loop.
And the idea was that the code that the programmer, a Windows programmer would write would ask Windows for the next event. And in doing so, it was turning control back to Windows. That is, it would say, give me the next event. Well, if there wasn't a next event that affected that window, then the software would just sort of wait. It would wait for another event to occur. And that was how you could have multiple windows on the screen that sort of seemed to all be alive at once, all active at once. That is, you could grab it and move it, or you could type in one, you could do different things. In fact, only one was receiving these Windows events. And so as the application processed these events, displaying text or moving the window around or resizing itself, whatever the event told it to do, the window was animated.
Well, one of the things that could happen is some applications take time to do something. Maybe they're copying a file to the hard drive, or they're doing some encryption, for example, some serious number crunching that's going to take many seconds. Well, if a Windows programmer didn't really understand the way to program Windows, they might, upon receiving, like, an event saying do the work from a button, because buttons, when you press buttons or you select menu items, those just generate messages. Everything in Windows is a message. So the programmer might just go off and do that work, like start doing some big encryption project.
The problem is, while he's doing that, he's no longer asking Windows for any new messages. So when Windows sends a message saying, hey, someone's trying to drag you to a different location, that window won't move because - not because anything's broken, not because anything's hung, it's just that the programmer of that Windows application didn't consider that something else might still be going on while he's busy doing work. It's very much like this polling problem we just talked about where, if the program - if we didn't have interrupt system, and the program wasn't going and checking back often enough to see if a new character had been typed, it could miss some.
So the proper way to write a Windows application is with something called "multiple threads." And in fact the next topic we're going to cover, in two weeks, after next week's Q&A, I call it the "multiverse" - multicore, multiprocessing, multitasking, multithreading. It's multi everything, what is all of that multiness at all the various levels that it can occur. We now have enough of an understanding of how computers function to tackle that.
But the idea, briefly, of multiple threads, which we're going to cover in detail in two weeks, is that it's possible to sort of split off another execution ability from sort of within your program, so that you could still - you sort of split it so that you could still be asking Windows for any new messages while at the same time you are inside the same program, you're able to be doing the math or the long, time-consuming copy operation or whatever it was you were doing. And if programs are written that way, then they do not freeze when they're just busy doing something. If you don't write your program that way, even if there's nothing wrong with the program, it freezes. And people are so used to programs not freezing, that is, staying responsive to the user interface, that they immediately think something's broken.
And in fact this was such a problem that Microsoft added technology to sense whether a program was not responding to its message loop. And that's where we've seen, I don't know when it popped in, like maybe it was Windows 98, where Windows itself will put up "not responding" in the message bar, I mean, in the bar at the top of the window, as if we didn't know that it wasn't responding. It's like, yes, we know it's not responding because we're clicking on buttons, and nothing's happening. Again, nothing necessarily is broken. It's just that the program wasn't written with enough flexibility in mind.
Leo: Very interesting stuff. It's very helpful to understand the underlying principles so that when your programs go south, you know what's going on. It's not a mystery.
Steve: And again, there are reasons for all of this.
Leo: Question #8. David W. Griffin in Atlanta, Georgia comments about programming in assembly language: I respect your abilities to program in assembly language, but much of the world's software these days is designed for large-scale software for which high-level solutions rather than low-level solutions are the right way to go. Developing large software projects with large staffs and then maintaining them for a decade is not a job for which you would select assembly language, not if you could help it, anyway. I'm not sure I agree with that.
Software engineering has made little progress toward reusable components, but at least high-level languages have some effect on achieving reliability. Nothing you have said contradicts this. You, after all, are doing small, well-focused applications with a single author. But I thought I'd make the point that much of the world's software today has other design considerations. I enjoy the podcast and your lectures on computer science.
Steve: And largely I completely agree. When I talk about my use of assembly language, I regard it as a personal preference. I'm not pushing it on people. I'm not suggesting that the world would be a better place if people programmed in assembly language. Well, maybe I am. But I completely recognize that high-level languages are here to stay; that they make much more sense for many applications. I mean, it's just - it's programmer productivity. I guess the metric that I've seen which is most compelling is that, no matter what level of language you're programming in, programmers generally produce the same number of lines per day. So if a high-level language line of code does much more than an assembly language line of code, and both programmers are going to be equally productive when measured in lines of code, then it's clear that more functionality is being written per day by someone whose lines of code do more per line because they're using a high-level language.
So I have no problem with that. I'm sticking with what I love and like and know and I'm so comfortable with, assembly language. But by no means have I been intending to denigrate in any way the value of high-level languages. We wouldn't have nearly as much stuff if we didn't have high-level languages. I would argue maybe we'd have better stuff, but much less of it. So would that be a bad thing? I'm not so sure that would be a bad thing.
Leo: Yeah. I mean, I guess my thinking is, you can make assembly look just like a high-level language with macros, and probably make it very efficient.
Steve: Well, mine is very clean, and I write a lot of it in a day. So again, it might be a little bit like, you know, the guy who wrote in about Forth, who took exception to my saying, "I can't read that. How can anybody read that? Nobody can read that." And he said, "I can read that." It's like, okay. So it's what you know.
Leo: It really is.
Leo: To each his own. It's like arguing which human language is the best. I mean...
Leo: They're all - they can all be equally expressive. But I think I like the metric, which is the more you can - assembly language does require more typing.
Steve: To get the same job done, you're absolutely doing more typing, yes.
The Multiverse: multi-threading, multi-tasking, multi-processing, multi-core (Security Now 247, 47:15)
Leo: The multi-verse. This is part of our basics of computing; right?
Steve: Yes. We discussed two episodes ago the concept of a stack, which is one of the fundamental core principles that underlies all current computing. I mean, there just isn't anything you could replace it with. And visually, just to remind our listeners, you can sort of model it on the stack of dishes in a cafeteria, where there's like it's spring-loaded, the idea being if you were to place dishes on this stack, they sort of are absorbed by it. That is, they go down into this well. And as you remove them, you naturally get them in the reverse order that you put them on.
So the way the computer uses this, say that you had four registers, and you needed to put their - you needed to save them somewhere safe while you were doing something else with them, and then later you wanted to restore them. Well, you could simply say, push register one on - the contents of register one on the stack. Push the contents of register two on the stack, the contents of register three on the stack, the contents of register four on the stack. And you don't have to worry about where the contents went. This abstraction, this cool idea of a stack where it's a region of memory with its own pointer, the stack pointer, which automatically advances to always be pointing at the top of the stack.
Then in our example you do anything you want to with these four registers, not worrying about disturbing their contents, because you've saved them on the stack, as the jargon goes. And then when you're done, you would - where "pushing" something is the term for putting it on the stack, "popping" it is the term of removing it from the stack. So you would pop the stack into the contents of register four, pop the stack into the contents of register three, pop the stack into the contents of register two, pop the stack into the contents of register one. That is, in the reverse order. So just as the plates would go down and come off in the reverse order, so does the contents of the stack.
And the point is that this is sort of a place that, as long as you always back out of what you're doing in the reverse sequence that you went in, the stack is your friend. It always provides you what you're expecting. And so that's a - that's one fundamental principle. The second we talked about in the last episode, which was hardware interrupts, the idea being that, if you wanted to do a couple things at once, if you wanted, for example, if your software program wanted to be printing, well, a printer operates vastly slower than the code running in the computer. And that means the printer can at some level only accept one character at a time.
Say that it was like an old-style teletype where you're going chung-chung-chung-chung-chung-chung-chung, you know, 10 characters per second, typing them out on a yellowish kind of roll of paper. You don't want to sit around in your program waiting for the teletype to say, oh, I finally got through printing the A, now I'm ready for the next character, whatever it is, because that would require that your program could do nothing else at the same time. It would just have to sit there watching the status of the teletype until it's ready to accept the next character.
So this notion of hardware interrupts was generated where you could have the hardware interrupt your program, which is off doing something else. And that interrupt, what's called an interrupt service routine, which services the hardware interrupt, it would yank control away from your program, just right in between the execution of any two instructions. You would never know when it was going to happen. In fact, that program running in the foreground, it would sort of even be unaware that anything had happened because, in between successive instructions, control would get yanked away from it by the hardware interrupt, which would save the registers, whatever they happened to have in them at the time, which would then free up those registers for its own use to, for example, get the next character in an I/O buffer and hand it to the teletype and start that being printed. Then it would restore the previous interrupted state of the machine and jump right back to the instruction that was about to be executed in the normal main foreground process. So hardware interrupts were the beginning of allowing us to do multiple things at once.
Well, the next evolution of that is when we said, wait a minute, we actually do want to do more things with a computer at once, not just printing in the background, but what about timesharing? What about having multiple users who are all able to use the same computer? And of course timesharing was a big innovation that we saw in the early '70s because these computers were many tens of thousands of dollars. And it became clear that, well, we've got this hardware which is much faster than a single user. I mean, even someone, Bill Gates in 1973, at Harvard, using the teletype on one of these early machines, he was typing relatively slowly, I mean, because you couldn't type fast on those machines. You could only input things at 10 characters per second, no matter how fast a typist you were.
And so what they realized was, wait a minute, the computer is sitting around waiting for the next character to be typed in. We could have 10 teletypes connected to this computer, and each user would feel like they had the computer to themselves. The question was, how do you share a machine among multiple users? How do you do timesharing?
And what they realized was, we'll have hardware interrupts generated by a timer. And that was, just that little change, that was a massive innovation because now what this meant was that part of the hardware, part of the fundamental hardware of the computer would be some piece of hardware, a little timer chip, or a divider which was dividing down the frequency of the main clock crystal, down into something like maybe 100 Hz, or maybe it used the AC line, actually used the AC current, which is 60 cycles in the U.S., 50 cycles in Europe. And every time the AC waveform crossed through zero, which would be 120 times a second, or 100 times a second, that could generate a hardware interrupt.
And the idea was that software, you could essentially have five programs, say, all loaded into memory at the same time. And then you'd have a supervisor. And here we begin to get the concept of an operating system. At that point we'd have a supervisor which would be supervising the execution of more than one program at once. Well, you needed some way to get control, to sort of yank control back from any one of these programs back to the supervisor. And hardware interrupts was the secret for doing that.
So when you started up the computer, you'd have the supervisor, sort of an early form of an operating system, which would initialize the hardware interrupt on the so-called "counter timer," or a periodic interrupt which occurred very often, but not too often, typically several hundred times a second, something like that. Maybe a thousand in later, faster machines. But initially you were basically slicing up time based on these interrupts. And every time the timer expired or the AC line crossed through zero volts, like 100 or 120 times a second, that would generate an interrupt that would give control back to the supervisor. It would look to see how much time had gone by and decide if the same program should keep executing, or maybe if it was time to give somebody else a chance.
And so essentially all of the programs which were running at once, sort of - as we know, computers don't actually run things, well, actually they do later on, when you have multi-core. They actually are running more than one thing at once, which we'll get to in a second. But back then they were just - they were time slicing. They were giving, often in a round-robin fashion, that is, program one, then program two, then program three, then program four, then program five, then back to program one, they would - this supervisor would get control and then hand control back to the next successive program. Meanwhile, each of those five programs had essentially been interrupted.
So execution would start on the program. And after this period of time, this clock interrupt would occur, yanking control away from that program, so putting it in sort of an interrupted state, which was fine. It would just sort of patiently sit there until the supervisor decided it was time for it to get another time slice, to get some more slice. And so the supervisor would essentially return from that interrupting event back to the program, having restored the state of the machine, the way the program had left it, and then it would continue executing just like nothing had ever happened. So this was sort of the first notion of there being sort of a nascent operating system, the supervisor, that would start up programs. And the supervisor was able to keep track of the passage of time by these periodic interrupts which had the power, thanks to the hardware architecture, of yanking control away from anything going on at any time.
Now, later on, programmers said, okay, well, I've got a program here, and it's nice to be able to have other programs running at the same time, but I'd like my own one program to be able to do multiple things at once. And what was invented was the notion of a thread, the idea being that a traditional original sort of older style program could only be inside of itself, could only be doing one thing at a time. That is, you'd have - you'd call it a single-threaded program, which is to just sort of say a non-threaded program. That is, it's only doing one thing at a time.
But what we realized was you could - the same abstraction which allowed us to apparently be running five programs at a time, could also give you the ability to apparently be actually in many places of a single program at a time, each one of those places being a thread. And so essentially it's like timesharing or multi-tasking, which is timesharing, but within a single program. Thus that's called multi-threading. And so what happens is, when the program starts, it always starts with just one thread. That is, you know, it starts at the beginning of the program. And so it goes along until it decides it wants to sort of branch out. It wants to be able to, for example, maybe it wants to be in charge of sending characters out to a device while it's doing something else, much like I was talking about before. And so it's literally able to spawn a thread that is, like, it's able to, within the program, it's able to say, okay, create another thread and start executing it at this location.
So because in a traditional single-core system the CPU actually only has a single program counter, which can only be in one location at a time, we know that we're not actually executing multiple threads, or in this case these two threads at the same time. But once again we have a timer as part of the hardware which is able to get control back to the supervisor. And so whereas before the supervisor was dividing all time up into, for example, five main programs, now it's more granular. Now it's dividing time up, not only among whatever programs may be running, but within those programs, among whatever threads each individual program may have created. So this is a tremendous benefit and very powerful thing that modern programs are able to do because it sort of gives the individual programmer within an individual application the ability to set different things running and let them sort of handle themselves.
You have to be careful, I mean, and this is where some of the skill of programming comes in because we're, as we do this, we're escalating complexity, in some cases dramatically. For example, you need to be careful about not having the threads collide with each other or not having multiple threads trying to do something that conflicts with each other because essentially the threads don't know, they have no awareness of when they're running and when they're not. Each thread sees itself as if it's running all the time, when in fact that time is being chopped up among all the threads in a system. You could have many different threads in different processes, that is to say, different programs, and from sort of a thread-centric view they just sort of see that they're running along executing, unaware that chunks of time are missing during which time other threads in the system have the actual access to the processor, and they're running moving forward.
So Intel at one point said, you know, this is all well and good, but what if we actually supported multiple threads in the hardware? What if, instead of this threading being a complete abstraction, what if we actually created hardware that could do sort of more things at once? And so this was before multi-core, but this was what they called "hyper-threading." And the idea was that you could have multiple program counters and share a lot of the resources of the processor; but, for example, have multiple sets of registers. And the idea would be that, so long as the programmers set these things up right, you could actually be doing multiple things at once. That is, the hardware could.
And so we went for a couple years like that. And then of course Intel went to the next stage and gave us multiple cores, where now you actually had full processors which, again, had their own program counters; they had their own registers; they had all their own resources. They shared memory. So that was the common thing that they shared, as well as the hardware, like the display and so forth. That is, they had common access to the system's I/O and also to its memory resources. But otherwise they were independent.
Well, by the time that this happened, operating system technology had evolved substantially. I mean, we already had Windows and multiple tasking and multiple threads. And what was so cool about this notion, first of hyper-threading and then of this multiple core, was that you'll notice that our operating system software and the applications we were running, nothing changed. That is, we didn't have to have different applications. We didn't have to have a dramatically different operating system. That is, there was this, sort of this notion of multi-core support. But from the user's standpoint, our systems just got faster. Things worked better.
Well, and the reason that happened was that we had already, before that time, there was this mature concept of multiple processes and multiple threads. And the operating system was given the task of deciding what the computer was actually doing at any one instant. And the operating system was jumping the program counter all over the place, giving all these different threads a chance to run until the next hardware interrupt yanked control back to the operating system, where it would decide what to do. And it was the so-called "scheduler," the operating system scheduler is, I mean, it alone, the optimal strategy for that is the subject of doctoral theses and whole books on computer science, how to maximally schedule all of these things that are going on simultaneously in the computer. Do you want to allow some things to run with high priority? Do you want some things to run in the background? Do you - how do you manage the simultaneous demand across all of this?
But what was so interesting was that the abstraction of all these threads essentially running at the same time, that is, in a given program, the threads were set up so they thought they were all going at once. And individual programs in the operating system thought they were all going at once. So when we added multiple cores, they actually could be going at once. And the way we got away with this was there was already the discipline in place about not having them conflict with each other. That is, there was already some interprocess containment so that processes couldn't stop on each other's memory. And within a process there was already inter-thread discipline to make sure that these threads that conceptually were all running at the same time behaved themselves and didn't mess with each other's business, essentially.
So when we added hardware that actually was able to run multiple things at once, first through this hyper-threading that was sort of poor man's multi-core, and then real multiple core, well, everything just got faster. It scaled beautifully because everything we already had, the whole structure was oriented toward the simultaneity which initially we were faking out, but later we were actually able to do it.
So that's basically the multi-verse of contemporary computers. And what we're all sitting in front of right now is all of that going on at once, all behaving itself, for the most part. When it doesn't, things definitely go sideways. But when everybody plays their part and behaves themselves, we end up with just an incredibly powerful facility for computing.
Leo: It's been fun watching Intel slowly add these capabilities, as you say, with hyper-threading and multi-cores. And they also use something called QPI, interprocess communications now in the new i3, 5, and 7 chips. They've really gotten very sophisticated about how to do this. It's no longer just time slicing. They're really getting sharp about it. And it does make a difference. And I think the reason that they put so much effort into it is they realized that they couldn't put more megahertz out, that they were getting all sorts of problems past 3 GHz. And so they backed off and said, well, we're not going to have a 6 GHz processor. How else can we improve speed? Multiple processors.
Steve: Well, and Leo, I mean, 3 GHz.
Leo: Is a lot.
Steve: Three billion. Three billion cycles per second. I mean, it's nuts. The first IBM PC was 4.77 MHz. And we were all thinking, whoa, this puppy just - it cooks.
Leo: I remember very well when we first started doing TV for Ziff-Davis in the early, I guess it was between '92, '93, '94. And Bill Machrone came in, and he brought in the new Pentium. And it was 90 MHz. Megahertz. And we were all saying, oh, it's so fast. And I remember using it, saying, wow, it feels slippery, it's so fast. It's moving around. That was 90 MHz.
Leo: I mean, we are now more than 30 times faster. In just clock cycles. And then you double the core or quadruple the core until it's got six core processors coming out. Each is hyper-threaded. That's 12 processes, 12 threads at a time.
Steve: Well, and what's really interesting, too, is that while the processor performance has been able to scale generation after generation after generation, main memory stalled, essentially. The so-called DRAM, Dynamic Random Access Memory, its technology turned out to be stubbornly non-scalable. So, and it's just due to the way, just due to the nature of the way DRAM operates. There are, you know, we hit essentially the physical fundamental limits of its speed much sooner than we did processors. And so what's been done is, we've made it wider, that is, so that one memory access is fetching many more bits, sort of in width, at a time.
And then we've also had - we've gotten very sophisticated with caching. Because one of the things that the early researchers noticed was that computers would sort of go over here in memory for a while and work in some loops and do some work in a certain region. And then they'd go over somewhere else for a while and do some work in a different region. I mean, it wasn't like they were, like the actual execution was randomly occurring throughout all of memory. There was a regionality to it.
And so what they realized was that, if you had a much smaller but much faster memory, a so-called cache, a stack memory that was able to be made to be kept at the same speed as the registers of the computer, where they could run in a single cycle, what they realized was, well, you might have to wait, essentially, to read what's in main memory into the cache to get it into the cache. And that would take a while. The processor would essentially, it might be stalled while you were painfully slowly pulling main memory into the cache. But once you got the cache loaded, then the processor could run again at full speed, full sort of local speed, as long as there were, as the term is, a high percentage of cache hits as opposed to misses. So that while the processor was doing things locally, staying there, it was able to execute again at full speed.
And so that's - and they extended this notion, not to just a single level of cache, but to then two levels of cache, having a very high speed, but smaller; and then a semi high speed, but larger; and then, like, finally, main memory, where we just can't make it any faster, unfortunately.
Leo: Yeah, in fact we know L2 cache, that Level 2 cache, the amount of that makes a huge difference in overall processor speed.
Steve: Yes. And when people are selecting processors, it's also expensive because it takes up chip real estate, and it takes up power. And so one of the things that purchasers do is they decide, oh, yeah, how much speed do I really want, because the chips with the larger Level 2 cache, the L2 cache, they're going to be more pricey.
Leo: Right. It's a great subject. At some point maybe, I don't know if there's enough for a whole show, but I'd love to talk or hear you talk about cache, the concepts behind caching. Because we see cache all over computing. We used to see hard drive caches, there's browser caches, and of course there are these processor level caches. And it's all the same idea, I would guess.
Steve: It's absolutely, it's very similar.
Leo: And there's a lot of computer science into how you do this; right?
Steve: Oh, well, yes. And there's also write-back versus write-through. And there's this notion of the cache being dirty. That is, the processor also wants to be able to write into the cache. It's not only that data from main memory is pulled into the cache, but the processor may be modifying the cache. So the question is, do you start writing back to main memory when the cache is changed? Or do you wait until you have, like, a different flush event? That is, for example, the cache may be changed a whole bunch of times, in which case you wouldn't want to be writing it back all the time. But then you've got a problem, for example, with multiple cores, multiple processors, because they need to have a coherent view of what's in memory. And if one processor's got modified memory in its local cache on its chip, then the other processor can't see into that. It's seeing main memory. And so it gets very tricky.
Leo: Question 7, Richard Doyle in Sydney, Australia stumbled upon your legacy project. I don't know what that is. Dear Steve, I'm 32, and I've only been listening to Security Now! for a few months, but I'm quickly catching up. Your explanation over the last several weeks of the fundamentals of computer architecture, organization, design and evolution over time has been accessible enough to inspire me to genuinely want to learn more and more about this entire area myself, from very first principles.
Have you ever thought of writing and publishing a book encompassing in greater detail everything you've explained - and would like to explain - in the current Security Now! series on the Fundamentals of Computing? Every other resource out there, mostly books, are dry, boring, and many assume a level of knowledge that most people just don't have. And for the most part every other resource out there is techie from the start. Not a bad thing, but we are badly in need of something that can begin to explain a thing in an extremely simple way, then scale up in plain language to the relevant level of detail. Other authors seem to enjoy an abundance of technical jargon for its own sake. And the people you've inspired through your current series in Security Now! are left with no entry point into this wonderful and amazing field.
Please consider, Steve - blah blah blah blah blah. How a computer works - the substance is there. Kindest regards, Richard. I don't have a Twitter account. You know, there is - my good friend wrote and has kept up to date a wonderful book called "How Computers Work." I think you probably know him. I'm trying to remember his name. He's been doing it for years. Let's see if I can find it on Amazon. And it's beautiful. It's a very - it's Ron, Ron White. He's been doing it for years.
Steve: Oh, my goodness, yeah.
Leo: You know Ron; right?
Leo: This is kind of the definitive book on this. And it's done with great illustrations. It may not be as tech- it's not as technical, in fact, as what you're talking about, Steve. But it is definitely aimed at the nontechnical, and it's a good start, if you're looking for that kind of thing. Incredible illustrations. But Ron is a smart guy. I mean, the detail in here is fantastic, and he has kept it up to date, which I really admire. So I think this would be a good start.
Steve: I agree. And it's available today.
Leo: Right. Would you like to do a book?
Steve: Well, no.
Leo: Yeah, I didn't think so.
Steve: But I've wondered what - this is going to sound strange, and I'm 55 and still have a lot of life left in me. But I've wondered what I'm going to do when I'm 75. That is, you know...
Leo: A good question.
Steve: And because I really do believe that, if you retire and sit on the patio in a rocking chair, you expire not long afterwards.
Leo: Seems to be the case in many cases.
Steve: And we know that I spent a chunk of time in the last year sort of looking at antique machines. I built the PDP-8s. And I also spent a lot of time researching instruction sets. I looked at field-programmable gate arrays, which I referred to last week as being these fantastic electronic building blocks which can be used for defining hardware out of software. And, for example, one of the things that people are doing is they're implementing processor instruction sets in field-programmable gate arrays, FPGAs, like taking classic instruction sets and creating computers out of these field-programmable gate arrays. There's something called OpenCores.com or .org, OpenCores.org, which has a lot of these. And so I thought about that.
And I also thought about these antique machines, the PDP-8 and the PDP-11s that I have. The problem is, they're not interfaced to any contemporary peripherals. I mean, you have to have a teletype, or maybe a serial interface. But what are you really going to do with them? And so I've sort of just - everything was sort of in a big mashup. And then I looked at instruction sets. And I sort of, like, surveyed the evolution of instruction sets over time. And all of this sort of ended up giving me the incentive to do this Fundamentals of Computer series that we've embarked on.
And for a while I was thinking, well, I was thinking the PDP-11 was the right instruction set, sort of like something that would be fun to program. And then I thought maybe the VAX. There was one instruction set from a company called National Semiconductor. They had the NS32000 series, which unfortunately never got off the ground. But in many ways I think it's the best instruction set ever designed. It was fun and nice to program, eight general purpose registers, a very regular instruction set, just beautiful. And then I thought, okay. I'd like to program the instruction set. But the chips don't exist anymore.
So then for a while I was thinking, okay, I could write the software to put this chip into an FPGA, a field-programmable gate array, basically create this chip that no longer exists out of contemporary silicon. The problem is that, when I thought about, okay, what kind of performance would I get, well, the things that have been done by, like, Intel and AMD to squeeze performance out of current processors are just over the top. I mean, it's unbelievable the technology that is in these things, with multiple parallel execution paths and multi-cores and pipelining, and even optimization of instruction sequences, and branch prediction, where it guesses based on local knowledge what path your code's going to take. And it's just daunting.
So I realized that, if I were to create this idealized processor that would start off probably as the instructions that National Semiconductor designed - five years after DEC's VAX, by the way, so they had five years of experience with the VAX instruction set and sort of tweaked it a little bit to make it better - I might do the same, you know, tweak it some more. The idea would be to create this ideal instruction set. And but if I put it into hardware, it would never perform like a contemporary machine because there's no way I'm going to invest the unbelievable resources that an Intel or an AMD has.
And then it hit me that, if I emulated that instruction set, this ideal instruction set, in a contemporary processor, like the current Intel, the current Intel architecture, then I'd be, in machine language, I would emulate another machine, sort of like PASCAL, and we've talked about p-code, like a pseudomachine. But because I was writing the emulator in machine language, and because I was writing it for hardware, the Intel architecture, which is already so unbelievably powerful, I would end up with an amazing amount of performance of this, like, the most beautiful instruction set I've ever seen, that I've, I mean, like, that there is, in my opinion. And so then I thought, okay. That's what I would want to write my operating system in. And that's what I would want to create an environment in.
And so my plan for retirement, my legacy, is to essentially create an entire open source free environment around the most ideal, beautiful instruction set that we've ever had, and write an entire world, a computing world in that - meaning assembler, editor, operating system, environment - with the goal of teaching a low-level operation of all of this stuff. Because it's hosted on contemporary hardware, everyone gets to play with it for free. And because it's a virtual machine, it gets to live forever. All anyone would have to do to port it into the future is write that little interpreter for the instruction set, and then everything else is available. So at this point, 20 years before I'm ready to start, that's what I think I'm going to do.
Leo: That's exciting. I look forward to it. I thought - so you've got a few projects, actually, for when you retire. I think you're going to be very busy in your 70s.
Steve: I want to be busy in my 70s.
Leo: Well, I'm ready if you are. We're going to get to our fundamentals of computing series. Kind of the final, although I think networking really counts as still part of it. But the operating system is about as high a level as you can get in the PC.
Leo: So let's hear how operating systems work.
Steve: Okay. We began back in the '50s, so operating systems are - today here we are in 2010 recording this. As of this day, or date, or year, rather, not specifically this particular day, they're 60 years old. Originally computers back at the beginning had no notion of an operating system. They were just these very expensive, rather limited machines. Typically they had, like, word sizes of 32K. Sometimes big ones were 64K. But, I mean, that's as big as they got back then. That wasn't a small machine. That was a big, multi-hundred-thousand-dollar installation in a university would have 32K.
And so the problem was, how do you make use of this machine? And in fact over time, as machines grew larger and also a lot more expensive, keeping them busy was the biggest problem because it would take time to load a program into the operating system. Then the OS would spend some time running, and of course you've got the debugging of the program to deal with, too. I mean, the thing to remember is that computers at this time were rarified. I mean, this was the men in the silver - silver - in the white smocks on elevated floors with air conditioning. And you had one machine that basically held hundreds of people in thrall because their jobs were about keeping this very, very expensive, half-a-million-dollar piece of equipment busy.
So the original model was what was called an "open shop," where the people who had a job to run on this machine would sit down and load their code, load their program into the machine over some length of time and then work with it for however long it took. Meanwhile, everyone else was standing around tapping their feet, wondering when they were going to be through, because it was their turn. Again, the goal was to keep this thing that was so expensive somehow busy all the time.
So this open shop model switched to a so-called "closed shop," where instead of the actual people who wanted the work done doing it themselves on the machine, you created sort of a layer of separation. Now, instead, people submitted their jobs to the staff, who were more adept and more expert at using the machine more quickly. So we got a level of efficiency that way. And also there would be a queue of jobs to be run that had been submitted, so there was a backlog. Well, of course this meant that people often had to wait longer to get their results. But in general the goal of keeping this machine busy, being better at keeping it busy was achieved that way. And so that sort of introduced this notion of batch processing, where there would be a queue of things to do, and people would submit their work into the beginning of the queue, and then experts who were better at processing the data or these jobs in the queue would then do the work.
Well, the problem still was that there was a large setup time in between these jobs. So, for example, programs might take a few minutes to run, but would take half an hour to get themselves set up and going. So people looked at these systems and saw that they were still being very inefficiently used. And the problem was the I/O, that is, typing up punch cards and then having a stack of cards would take a long time to read a deck of cards into the machine because the machine was much faster than the card reader.
And similarly, on the backside, ultimately you were trying to print out reports of some sort, and the printer was much slower than the computer. And we've talked about this when we were talking about interrupts and I/O, the I/O systems, the idea that the computer could be busy doing other things, and every so often send a character to the printer, the printer being so much slower. But here, in a system that was - there was no concept yet of timesharing, of multiprocessing, of doing more than one thing at once. That just hadn't occurred to anyone. Well, and frankly the systems at the time, they weren't capable architecturally of doing that. They didn't have stacks. That didn't come along until later with Burroughs, the early Burroughs machines, the 5000 series machines, first introduced a stack architecture.
So here we've got this machine that is 100 percent I/O bound. It's sitting around waiting for cards to get read. When the program is finally loaded in, then it takes a short time to run it compared to the time it took just to read it in. And then on the backside, once it's done, now the computer sits around waiting for the printer. So again, looking at this, it's like, okay, how do we solve this problem?
Well, what they used was they used magnetic tape in order to decouple the slow speed of the physical I/O devices from the much faster speed of the computer. So now what happened was, people would punch their cards as sort of like the slowest step of the process. So now you'd have card decks. Then there would be a number of machines which read cards and wrote them to mag tape because mag tape was much faster than cards. So jobs would get punched on cards, and that would happen by some keypunch operator that was skilled at running keypunch. Then those would get written to mag tape. And as many jobs as possible would be put on the tape.
So the tape would be filled up with jobs to do until it was full. Then that would be mounted on a - taken from this machine that its sole purpose was just to read cards onto tape. Then the tape would be stuck on a large mag tape drive connected to the computer. And it had several mag tape drives that were being used in round-robin fashion for input, and another set used in round-robin fashion for output. So this mag tape would get loaded. And as soon as the computer was done with the previous mag tape on a different drive, it would start reading in jobs from mag tape drive number two. Meanwhile, the first one would be rewinding. It would get pulled off and stuck back on the card-reading machine.
And so you could see, I mean, literally it's like, I mean, the people were running around spending all their time trying to keep this machine busy. And you can sort of think of it as like a hierarchical funnel where slow processes fed into a faster process that then fed into a finally very fast process. The upshot of this was that you finally had this very expensive machine. Thanks to having funneled all the I/O into it, you had solved that speed problem. You were still running everything one at a time. That is, again, still no notion of the machine doing more than one thing at a time. But at least now you had solved the input problem so that this thing was - the machine itself was now very quickly loading programs off mag tape, running the programs, and then dumping the output back onto one of a number of output mag tapes.
The printers weren't capable of keeping up with that. And so you had sort of the same sort of, in the way that you have a funnel on the input, you had an expansion on the output. The computer would fill up a mag tape with output and then switch to the next mag tape that was waiting. The operators would pull that full mag tape off, now take it to offline printing, where that mag tape would then be servicing a number of printers that would print the results out. And so that was sort of like the next level of how to keep this extremely expensive machine busy.
So that notion, this whole thing was called "spooling." And it's funny, I didn't realize until recently, when I was doing some buffing up on the history, that "spooling" was actually an acronym. It just seemed to me, I mean, and I'm sure you, Leo, have heard of, like, spooling this off to tape, spooling it off to the printer. Spooling seemed like a thread on a spool, where you were just storing it somewhere. Turns out that it's an acronym. SPOOL stands for Simultaneous Peripheral Operation On Line.
Leo: Oh, you're kidding. SPOOL is an acronym?
Leo: I never knew that.
Steve: I know, I didn't either. I ran across it.
Leo: It makes sense because it's spooling something out like it's a thread. So I thought that that's what it was doing.
Steve: I always just assumed it was a verb, but it's an acronym.
Leo: Holy cow.
Steve: Simultaneous Peripheral Operation On Line, that's where the word came from.
Leo: How funny.
Steve: So it was in 1959 that, essentially after 10 years of this fire drill of trying to keep a very expensive single machine busy, that John McCarthy at MIT, who of course famously gave us the LISP programming language and lots of other computer science in between, he wrote a memo first suggesting the notion of timesharing, the idea being that - which was radical at the time - the idea being that, rather than the entire machine being committed to a job, to a single thing, we would slice up time and handle it in a round-robin fashion or in some sort of a priority fashion. But the idea being, have many different jobs operating in the machine at the same time.
And in fact McCarthy proposed the notion of terminals being connected to the machine in such a way that the individual users all thought they had exclusive use of the machine. He recognized that the duty cycle of someone typing characters, and how rarified character typing is, would allow essentially the same kind of funneling as was being done with keypunch operators concentrating onto cards, concentrating onto tape, concentrating onto the machine. He said hey, you know, you could achieve the same thing if you had real-time connection among many different consoles, all feeding into a machine which was able to look at them all at the same time. And so which was at the time a huge change, conceptually a big leap in the way systems operated.
So we talked last time, when we talked about the episode called "The Multi-verse" - I guess that was actually three weeks ago because we had a Q&A, and then we had the portable dog killer episode. Oh, and a Q&A before that, so I guess four weeks ago. Anyway, we talked about in the multi-verse this notion of - we've looked at this concept of the stack, which is able to store the state of the machine; and that hardware interrupts that were caused by people pressing keys or a printer saying I'm ready for another character, they could steal tiny bits of time just to fill an input buffer or to empty an output buffer, and then switch right back to what was being done before. And so this notion of interrupting the flow of instructions, and saving the state of the machine so that you can restore it when you come back, and jumping somewhere else and doing something, it's the growth of that that created what we now have with the current operating systems.
So operating systems over time have had various structures. Before there was a theory, sort of, of operating systems, before we had enough experience, they were just sort of big, monolithic things. The idea was that programs would be clients of the operating system. There would be sort of this code in the machine which is always there. That's the part that's not changing. Programs would be transient. They would come and go and run for some length of time while their user was running them, and then they would be terminated, releasing their resources.
And one of the things that quickly happened with this notion of timesharing was that the designers, the people wanting to use these resources were a little bit aggressive because notice that one thing happens when we chop time up, which is different from when we process jobs sequentially. If we process jobs sequentially, each job has the entire machine to itself while it's running. That is, however much RAM, for example, the machine has, the job can fill all of that RAM. But with McCarthy's notion of timesharing, things get a little complicated because, if you're going to have many programs running at once, then they've all got to be in RAM. That is, now - and I'm saying RAM because I'm used to today. But that was core.
Leo: Core, yeah.
Steve: Yeah. And 64K, remember, was the kind of size of core these machines had. So one of the notions that was created was this notion of swapping. And swapping was a sort of an early form of what we now call "virtual memory." We'll talk about virtual memory and how that works a little bit later in this episode. But with swapping the idea was that you would have fixed-size partitions in core. And you might have, say, 16 users, using the machine at once, at terminals. And the single computer that was being shared might have an operating system that used some piece of core.
And again, back then the operating systems were all written by hand in assembly language, which we know is sort of just an ASCII-ized version of machine language. So a one-to-one correspondence between instructions that had been manually written by the operating system creators and the code that's being executed. And then whatever memory was left over might be, for example, divided into four partitions. So you might have, say in a 64K machine, you might have - normally these operating systems were relatively small because they weren't doing that much. You might have 8K taken up by the operating system, and then the balance of that memory, like, what, 56K would be divided into four equal-size pieces. Those would be partitions.
And you'd have, however, four partitions and 16 people. So obviously we've got a problem because what these four partitions mean is that the system can only actually be running, can be quickly jumping between four programs at once. So swapping was the solution they came up with. They said, okay, we'll have a drum, and drum memory was - this predates disk memory. And we will have partitions on the drum to essentially hold all 16 of these partitions. And we'll swap them one by one, or actually four at a time, essentially, into the machine's memory. So four programs can be running. And after a program had run its allotment of time, it would be sort of frozen and put out, swapped out onto the drum. And another waiting program would be swapped into one of these partitions. And then it would be, being now in core memory, it had the opportunity of running.
So that was the way, back then, they solved the problem of needing to have essentially more resources because now they were trying to do this timesharing. They needed to have more resources than the machine could technically support because oftentimes you couldn't just add memory, even if you had the budget. Remember that the instructions were so short that the instructions themselves couldn't address more than 64K words of memory at the time.
So originally there was no sort of theory of operating system design because no one had ever designed one before, or written one before. So operating systems started as sort of big, monolithic systems. And they sort of grew awkwardly and not very elegantly. And back at this time a lot of this research was being done, if not all of it, primarily in universities. So the university sort of model is one to say, okay, wait a minute, time to reset. This thing, this operating system has gotten unwieldy. No one really - the people who wrote it graduated from - got their Ph.D.s and left a couple years ago, so we've got these huge chunks of code that no one here knows about anymore. We need to just start over.
One of the approaches that was taken was called a "layered" approach. As people began to apply real sort of academic discipline and thinking to the design of an operating system, they said, okay, wait. Let's try and do this in, like, a layered fashion. The bottom layer will be just the processor's own usage and allocation and the division of time. Then on top of that layer we'll write the sort of the management of memory and the drum swapping code. And then on top of that we need to have an operator, some operator interface that is able to control starting and stopping these jobs running. And then on top of that we need to have input/output so that the operator's console can receive input and then display the output. And then on top of that is user programs that are running. So there was this attention, they began to get the concept of some sort of structure, a hierarchy, a layering of the way things were being built on the OS.
The next problem they ran into, as systems became larger, was again one of complexity. The problem was that they recognized quickly that mistakes that were made in the operating system brought the whole system down. And the operating systems were continuing to sort of grow without end. So at one point this notion of a microkernel was proposed, the idea being to recognize that the kernel is this privileged resource where it's sort of the supervisor, the monitor of all the programs that are running on it. And it provides services, which we'll be talking about in a second, to the programs that are running on the operating system.
The problem with it getting big is that there's a given number of mistakes are going to be made per thousand lines of code, and that just sort of seems to be an immutable fact of the way software is being written. So if that's the case, and if mistakes are bad, and they're especially bad in the kernel because a mistake in the kernel brings the whole system down, as opposed to just a user program being aborted, but everybody else gets to keep running, a mistake in the kernel and it's game over. So the logic was, make it as small as possible. If it's as small as possible, it'll have as few lines of code as possible; and, based on the average number of mistakes per line of code, there'll be fewer mistakes. So the microkernel approach was specifically designed in order to minimize the number of bugs that could bring the system down.
On top of that, then, was created this notion of a client-server model. And that's what we now see in modern-day, for example, UNIX systems and in Windows, where you have a relatively small kernel which provides the fundamental services to both user programs and services which are running on the system. So the idea is that the kernel is very small, and then - but also not very capable. It does the fundamental sort of lowest common denominator things like handling processes, dealing with threads, slicing up time, managing memory, managing file systems. Sort of the core services of the operating - that everything uses in the operating system. And then you've got two classes of programs essentially running on top of that, that is, using those services. You've got service programs which provide additional features, much richer features. And then you've also got the actual client programs that are clients of those services. And that's sort of the model of operating systems that has evolved.
We've sort of been talking, we were talking originally about mainframe operating systems and evolved into this notion of personal computer operating systems. But it's worth noting that there are operating systems of some sort in a microwave oven. As we were talking about complex systems running in cars, there's operating systems in cars. I mean, there's operating systems literally in anything today that is complex enough to have a computer in it. There will be some sort of operating system. So there's also a hierarchy sort of in OS scale. Mainframe operating systems still exist today, running big Cray hardware. Large insurance companies will have huge mainframes, still with the - I don't know if the guys run around in white coats anymore. But large-scale systems that are still doing oftentimes batch processing, that are producing reams of reports and data, and also lots of communications. They tend to be very I/O heavy.
Coming down smaller we've got server operating systems which are still large, capable hardware that are then servicing the needs of many people at once. Then we come a notch down to the personal operating system, running on our computers. We're having many, typically, many things going on at once, but generally serving one person at a time. And then handheld operating systems that are running smart phones and PDAs still have a computer. There's still a core set of services that those are offering to programs that run on them.
And then the notch down are embedded operating systems. And that's the class of OS where you don't think in terms of there being an operating system. There isn't, often there's no file system. There's no sort of generic user interface. This is an embedded OS, for example, what's probably in our cars and microwave ovens, consumer appliances, DVD players and things. We see a very purpose-specific user interface where there's a display of remote control, buttons that you can push. But back there, in there is an OS.
There are a number of companies that license these for the common chips. And they're embedded in also sometimes real-time. An RTOS, a Real-Time Operating System is typically a small system that prioritizes time over features. That is, it guarantees that code that's running on it will be responsive, and that nothing else going on in the OS will tie up the computer's resources more than a certain amount of time. So there's a guaranteed latency in the operating system's ability to respond to the tasks that are running on top of it.
So what do all these operating systems do? What features do they offer? Well, one way or another the mainframe operating systems, server operating systems, personal operating systems, even handheld operating systems, that is, everything above sort of the embedded operating system, they provide a means for loading a program from mass storage into memory. We know now that programs, processes, are typically composed of, well, at least one, but many times many more than one, threads, where threads are sort of an abstraction of a series of instructions that are being executed.
A program could very well just be single-threaded, as is the terminology, where it starts, and the program executes only conceptually one thread of execution at a time. So there's only, for example, one program counter associated with that process because there's only one thread that is at a specific location at any point in time. And as we explained in the multi-verse, you can have multiple threads because a thread is able to spawn or fork another, essentially another conceptually asynchronous stream of execution so that it's able to go, this forked or spawned thread is able to go off on its own and do something else. For example, it may be responsible for reading the keyboard or updating the display or doing some other things that are sort of asynchronous to the main work that the primary thread of the process is doing.
So the operating system is the thing that oversees all that. It has, somewhere, some hardware, a timer which is generating interrupts at a given frequency. That interrupt is what's responsible for yanking control away from individual threads running in individual processes and getting control back to the operating system. When that happens, the operating system takes a look at what's going on and decides if the thread should continue running, if it's time to give it to a different thread in the same process, or if that process has used up its allotment of time, time to bring back to life a thread in a different process that was previously suspended. In which case it restores the state of a given thread as if that thread had never been stalled and just continues executing where that thread let off.
So that's the job of scheduling, which is the subject all by itself of books on operating system theories. Scheduling is amazingly complex, and it's easy to sort of stand back from a distance and say, okay, well, this just has to divide time up among all these threads. But it turns out that when you start actually doing it, there's all kinds of complexity with deadlocks and stalls and competing priorities. And if something has a low priority, and the things with higher priority never take a breath, then the things with low priority never get a chance to run, so that's not good. So it's like an amazing mess of special cases and, again, has been the topic of many doctoral theses over the years.
One of the other things that operating systems do is allow interprocess communication. That is, they support a means for allowing processes to communicate with each other, which is often something that co-working processes want to be able to do. You want to be able to have some sort of communications between them. Even though the operating system is all about interprocess isolation, in some cases you do need to facilitate communication. The operating system provides that mechanism because there is otherwise no direct means, for example, for one program to mess with another program's memory. You want to manage that carefully to prevent mistakes from happening, and also of course to prevent deliberate abuse from being possible.
One of the other things that all of this relies on is memory management, which we've talked a little bit about but never really directly. The idea is that, in every case, all these operating systems basically produce an abstraction of the machine's underlying resources and of time. Time and the passage of time is an abstraction, as well, because from a thread-centric standpoint, a thread just thinks it's executing all the time. The thread has no awareness that time isn't all its own, that it isn't running all the time because control is yanked away from it, and its state is saved. Then the state is restored, and it picks up exactly where it left off. So from its standpoint it's just running all the time.
In fact, we know that's not the case. There's thousands of threads in a typical system that all believe they're running all the time, when in fact on a single-core system only one is actually running at any time. In a multi-core system you can have as many actual threads running as you have cores. So the operating system is essentially creating an abstraction of the underlying resources. One of the resources is memory. Processes can, much as was the case back in the days of early timesharing, where it was possible for a process to be sharing memory with other processes. Back then you could only have as many partitions of a certain size as there was room left over after the operating system had been accounted for.
In this day and age we've got this notion of virtual memory and the ever-present, for example, Windows users are familiar with the paging file, or sometimes still called the "swap" file, the idea being that processes believe they have a much larger amount of memory than they actually do. And this is hidden from them. This reality of how much memory they have is hidden by the operating system. If the process attempts to execute or read or write memory outside of what's currently loaded in the system, the operating system, using this virtual memory technology, will get control, literally at the instant of a program's attempt for read or write memory that is not physically loaded at this time. That creates what's called a "page fault."
The operating system gets control at that instant, sees what's going on, and says, ah, the program is trying to access memory which we've had to swap out to the hard drive because of limited resources. Other programs needed to use physical memory. So what happened was that memory was paged out, as is the term, onto the hard drive while another program that was running was actually using the underlying physical memory. So when a program attempts to use memory which doesn't physically exist, it's been paged out, the operating system gets control, sees what's going on, decides, okay, do we page this in, do we suspend the process, there's a whole bunch of decisions to be made. As I said, scheduling is itself a huge issue for operating systems.
Typically what happens is, because physical I/O is necessary and always takes time, the act of the process touching memory that isn't physically located or that isn't physically present in the system, suspends it. The memory is cued up to be written or to be read from the system, except that there's no doubt some other process using the physical memory that's in use right now. So that needs to be written out to the swap area so that the memory which had been swapped out could be brought back in. Like I said, this gets really complicated very quickly.
But the bottom line is, what ultimately happens is, the so-called swap file or paging file or virtual memory file, it's an oftentimes very large extension of the physical memory in the system. And we talked last week, or last time we were talking about this, about the notion of caching, where the processor had very fast access to its own registers. And then it would also have a cache of maybe 8, 16, 32K in the old days, megs now, maybe several megabytes. And that would be a copy of what was in physical memory that would be maybe hundreds of megabytes or a gigabyte. Well, you can see that that forms a hierarchy, from the register in the chip to the cache. And oftentimes there's a Level 1 cache and a Level 2 cache - Level 1 being smaller and instantly fast, Level 2 being larger and not quite so fast. And then you have physical memory.
Well, virtual memory is the last largest layer of this hierarchy, where it represents sort of this super physical memory that is really slow inasmuch as it's written out on mass storage, which is typically a rotating medium in this day and age. And so it's very slow to read and write, but it's very large. And so that sort of - the virtual memory forms the final layer of this multilayered memory hierarchy where we get smaller and faster all the way up to the individual register of the machine. And the operating system manages all of that.
One of the other things that the OSes do is support a file system so that the programs running on the operating system are able to, again, see an abstraction known as files which exist only because the operating system creates that abstraction for the programs that are running on the OS. It knows about the physical organization of the media, the mass storage on the system, but hides all of those details from the programs that are running under it.
Leo: Steve is taking a small drink at this point. For those of you listening at home.
Steve: My throat was getting scratchy. Believe it or not.
Leo: Well, you said it wasn't a lecture, but this is a lecture, but a great lecture I'm really enjoying. It's fascinating.
Steve: So one of the things that happens is that, notice that we've got floppy disks, which are small, 1.44MB, for example, in the famous instance of the IBM PC, and have a physical organization with some small number of sectors around some small number of cylinders on two sides. But we might have a multi-platter hard disk that's got thousands of sectors arranged around millions of cylinders and multiple platters. Well, the operating system's drivers understand the specifics of the hardware that they're driving. All of that is hidden from the programs that run under the OS. The program sees this abstraction of named files, and so it's able to say to the operating system, hi, open for me this file of this name and let me read it.
And one of the main innovations that operating systems had was that this concept of I don't care where it is, I don't care what media it's on. And in fact, even as far back as remember UNIX and CPM, there was this notion of this could be paper tape, or this could be a hard drive. There was this disconnection with the I/O system and the physical devices that were running underneath the operating system. So this abstraction is provided by the operating system. And then there's also an abstraction provided by I/O drivers, where the operating system would have a defined mechanism for interfacing to, for example, file system devices. And different drivers would be used to interface the operating system to very different storage mechanisms.
So you'd have a floppy driver which is able to answer specific uniform requests from the operating system, translating them into the instructions needed to work with the physical floppy drive hardware. And you'd have a mass storage driver, for a given hard drive, which is able to take the same operating system requests as the floppy driver, but in this case translate them into very different specific functions for its hard drive. So again, we create these boundaries of abstraction where the entities on each side of the boundary have agreed upon a protocol that is uniform, typically documented, and not changing. So drives can get bigger. The drivers may need to change over time. But that agreed-upon protocol at the boundary of the driver and the operating system, or even the operating system and the application program, that doesn't change. And so that allows things on either side to evolve over time, which was another major innovation in operating system design.
And then finally, the last thing that's come along in recent operating system architecture, on top of file systems and I/O, is security, where there's an increased notion of both wanting to prevent mistakes from compromising the system and wanting to prevent deliberate malicious intent from compromising the system. So there's this notion, which is layered on top of everything, of privilege. The idea that certain files can be privileged, certain memory regions can be privileged, essentially there's this notion of ownership and rights which are applied to all of the objects within the operating system, right down to and including the individual thread of execution. Threads can own, objects can have rights, and everything's scaled all the way up from threads. And all the other abstractions that the operating system creates can have rights.
And so, in addition to everything else the operating system is already doing, it also has to check the security. It has to make sure for, I mean, at the level of the individual action, it has to make sure that that action is permitted, that the thing asking for something to be done has rights to do that to whatever it is asking for it to be done to. So OSes have gone from starting off being basically a loader of code, for a system which was struggling to be used enough, to what we have today. I mean, we're all sitting right now, I mean, you're hearing this through probably a device with an embedded operating system.
Leo: Almost guaranteed. I don't think you can play it back any other way.
Steve: No, there's no other way to play it back. Leo and I are sitting in front of computers that are right now, as we sit here, have thousands of threads all running, all competing for resources, using the hardware, somehow staying out of trouble. And to me it's amazing.
Leo: Amazing. Absolutely. Absolutely. It's truly a miracle that these things even work. It's amazing. Are you done?
Steve: I think that's it.
Steve: That's operating systems.
Leo: I tell you, you've got to put these all together on a DVD or something, and we'd have the complete set of building a computer from scratch.
Leo: Yeah. Question 2. Actually 3. No, 2. Gary Robinson in, now, this is a good one, Magherafelt, Ireland, asks a question: What happens after arbitrary code execution? Hey, Steve and Leo. Thanks for the great show in Security Now!, especially the ground-up computer principle series. I've been listening for just over a year now and have finally caught up on all 250+ episodes. Leo's right, I'd much rather listen to podcasts during my hour-and-a-half daily drive than the same boring thing on radio. It's true. That's our plan, anyway, our evil plan.
I've a question about what happens after a bad guy has found a vulnerability in an application and gets that arbitrary code to run. Based on your descriptions of this subject, bad guy uses some buffer overflow or some other vulnerability to get the arbitrary code into memory, causes the program counter to jump to that arbitrary code instead of returning to the previous function, or however else they've corrupted the program counter. When the bad guy's arbitrary code finishes, does the application crash? Is the bad guy smart enough to populate the arbitrary code with the correct return address so the application continues as normal? If the app will always crash after arbitrary code execution, isn't that a good indicator for us to know something bad may have just happened?
I know it could be hard to tell this application crash apart from the other standard common application crashes, but might be one way you could kind of be aware that something may have gone wrong, and we should run some diagnostic tool or virus checker before going to that banking site. I would be delighted to know what you think on this. Please keep up the good work on Security Now!. Gary.
Steve: I thought it was a great question because we've absolutely not even once discussed that.
Leo: We've talked about the mechanics of it, but not what happens afterwards.
Steve: Yeah, exactly. And the answer is, we don't know.
Leo: Well, couldn't they just push the return code on the stack?
Steve: Precisely. I love the fact that in his question he was talking about the program counter and setting it to something. And you're exactly right, Leo. It may very well be that the attack could look like a subroutine. In which case, if it was clever, I mean, if it deliberately wanted not to crash the application, it could push the current program counter on the stack, immediately push any registers it was going to modify on the stack, do whatever nefarious things it wants to do and then, just like a subroutine, pop the registers off the stack that had been modified and do a return instruction, which will continue just like nothing happened.
Leo: In fact, exactly how a subroutine works.
Steve: That's a subroutine. Now, in practice, it's more often the case that whatever function was trying to execute fails. That may be the whole app crashing, as Gary suggests. Or it may be that, like, edit/copy doesn't work. That's a bad example because edit/copy often doesn't work for other reasons. But it might be that what you try to do fails, and you kind of think, oh, well, I wonder why that was. Maybe it's not feeling good today.
Leo: We'll never know.
Steve: We'll never know. So there isn't a definitive answer because three things could happen: nothing at all, the application runs, meanwhile something bad happened in the background and you never noticed; or part of the application no longer works, but the rest of the application manages to continue limping along and sort of doesn't need to be restarted; or the application completely just crashes. It disappears from the screen, it locks up, something bad happens. And again, I got a kick out of him saying, but, you know, that happens all the time anyway, so would we really know. I would say more likely, from what I've seen, it's more common that the application does die after having achieved the hacker's goal. The hacker is generally much less concerned about a smooth exit than they are about getting their stuff to run, figuring, exactly as most of us would, oh, well, it just crashed.
Leo: Yeah, crash.
Steve: I mean, it's funny, I've sometimes spent hours working in a graphics editor on something, and then thought, oh, goodness, I haven't saved. Then I'll quickly save my work because, similarly, I've spent hours working on some graphic stuff, and then it just disappears from the screen. It's like, (groan), why didn't I, you know, save it.
Steve: So both things happen.
Leo: And I believe, correct me if I'm wrong, but this special code only has to execute once. It executes to install the trojan horse, which will then continue to execute normally without crashes. So it's not like you're going to crash all the time. You're going to crash once when the bad guy gets the code injected. And after that the trojan will run.
Steve: Correct. For example, it would establish a connection to a remote server, which it turns out is very simple to do, just a few instructions because Microsoft in the case of Windows provides an API for that. Then it would download a block of code which allows it to get much more of its own code into your machine. And that code might - just might be another process. It would spawn another process with that code in it, which it would then run. And none of that takes up that much space. I looked at shell, as it's called, shell code, which is this kind of stuff that does this. And it's, you know, it's written in assembly language, and very tight, and doesn't take up much space. But by the time the app crashes, the other thing is now running in your system. It then tends to be a bootstrap to go and get much more code from the remote server, and it just starts shuttling stuff into your computer in the background while you're thinking, huh, I wonder why my word processor just died. I'll just fire it up again and hope I saved a copy recently.
Leo: That's the gift. All right, Steve. I'm ready. I've got my thinking cap on. Some may call this a dunce cap, but it's a thinking cap. And I'm prepared to think. Tell me, tell me, sir, how do we - where do we go now that we've been building a computer?
Steve: We've established a beautiful foundation over the series that we've covered so far, the idea being, of course, to demystify what a computer is, how it works, what it does. We know that there's a main memory, and there's a program counter that points to a current address of a word in memory, and that that word is composed of a bunch of bits, and that those bits are broken into fields, the most important field being the so-called "opcode," the operation code. The pattern of bits in there describes to the computer or specifies what the computer should do with that instruction, each one of these being an instruction. And the rest of the bits in this word provide the parameters for the instruction. For example, it might be "clear the accumulator." Or it might be "load from a location in memory into a register," or "store the contents of a register into a location in memory."
So, and we've looked at what the stack does, the notion of the convenience of being able to have sort of a scratchpad where we can push things on, and we can pop them off in the reverse order so that as long as we kind of keep track of our own housekeeping, we don't have to pre-set aside areas to read and write into. Everybody's able to share this as long as they do so carefully. And so we've talked about subroutines. We talked about hardware interrupts and how interrupts from physical I/O devices which are much slower than the computer is running very fast, are able to be used to yank control away, yank the program counter to somewhere else, allowing the computer to service this interruption, send characters out to a slow printer, for example, and then return to the code that was running as if nothing had happened.
So we've got this architecture, and that's where everything began. What I want to do today is describe the evolution from there to now. That is, what happened after this and what were the pressures that were on the industry and on the engineers and on the companies that evolved this from that very clear and sort of basic beginning to present-day machines. So one of the things that happened, as hardware became less expensive, and programmers were complaining to the engineering team that, you know, like hey, be nice to have some more registers here. We've got the accumulator, but we're having to use it sort of as a scratchpad for main memory. We can't really hold much of what's going on in the accumulator. So they said we need some more - we need more registers, more accumulators.
And so the engineers said, okay, well, that means we're going to have to make the word longer because we're going to need some more bits in the instruction to specify which register you want to load and store and add and so forth. And so if the engineer says yeah, okay, fine, do whatever you - or the programmer said that, oh, that'll be fine. So words got longer. And then some of the engineering guys said, well, you know, a simple instruction like "clear a register," that doesn't have a memory address. It doesn't need a memory address. So that one could be shorter. And how about, the engineer said to the programmers, what about if, like, you could add two regions of memory and store it in a third. And the programmer said, oh, you mean like with one instruction? And the engineering guy said, yeah, why not? And then the programmer said, well, that would be fantastic.
Well, of course that would mean that the instruction, a single instruction, would need to contain three memory addresses for an add. It'd have to be, you know, the address of one of the operands, the address of the other operand to be added, and the address of the location where the result would be stored. Well, memory addresses take lots of bits in order to address a region, you know, enough memory. So this kind of an instruction would be really long.
So what evolved was, and this was a major break, was the notion of variable-length instructions. We started off with fixed-length instructions on our very simple machine, where most of the instructions were memory reference, and only referencing one piece of memory at a time - load from this memory, store to this memory, add from this memory to the accumulator, or/and from the memory to the accumulator as we've discussed. But as machines became more powerful, essentially because hardware prices were coming down, integrated circuits were happening, it was possible, it was practical to put more complexity into the systems. The programmers were demanding more features.
And so if you were going to have an instruction which had three memory addresses in it, it was going to be really long, multiple bytes of memory. But if you had an instruction that was going to just say "clear an accumulator," that could be one or two bytes. So we broke from this notion of fixed-length instructions into variable-length instructions. And at that point all bets were off because now we could make our computers very complex. They could have, for example, you could have an instruction that was a prefix byte that said, you know that opcode we had, well, here's a whole 'nother set of opcodes. So now we weren't limited to eight or nine or 12 opcodes. We could have hundreds. It was practical to have many more possible operations.
So the engineers said, okay, well, what are we going to do with all this power? And the programmers said, well, since you're asking, there's some other instructions we'd like to have, like how about - we do a lot of things with linked lists, so how about an instruction to do linked lists? And the engineers said, wow, that'll take, like, a lot of manipulation. And the programmers said, well, what are you guys doing? Go do that for us. And so the engineers said, okay, write that down, Harvey. We've got to go implement that.
And then they said to the programmers, what else do you want? And they said, well, we do a lot of subroutine calling. And we know that when we call a subroutine we need to save all the registers. But it's tedious to have to, like, push this register and push that one, then push the next one, and push the one after that, and then before we exit to pop them in reverse sequence. How about an instruction for calling subroutines where, like, we had bits in the instruction which specified which of the registers to preserve in the subroutine call. And the engineers said, oh, we like that, that's cool. Write that one down.
So the programmers got very creative with the stuff they wanted. But then the engineers who left that imaginary meeting went back to their own area and said, okay, now we're in trouble. How in the world are we going to arrange to implement instructions this complicated? I mean, these things had sort of gotten out of control. And remember that the very simple fixed-length instruction computers were, for example, in the case of a PDP-8, they were contained on just three 8x10 circuit boards that weren't very dense, that just had simple AND and OR gates on them because everything happened in a single cycle.
Well, that's one of the other things that changed. As these instructions got more complex, no way were you able, for example, in the case of pushing a random, well, not a random, a specified subset of registers, well, this instruction was going to take, could take a long time to execute, one instruction, because the instruction would specify some number of registers get pushed. So it's going to have all these memory cycles after it reads the instruction, all these memory cycles to push these registers in succession. And there's going to be a reciprocal instruction to pop some subset back off. So that's going to take many cycles. And an instruction for managing a linked list, there actually was such a thing in the VAX. The DEC VAX had a linked list.
Leo: Really. That's such a...
Steve: Oh, yeah.
Leo: ...high-level primitive, I mean, golly.
Steve: Yeah, it was amazing how complicated these instructions got. There were some that were like, the programmers said, well, you know, we spend a lot of time trying to find the first bit which is set in a long run of memory. We want to, like, find a bit. How about an instruction that just reads memory until it finds a bit set?
Leo: Oh, geez.
Steve: I mean, and there is such a thing. There is that instruction. And so, for example, in the famous Intel x86 instruction set, instructions can range from one byte long to 17 bytes long.
Steve: There's that kind of range. So the engineers who actually had to create a machine that would deliver this kind of power said, okay, look. We cannot...
Leo: Give us a break.
Steve: Yeah. We cannot design hardware that, I mean, do you know how many AND and OR gates it would take to do this?
Leo: We're not going to do your job for you, for crying out loud.
Steve: So what they said was - I mean, and it was probably an instruction like this pop multiple registers thing. They said, well, now that we've got all these cycles that it's going to take, we need microcycles. We need something like a computer in the computer, which can be smart enough to implement very complex instructions. And so someone said, what are we going to call that? Well, we'll call it "microcode." And it's like, oh, okay. So what they did was, and this was a revolution in computer architecture, was they actually - they dropped the idea of unbelievable rats nests of AND and OR gates, essentially, to try to implement this ever-expanding aggressive instruction set. And they said, wait a minute, computers have sort of flow paths. There's an adder that can get its data from different places, and there's a memory buffer that has the contents of memory, and there's maybe a multiplier that has its input.
So imagine a very different kind of instruction word, probably long, many bits. But the bits in this so-called microcode, they just enable the flow of data along these different data paths at different times. And so just by sort of opening and closing these data paths in sequence, we can implement a complex instruction in multiple small steps so the outside world still sees it as a single instruction. Inside, the microcode has many small steps which it goes through to pull off this complex instruction. So the programmers don't see it on the outside, but the engineers who engineer the microcode, they came up with a whole new way of thinking about how to engineer a computer by...
Leo: I always assumed that all processors had microcode. I didn't realize that was a later invention.
Steve: Yeah, it was. And it was something that was developed through this necessity of they just couldn't, you know, it was like how are we going to do this with just AND and OR gates?
Steve: And so the idea was that you'd have the so-called "microstore" was very fast because it had to run in, like, much faster than main memory. Main memory was chunking along, just reading instructions. But the microcode had to run many times faster in order essentially to have all those little microcycles fit within one major instruction cycle in order to get the job done. And that also meant that the instructions no longer all took the same amount of time. As we said, more complex instructions would take more time. But that allowed this flexibility of instruction length, for example. The microcode, if you were doing a fancy operation like adding two words, both in main memory, and storing them to a third, that was no problem now. That instruction had many more microcode steps in it in order to get the job done.
And so this really changed the way the system was working, the internal design of the computers. And so what that also of course did was create a renaissance in complexity because, when the programmers heard that now there was like a computer in the computer, then they went hog wild with instructions they wanted because it was like, I mean, it was very much like, you know, we all have heard the slogan Apple has, "There's an app for that."
Leo: Right. There's a microcode for that. There's an instruction for that.
Steve: There's an instruction for that. You know, anything these guys, these programmers could think of, they'd say, hey, how about one that does this?
Steve: And the engineer said, really? You need that? Oh, yeah, yeah, I needed that yesterday, and I wish I had that. Okay, fine. Well...
Leo: So it does operate faster if you put it in microcode in the chip than if you wrote it. I mean, it's easy to write a linked list in a higher level language, or even assembler. But it's faster if you put it in microcode.
Steve: Well, and, see, that - the perfect question. Because remember that main memory was still very slow.
Steve: And it was very expensive.
Leo: Got it.
Steve: So what had happened was computers turned into interpreters. It was...
Leo: Right. To save memory.
Steve: Well, exactly. So what you had was super powerful instructions that then inside the computer it ran around and pulled it off. But that meant that you were saving main memory and you didn't have to fetch it that often. That is, if fetching from memory was time constrained, and it was, because main memory was slow still, then - it was core. Remember, every time you had to read, reading was a destructive process. So then you had to rewrite what you'd read because reading, the way to read from core is to set everything to zero, and you see where pulses come out in the cores that switched from a one to a zero, meaning that those were ones before you set them to zeroes.
Leo: And now they're nothing.
Steve: Now they were nothing, so you had to reset them to ones again.
Leo: Holy moly.
Steve: So this gave a lot of opportunity for the microcode to run much faster, and it meant that your code density and main memory was very high. Essentially, we'd created an interpreter. And we know that an interpretive system allows very dense interpreted instructions where the instructions themselves are doing a lot.
Leo: Ironically, we do have a constrained hardware environment these days with mobile devices, with cell phones. And they, for the most part, use interpreted virtual machines to save space.
Steve: Well, so, yes. What happened was, as technology moved forward, memory became faster. And more than that, it became cheaper. We just got better production rates, and we got more clever, and we began to not be memory constrained. And at some point some engineers said, you know, let's profile actual code that's in the wild out there that's running. Let's analyze the code and see what instructions are being used. The other thing that had happened during the same period of time is compilers came into use. Back in the beginning it was the programmers writing all this in machine language/assembly language that said, oh, I'd love to have an instruction for doing linked lists. Then I wouldn't have to be writing that all the time.
And so over time compilers came into play. It just made sense to create a higher level language that itself would then create the assembly language, the machine language that would drive the machine. And but here was a problem. Compilers, it turns out, couldn't make use of all this fancy stuff. The compiler, the nature of the compiler - see what I mean?
Leo: Yeah, yeah.
Steve: Yes. Humans could; but the compiler was like, wait a minute.
Leo: I don't want to do that.
Steve: Yeah. Well, it just - it didn't make sense. What had happened was instruction sets had become very specialized. And individual...
Leo: Right. So it's great for assemblers, for people who write, like you, who write machine language. But not for an automated system.
Steve: Not for a mechanized code generator that would sort of always be using the least common denominator. And so...
Leo: Well, that's a good point, too, since not every chip would have that capability.
Steve: Right. So when the computer architects of the past profiled the actual instructions that were being used, they discovered the familiar 90/10 rule. You know, 10 percent of the instructions were used 90 percent of the time.
Leo: Right, right.
Steve: And then they said, wait a minute. If these fancy instructions are not being used, look at the expense in this machine for a linked list instruction. Or these wacky bit search nonsense that we ended up getting talked into by the programmers. Compilers never use these. Yet even though they're not using them, they're in the machine. Every one that we push out the door has very expensive-to-implement instructions that nobody uses because we fired the assembly language programmers, and we hired Fortran programmers or Cobol programmers. And the compilers now do the job that the assembly language guys used to do, and the compilers aren't using any of these instructions, which every single time we push one off the assembly line we're paying for all this microcode store to implement things that no one's using. So there was a complete rethink at that point. And...
Leo: Where does this happen? Does it happen at Intel, or is it happening elsewhere?
Steve: It was happening in - this was sort of - it was happening in universities, actually. The SPARC came from Stanford and Sun, or Sun later. And the MIPS RISC machine was at Berkeley. And the ARM from Acorn was in the U.K. And so that's where this notion - they said, hey, let's reduce the instruction set complexity. And someone said, ooh, RISC, R-I-S-C, Reduced Instruction Set Complexity, or Reduced Instruction Set Count, it's sometimes known as. The idea was that they realized they'd sort of gotten way off track with these incredibly expensive instruction sets because, they said, wait a minute, let's - hold on. Let's instead kind of go back to where we were. That is, let's have very simple instructions which, now that main memory was fast, had caught up in terms of the speed we needed, and it got cheap.
The other thing is that memory costs dropped through the floor. So you no longer had to have an entire operating system and 20 shared timesharing partitions all fitting somehow into 64K, which actually was done once. No, now we had, at this point, megabytes of memory, easily, so code didn't have to be as dense. And essentially there was this wave of simplification that said, wait a minute, let's go back to - oh, the other thing is that, remember, we went to a - we had variable-length instructions? Well, the only way you could handle a variable-length instruction was in microcode, having microcode that knew that, oh, look, this particular opcode requires these different parameters so we've got to go fetch those now. And that means this instruction has more bytes after the opcode of parameters for the opcode.
Well, all that was thrown out the window. They went back to a radically simplified architecture. One of the most popular is called a load-store architecture, which is now, currently, today, the most popular of architectures, the idea being that you have exactly one instruction for loading something from memory. You have exactly one instruction for storing something into memory. And all the other instructions deal with a much richer set of registers, maybe 32 is the number that most of these architectures have now. So you've got 32 registers. And you can do things like add register one to register two and store it in register three. You cannot add register one to the contents of memory. The other thing, again, in a code analysis, what was seen was that that wasn't being done that often.
So they said, okay, let's try a whole different approach. Instead of instruction, instead of, like, having all instructions able to add and subtract and multiply and do things with main memory, let's have a load instruction and a store instruction. And then everything else, our jumps and our branches and our map and logical things, those only work with the contents of what's in the registers. And we're going to have - oh, that allows us to go back to a fixed length. And that means that every single instruction, like back in the old days, is a single cycle. That is, one cycle. So now that main memory can catch up - or, if not, then caching had come into vogue.
So then there was this notion of they noted that programs tend to sort of stay in the area where they're executing a lot. And so the concept of caching came in. In order to provide a RISC processor that was able to ask for an instruction every single cycle because it was able to execute an instruction in a single cycle, the cache would feed instructions at that rate. And so the idea was lots of little simple instructions executed very fast. And then other things could be done, like you could, if you knew that the instructions were all the same length, you could fetch a block of them ahead of time and bring them into the cache so that they were ready in case the processor wanted them.
So that's the - there's a little bit of, like, coming back where we came from, but with a much more mature understanding of how to get the most bang for the buck. And the major thing that happened is that this rethink of architecture allowed for a dramatic simplification. Now, Intel was, and even to this day is stuck with this insane instruction set, which I happen to...
Leo: How interesting.
Steve: ...know by heart.
Leo: Yeah. The x86. They wanted to get out of it. They wanted to stop.
Steve: Yes. The x86, with its variable byte length between one and 16 bytes, they're stuck. The problem is, the reason they're stuck is backward compatibility. It matters too much, I mean, they've always wanted to, like, break free. But how can they ever do an incompatible chip that, like, didn't run everything? And they've just...
Leo: In fact, they were going to, and AMD was the one that put the screws to them by saying, well, we're going to keep doing x86.
Steve: Exactly, yes. And so what happened was, it was, for example, a company like Acorn, back in the late '80s, that was trying to come up with a successor. Acorn was the preeminent personal computer company in the U.K. There was Sinclair, and Commodore, and then also some U.S. companies had strong sales. But Acorn was the number one. They had a system based on the 6502, which was originally the processor technology - I'm sure that was the company. Although processor technology was a different company also.
Leo: Yeah, there was - oh, my. You know, it's funny how blurred it all starts to get. The 6502 was in the Apple II and the Atari, I remember.
Steve: Oh, yeah, yeah, exactly.
Leo: A crappy chip, by the way, the worst instruction set. Oy.
Steve: Careful, there, Leo.
Leo: I liked the 68000 instruction set, the Motorola instruction set. That was a very clean..
Steve: Well, but that was a whole different generation. It's really not fair...
Leo: It was, you're right. Oh, no no no, I know. But even the Z80 I think was a little bit cleaner than the 6502.
Steve: Well, that, too, was a different generation. What the 6502 was, was incredibly inexpensive.
Steve: And it had a beautifully...
Leo: It was MOS Technology.
Steve: M-O-S, that's it, MOS Technology, thank you.
Leo: Thank the chatroom.
Steve: MOS being Metal Oxide Semiconductor, of course. It had a beautiful instruction set. And what that allowed was it had just enough to get the job done, which meant it had a low transistor count, which meant it had a small die size, which meant your yield on wafers of the time, that is, you got many more processors per silicon wafer because the individual dies were so small because the transistor counts were so low. And that meant that made these processors incredibly inexpensive. And that's why Commodore, Apple, and everybody jumped on the 6502. So what the Acorn guys did was they took a similar approach. Because they'd been 6502 users, they took a similar approach to a brand new design, and they were going to do a RISC processor. So naturally they called theirs the Acorn RISC Machine, ARM.
Leo: Oh, my goodness. I had no idea.
Steve: And they ended up with a beautiful, lean design.
Leo: Right, right.
Steve: It was the small transistor count. Now, the other thing that is expensive in large dies and large transistor counts is every transistor you've got not only takes up space, but it takes up a little bit of power. And so the other problem Intel has been fighting valiantly all along is they've got this insane instruction set that they can't get away from which they're trying to make faster and faster somehow. The problem is they are just dogged by power consumption because this complex instruction set requires a huge amount of power, I mean, like, disproportionately so. Just to feed the latent complexity in the instruction set takes its toll. Whereas the ARM, the Acorn RISC Machine chip, because they had the idea, I mean, they came a little bit after the SPARC and the MIPS, so they were able to learn from those designs. But they also came from a 6502 era of understanding low complexity. And interestingly, they were a small company with a small design team. And they didn't have fancy tools. So they also had no choice but to do a simple, small, RISC machine.
Well, even though it couldn't compete then with the SPARC and the MIPS, it competes now because of its low power consumption. And what happened was, in the evolution of Acorn, they ended up renaming themselves Advanced RISC Machines Ltd. Again, ARM. And they're a licensing company that licenses the intellectual property of the ARM RISC core and allows other people - they don't produce them themselves. That's why, for example, Qualcomm has the Snapdragon, which is an ARM-based chip. And it's why Apple with their A4 is an ARM-based chip.
So Advanced RISC Machines is an intellectual property licensing company, licensing this beautiful core which has, because of its - it was always an efficient design, but it was a small design because that's the only thing they could design. But "small" means very inexpensive, just like it did on the 6502; and it means there just aren't that many transistors. The first ARM had only 25,000 transistors, compared to now you've got hundreds of millions of transistors. So you had very low power and, because it was a relatively state-of-the-art design, the right architecture. And that's why everybody who's doing handheld devices has ARM-based technology.
Leo: Makes sense.
Steve: You get a lot of - you get the processing power you want, and it's inexpensive because it's a small die, and it's inexpensive in power consumption because it doesn't have that many transistors hungrily burning up power. And the ones it has are all in use. The other problem with the Intel instruction set is, as we saw, it's inefficient due to the 90/10 rule. So you have all these transistors sitting around not doing much because they just end up being the exception rather than the rule. Whereas this lean RISC design has all your transistors busy all the time, so that's efficient, too. You get a lot of use out of the power that the transistor is using. And that's the story of RISC.
Leo: I love it. And very timely because of course here we are sitting, looking at RISC processors all the time now, in our phones, in our iPad, and all these small devices.
Leo: You know, Android, which is originally running on an ARM processor, is being ported to the Intel Atom processor so that Google TV can run on an Atom-based system. Of course, it runs on a Java Virtual Machine, so it's probably a simple port to make. You just need a virtual machine to do it. But I wonder what kind of efficiency hit they're going to take.
Steve: It'll be interesting to see. In two weeks what I want to talk about, and I think this exhausts me in talking about technology, is the need for speed. Because there's a fascinating story that's the last thing I want to cover on what has been done inside these machines to give us the speed they have. People are not going to believe, I mean, it is - it just grays your hair to think of the engineering that has gone into this. We're going to talk about pipelining and multiple issue and out-of-order execution.
Leo: Ooh, I love that stuff.
Steve: Branch prediction, superscalar stuff. I'm going to lay that out. And at the end of that episode everyone's just going to look at their little, their iPod or their pad or their phone and think...
Leo: Oh, my.
Steve: ...that stuff is in there. And, I mean, it makes you appreciate, I mean, I'm just stunned by what went into the technology that we so easily take for granted today.
Leo: It's really true. It's so complex. And yet it works.
Steve: It does.
Leo: John Hughan with Question 2 in Austin, Texas, wonders why microcode reduces complexity. This is from the last episode where we talked about "RISCy Business." Hey, Steve. Great show, as always. I'm hoping in the upcoming Q&A that you might be able to explain in a bit more detail how having microcode made engineers' jobs easier in terms of the number of AND and OR gates required to implement complex instructions. Why is it not the case that having a "computer within a computer" just meant that those AND and OR gates had to be implemented in the microcode area in order to run those instructions and manage the "main" area? Or, if microcode allows those types of instructions to be executed in a fundamentally different way that doesn't require those AND and OR gates, why can't the rest of the instruction set be implemented that way? Keep up the great work.
So he's saying really, when you were saying that one of the things that came up was that they were building into the silicon these fancy instructions, like linked lists, so they came up with microcode as a way to implement it within the silicon, almost in software. But was it software? Or is it - does it require actual AND and OR gates?
Steve: Yes, exactly. And I liked John's question. I thought it was a really good one because he's saying, well, okay, all you've really done is move the complexity from one place to somewhere else. Why is it any less complex? There's two things that microcode does. The first is that, as I described it, the microcode which is used to implement instructions is generally a long word, that is, it's many, many bits wide. And the bits are turned on and off in order to open and close paths through the system in order to implement the instruction. So you route some bits of the instruction word to the adder. And then you route some, like the memory fetch results to the adder, and those get added. And then they go into a buffer.
And so one of the real powers of using a ROM, a Read-Only Memory, is as a lookup table. If you imagine a matrix, a two-dimensional grid, where you have a bunch of inputs on one side, that is, like on the horizontal, and a bunch of outputs on the vertical. And this grid is filled with a collection of ones and zeroes at the intersections such that when you select one of these addresses, some number of bits change on the output. What a ROM does, it allows you to have an arbitrary association between the inputs and the outputs. And if you were to implement that same arbitrary association in discrete logic, you'd just pull your hair out trying to, with standard ANDs and ORs and NANDs and NORs, inverters and all that, trying to wire up what you can do so easily with a simple table.
So the first part of this is that a table lookup, as it's called, can beautifully, with almost no components, just like a little ROM, can allow you to map an arbitrary combination of inputs into a different combination of outputs. So that's a huge simplifying thing, which is one of the things that microcode is, is a table.
And the second part is that, by doing a big job in steps, you don't have to do it all at once. So microcode implies multiple steps to achieve some end. So without this notion of multiple steps, all the instructions you had, no matter how complex they were, were just going to have to happen, bang, just in a single cycle. I mean, it would be like an amazing amount of work somehow almost magically being done, bang, all at once. Instead, if you've got microcycles, then you're able to break up a complex job into many smaller steps, each of which is more simple. So that's the second way that you get simplicity is by sort of factoring all the kinds of things the computer might do into simpler, smaller steps, and then allowing yourself multiple steps to achieve a bigger result. And as a consequence, the savings are dramatic, such that virtually all systems today have used microcode in order to get the job done.
Leo: All right, are you ready for another question, my friend?
Steve: Yeah. Or an observation, in this case.
Leo: In this case, from Simon in Canada, with a security data point from a hospital operating room: Hi, Steve. This week my five-year-old daughter underwent a relatively minor surgical procedure, but still one that required full anesthetic. Oh, that's always scary. Standard operating procedure - literally in this case - dictates that when possible one of the parents attend until the point the child is unconscious, which is why I found myself standing in an OR of a well-known children's hospital, clad in a surgical gown, mask, and paper booties.
After the anesthesiologist had done his stuff, and my little angel was peacefully sleeping, I had time to take in a little more of my surroundings. It was then I noticed - oh, dear - that the rest of the nurses and doctors, who currently had nothing to do, were watching a World Cup game streaming live on one of the operating room computers via a Flash player. Now, I obviously have no idea whether this PC was segmented from the more critical systems in the OR, but I do know that the screen immediately next to it was displaying the medical imaging. I also know that, A, there is at least one computer with a Internet connection in that operating room; and, B, it's got Flash installed - one of the most fertile attack vectors for recent malware. Just an interesting observation I thought you might be interested in sharing with your listeners and viewers. Thank you, Steve and Leo, for your great work. Wow.
Steve: Yeah. You know. Not surprising, unfortunately. I don't know what it will take for the word to get out that this kind of thing is a problem. I mean, we had, remember, UK hospitals that were almost shut down by Conficker getting into their networks, into their operating room computers...
Leo: It's amazing, just amazing.
Steve: ...and causing problems. So you've just got to shake your head. I mean, there's nothing we can do about it. But it's worth just sort of being aware of it.
Leo: Question 4, James Truesdale in St. Louis, Missouri had a RISCy question: Listening to the podcast, heard your explanation of how instruction sets grew due to programmer requests for more complex instructions to make their life easier. I had this thought: Instead of adding instructions, why not just use macros - you mean, do more work? - for commonly used operations?
Steve: Well, you just answered the question, Leo.
Leo: You mean work harder? No.
Steve: Yeah. The idea was that back then, memory was very expensive. And so it really wasn't the program, I mean, the programmers wanted more powerful instructions, rather than using more, less powerful, instructions.
Steve: So, for example, in the case of, for example, the VAX that has a linked list instruction, which is like doing all this amazing pointer moving around, programmers were able already to manually manage linked lists, and they could have certainly hidden all of the instructions that they were using underneath a macro. Of course, that would have been dangerous because it's very easy to forget how much work a macro is doing, specifically because it's hiding all the work it's doing from you. It's convenient from a programming standpoint. But the problem is it would be expensive in terms of time, and also the memory that it would take up.
So back then, different from now, where you might say, hey, wait a minute, you know, RISC approaches are much more many small, simple instructions than CISC machines were back then. Back then memory was expensive and slow. So what the programmers were saying was, hey, we're expending all these instructions to manipulate pointers in a way that it'd be really convenient if we just had an instruction that could do it for us. Then we'd save all of this expensive memory and all the time it takes to fetch from this expensive memory. So macro doesn't do the job that implementing complex instructions in microcode does.
Leo: Yeah, I think we kind of touched on that last week, but just it's worth reiterating. It isn't laziness, it's a response really to scant resources, as a lot of this stuff is. And as resources change, you change what you do. It's why we don't need RISC so much anymore.
Leo: I feel the need, Mr. Gibson, for speed.
Steve: Well, so we've established sort of the original technology of computers, looking at the way, for example, early minicomputers like my favorite old PDP-8 operate, where memory is a bunch of words, and the words are broken into fields of bits, and the bits specify, for example, the opcode, the operation code, what the word of instruction is going to cause the hardware of the machine to do. And even then, even though the machine went from one instruction to the next, the execution of that instruction internally took several phases.
You would fetch the instruction from main memory into the - we talked about the IR, the Instruction Register, where the machine would then look at the opcode to determine what this instruction was telling it to do. So there was a fetch of the instruction. Then there was a decode, where you'd decode what it is that you fetched.
Then comes to execute the instruction, whether it's incrementing the accumulator or adding a register to another, maybe jumping to somewhere else. And then in some cases you would be writing the results back, maybe writing the result of incrementing the accumulator back into the accumulator, or writing it back into main memory, if you were storing.
So from the programmer's view, the programmer sees this as atomic events, one instruction per word. The engineer who's designed the computer to do this sees that there's more going on. A single execution of an instruction actually requires many different phases - fetch, decode, execute, and then write back the results. So machines were being produced like that. And people naturally wanted them to go faster.
And what the engineers saw was that, well, you know, we fetched an instruction. Then we're decoding it, and we're executing it. But while we're doing those things we're not using main memory. That is, it's waiting for the next fetch. And so the concept dawned on them, and this actually happened on the mainframe level in the late '60s, this notion of sort of overlapping things. And the best example, sort of I think the model that's clearest is, because we've all seen examples of it, is the automobile assembly line - which, as I understand it, Ford invented to create his cars, the idea being...
Leo: Just a side note, by the way, we're going to be going to visit the Ford assembly line on July 30th.
Steve: Who, "we"?
Leo: Me. Who, "me."
Steve: Oh, cool.
Leo: Yeah, and I'm going to bring the live camera, and we're going to actually show the state of the art in modern assembly, which I can't wait, their Dearborn plant.
Steve: I would love to see that because you only get little snapshot snippets of...
Leo: I know.
Steve: ...pictures with robot arms swinging stuff around in the air.
Leo: I know, I'm so excited.
Steve: That will be really cool.
Leo: I'm going to go see where my Mustang was born. Anyway, sorry, didn't mean to interrupt, go ahead.
Steve: So the idea with an assembly line is that, at every stage of assembly, you do a little bit of work towards producing a finished car. The time required to produce one car is the time it takes to go the length of the assembly line. But once the assembly line is full of partial cars being assembled, the rate at which cars come out is much faster than the total time it takes for a car to move through the assembly line. So...
Leo: Wait, now, let me think about that. The cars come out faster than it takes for a car to go through the assembly line.
Steve: Yes. So say that you had an assembly line of 10 stages.
Steve: And that each stage took a minute.
Steve: Well, when you start making a car...
Leo: It takes 10 minutes.
Steve: It's going to take 10 minutes for that car to go all the way through the assembly line.
Leo: Oh, but then cars will come out one every minute.
Steve: Once the assembly line is full, then they come out every single minute.
Leo: Got it, okay. I'm glad - sorry. I'm stupid, but I needed to understand that. Okay.
Steve: And so in processor technology we call this a "pipeline." And virtually every machine now being made, and actually made for the last two decades, has been "pipelined" to one degree or another. So let's first apply that to the very simple model of this machine which fetches the codes, executes, and writes back. The idea with a pipeline there would be that you fetch an instruction, then you start decoding it. Well, while you're doing that, memory is free. You're not using memory. So most instructions, most code is sequential. That is, we know that after normal instructions are executed, the program counter is incremented by one for the next word, which is then fetched. And the one after that and so forth.
That changes in the case of jump instructions, which jump us to somewhere else; or branch instructions, which may or may not branch to somewhere else. But in general it's a safe bet that we're going to be moving forward. So the engineers who wanted more performance out of the system basically - and this will be a recurring theme through this podcast. You look at the various components of your system and think, how can we keep them busy all the time? How do we get the most performance out of it? Well, it's by keeping all the pieces busy.
So if, while we're decoding an instruction we just fetched, we assume that we're going to be executing the next one here in a while, well, go ahead and fetch it. Get it read from memory. And similarly, after that first instruction's been decoded, then it's time to execute it. Well, meanwhile, at that point the decoder is not busy because it just did its work on the first instruction. Well, now we've got the second instruction that we fetched while the first one was being decoded. It can now be decoded.
And so the analogy is exactly similar to the assembly line where instructions move through several stages. And once they get going, rather than an instruction having to go all the way through before you even start on the next one, you're able to make some assumptions that allow you to basically create an assembly line for computer instructions, just like you do for cars.
Now, it gets, from that simple sort of start, then things really get interesting because one of the things that happens is that instructions may interact with each other. That is to say, if we were to add two registers - say that we had a machine with multiple registers, as all contemporary machines have now. Back, you know, that PDP-8 had just the one accumulator, which you sort of ended up using as a scratchpad. Now we've got 8, 16, 32, lots of registers. So say that an instruction that you read was adding two registers together, that is, adding one into another, and that the instruction that followed took the value from that add and stored it. Well, so now we have a problem because we have instructions in this pipeline which interact with each other.
So over time engineers have made these pipelines longer because they'd like to get as much simultaneity going as possible. But they've also had to deal with the fact that there can be interactions, and often are, between instructions that are in the pipeline at the same time. So the first thing that's done is that instructions are broken into smaller pieces. They're called "micro ops" (uOps).
So, for example, say that we had a simple instruction. We've talked about how the stack works; how, for example, when you push something on the stack, what happens is the stack pointer is decremented to point to a new, lower location in memory. And then the value that you're pushing is written to the location where the stack pointer is pointing, sort of in this scratch pad. So that operation, a single instruction, "push a register," can actually be broken into two micro operations. The first one is decrement the stack pointer. The second one is copy the register to where the stack pointer is pointing.
And imagine another instruction, like adding a register to what's in memory. Well, to do that you have to read out what's in memory. Then you have to add the register to what you read out and then write that sum back to that same location in memory. Well, that's three micro operations. So what the processors do now is they take these sort of what look - the programmer sees them as instructions, but they're actually complicated enough that they require - they can be broken down into smaller pieces. So the processor fragments these single instructions into multiple micro operations and then basically pours them into this pipeline, which is getting increasingly long, in some cases as long as, like, 20 stages of, like, staging of instructions.
Now, one of the things that engineers noticed was that some instructions, like this - imagine the instruction I talked about where we're wanting to add a value to something in memory, where we're having to read the thing out of memory, then sort of into some internal temporary location that isn't even named. Then we add a register to that and then write it back out. Well, so we've taken that single instruction and broken it into these three micro operations.
Now imagine that there's an instruction behind it, that is, it actually is later in the code, that's doing something else entirely. It's adding two registers that aren't related to these micro operations. What the engineers realized was, while the computer was out fetching the value to be added to, it had already fetched more instructions behind. And the ones it had behind were independent of the outcome of the instructions that it was working on currently. And, for example, while fetching something from memory, the machine's adder, the ALU, the Arithmetic Logical Unit, was idle.
So the processors of today are able to realize that they've got other things they can be doing which are independent of the outcome of earlier instructions. So why not execute them out of order? That is, literally rearrange these micro operations in the pipeline so that things that are taking longer can literally be passed by instructions which can take advantage of resources in the chip which are not currently in use.
And so what we've ended up with is this amazing technology which pours instructions in the front end of the pipeline, fractures them into smaller sort of individual granules, which need to be executed in order for that to happen. Then logic which is sophisticated enough to look at the interdependencies between these micro operations and reorder them on the fly so that the assets that the chip has, like Arithmetic Logical Units, like a floating point processor, like instruction decoders, all these assets are being maximally used.
And in fact one of the things that happened was that processers went so-called "superscalar." What I've described so far is a scalar processor. A superscalar one is one which is actually able to execute more than one instruction per cycle. That is, normally you would be retiring instructions when you're done out of this pipeline, sort of at a given speed.
Well, if you have enough assets to execute instructions, there's no reason you can't go faster than a hundred percent. And so superscalar processors go faster than a hundred percent. They've got, for example, there are some that have four ALUs and two Floating Point Units. And so they're literally able to be doing four additions at once. Sometimes those are part of a very complex instruction, or sometimes they're part of different instructions.
The point is, the processor has become smart enough to break all of these down into little subfunctions and then sort through them, analyzing the interdependencies among these subfunctions and taking advantage of anything that might require a delay in order to say, oh, wait a minute, we've got a guy back further here who isn't dependent upon any of our outcomes. And we've got a free adder. Let's do that addition right now. And if you think for a minute about the logical complexity of any instructions which you might encounter, and having to, on the fly, I mean, we're talking - there's no time to do this, either. This is not slowing things down. The goal is to speed everything up.
So there's no - you can't even catch your breath. This is all happening billions of times a second. At gigahertz speeds this is all being managed. So now we have a system which is able to do, literally, sucking instructions in, cracking them down, rearranging them on the fly, looking at interdependencies. Well, that wasn't enough for the engineers. Management said "faster, faster, faster." And so the engineers are like, wait a minute. We're going as fast as we can.
Well, what they realized was that wasn't true because there was still a way they could get a little more clever. I used the word "retiring an instruction" before. And that's a term used in this art where you finally say - you, like, write the results of the instruction back out. So inside this pipeline you've got an amazing amount of resources. You've got unnamed registers. By that I mean they're not like the register 0, register 1, register 2, or AX, BX, CX. That is, they're not the registers that the programmer sees. These are like temporary scratchpad registers which are not visible to the outside world, not visible to the instruction stream. But they're used, for example, if you were adding something to memory where you've got to read that into somewhere in order to add something to it. So that's an unnamed register.
So when you retire an instruction, you're sort of writing its results back out to, like, the real registers, to the programmer-accessible registers. But the engineers realized that in some cases they did have a result which a later instruction was waiting for, even though they hadn't yet retired the earlier instruction out to, for example, writing to, like, the AX register. They did have it in the pipeline.
So they added a whole 'nother layer of nightmare by allowing results to be forwarded, and that's the term that's used, within the pipeline to, like, track this so that partially executed instructions which had not yet been retired could have their results sent sort of back into the future in order to allow instructions that had stalled because they were dependent upon an outcome which hadn't been resolved yet. And all of that exists also. So what we have now is something unbelievably complicated.
Now, what happens if you hit a branch? Because branching, any change of linear flow is the worst possible thing that can happen. Think about it. We've got all this happening. We've got 20 instructions maybe that have been taken apart, all under the assumption, remember we made one fundamental assumption at the beginning, which was we're going to go linear. All of this sucking in things ahead of where we are assumes we're going to use them. All of this work says that we know where we're going.
Except when we come to a conditional branch, or even a jump that's going to go somewhere, suddenly everything changes. We now don't know whether we're going to keep going or go somewhere else until later in that instruction's phasing. Remember, now instructions are being cracked apart. They're being decoded. They're being executed. There's, like, all this work being done before the outcome of the instruction is known.
The problem is, if it's a branch instruction that might change the sequence, if it does change, if it's branching us to somewhere else, well, everything behind that instruction has to be scrapped. So the entire pipeline has to be dumped. And we stall until we are able to then load a series of instructions from the new location and sort of get all this going again.
Leo: And that's what screwed up Prescott because I think their prediction wasn't good, or their pipelines were too long, and they got a lot of dumps.
Steve: Well, yes. So having developed this amazing complexity for dealing with, I mean, like, just incredible acceleration of performance, as long as you go straight, the second you change away from that, that linear execution, you're in serious trouble. So engineers realized that branch prediction was crucial, that is, literally being able to guess without knowing what a branch was going to do.
Well, the way they've come up with doing this, there was a first level. You can imagine a simple-minded way which says, okay, let's assume that the branch that we encounter, if we've ever encountered it before, is going to do the same thing. So that sort of makes an assumption that branches generally do whatever they're going to do. In fact, microprocessor designers realized that many branches that are branching backwards are at the bottom of a loop, sort of a loop of code which tends to get executed a lot, and then finally isn't executed. So the branch, a branch backwards tends to be taken, as opposed to a branch forward. So there was some simple-minded sort of branch guessing that way.
Then they said, well, wait a minute. Let's record the last outcome of the branch and assume that it's going to do the same thing again. So an early branch predictor simply did that. And the idea was that you would take a chunk of the lower address bits, so like the least significant address bits in the instruction counter; and you would, for every one of those address bits, you'd create a table that had a single bit in it, which recorded a branch at this location did the following thing last time. It was taken or it wasn't taken.
Now, we're not talking about having a bit for every branch in the computer. We're saying that we're going to have sort of a bit, maybe 256 bits, for, like, the lowest byte of the instruction. So branches could collide with each other. A branch that was exactly 256 words further down would end up having the same least significant byte of address. So its bits would collide with each other. There's nothing we can do about that. But the probability of that is relatively low. And so there was always this cost versus performance tradeoff that's being made.
But the engineers weren't happy with just using a single bit because imagine that you had a branch which, in the code, literally alternated what it did every other time. It turns out that's also very common. Well, that would literally mean that every prediction you would make was wrong. If you remembered what it did last time, and you assumed it was going to do it again, and the logic in this branch was in every other logic, then you'd always be guessing wrong. And so the performance would just be abysmal. You'd get no benefit from your pipeline. You'd be constantly dumping the pipeline and then needing to refill it.
So the developers came up with a two-bit branch predictor, which they call a saturatable or a saturating counter, the idea being that - so two bits could obviously have four states. You could be zero zero. And then if you count up, you go to zero one. You count again, you go to one zero. And again, you go to one one. So those are the possible values. So the idea of a two-bit branch predictor was that, if you took the branch, you would increment this two-bit counter, but never more than one one. So that's the saturating part. It would saturate, it would go to one one and then just stay there. If you did not take the branch, you would decrement the counter down to zero zero, but then you never go below zero zero. It saturates at the bottom end also.
So what this gave you was a better, sort of more of a probability. You could, if you generally took the branch, but not always, this counter would - it would still make a mistake, but it wouldn't change its mind completely. So if you generally took a branch, even if you occasionally didn't, it would still remember that you generally took it. So it would, again, it would generally be guessing correctly. And so that increased the performance of branch prediction substantially. But there was still a problem, which was that there were patterns of branching which this simple-minded two-bit predictor couldn't handle.
And so in real-world applications it was better than nothing, way better than nothing. But some other engineers realized, hey, we can do something even more clever. We can do a two-level prediction. So what they created was a shift register of bits which was whether the branch was taken or not, in history. And it wouldn't be very long. Maybe let's say for a discussion that it's only four bits long. So the shift register is remembering whether branches actually were taken or not. And every time we come to a branch, we first of all look at the least significant byte of the address to choose one of 256 whole worlds.
So each possible location in memory, with this 256 cycle, has its own entire little branch prediction world. Okay, so within that world is a four-bit shift register that remembers for that branch, or branches at that location in memory, whether the branch was taken or not. Okay, those four bits, we know that four bits gives us 16 possibilities. Those four bits are used to choose one of 16 of our little two-bit saturating counter predictors.
And what we end up with is literally pattern recognition, where over time this acquires a knowledge of any sequence of up to four long of branches and not branches being taken. That will be recorded in the two-bit predictor which will tell the computer with very good probability whether the branch will be taken again or not. And these predictors have grown in length and in size. And so remember that there's one of these whole predictors for each of a number of different locations in memory where these branches could fall.
So now what we've done is we've got this pipeline sucking in instructions, cracking them down, looking at their interdependencies, reorganizing them on the fly, taking it - we've decoupled the Arithmetic Logical Units and the floating point processors and the instruction decoding and all of this so that those are all now separate resources which are being assigned and used as soon as they can. As soon as we're able to see that we know enough to allow one of these micro operations to proceed, we do.
At the same time, the system is filling up the pipeline at the top using the results, assuming we're going linearly, unless we hit a branch or a jump, and then recording the history and literally learning the pattern of the past sequence of branches in the code and sort of heuristically developing an awareness of pattern recognition of whether - I mean, so that it's able to guess with as much as, it turns out, 93 percent probability whether a given branch will be taken or not, only missing about 7 percent of the time. And when it's wrong...
Leo: Is that the average for all processors, or...
Steve: Yes, state-of-the-art prediction now.
Leo: That's amazing.
Steve: I know. It's just incredible.
Leo: Just amazing. It's like that old joke, how do it know? It's like predicting the future, really.
Steve: It's like 6.75 percent misprediction, so about 7 percent misprediction. 93 percent of the time they're able to guess right.
Steve: And so making a mistake is expensive in prediction because we have to flush all the work we were doing, and then go somewhere else. But 93 percent of the time we're able to get it right.
Leo: Somebody's asking in the chatroom, this isn't security. Well, in a way it is. This is a series Steve's been doing all along on the basics, the fundamentals of computing. In fact, from day one on Security Now! you've really done a great job, I think, of getting people up to speed with these fundamentals, things you have to understand to understand security; right? These are not completely incidental to security.
Steve: It certainly is the case that everything is interrelated. For example, I'm thinking as I'm working toward getting going on CryptoLink, the VPN product that I'm going to do next, well, encryption performance and decryption performance is very important. And understanding the internals of what the chip is doing really does allow a developer who wants to truly optimize their code to arrange the instructions so that the logic in the chips have the most latitude for working. And certainly performance has been something that we've been questing after forever.
Leo: Yeah. And we're getting it with this amazing pipelining and parallelism and so forth.
Steve: So the engineers have got this incredible pipeline built. They've got now this amazing branch prediction system. And then they realize that they've still got a problem because they suck in a return instruction into the top of the pipeline. Well, we know from having discussed this before what a subroutine return does. When we call a subroutine, we're using the stack. We decrement the stack pointer and put the address of the instruction after the call on the stack so that later, once that subroutine has finished, it's able to do a return instruction which looks on the stack for where to return to, which has a beautiful effect from a programmer's standpoint of bringing us right back to the instruction following the one which invoked a subroutine.
Well, one of the first things a subroutine does, because most subroutines don't want to mess up what was going on when they were called, they'll push a bunch of registers value onto the stack themselves so that they can be popped off the stack and restored prior to returning. That allows them to have sort of those registers to work with and then not mess up what was going on in the main code that called the subroutine. Okay, so with that in mind, visualize what's going on in the processor now with the pipeline, where the pipeline is full of instructions toward the end of the subroutine, and then the subroutine is finished, and it does a return.
Now, the problem is that the return uses the value on the stack. But the thing that the subroutine is doing just before it returns is cleaning up its registers, getting their values off the stack in order to restore them to what they were. And this is happening further down in the pipeline. Which means the stack pointer is going to be changing a lot, and there's no way we can use, there's no way we can execute any of the return instruction until literally we get - we know what the stack pointer's going to be. And then we have to go read where it's pointing, get that value. That tells us where to return to.
So then we start fetching instructions from there. Which means a return instruction is deadly. It literally brings everything to a halt because we don't - we don't know where the stack pointer will be because the instructions typically occurring, all of those instructions just before the return are changing the stack pointer as they pop the values of the registers back off the stack into the registers so that they're restored when we go home.
So the engineers scratch their head for a while, and they say, wait a minute. What we need is an internal call stack. We need our own private stack because we know that, more often than not, subroutines nest perfectly. That is, some code calls a subroutine, which will return. Or maybe that subroutine calls a subroutine, but that one returns to the one that called it, and then it returns to the one that called it. In other words, there's a nesting which is almost always followed. Which means that this incredible execution unit in the processor now maintains its own call stack. When it sees that it has been jumped to a subroutine, it records internally the address of the instruction after that call on its own little stack. There's no instructions to get to it. Programmers can't see it. It's completely invisible.
The call stack ranges from 4 to 32 entries in modern processors now. And so what happens is, since the internal pipelining miracle has recorded this, the second a return instruction is seen, which is just a byte, for example, in an Intel instruction is just a 60 hex, a six zero hex, the second that byte is seen, the system says, ah, that's a return instruction. We don't have to wait for anything. We can immediately pull where we know it's ultimately going to return to off of our own internal stack and continue without interruption sucking in more data from the point we're going to be returning to, without missing a beat. So that's another level of what was added.
Now, once all of this was finished, and this was maybe, oh, about a decade ago we had this level of technology, there was still some unhappiness with the contention for resources. That is, there was still not what's called "instruction-level parallelism." There still was, like, the ALUs and the Floating Point Units. They were sitting around not being used all the time. The engineers weren't able to get them busier because there was still too much interdependence among these micro operations that they were just - they couldn't get enough, they weren't able to use the resources fully.
Well, this is when this notion of simultaneous thread execution occurred to them, which Intel calls "hyper-threading." I mentioned it a couple weeks ago in passing. I couldn't do it justice because we didn't have enough foundation to understand what hyper-threading is. Well, we do now, after the last 45 minutes of this. What hyper-threading is, is the recognition that there is what's called "register pressure." There is not enough freedom of value assignment among registers. There's just too much interdependency. But if we had a whole second set of registers, if we duplicated everything, then where some microinstructions are fighting with each other, too interdependent, where they're having to wait for results to finish before the later ones can start, therefore assets like the Arithmetic Logic Unit and Floating Point Unit are sitting around being unused, if we have another physical thread, that is, we have another whole set of registers, well, there, because it's a different thread, they're logically disconnected from the first thread's registers. There is no conflict at all possible between these separate banks of registers.
So what hyper-threading does is, I mean, and this - talk about it being confusing already. This literally pours instructions from two entirely different threads of execution down into the same pipeline, breaking them all up, keeping them all straight, realizing that these micro ops and these registers are actually different from those micro ops and those registers. So now we have - we've doubled the opportunity for exploiting these fixed assets, the Arithmetic Logical Unit and the Floating Point Unit, being able to keep them busy much more of the time, which is what hyper-threading does. Essentially, it doesn't duplicate the entire system, but it allows us to pour two different threads of execution into the same pipeline and get a tremendous boost, I mean, it's not like doubling. We don't get double because the resources weren't that underutilized. Typically it's about a 25 percent gain, which in this quest for performance is better than a kick in the head.
So, lastly, with all of this, sort of with this much industry having been expended in order to satisfy essentially CISC, that is, Complex Instruction Set Computers, the guys designing the RISCs were just dizzy. They said, okay, wow. Do we want to do the same thing? Are we going to basically duplicate this insanity? RISC architecture is different in a number of ways. Fundamentally, the RISC concept was designed to prevent there from being a lot of this kind of like available performance boost because the instruction design just doesn't get itself into trouble so much.
One of the very clever things that RISC instructions do is there's something called "conditional instructions" and something called an "explicit condition code update." Now, what that means is that, notice that we have a stream of instructions that are being executed by the processor, and then a branch instruction is often skipping over some. There'll be something that, like, you don't want to execute in this certain case. So you jump ahead 10 instructions or five instructions or something. You're skipping over them. Which is many times what a branch will do.
What the RISC designers realized was, at the expense of some more bits in the instruction word, and it does widen the instruction word a bit, they could make what's called "conditional instructions" instead of branches, that there are still branches and jumps, and those are being optimized still very much the way they are in CISC instructions, with branch prediction and so forth. There's no way around that. But essentially the RISC guys said, wait a minute. If we just want to skip over five or six instructions, for example, if the result of an add was zero, or if the result did not overflow and the carry bit was set, why not add to any instruction some additional bits that say "execute this unless the condition code is zero." Which means that we've saved ourselves a branch. We don't have to branch over those instructions. We can make the instructions themselves just sort of skip over themselves. The instruction says "only execute me in this certain case," that is, the case where we wouldn't have taken the branch.
So what this did was, this allowed a very aggressive forward-fetching pipeline to go in a straight line. And we understand why pipelines like to go in a straight line. We were talking about that before. This allows the pipeline to fetch ahead. And even though it may not be executing instructions, it saves all of the possibility of a branch misprediction because we don't have a branch at all.
Now, the other trick is, if you had a group of instructions which you collectively wanted to execute or not in a certain case, if you were executing them, you wouldn't want them to change, like, the state of the carry bit or the zero bit or any of the condition codes because then that would mess up the conditional execution of the instructions that followed. So the other thing that was added, in addition to this notion of a conditional, conditionally executed instruction, is the ability for the instruction not to modify the condition code when normally it would. You might be, like, doing some addition. And normally the add instruction will set a flag saying, oh, the outcome was zero, the outcome sent the carry bit, the outcome was not zero, you know, various condition code situations like that.
So what was done was a bit was added that said, do the add, but don't change the condition code because we're wanting to continue the instructions afterwards along the same - to have the same effect as the one we just executed based on a condition code which was set deliberately earlier in the path. And so that was essentially the final optimization that the RISC guys brought into the design of the instruction set, which further made pipelines able to absorb this huge number of instructions, sort through everything, and perform really this just overwhelming job of making processors incredibly fast.
Leo: It is such an amazing, mind-boggling thing, especially when you think that we're operating now at the microscopic - microscopic - at the level of a molecule's width, in some of these newer 45-nanometer processors. It's truly amazing.
Steve: Well, yeah, and I would imagine that probably everyone listening to the podcast has at one time or another seen one of those very cool photomicrographs just of a processor chip, sort of as if it was taken - it looks like a satellite photo of a city. And you look at that, and you think, my goodness, look at that, I mean, you can just tell by the level of detail in there that an incredible amount of something is going on. Well, what we've just described is what that something is. This technology is what has increased the power consumption, increased the size, increased the cost, but dramatically allowed the performance of these processors to increase.
And this, what we described today, this kind of incredible, out-of-order execution, branch prediction, internal call stack, register renaming, all of this is in all of today's processors. It's just being taken for granted now. It's the way we have the kind of performance that we do. Without any of this stuff, we'd be back with an 8088 running at 4.77 MHz.
Leo: There was a really good book, must be 10 years ago now, called "The Silicon Zoo," where they had those little pictures, the pictures of the stuff. Of course, it's so old now, it's changed a lot. But these photomicrographs, if you search for Silicon Zoo, they're still online. Some pretty amazing pictures. You can tell how old this site is, though, because it says "This is going to take a minute at 28.8." It's a big image. But you can do a little googling, and you'll find it. Fascinating stuff, Steve. Once again, you've done a great job of explaining how this stuff works. And networking next. But next week we're going to do a Q&A, I think; yes?
Steve: Yup. We have a Q&A, and then I'm going to - many listeners have said, hey, Steve, I thought you were going to tell us about LastPass. We're using it. We want to know we should be and it's safe. So that's queued up for two weeks from now. We'll do a Q&A next week, and then I'm going to explain in detail the cryptographic protocols for LastPass and how the whole system works.
Leo: Oh, great. That's great. If you have Q&A questions, GRC.com/feedback is the place to go. He's got a feedback form there. GRC is a great place to go for not only SpinRite, the world's best hard drive maintenance and recovery utility, but also this show. 16KB versions are there for bandwidth-impaired fellas and gals. Of course I love the transcripts, it's a great way to follow along. And I suspect more than one teacher is using your lectures on computer fundamentals in their classes. So I think transcripts would be very helpful in that case, as well. You're more than welcome to do that.
I hope that you understand you don't have to get our permission to use these podcasts. They're Creative Commons licensed. Attribution-Noncommercial-Share Alike is the license. You can find out more at TWiT.tv at the bottom of the page there about our license. And you're more than welcome to use these. In fact, I think it's great if you do in courseware. Somebody was asking in the chatroom.