Apple's Former GPU Supplier Imagination Tech Shares AI GPU Advantages Over NVIDIA

As AI GPUs continue to dominate the technology conversation, we decided to sit down with Kristof Beets, Vice President of Product Management at Imagination Technologies. Imagination Technologies is one of the oldest GPU intellectual property firms in the world and has been known for previously supplying Apple GPUs for the iPhone and iPad. With GPUs being quite close to AI processing needs as well, our discussion with Kristof surrounded how Imagination Technologies' products are suitable for AI computing. He also compared them with NVIDIA's GPUs, and the conversation started off with Kristof giving us a presentation of Imagination's latest E-Series GPUs. The E-Series chips are designed for traditional graphics processing workloads and AI processing as well. According to Imagination, they are capable of scaling performance up to 200 trillion operations per second (TOPS) for INT8 and FP8 workloads. These workloads enable edge AI computing applications, such as AI PCs, and high performance applications, such as training and inference, respectively. Kristof started out by giving us an overview of the E-Series GPUs and Imagination's experience of dealing with CPUs and NPUs as well. NPUs, short for a neural processing unit, is a speciality chip designed for AI workloads. The Imagination executive explained that while his firm did find the NPUs "fantastic," they ended up suffering with scalability issues. Fundamentally the E-Series, is really for us looking at the compute at the AI side. We used to play around in dedicated accelerators, but we've really found that they were difficult to program. They lacked flexibility. They were very efficient as long as you stayed on the narrow path, but they lacked really a lot of flexibility. And if we looked at the market continuing to grow, and evolve, that flexibility was really a key requirement. And that drove us very much to look at, you know those CPUs, NPUs, GPUs, as the space. Because on the CPU side, we for a short period of time, we owned MIPS. So we had quite a lot of experience with CPUs, we dabbled around in the RISC-V processors phase. And CPUs are great right, they're very programmable, they're very flexible but they really lack on the scalability and a lot of the AI-related efficiency because they're not really parallel processing engines. They're much more designed to be the single-thread, kind of decision style engines. So not really very efficient to target them. The NPUs as we played around with, fantastic efficiency, very high density, very high power efficiency. But they really traded off that flexibility and a lot of the scalability mechanisms. A lot of NPUs end up with very poor utilization, especially if they drop off the fast path. Now we call them NPUs, you've got a lot of DSPs and similar things which operate in that space as well. And then finally you have the GPU and ultimately if we look back at the whole AI space, and partially this is driven by where all the innovation is and where all the usage is with NVIDIA, the GPUs tend to come out on top if you look at the support, the performance, the flexibility, the density, the efficiency just because it's such a rapidly moving space. The conversation then shifted to the synergies between AI and graphics processing. Kristof explained that tile-based rendering and compute was a key link between graphics and AI processing. Tile-based rendering in graphics processing reduces the memory required by a GPU to process an image as it renders portions of a screen before sending them to memory. Tile-based programming in AI sees developers to break down matrices into tiles and rely on multi-threading for execution. Tiling divides matrices into chunks to compute products instead of individually using each row and column in them. So with E-Series we really wanted to address that and look at our GPUs to fundamentally evolve them from our tradition. Right, we've been doing GPUs for over 30 years. Starting in the PC market, doing Dreamcast for the consoles, going back into the PC market before we ended up in the IP licensing space. So have been doing this for a very long time, the kind of parallel compute for graphics, but now really looking at how to we bring that architecture and that efficiency into the GPU space. And, the key thing for us really on the AI side, as we started looking at it, was there's a lot of synergy between graphics processing and requirements and the AI processing. GPUs are already very good at all the kind of data management, data flow, data type conversions. And equally we do a lot of optimizations that we see in the AI space, right from the graphics point of view, we're a tile based rendering engine. Right that means, we bring data on chip, we do all the processing, for a region of the screen and then we write it out. If you look at AI, it's very similar. It's tile-based compute, you load a whole bunch of matrix data in, you do your processing in, you try to keep it there in multiple layers before you write it out. So tile-based rendering, tile-based compute, very similar. And that means we already have a lot of the memory on chip to be able to do that. So whether we use that memory for graphics or for compute that is where can bring a lot of the efficiency. We don't need to put new memory down. We can reuse what we have already invested in. If we look at our deferred renderings, we try to figure out what is visible and then process it. With compute, and with sparseness it's very similar right. Do I really need to compute this or is it all zeros and I can quickly and efficiently skip processing. So again that triggers potentially bubbles in your processing pipeline. But again, because our GPU for graphics was doing this, we can also tap into all those mechanisms for keeping the GPU busy for those compute usage cases. So that's really going from all the stuff we already have for graphics and being able very effectively reuse it for compute and especially AI processing. Now, one of the things we learned from the more AI side and we're able to apply to the graphics side is something that we call utilization. NVIDIA came up as Kristof shifted to TOPS and their contribution to AI workload performance. He outlined that since Imagination is an IP (intellectual property) company, its customers are not limited to buying one kind of product for their AI compute. Instead, they can customize their chips. He also detailed the firm's arithmetic logic unit (ALU) and warp processing capabilities. A warp is a group of threads and one warp contains 32 threads. Kristof explained that Imagination's products merge these warps for execution: So you heard utilization a lot for AI. I've got hundreds of TOPS, but how many of them are truly contributing to the inference or the AI performance? And very often those are disappointing numbers. Now, you've got equivalent things in graphics. That is, you have an amount of raw statistics, how many gigaflops, how many teraflops, how much fill rate, how many pixel drawing capabilities you, have. But that's really just a theoretical metric. How good are GPUs at translating that into the actual frame rate. And when we started looking at that, we found that our GPU is very efficient. So it was very good at turning the sometimes limited gigaflops that we are given by a customer in a specific design into, relatively speaking, higher frame rate. So being able to do that conversion very effectively is very good for your power efficiency and, of course, for your density as well, because it means you can extract more performance of a relatively small GPU. So all of that is combined into the E-Series of GPUs. And what we really focused on was you know compute efficiency, so how can we boost the power efficiency of the design? How can we scale? You know, traditionally, a lot of people remember us from the mobile days, but actually mobile has, you know, in many ways not been our primary market. We've been very dominant in the automotive market. That's both automotive graphics and growing automotive compute, so quite a few autonomous vehicle partners using our GPUs as big compute engines, so that means we need to be able to scale a lot higher than what we would do normally in a mobile phone, so scalability to 200 TOPS, which really puts you in those kind of compute domains very efficiently. And then, of course, ultimately, developer and customer flexibility. We're an IP licensing company, so the biggest benefit of IP is that you can build exactly what you want and need. So working with customers, we can build exactly the configurations in size but also very often in feature set to adapt to their needs. So we're not like NVIDIA where we say this is what you can buy it's this chip or nothing else with IP we very much deliver what our customers tell us to do right. This is not IMG saying you must use this or it must be that size; ultimately all those choices are made by the customer. I put in a quick slide to talk about the fundamentals. So as I mentioned tile based rendering so very much about on-chip processing, uh efficient processing bandwidth and power efficiency benefits .Deferring of rendering very much means that we figure out what's visible in each pixel on the screen before we execute any of the shader program. So that saves a lot bandwidth, because you don't need to fetch anything that you're not processing and of course not executing those cycles is savings in power. Our ALU engines are pretty much asynchronous everything you very often hear about asynchronous compute but in reality we are processing four different types of tasks concurrently within the GPU. So that's geometry processing, pixel processing, compute but also what we call 2D or kind of housekeeping data movement. All of those tasks can be running concurrent in any of our compute engines. And it really helps us to keep them busy. Because that means we have different divergent workloads, we have different bottlenecks. We will just execute the warps that we have available to us. If we have the data and they can progress they become active and we can use them to keep our engines busy. We're ultra wide which is a little bit different from some other players so we've got a 128 threads within a warp. If you compare that with NVIDIA, NVIDIA has 32 threads in a warp. So much smaller but to make that efficient what our architecture does a lot is basically task packing and repacking. So if a workload in compute gives us work groups of 32 we just pack four of them together and execute them. And we do the same with geometry and pixel processing we're basically merging triangles, merging tasks to keep them as busy as possible. The conversation moved towards mobile applications where Kristof explained how Imagination's products rely on lossy and lossless compression for image processing. Lossless compression, as the name suggests, reduces file size without losing any data in the process. As for AI, according to the Imagination executive, the firm's product designs for automotive applications also flow into AI servers. He outlined that idle areas within the GPU are used to monitor overall performance to determine whether data is properly processing. This monitoring allows the chips to detect server failures before they occur to prevent shutdowns during training and avoid losses: On the geometry side and overall on bandwidth management, you know lots of mobile devices are constrained by power. But also that means the kind of memory chips the DDR and that means compression so our geometry data flows use a block based compression so all the tiling, all the geometry storage we do is compressed in memory again for footprint reasons and power and bandwidth benefits. On the image side there's full lossless and lossy compression. It's a wavelet based technology. If you go purely lossless you would typically expect a two to one saving but if there's a lot of constant colors or gradients you'll get much higher compression ratios of course. Also fully support lossy compression, so this is where you guarantee your footprint reduction but you trade off the quality. And we use a hybrid scheme so we first try to compress lossless if we can't hit the bit budget, then we will fall back into a lossy algorithm. So typically a lot of the screen will be perfectly compressed because it's using a lossless mode we will only fall back to the lossy algorithm where we really need to. As I mentioned we're a big player in the automotive space and that means you have to deliver functional safety. And that's a certifiable requirement so we have to go to auditors together with our partners to get those certifications and we have the option of building a lot of mechanisms into our GPUs to enable that functional safety. So that is including zero overhead virtualization, ECC or parity on all the memories that we have. We can tag specific parts of the screens as safe so we can say this is really critical information make sure it is rendered correctly so we can render that twice and double check that it's rendered correctly. We also have what is called distributed safety mechanisms or something that we call idle cycles feeling. Basically when an ALU pipeline within the GPU or other processing pipelines aren't used we will use those idle cycles not to just sit around but to run small localized test patterns. So we're continuously checking if the processing pipelines are operating correctly. But of course we only do that where customers need that so that's that functional safety automotive market. Or if you look at the AI space there's a lot of talk about the stability of servers. You know a server that fails can be very damaging to the progress of the training. We can detect those faults very very quickly in hardware and then flag that the hardware needs to be switched out. As I mentioned we've been designing our own RISC-V firmware processors. We've integrated those into our GPU so the top level scheduling all the virtualization all the safety mechanisms are driven by that RISC-V firmware processor so high performance very flexible and we also use that to interact with a lot of other components in the system. So if a customer has dedicated neural network engines that they want to use but they want to very quickly exchange data with the GPU, we can basically do hardware based handshake signals and say are you ready with your data we can do processing. So very efficient low latency synchronization mechanisms that we can deploy there. And that works through those GP/IO signals so we have a number of general purpose IO pins on the GPU that we can use to interact with all those other blocks Our ALU's multi-issue so that means there's not just floating point data going through it we can be running integer complex operations um move and kind of packing and unpacking and data conversions many of those things can all be running concurrently within the same cycle. Another element we have is on the texturing side, we make a difference between what we call 2D image sampling and 3D. 3D sampling is more complex, you have a lot of perspective correction level of detail calculations that you need to do as you do 3D texturing. In 2D you don't need any of that you're really just sampling data filtering it and processing it so we've doubled up the speed of 2D access so that's very good for image processing and very good for a lot of compute usage scenarios as well. Kristof then shared some of the "funky" things that NVIDIA's been discussing and how Imagination has provided the same features for a long time. These include full hardware ray tracing and the ability for pipelines within the chip's ALU to exchange information. He stressed that these features are nevertheless optional for customers. Imagination also allows customers to utilize accelerated matrix multiplication within the GPU and rely on Vulkan and OpenCL. According to Kristof, customers can choose to use these functionalities for graphics only computing as well: That's all the fundamentals those are things we've been doing in previous generations. With E-series we've really focused on compute and that's what you can see at the top. We have introduced a new technology called Burst Processors, which is where we're really getting our improved power efficiency and i'll explain that in a bit more detail on following slides. We also bring full hardware subgroup support so this is the ability for ALU pipelines to exchange information. And this is increasingly common, so in AI you know that there's a lot of shared data between the different processing units they're using the same data or they're trying to share compute, with subgroup it means that we can just exchange all of that very fast in hardware and in a lot of image processing cases you can get those benefits for graphics as well. We can deliver full hardware ray tracing up to level four which means box triangle testers full BVH walking in hardware as well as coherency sorting. So all the funky things you've heard NVIDIA talk about, we've already been doing them for quite a few years. But it's optional, right? If customers don't want it, we can really just exclude that functionality. And a lot of automotive customers, for example, don't care about ray tracing. Then on the AI side, you know, really seeing the GPU as the optimal processing block. So we've added dedicated matrix multiply acceleration ALU pipelines very deep into our GPU. Probably standard base, so all the Vulkan, OpenCL extensions that NVIDIA has been prototyping, we can fully tap into those and use them. And we've tried to keep that unit very flexible so we can use a lot of that functionality for graphics as well. Full multi precision support, so both 16 bit and 8 bit data types, including the FP8 variants, FP4 data types, as well as the MX block compressed formats, and we can boost that with a factor of 8 or 16x, so much higher AI throughput. And since the kind of D-series, we're backing all of this up with optimized compute libraries. So very specific kernels for computer vision, for AI, convolutions, large language models, transformers, all those building blocks are very much optimized. So building effectively the equivalent of a lot of the CUDA libraries. So most developers don't really use CUDA at low level, they are using CUDA libraries. So if we match the functionality at those higher level libraries, we can deliver really good performance portability for our partners. Virtualization boosted what used to be eight clients to 16 clients, so doubled up there. That is really 16 hardware software interfaces, so we can be driven as a GPU by 16 completely independent driver stacks that all believe that they are talking to their own hardware. But of course, in reality, all of that schedules across the engines that we have. Full API and OS support, so Vulkan, Android, Windows, Stardex, Feature Level 11, OpenCL, OpenGLS, all delivered as reference driver stacks, which, of course, our customers need to integrate in their systems and where necessary, optimize. So things like DVFS or power management or the interaction with video or displays, customers need to make modification to our stack. Because it's really just a driver for the GPU engine. So customers get full access to that source code to be able to do all the porting, tweaking and tuning that they want to do. A key feature of Imagination's E-Series GPUs is the Burst Processor. The firm claims that this processor delivers up to 35% in power efficiency for edge use cases. Kristof explained that the Burst Processor focuses on AI and graphics workloads such as addition and multiplication. According to him, the processor stemmed from earlier designs which relied on longer compute cycles which meant that some capacity was idle as computations took longer: Neural super resolution. We hear lots about different algorithms. We've been developing our own in-house as well, which we deliver to our partners to match our hardware capabilities. And we've really tuned them to operate in a single pass and where that all fits into the local memory of the GPU. So very bandwidth and performance efficient. And of course, you can just go to our website for everything in terms of development tools, profiling and simulation tools as well. Those last simulation tools are again more something that we deliver to our customers and our partners. That means they can do a lot of analysis ahead of silicon and basically optimize their software should they wish to do so. So really, the headline number was we looked at the ALU and the compute. And originally that was a compute focused effort. But that translates into benefits for graphics processing as well. So like you can see here there's three different graphics workloads where we boost up significantly in the power efficiency. And you can aid or trade that for higher clock frequency, higher performance, or you can take the power efficiency savings. Now what was that based on? That's really that Burst Processor. If you look at our generations, so if go back to the D Series or the C Series we had an ALU pipeline that was very powerful but very deep. We had up to 10 pipeline stages where we could do different amount of processing. And if you have 10 stages of compute, you can't really do lots of things back-to-back. Because if you start processing something it takes 10 cycles before that result is available. So what that means is that on every cycle we were scheduling a different warp. And that means that a lot of those units were reading and writing data to the big register banks that we have inside of our GPU. Now the problem with all those pipeline stages is they're very powerful but if you don't use them they're kind of wasted. So you have all that capability but most of the time you're not tapping into it or it's not needed. So it just burns power and all that reading and writing to the big register banks is quite expensive. It's a distance, it's toppling bits, it's big memories, so it consumed a lot of the power. What we've done with the Burst Processor is to say most of the compute that we are seeing especially in AI but also in graphics workloads is really just a lot of multiplies and adds. There's no branches there's no pre-processing there's nothing special going on, you're just doing a lot of pure compute. So we split that out in a special Burst Processor that can execute these kind of bursts of instructions. And that requires only two pipeline states. So you can reuse previous results very quickly. And that's what we did with the Burst Processor. So we now run two bursts, back to back and that means we can keep a lot of the data very local to those ALU engines, we can exchange data with the neighboring pipelines so there's a lot of ways of using data that is very very close to the ALUs and we can offload a lot of the excesses into the big SRAMS, the big storage that we have inside of our GPU. So much more power efficient, much more effective execution, much more back-to-back instructions that you're focused on and executing. And that's really what was translating in that 35% power savings that we were seeing for heavy compute usage cases. The Imagination executive then delved deeper into the E-Series' design and explained that customers have the ability to use single cores or merge them depending on the use case. This scalability is important for desktop and data center customers, he added. According to Kristof, the model of the E-Series GPU is similar to NVDIA's products, which enables targeting programming models such as Vulkan and OpenCL: Our GPUs are very scalable so we deliver single cores which go into the lower end markets but then we can deploy multi-core. The fact that we can have multiple GPUs which are just sitting on the bus but they then can split or they can merge. And again the automotive market loves this because they might be wanting to do lots of things concurrently than they want a lot of these cores all drive them independently doing different things. On the other hand you may have one big compute job. What we can then do is group all those GPUs together, have them collaborating and acting as if they were one larger bigger GPU. So this is what really allows us to build this very high-end design that we have here. So going to you know 13 teraflops, 400 gigapixels, for AI that translates 100 and 200 TOPS at 16 bit and 8 bit so very high density very wide scalability of those solutions for the cloud gaming customers for our desktop and data center customers as well as those automotive customers. Now that AI performance is again something we looked at very carefully. So this is a simplified view of our GPU you can see at the top the USC, the ALU engines the texture processing unit underneath it, and then an amount of shared fixed function logic and then a top level cache with the RISC-V firmware processor. So as we looked at this for integrating AI we considered the top level. But that's very ineffective, it's quite a long distance for data to travel, and if you're that far away and you have to co-work with the GPU you need a lot of extra SRAM to buffer up data and basically absorb that latency. The same problems happen if we try to put things more at the accelerator level inside the GPU. It's just too long a distance. There's not a lot of SRAM you can share. Most of the SRAM is really deep inside our USC engine. So this is where we have a lot of memories, those big register banks, the local memory. And that means we can reuse all those resources and very quickly exchange data between the classic FMA and GPU pipelines, and now the new dedicated AI matrix multiply engines. And that's very similar to NVIDIA, right? So it's a very similar model, but of course, a very different design as to how we've integrated and designed it. But functionally, it's the same level of position. And that means we can tap into all those existing programming models for OpenCL and for Vulkan. And that would really then deliver us a lot of that efficiency. He also believes that the E-Series GPUs' ability to support a myriad of AI software development frameworks is an important factor in today's fragmented industry. Imagination also focuses on the industrial sector through its open source program which targets another highly fragmented sector. The firm is also working to provide similar capabilities for high end use cases, and its neural shaders are keeping an eye on the future development of AI integrated GPUs: From a software point of view, we're then backing that up by a number of reference stacks. The great benefit is that, you know, there's a lot of fragmentation in AI, you know, lots of different frameworks at the top level, whether they're TensorFlow or PyTorch or Paddle Paddle, lots of different things, but they're all open source. And there's a lot of intermediate stacks as well, which are open source, so whether that is TVM, which is more compiler-like, whether it's the light RT that Android and Google uses, or the kind of 1DNN, which was a kind of standardization effort with the UXL Foundation, which was originally started by Intel. Now, all of them ultimately use OpenCL, which is the chrono standard for a lot of the high compute capabilities. The real optimization part for us, you can see with those compute libraries and the graph compilers. So, that's really where we are inserting highly optimized primitives for these kind of AI operations and run them efficiently. And then with the graph compiler we can merge stacks and basically use all that on-chip memory to keep that data there and avoid becoming bandwidth limited. All of those are example implementations. The libraries are where we are very highly optimized, and then we teach our customers how to integrate and optimize for their specific framework choices and for their deployments. So, yeah, really, this is really a summary for the OS and API. There's a lot of information for developers, so documentation, all the tooling, all of that is available for free, so they can just go to our website and access it. We also have an open-source program, so that has been initially focused on the industrial market. So low-end solutions, where there's a lot of fragmentation in use cases. But we continue to build that up for the higher-end solutions as well, so over time, even these bigger GPUs and even these automotive solutions will be backed by open-source driver stacks as well. And then, of course, for AI, a lot of it is co-work within the open-source projects to get the best results for our solutions. Of course, RISC-V is very important for us. We have a lot of partners that we work with, and we tune our GPUs to work very efficiently with the RISC-V architecture, but we're largely agnostic, right? We've been used with ARM CPUs for many, many decades, but equally, we've been used with x86 solutions because we still deliver PC add-in boards together with our Chinese partners, so really fully flexible on the CPU architectures that we deliver with our driver stacks. And the neural cores, of course, there's a lot of innovation in the AI space. Super-resolution frame generation are the most common, but a lot of those algorithms can be extended as well, so things like post-processing, whether that's blurs or ambient occlusion or adaptive field, what we are finding is that many of them can be turned into what we call neural shaders, that is, you can teach a very small neural network how to implement those effects, and that can run a lot faster and more efficient than writing really big multi-pass algorithms that run in the traditional sense. So those kind of neural shaders will become ever more common as the AI integrated capabilities of the GPUs continue to improve, and that really will offer power savings and performance benefits as we can basically teach those small neural networks how to implement a lot of the advanced graphics effects. Kristof further elaborated how the E-Series' different aspects, such as accelerated matrix multiplication, ALU pipeline optimization and customer choice make the E-Series suitable for ecosystems. The use cases enabled by the E-Series also include AI PCs which rely on multi-core use. Additionally, he also believes that Imagination's products also exceed NVIDIA's capabilities in the automotive space: As such, E-Series is a collection of technology. From that we drive our different target markets, so the automotive side, where we may be very compute-centric for autonomous vehicles, or we would be very baseline graphics-centric for cluster and in-vehicle infotainment screens. The smartphone market, of course, a lot of generative AI, a lot of ecosystem applications that typically end up on the CPU or the GPU. Because there's just too much fragmentation on the dedicated hardware side, right. Google has their TPUs, MediaTek has something, Qualcomm has something else. It's just too hard for developers because all those things have a different programming model. Different tool flows, different optimizations. So it's just too hard. Whereas GPUs all have very much the same capabilities, so it's much more effective to use the GPU for the ecosystem compute rather than dedicated hardware. And that's really what we're enabling here. With our partners in China enabling AI-PCs as well as cloud gaming, so that's where a lot of the very big multi-cores are coming true, lots of concurrent cloud gaming instances so many users per graphics core. So you can see quite a stretch of solutions and design points that we enable together with our partners. On that automotive side you know the concurrency of all those different jobs the hardware priority mechanisms the virtualization the isolation that we can do between those different domains is extremely powerful and very similar and actually exceed other capability has in the automotive space as well as that SOB-hardware based functionality where again we're exceeding NVIDIA's capabilities right. We're not a lockstep solutions you don't need to have two of those chips in your trunk which a lot of the NVIDIA based solutions do rather that just deploying two very expensive chips just to deliver functional safety safety whereas we are doing all that testing within the hardware itself. A bit more on the flexibility: So again we're very flexible, right. We're making the GPU better at AI and compute but we also recognize that in many usage cases dedicated hardware will win out if you're fully vertical you own your own algorithms you design likely your own NPU hardware. What we are offering then, is to co-work very efficiently with shared memories and hardware based handshake mechanisms or if you're in a market where you have a lot of GPU and you want to use some of that for compute usage cases. You know if you look at smartphones when you're using your camera, the GPU from a graphics point of view is not actually that busy so there's a lot of cycles available for doing AI processing. The same is true when you're asking Google Gemini something there's very simple animations on your your screen so there's actually a lot of GPU performance that can be assigned to the GPU for AI compute at that point in time. When you're doing gaming it's of course the other way around. You want to use the GPU fully for your graphics processing at that point you would not want to use it for AI processing. But again we're an IP supplier we're very flexible as to how customers want to build their systems and their solutions and that's really what we illustrate at this slide we can go from the very small thing an 8256 configuration so 8 pixels per clock 256 floating point FP32 operations you know and scaled from that all the AI capability but all the way at the other end a much bigger single core with a 64 2048 so a lot more compute a lot more fill rate and then we can do multi-core on top of that to build very high-end solutions. And as it says there, we're delivering those first configurations effectively now to our lead partners we're operating in a number of these different market segments. So you know the desktop, data center space, the mobile and consumer markets, automotive, as well. as that industrial space which is very cost centric. Kristof's presentation ended up answering several of our questions such as the applications of the E-Series when it came to AI and the mechanics behind burst processing. Still, we didn't give up and peppered him with a few questions - some prepared and others in response to the presentation. We started by asking how the GPUs were optimized for AI and Kristof responded by pointing towards matrix operations: Wccftech: Awesome I guess your presentation did answer some of my technical questions. So I'd like to dive into the specifics of the GPUs' use cases. For starters I was wondering how is the E Series GPU optimized for edge AI use cases particularly for smartphones and for AI PCs as well. Kristof: Yeah. The key benefit for the AI processing is really the dedicated matrix pipeline. So you can look at it as the equivalent of starting to add a tensor core into our GPUs. So that means that that very heavy and very specific data flow operation of matrix multiplies can run very efficiently. So rather than running a lot of FMA instructions right. Simple individual multiplies and ads, we're executing a big matrix instruction right. Execute this 64 times 16 matrix multiply in a very efficient batch, keep all the data rippling and stored very locally. And that gives us the bulk of the processing benefit and the power efficiency benefit as well as supporting all those new data types. So the new floating point 8-bit 4-bit data types which we are fully natively supporting in that GPU. Now we combine this all then with existing resources. So in every single of our compute engines we actually have a half of a megabyte of register space. So that's a lot of on-chip storage in each compute unit that we have. So that means it's very effective of loading data into that, reusing it, processing multiple layers of the neural network before we push it back out to memory. So it's really near memory compute, and the key thing is we don't need new memory we already have that memory for graphics. And that also makes it multi-purpose right. It's registers that we can use for running games and then we can reuse all that same storage for doing AI and compute very efficiently. So that means we don't need to increase the silicon area of the design, we don't need to add an NPU engine with four megabytes of dedicated RAM. We already have four megabytes effectively inside the GPU that we can just reuse. Now the other key thing is the software optimization and the access. Like I said that one of the problems you see in Android today is that there's lots of marketing about how great the AI is when it's running on the hexagon or when it's running on the dedicated MediaTek hardware or when it's using the TPU. The problem is how as a developer do you write content for the Play Store that can access that functionality and that's where you fail on the standardization right. You need to use Qualcomm special tools and libraries or NVIDIA's or MediaTek's or Google's and in Google I'm not even sure it's currently public, how you access the GPU. So you don't access to this great thing. But what you do have access to, the GPU. So the GPU is a lot more logical in that AI market, whether that's mobile or the same problem in AI PC it's much easier to access the GPU and all of its standard capabilities. And that's what we focused on. We tried to make this very similar to how a lot of the algorithms are optimized for NVIDIA hardware, so it becomes very easy to port and optimize for our architecture as well. So both a hardware and a software angle there in terms of optimization. W: Great. So delving deeper into NVIDIA, in which markets do you think your GPUs compete with NVIDIA's products and how? K: The key problem that NVIDIA has is that they're a chip maker. Right, they make a chip and then you have to buy that specific configuration, that specific set of capabilities. So if you're a pure graphics customer and you go to NVIDIA, you have to buy all their tensor cores you have to buy all the other stuff that maybe you don't care about. If you're a pure AI customer you're largely buying a full-blown GPU so you can't say can you remove all your graphics capabilities. Now increasingly NVIDIA and AMD are doing more dedicated silicon but it's still very hard for customers to buy exactly what they want. They may want a little bit of graphics for visualization a lot of compute, you can't build that you can't buy that so the key benefit we have is that customers can go and build exactly what they want. If they want a lot of compute, we can deliver that if they want a lot of graphics, we can deliver that as well. If you want a mix of the two that's highly optimized you can deliver that together with those partners. If they have their own processing engines right you have your own special sauce your own capabilities. NVIDIA doesn't care right, you buy the NVIDIA chip. With us you can put accelerators next to this and work very efficiently with it. It's extremely flexible so it's very much like what you can see in the market lots of people are building semi-custom ASICs for AI, for compute, for graphics, for data centers. And what we are delivering is one of those GPU solutions so the GPU like compute engine or the graphics engine so that's where a lot of our partners are. The Broadcoms, uh the Marvels who are building a lot of those solutions together with their end customers. We are one of the suppliers into that chain for compute engines. And I would say automotive has traditionally been our big market there. The visible customers for that are mostly companies like Renaissance and TI. But there are a lot of others who really don't want you to know what is in there. And IMG respects of course the NDAs and the confidentiality agreements so you may have a lot more devices around you that have some of our GPU and compute technology in it than what we can communicate about. W: Okay great. What role do manufacturing process nodes play in allowing the E Series GPUs to achieve peak performance? K: So again and this is a bit of a benefit as well of IP we deliver what is known as soft IP right so we deliver effectively full source code so the high level languages that describe the GPU is what we delIver to our customers. And that means you can target it to any process node. So we have customers in China who are stuck on you know maybe 12 or 7 nanometer and you can perfectly well do that with an E-Series GPU. But it will constrain the size, the clock frequencies, because of course the older the process technology, it itpacts the power and the sizing. On the other hand we are working with you know high-end customers in the Western world who are targeting two nanometer. But again because it is soft IP, customers can target that, they can get much higher clock frequencies. So they can put in a lot more cores now, because it fits within in their budget. But fundamentally, we're completely agnostic. We do tune based on feedback from our customers, so critical parts timing paths. But fundamentally we're agnostic to the process nodes. And even in the different markets, right. So like a process node is not just one library. So if you're dealing with a mobile customer they're very power sensitive. So if you then go to TSMC, they'll say we'll have three different sets of libraries and this one is power optimized, this one is performance optimized, and then even then you have a lot of cell choices within that library. So the low leakage cells typically lower power but they toggle slower. So then you have lower frequency but you know better power efficiency. On the other hand, if you're doing a desktop solution, you don't care as much about power so you'd be picking the high performance library from TSMC and you'll be over driving the voltage levels and you'd be using much more leaky cells that toggle much more quickly. So there's a very wide design space that our customers can apply to our GPUs in terms of frequencies, power budgets that you can drive through. W: Cool. Since you mentioned ASICs, what role do you think Imagination Technologies' products can play in the custom ASIC AI chip ecosystem? And what trends do you see for this particular segment moving forward, particularly given the energy constraints for AI data center rollouts and the cooling constraints as well with regards to liquid versus air cooling? K: Yeah, the main trend that we've seen is, when a lot of the AI custom ASICs started, it was very dedicated engine centric, so they built very large very specific processing engines. And a lot of them claimed flexibility, but in reality most of them had quite severe limitations. So as new stuff appeared, transformers and other evolutions, or even sparseness, a lot of them really struggled. They didn't have enough flexibility, and they started to really get much lower utilization, or they had to jump through all kinds of weird hoops to kind of try to map new things onto a, a very kind of rigid configurable architecture. And what we're starting to see is that the modern custom ASICs are much more of a mix. You're likely to have an array of CPU cores which are extremely flexible in decision engines, you know, figuring out what to execute and how to execute it. They then still have those dedicated neural processing engines, but now they have a closely connected array of more GPU style processing blocks, which is where a lot of the innovation appears, and a lot of the quick mapping and quick results can be achieved. So really growing in that space, delivering the GPU focused compute engine, and all that flexibility, while tuning to work very closely with those customers' dedicated engines. So this means putting an amount of on-chip SRAM down, those dedicated engines writing and streaming data into that on-chip memory. And then handshake signals that tell the GPU the next slice is available, process it, and hand it back over through that memory to the dedicated processing engine. So a lot of the system integration, the system efficiency, the co-working with, you know, dedicated custom solutions is what we've been offering. But we are then delivering then delivering that standardized GPU efficient compute engine into the solution. And that's really the growth space for us. The reason for that integrating some of that AI and tensor processing is really the efficiency. We don't want to toggle between those engines if we don't have to, because it's very inefficient. So if we can merge more of those layers, more of those processing passes with sufficient efficiency, that's really the sweet spot. So some of the really big layers may run on the dedicated hardware, but then a lot of the smaller, more tuned, flexible layers, we can execute very efficiently on the MMA engine and our traditional pipelines.

Apple's Former GPU Supplier Imagination Tech Shares AI GPU Advantages Over NVIDIA

Guess You Like