This is the first article in a series of blog posts on Mir’s and XMir’s performance. The idea is to provide further insights into the overall performance work, point out existing bottlenecks and how the team is addressing them.
Our overall goal for Mir and XMir is to provide an absolutely fluid user experience, both in the case of typical desktop usage as well as in the case of more demanding usage scenarios like 3D gaming. More to this, our efforts to provide a fluent user-experience on the desktop should at most have a minimal impact on overall 3D application performance.
During the last weeks and months, a lot of people have raised the question if and to what degree the introduction of a system-level compositor impacts graphical performance. The short answer is: Yes, any additional layer between the GPU and the actual rendering process has an impact on the overall performance characteristics of the system. However, there are ways to avoid most of the overhead and this blog post is the not-so-short answer to the initial question.
As its name implies, a compositor is responsible for taking multiple buffer streams or surfaces and assembling (a.k.a. compositing) a final image that is then scanned out to the connected monitors. In the general case, composition requires buffering of the final image and it requires GPU resources to render the individual surfaces to the destination buffer in preparation for scanout. Here, the destination buffer is the framebuffer. The overhead of a system-level compositor can be summarized as this additional rendering step in the overall graphic pipeline, for the obvious benefit of being able to control the final output and enabling flicker-free boot, shutdown, resume, suspend and session-switching.
Both internally and externally, people have been measuring the overall performance impact with XMir as available from the archive today. Roughly speaking, people have been reporting a performance impact of ~20% in the Phoronix test suite and the question becomes: How can we significantly decrease the impact in the specific case of XMir while still keeping all the aforementioned benefits in place? The underlying idea to solve the issue is straightforward: If the compositor is clever enough, it could recognize situations where an opaque client surface does cover a complete output (XMir matches exactly this configuration). In that case, composition can be avoided and the client should be provided with a framebuffer as rendering target instead of the usual graphic memory buffer. Moreover, the server-side composition strategy can be smart, and completely skip the final composition step and scan out the framebuffer as soon as the client signals “done”. Luckily, Mir’s composition engine and associated buffer allocation/swapping infrastructure allows for implementing this behavior easily and transparently to the client. The respective implementation has been living in https://code.launchpad.net/~vanvugt/mir/bypass for some time now, and we have been testing it in parallel to trunk. Our primary test and benchmarking platform was Intel, and we haven’t seen any issues with the patch on that platform. There is a graphical glitch present on ATI cards that we are actively working on. Nouveau gives us some headache as it is quite slow both on X and XMir right now. However, we are confident that we won’t see any major issues in XMir once the underlying cause in the Nouveau driver is fixed.
Measuring graphical performance and developing meaningful benchmarks is a complex task on its own. Luckily, we have some pretty capable tooling available in the opensource world. During development and evaluation of the bypass feature, we have been relying on selected test-cases of Phoronix Test Suite and on glmark2 to continuously evaluate performance gains and overall impact. We are going to publish the results across Intel, NVIDIA and AMD GPUs as part of our regular QA reporting at http://reports.qa.ubuntu.com/graphics/ as soon as we hit trunk. In summary, we are able to reduce XMir’s total overhead to ~6% on Nexuiz and OpenArena (see section “Conclusions and Future Work” for reasons for and approaches to further reduce the remaining overhead). Please also note that we are actively investigating into the results for the “QGears2: OpenGL + Image scaling” test case:
GLMark2 numbers are not yet reported via the public dashboard but we are actively working on wiring them up as part of our daily quality efforts, too. However, the numbers are quite promising as can be seen from this preview (Lenovo x220, i7 vPro, Intel(R) HD Graphics 3000):
Conclusions & Future Work
Today, we are landing an important GPU-bound optimization for the XMir use-case with the bypass feature and we see significant performance improvements in our benchmarking scenarios. Everyday users will hardly notice any difference in graphical performance, but notice a decrease in power usage on laptops due to the system-compositor requiring less GPU and CPU cycles to carry out its tasks.
However, this is only the first step and we still see some overhead in the benchmarks. Our GLMark2 benchmark numbers for raw Mir when compared to X as in Saucy today suggest that we still have GPU-bound optimization potential that we should leverage in the XMir case. The unity-system-compositor performance is not the bottleneck in this specific scenario and we need to become more clever on the X side of things. In summary, we need to propagate the bypass approach further down into the X world and its clients with X/Compiz handing out the raw buffer provided by Mir to fullscreen, opaque X clients. Luckily, Compiz already knows about the notion of composite bypass, too and the remaining optimization potential lies mostly within X itself by making it more aware of the fact that it is living in a world of nested compositors now. Quite likely, though, Mir will require adjustments, too, to expose composition bypass end-to-end in the XMir scenario. Stay tuned, we will keep you posted within this series of blog posts.
[Update] Michael of Phoronix found out that some games, when run in fullscreen mode but not at native resolution, do not benefit from composition bypass. As mentioned in one of the comments, we are now starting to investigate into this sort of issues and will come back with updates once we identified the root causes. At any rate: Thanks for bringing it up, we will make sure that the respective benchmark/setup is present in our benchmarking setup, too.
Filed under: Canonical