The Quest for Performance
I'm sure we all know the three basic principles of performant job execution:
(1) Cache: The best job is the job, where you don't have to work at all.
(2) Algorithms: If you can't aviod work, then do it as quick as possible, using the best algorithms available.
(3) Resources: Dont make a mess, while doing your job. You'll have to clean up afterwards - and that costs time too.
After applying these priciples, the Layouter implementation is now three times faster than before (400 seconds pagination time vs. 120 seconds pagination time with the new layouter for 200.000 rows). However, this is still bad compared to the 75 seconds of the old layouter. But that lazy beast had mastered the art of caching - it avoided almost every real work by using its extensive caches.
Sunny days for the Flow-Engine
Its finally gone public: OpenOffice uses the new Pentaho Reporting Flow Engine as backend for their new Report Designer.
I strongly believe that this is a big win for both communities. OpenOffice was in bad need for a sane reporting system, that is fully integrated into the office suite. Sure, with OpenOffice 2.0, they already introduced one - but this one was more or less a extended mail-merge system. And honestly, I was never able to design a report from scratch with that thing.
Ocke Jansen created a modern (and beautiful) UI to design banded reports now. Adding elements to the report is now a lot easier - just grab the mouse and drag'n'drop them where you want them to appear. Simple and easy to use - that's how reporting should be.
And once you've finished the design, you heat up the engine and tell it whether you want a Text-Document or a Spreasheet-Document. And just in the twinkling of an eye later you get the results.
But the interesting part lies hidden deep inside the engine: The Flow-Engine works natively on the OpenOffice document structures. The engine does not waste time converting the OpenOffice-tables, paragraphs and everything else into an internal proprietary representation - simple take what we get and process it immediatly.
And we do it fast. Displaying the results in the OpenOffice-Writer takes more time than the report generation itself.
And we do it conservatively, with as little memory as possible while preserving all incoming report structures. So no matter what valid OpenDocument content you want to process - the engine will happily accept it from you and produces all the report you always wanted to produce.
A new layout algorithm for the Classic-Engine
When you're afraid of touching your old code, it is a good time for a rewrite. The code that I was afraid most was burried deep inside the Layout-Controller.
The development of layout-engine of the Classic-Engine was purely driven by need with an almost obscene lack of planing. However, it worked surprising well - the engine was swift and fast. Even when the layout-manager implementation was almost already out of control (better dont touch it, but still working), we were able to get more performance out of it.
Stephane Grenier, founder and author of LandLordMax, was truly amazed about the performance gains we achieved in the last year. To quote his comment on his Blog
In regards to performance, as far as I understood the biggest gains were in upgrading to 0.8.7-10 I even posted a message to the forum about this. Based on the performance differences, and the fact that version 0.8 (based on information from their head developer) will eventually be able to print combined reports (which is great if you want to print multiple invoices directly within the invoice list) we definitely upgraded to the latest 0.8 version.
But in JFreeReport 0.8.8, the subreporting caused some major changes in the layouting system. If you add new features, you always pay a price for it.
Fixing this innocent little bug #1681595 is more difficult than expected. A quick and dirty fix would tear down the remaining architecture - at the end, we may have implemented a fix for one bug, but laid the fundament for a whole new series of bug. No! Heading down this path would have been stupid and a receipt for doom and desperation.
But: We already stole the subreport-implementation from the Flow-Engine. And once you've started stealing, you can't stop halfways. LibLayout, the layouting system of that engine looks nice too ..
During the last week, I've downported and modified the rendering pipeline of LibLayout. The Classic-Engine has no need for CSS-StyleSheets and it does not (and cannot [for compatibility reasons]) make use of the DOM-oriented architecture of LibLayout. So all we need is the last stage of the rendering process: The part where abstract elements turn into content.
So far, the extracted renderer is working and produces reports with the new layouting system. With this layouter, a many of the most hated restrictions of the old engine will be history:
- Bands can span multiple pages.
- Text-Elements can produce formatted text. A single text element can use several different fonts and font-styles.
- Text-Elements can contain inline-images.
- Dynamic-Height Text-Elements will be considerable faster.
- The layouter now has documented rules how the layout is computed.
For me, the last point is the most important one. It took me several days to understand how the old layouting system really works. It is easy, if all elements are static. But as soon as one element uses relative positioning (percentages instead of absolute values), things became very messy. With the new system in place, the report layout finally behaves deterministic.
But you alway pay a price ..
As the layouter now uses different principles to perform its job, the internal API massively changed. There old OutputTargets are almost gone - right now they are dummy classes to maintain some backward compatiblity. All of the old layouter implementation classes are gone and only a few will continue to exist as dummy or backward-compatibility classes.
There will be border-cases, where the new layouting engine does not exactly behave like the old one. Everyone who relied on the old behaviour, that bands will not span multiple pages, will now have to change the report definitions.
Functions which modify the page-footer or a repeating group-footer during the page-finished event will produce different results now. In the old days, such changes never changed the page-break position. Now, this can happen. If the page-footer grows, it may start to move the last band to the next page. However, it is still better not to change the footer-structure during this even.
Well, the fact, that - right now - this two-weeks old implementation is slower than the old one should surprise no one. Once we have the same insane amount of caching and heuristics in place, we shall come back to our old performance high.
One final word: Once this implementation is complete, the next release of the classic-engine will be called 'JFreeReport-0.8.9'. One step closer to a '1.0' version.
Pagination kills me
Not that pagination is overly complicated. Pagination is a simple process that involves shifting a couple of boxes downward whenever that box crosses a pagebreak.
The whole content is already laid out (ignoring all breaks so far) and now all I have to do is the simple shifting. But I cant. It involves logic and mathematics (addition, subtraction, nothing else). Logic and maths trigger something deep inside me - and a simple task becomes a living nightmare.
Now I'm really glad that LibLayout's rendering follows such a strict separation of concerns. At least I can be sure that the common layouting works. This really makes it easy to locate the place where things go wrong.
Ahh, sooner or later, after long nights of coding, this thing will start to work. If everything fails, I can still resort to the more approach and start to summon a daemon or two to help me on that. They still owe me a favor for the 'extended-XML' format. This one most certainly caused more cursing than any rush-hour traffic-jams on the A5 in Frankfurt.
Sometimes things are just mess up ..
If you download any of the 0.8.8 versions of Pentaho Reporting Classic, you will notice one thing: The last page-footer does contain any values. It contains labels, but it never displays function results or other data-items.
After this bug survived undected for three month (or maybe simply no one cared for filing a bug-report), I finally received notice of this bug's existence. Five minutes of debugging later, I got that very ugly feeling. No, this was not a simple bug - it came directly from the 7th circle of hell: A logic bug.
When we introduced sub-reporting into the engine, we also had to change the datasource management of the reporting engine. In the old days, there was not much of a datasource management. Everything was static - when the report started we knew exactly what columns would be available, and that number would never change. Oh, life was beautiful and simple.
But then came the subreporting and everything changed. Subreports introduce their own datasources and an own set of functions. Suddenly the simplicity was gone forever - and would never come back again.
As the Flow-Engine already handles such dynamic datasources already (and I really, really hate to maintain two independent implementation for the same problem), we simply grabbed the datasource-code from that branch and ported it to the classic engine. On the positive side, it gave us the whole datasources and query-management features of the flow engine along with the very flexible parameter-passing mechanism. But we had to pay a price: We now have to maintain a data-context so that we know when to add and to remove columns from the datasource.
As the classic engine is very simple, so is the context management. There is only one context at all, and that's the report or subreport.
It could have worked out so nice.
The page-footer is not printed, until the system detects that the page is filled. As outlined before, although we layout the page-footer all the time, we do not copy it to the final page until the end of the page has been reached. This actual copying is done outside of the normal report processing. For the last page, however, the report processing has already been finished. The current context is closed and everyone is waiting for the garbage-collector. And at this point, printing the page-footer fails.
How to fix it: Ok, we could start tweaking and start adding workarounds until the system behaves as expected. We did it before, it surely produces a solution - somehow. An unmaintainable solution that eats souls like others eat cornflakes, a nightmare for those who have to maintain the beast and have to fix bugs there.
As chances are good that I'm the one who will be called for that, this patching is no solution. I'm crazy, but not stupid.
And here's my favourite approach: (1) Look at the flow-engine and how they deal with page-footer definitions. (2) Copy the solution. (3) Be happy.
In the Flow engine, page-header and footer are defined together before the content is genereated. These footers are then pushed down to the layouter, where they get replicated on each new page. Whenever the header or footer changes, the engine simply replaces the stored template with a new one. (And if you look closer on it, then this approach is not that far away from the repeated layouting of the page-footer to compute the available space, as we did it in the classic engine.) The Classic-Engine even already has an storage layer between the report-engine's element definitions and the output targets content implemented. The 'MetaBands' and the 'MetaPage' classes currently serve as some kind of a spooling mechanism. It should not be too complicated to transform this into a real layouting layer.
And yet another step on the long road to get both engines aligned.