Reporting Advertising

     

What Pentaho Reporting can do for you

Current Stable

Previous

In Development

Pentaho Reporting allows you to refine your raw data into visually appealing reports that convey all the information you need to make better decisions and to get your job done faster. The open architecture of the reporting system and our Open-Source nature makes it a breeze to integrate the reporting engine into your existing systems.

Many of the worlds leading enterprises already use our technology to gain a competitive edge. What are you waiting for? Download it now!

Learn more about Pentaho Reporting

Pentaho Reporting 3.8.3

Pentaho Reporting 3.8.2

Pentaho Reporting 4.0.0

Development for this version has just started. Relax, it will take a while. Crosstabs are coming ..

Tuesday, April 13, 2010

What is that 'drill-linking'?

On Pentaho's roadmap shown earlier this year, the first item for Pentaho Reporting was a mysterious bullet point called "Drill-Linking" with no further explanation on the slides on what this actually means.

Drill-linking is, very generally speaking, the ability to connect reports with other reports or more general any web-based systems via hyperlinks.

At the moment, drill-linking can be done by adding a style-expression on the href-style-key on any element. There is no big magic to it and it is (in most cases) reasonably simple as long as you know where you link to.

As this is the only roadmap feature for the upcoming 3.7 release, there must be something more to it. (After all, you dont make a release out of nothing, don't you? And it would be Dilbert-style weasel work if we do. But then again, its product management and marketing - and thus everything that is not outright illegal is probably allowed (and has been done somewhere in the past).)

So what is our idea of drill-linking?


Foremost it is a new method of computing URLs for using it in the href style-key. For that I created a new formula function called "DRILLDOWN", but more on that later.

It also includes work on the platform components, so that we can ask published reports, XActions, Dashboards and so on for their expected parameter set, so that we can provide a sensible UI for the editor.

And last but not least, I'm keeping an age-old promise and extend the JFreeChart-expressions to generate image-maps (and thus allow to put link on elements in the chart-plot, so that you can click on the pie slice to jump to a more detailed report that tells you all about that specific country/product/whatever.

DRILLDOWN function? What's that?


The DRILLDOWN function is a declarative way to define URLs. Whenever we want to build an user-interface for a feature, it is most helpful if the values we want to edit exist in a well-defined data structure instead of being hidden like Easter-eggs in a programming language.

(This is the same argumentation we had about XActions vs. PRPT, so if you know the reasons for having a cleanly editable model, you can skip the next parts. But as I still - from time to time - get those questions on why we had to use our own datasources when the XAction system already worked in the platform, I will go through the same arguments for the DRILLDOWN as well. At least I can point everyone to this post and wont have to waste my breath in repeating myself again and again.)

Up until now, URLs are computed via a formula function that most of the time looks something like this:
="http://localhost/pentaho/content/reporting?solution=steel-wheels&path=&name=report.prpt&parameter="  [value]
or it looks like one of these
= [::env::baseURL] &  "/content/reporting?solution=steel-wheels&path=&name=report.prpt&parameter="  [value]

= [::env::baseURL] & "/content/reporting?" &  MESSAGE("solution={0}&path={1}&name={2}&parameter="{3}";  "steel-wheels"; ""; "report.prpt"; [value])


This stuff is easy and obviously works. But the problem is - as so often - in the details.

I can some up with a million ways of generating the same URL with different formulas. These formulas can be simple, can be complex, but they wont be easily parseable so that I can extract the values for the UI in a automated way. And whenever we fail to extract the values correctly, we will be beaten up by the users for providing a buggy system, or a bad user experience.

We _could_ workaround that by saying: Dear user, we only parse a very specific format. If you are not following that rule, you are on your own. (In the XAction world, this "workaround" was suggested for our PRD-3.5 as well - only accept XActions that look like the ones generated by PRD. And anyone with non-standard problems would be left in the rain.)

Nah, such a solution is not a solution at all. First, no one reads manuals, where that limitation and the valid format would be documented. I don't, and I guess you do neither. So the first thing I would receive is bug reports on why this formula is not working in the Drill-Down UI. Bah! I want to spend the rest of my life lazily sitting at the beach, and not answering these reports with "RTFM".

Parsing a program, and a formula or a XAction is a programm, and predicting what it does and how to change it to do something else is a very interesting research area - but the only research I want to do has to do with beaches at the Irish sea, not with geeky problems.

(Welcome back to everyone who skipped the previous paragraphs.)

So we need something that splits the various parts of a URL into separate fields. So lets look at what fields we have:

  1. Protocol, Host, port, web-application-base-url:

    This tells us where the web-application sits.

  2. Content-path:

    This points to the reporting plugin, the XAction-servlet and so on. It can also point to a PHP page, if we want to link there.

  3. Parameter:

    In the pentaho-world, we have three standard parameter, that are always there: solution, path and name, which (for historical reasons) contain parts of the solution path to which we link to. When linking to a non-Pentaho service, that could be omitted.

    And then of course, we have loads of user-defined parameter, that the report, XAction or PHP file could accept.

    All parameter are name-value pairs, so a 2D-array seems a natural solution.

So our DRILLDOWN function would look somewhat like this:
DRILLDOWN("http://localhost/pentaho"; "/content/reporting";
{{"solution" | "steel-wheels"} |
{"path" | ""} |
{"name" | "report.prpt" } |
{"parameter" | [value] }})
Not bad for a first throw. Now even a brain-dead monkey could write a UI without having to worry about how to parse complex formulas into a usable editor model.

But there is more to it. First, we know that for reports linking to the same server, we can replace the hardcoded server address with the "env::baseURL" variable to make that report more maintainable.

Next, the "/content/reporting" is only valid for links to PRPT files. For XActions that part looks different (heck, I never can remember the real value!) as does it for dashboard-links.

Boring stuff like that screams to be abstracted away from the unexpecting user.

Every link can be classified into one or more groups. We can ask questions like:
  • Do you want to link to a local report or a report on an other server?
  • Do you want to link to a PRPT file, a XAction or something else?
Let me call the combined set of answers to these questions "profiles". And hey, lets give our profiles some speaking names and lets hide the technical gibberish that so confuses the poor user. Then our drill-linking function can look somewhat like that now:
DRILLDOWN ("remote-prpt"; "http://localhost/pentaho"; {...  parameter-array here ..})
DRILLDOWN ("local-prpt"; "<not> used"; {...  parameter-array here ..})
The various profiles along with the ugly technical stuff are then safely tucked away into a configuration file. Now if we make these profiles (admin-)user configurable, we actually opened up a totally new play ground. Now we can do stuff like
DRILLDOWN ("accounting-prpt"; "<not> used"; {...  parameter-array here ..})
and link to the almost forgotten server in the accounting department and best of all, we dont have to remember the IP or DNS-name. And if they ever move the server, one change in the profile-configuration fixes all the reports.

The profiles would be defined in a configuration file in the reporting engine and (as usual) can be overridden by a local configuration. This way, "local" can take up different meanings depending on where the report runs. Now, if the profile configuration also defines how to produce the URL, it is easy to let knowledgeable admins create their own link-profiles to 3rd party systems.
DRILLDOWN ("legacy-erp"; "<not> used"; {...  parameter-array here ..})
could easily produce a link to your ERP system with the customer number or product id passed down. And again your users would not have to worry about the ten-foot-long URL that is needed to make that happen.

How would the UI know what parameters and fields to offer?


Well, this is a problem I prefer to solve by not solving it. OK, that was to much Zen. So lets talk about the UI for the Drill-Down system.

In its most simple form, we can present it as a set of input fields, each one corresponding to one slot in the DRILLDOWN function, and a plain table representing the parameter name/value array.

Its generic, its simple, but its just not sufficient.

For Pentaho-links, we have to maintain the three special parameter (solution, path and name) internally. I prefer to give users a file-system like view with a single path to enter. After all, we are long past since the DOS-driveletter system (ok, most of us are). So let's make that thing one input and let the UI code figure out how to split that into the three parameters.

For third party links, we have to keep this a single path.

For remote links, we need to show the host-URL inputs, while for local ones this would be dead-confusing.

For PRPTs we probably want to show the system level properties as well. Not everyone want to paginate, and not everyone wants HTML as output.

So in short - thinking of a single unified UI makes my brain hurt. And whenever that happens, I delegate (in the worst sense of the word) my problems to someone else. The someone else happens to be you.

In my book, a pluggable profile system needs a pluggable UI as well. We provide sensible UIs for the profiles we define, and a rather generic one for everything else (some sort of fall-back, in case you or your admin is as lazy as I am).

If you have a complex URL building requirement, then at least you can provide your own UI. If you're happy with the ones we provide - fine, then we all can spend our time drinking beer instead of coding. And if the generic solution is good enough to cover common linking problems well, then only system integrators will have to worry about actually writing UI-plugins. But if your users start to nag, and after playing World of Warcraft on the servers got boring, then you have an easy way to provide them the slick made-to-measure UI they always begged for.

Saturday, April 10, 2010

Agile without fast tools aint agile: Tuning our performance ..

From time to time, we get some .. extra ordinary .. requests of how our customers use the reporting system. There's the requirement that we create insanely large reports (400.000 rows resulting in 75.000 pages in a PDF file), or HTML files so large that no browser ever could render them. (Forget about the BORG-virus, just send them one of those files and watch them go down.)

As a general rule, I treat such requests as a nice way to test and optimize the performance of our reporting engine. I focus primarily on making small, well, human readable reports fast. Making the huge ones faster is ok and when I get the chance, I happily optimize that as well. But if I have to choose to make smaller reports slower to make the insane ones slightly faster, then I happily resist the change. After all, if your CPU burns 5 hours or 4 hour does not matter, if you are not going to look at the report 9 hours later. But waiting 10 seconds instead of 20 seconds for a report during your work day surely makes all the difference.

During the last few weeks, I once again had such a case. The customer needed to produce a large scale report, probably just to fulfil some ill-thought-out government regulations. But for some reason, the report constantly failed with OutOfMemoryExceptions.

(Yes, this is the moment where a support contract comes in really handy. ;) )

Memory management is usually a rather critical issue. For our reporting engine, it is even more critical, as this engine is based on the idea that all reporting problems can be solved in the available memory, without making a mess in your temp-directory. Actually, I'm way to old to believe in the "throw more memory/CPU/disk-space/nodes" myth. If you can solve the problem efficiently in a embedded-systems scenario, you can always scale up. But if you assume everyone has a high-end system, your code probably wont scale down that nicely.

So ok, we are a all-in-memory engine, and I want to keep things like that for a while. Therefore I work with a assumed limitation of 128MB for normal reports (<3000 pages) and 512MB for anything else.* Less memory consumption for report runs means you can run more reports on your server at the same time. People seem to like that idea.

After digging through the case, running a sample report, I discovered a couple of conditions, where we started to add up memory during the report processing at a rather unhappy rate. During profiling, I also discovered a bunch of non-optimal (polite for: purely crappy) data-structures I introduced years and years ago, which make the problem even worse. Oh, and the customer uses engine version 0.8.9 - not my favourite place to spend my time either.

After loads of tests, loads of profiling, loads of just waiting for results (ye olde MacBook aint that fast), we are now at the happy spot of reporting success.

In 0.8.9 and 3.6.1, this report now runs within the 512MB barrier. It is not lightning fast, but it completes running within 90minutes here, and thus it is fast enough for a nightly batch processing run. (In 0.8.9, the table (HTML, CSV etc) exports needs a lot more memory and thus require access to a full 2GB heap. Luckily that condition had been fixed in the 3.5 codeline.)

In the 3.7 codeline, I eliminated the last few memory hogs and there the same report runs within the 128MB corset. As these changes required some non-trivial API changes, this is nothing I could sanely add to a bug-fix release.

A updated build for the 0.8.9-reporting engine can be found in our Hudson system. Be aware that you also have to replace libfonts with the version supplied here, as it contains other performance fixes (+some API changes) we've made earlier on for a different customer.

Hudson job: LEGACY_classic_engine_core_089_bugfix

While working on that issue, PRD-2579 came up. This case reports that report processing has been slower in the 3.5-versions than it has been in the 0.8.9-versions. A bit of investigation turned out that this is indeed the case and that we better fix that before the higher CPU utilization causes more global warming.

The initial tests showed that PDF generation and print(-preview) was about 4 times faster in 3.5 than it was in 0.8.9-10. But HTML export was slow: 10 seconds vs 30 seconds. As I tend to work primarily with the Swing-preview or the PDF exports, I never noticed that part. BI-Server users tend to see more HTML exports than anything else, and there the slowdown matters.

Adding smarter caches solved the slowdown - which was originally caused by the fix for the table-export memory consumption problem in the 0.8.9-problem. In combination with some other performance fixes, our table export rendering speed is (nearly) back to where it was in the old days, while the PDF speed is faster than ever. (And ya can't complain about a 4x speedup!)

Right now, I'm busy making more bug-fixes for the 3.6.1 release, which is at the moment scheduled for April 22 (.. this year).

Pentaho Reporting 3.7 with the new drill-linking API should be out in the wild within Q2-2010.

As the 3.7-codeline is currently a bit "funny", you might want to check out the 3.6-branch CI-builds instead.



* Subject to change if I ever get access to the BORG-cluster. You will be assimilated, but I have all the CPU time of the world. :)