Reporting Advertising

     

What Pentaho Reporting can do for you

Current Stable

Previous

In Development

Pentaho Reporting allows you to refine your raw data into visually appealing reports that convey all the information you need to make better decisions and to get your job done faster. The open architecture of the reporting system and our Open-Source nature makes it a breeze to integrate the reporting engine into your existing systems.

Many of the worlds leading enterprises already use our technology to gain a competitive edge. What are you waiting for? Download it now!

Learn more about Pentaho Reporting

Pentaho Reporting 3.8.3

Pentaho Reporting 3.8.2

Pentaho Reporting 4.0.0

Development for this version has just started. Relax, it will take a while. Crosstabs are coming ..

Wednesday, September 22, 2010

Its Community time

Two more days and we will all have a blast of a time in Lisbon. This year with a large release close by I will solely concentrate on picking the brains of our fellow community members to see what bugs everyone most.

Developing a project as large and complex as Pentaho Reporting is like steering a large container ship. Many times we find ourself restricted by the existing API and the holy grail of API-level backward compatibility. Letting reports tell the datasources how to sort data? Forget it. Efficient crosstabbing within the corset of the layouting API - difficult. Multi-column layout without exploding memory consumption - you must be kidding!

The next release will be a major-number release called "4.0" - so I'm finally free to get rid of a couple of old mistakes and to comb through the API to bury a few dead bodies. For releases like this we can break up APIs and carve in more efficient ways to funnel information around. The sky is the limit once more.

Hopefully armed with loads of feedback and real-world needs and wishes we can then lay the architectural ground work for the next years to come. So don't get curbed by reality - bring in your deepest desires. When I know your wants and wishes for in the next few years I can shape the API so that we can bring them all in.

Right now my personal list looks like that (In no particular order and not everything comes in the 4.0 release ;) ):

  • Truly Dynamic report parameter - enable/disable based on previous parameters
  • Auto-Sorted data from datasources
  • Native Crosstabbing
  • CSS based styles system with all the perks it brings (selectors, external style definitions)
  • Dynamically loaded subreports
  • CDA integration (yes, Pedro, I know I'm slow)
  • Add ProtoVis and Pentaho Charts to the designer (yes, I know, Pedro, I know ...)
  • Cell-based metadata support
  • OLE-style inline-editor for subreports - see your content instead of a generic blue box
  • Parameter layouting
  • GeoBI components (Maps, visualizations)
  • PDF files as report content
  • JavaScript widget library for HTML and PDF
  • Kill the WAQR
  • A Lightweight Web-Based Report Designer (iWAQR)
  • Better Inline-Subreport performance
  • Multi-Column layout
  • A smarter Report-Wizard

So how does your wishlist look like?

Monday, September 20, 2010

Performance tuning settings for Pentaho Reporting

General thoughts on report processing and performance


Performance of PR is mainly dependent on the amount of content printed. The more content you generate, the more time we will need to perform all layout computations.

Use Inline Subreports with care


Large Inline-subreports are the most notorious reason for bad performance. The layouted output of a inline-subreport is always stored in memory. Layouting pauses until the subreport is fully generated and then inserted into the layout model and subsequently printed. Memory consumption for this layouting model is high, as the full layout model is kept in memory until the report is finished. If the amount of content of the subreport is huge, you will run into out-of-memory exceptions in no time.

A inline-subreport that consumes the full width of the root-level band should be converted into a banded subreport. Banded subreports are layouted and all output is generated while the subreport is processed. The memory footprint for that is small as only the active band or the active page has to be held in memory.

Resource Caching


When images are embedded from servers (HTTP/FTP sources) it is critical for good performance that the server produces a LastModifiedDate header. We use that header as part of the caching. A missing header means we do not cache the resource and will reload the image every time we access it.

As a general rule of thumb: Caching must be configured properly via a valid EHCache file. If caching is disabled or misconfigured, then we will run into performance trouble when loading reports and resources.

Performance Considerations for Output types


Within PR there are three output types, each with its own memory and CPU consumption characteristics.

(1) Pageable Outputs


A pageable report generates a stream of pages. Each page has the same height, even if the page is not fully filled with content. When a page is filled, the layouted page will be passed over to the output target to render it in either a Graphics2D context or a streaming output (PDF, Plaintext, HTML etc).

Prefer "break-after" over "break-before" pagebreaks.


When the content contains a manual pagebreak, the page will be considered full. If the pagebreak is a "before-print" break, the break will be converted to a "after-break" and the internal report states will be rolled back and parts of the report processing restarts to regenerate the layout with the new constraints. A similar roll-back happens, if the current band does not fit on the page.

Stored PageStates: Good for Browsing a report, but eat memory


When processing a pageable report, the reporting engine assumes that the report will be run in interactive mode. To make browsing through the pages faster, a number of page-states will be stored to allow us to restart output processing at that point.

Reports that are run to fully export all pages usually do not need to store those pagestates. A series of settings controls the number and frequency of the pagestates stored:

org.pentaho.reporting.engine.classic.core.performance.pagestates.PrimaryPoolSize=20
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolFrequency=4
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolSize=100
org.pentaho.reporting.engine.classic.core.performance.pagestates.TertiaryPoolFrequency=10

The reporting engine uses three lists to store the page-states. The default configuration looks as follows:

The first 20 states (Pages 1 to 20) are stored into the primary pool. All states are stored with strong references and will not be garbage collected.

The next 400 states (pages 21 to 421) are stored into the secondary pool. Of those every fourth state is stored with a strong reference and cannot be garbage collected as long as the report processor is open.

All subsequent states (pages > 421) are stored in the tertiary pool and every tenth state is stored as strong reference.

For a 2000 pages report a total of about 270 states will be stored with strong references.

In server mode, the settings could be cut down to
org.pentaho.reporting.engine.classic.core.performance.pagestates.PrimaryPoolSize=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolFrequency=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolSize=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.TertiaryPoolFrequency=100

This reduces the number of states stored for a 2000 page report to 22, thus cutting the memory consumption for the page states to a 1/10th.

(Note: In PRD 3.7 full exports no longer generate page states and thus these settings will have no effect on such exports. They still affect the interactive mode.)

(2) Table exports


A table export produces a table output from a fully layouted display model. A table export cannot handle overlapping elements and therefore has to remove such elements.

To support the debugging of report layouts, we store a lot of extra information into the layout model. This increases the memory consumption but makes developing reporting solutions easier. These debug settings should never be enabled in prodcution environments. In 3.6 and earlier the pre-built "classic-engine" has them enabled, as this helps inexperienced developers to find their report-definition errors faster.
org.pentaho.reporting.engine.classic.core.modules.output.table.base.ReportCellConflicts=true
org.pentaho.reporting.engine.classic.core.modules.output.table.base.VerboseCellMarkers=true

Note: With PRD-3.7, the defaults for these settings will change to "false" as we assume that most users use PRD for developing reports now. PRD comes with its own method to detect overlapping elements and does not rely on these settings.

Special Notes on the HTML export


In HTML exports, there are a few settings that can affect export performance.
org.pentaho.reporting.engine.classic.core.modules.output.table.html.CopyExternalImages=true

This controls whether images linked from HTTP(S) or FTP sources are linked from their original source or copied (and possibly recoded) into the output directory. The default is "true" as this ensures that reports always have the same image.

Set to false if the image is dynamically generated and should display the most recent view.
org.pentaho.reporting.engine.classic.core.modules.output.table.html.InlineStyles=false
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ExternalStyle=true
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ForceBufferedWriting=true

The style settings and the buffered writing settings control how stylesheets are produced and whether the generated HTML output will be held in a buffer until the report processing is finished.

Style information can either be inlined, stored in a external *.css file or can be contained in the <head> element of the generated HTML file. (Inlinestyles == false and ExternalStyle == false)

Buffering is forced when styles need to be inserted into the <head> element of a report. Buffering should be set to true if the resulting report is read by a browser, as browsers request all resources they find in the HTML stream. If a browser requests the stylesheet that has not yet been fully generated, the report cannot display correctly.

It is safe to disable buffering if the styles are inlined, as the browser will not need to fetch a external stylesheet in that case.

Buffered content will appear slower to the user than non-buffered content, as browsers render partial HTML pages while data is still received from the server. Buffering delays that rendering until the report is fully processed on the server.

Tuesday, September 14, 2010

Market-Researchers and Consultants with an Agenda

My lovely friend "Google Alerts" regularly brings lovely piece of writing to my desk - often to my enjoyment and many times to my astonishment on how little research goes into research these days.

Market-Researchers - Tell me whose song I sing


Lets start with Actuate trumpeting funny claims around that BIRT is the leader in Opensource BI. Lets see what the report says:

"Actuate BIRT leads the pack. This is a prime example where one needs to clearly understand what they are getting, as Actuate cannot really be directly compared with the other vendors in this Forrester Wave. As we mentioned earlier, other than a few basic reporting components from the Eclipse BIRT — the community edition of the product — Actuate BIRT is mostly a commercial offering. Also, Actuate BIRT is a “traditional” or “pure play” (reporting, analytics, and dashboards only) BI suite that does not offer advanced analytics, and it only offers limited data integration functionality."
(Pg 8 of "The Forrester Wave™: Open Source Business Intelligence (BI), Q3 2010"


and further down

"Eclipse BIRT and Jaspersoft Community editions lack enterprise BI suite functionality. These community editions of open source projects cannot really stand on their own as enterprise BI suite platforms or solutions."
(Pg 10 of "The Forrester Wave™: Open Source Business Intelligence (BI), Q3 2010"


Ouch! Let me paraphrase that: "BIRT (OpenSource) is crippled. You need the expensive commercial version if you want results." A non-opensource product leads a opensource wave. Hmm.. I bet Crystal Reports also contains bits of opensource components (I would be surprised if not), so by that standard do they qualify as Opensource players too?

And I probably will never understand why they test the EE-edition of the PRD 3.6 but the CE-edition of PRD 3.5. Especially as (except for the documentation and the amount of database drivers shipped with it) the CE and EE editions are the same code. The enterprise value for our reporting offering comes with the BI-Server and its ability to easily share reports (and even here CE brings you far). From my point of view the Enterprise editions primary argument is the peace of mind that comes with a support contract for your mission critical business intelligence installation.

Enough of that fun, I do get easily bored by bullying such easy targets.

Clueless Consultants ... you cant make it up


And the BIRT affiliates of course get their own game going by starting comparisons to justify their goals. Well, I don't mind getting told where the weaknesses of our offerings are. But keep it honest, or stay out of the game. Remember, bogus operations always backfire at some point.

Here is one of these comparisons:
Bogus comparison matrix

They claim there is no support for user defined aggregates. Hmm, we can implement functions in Java and they can be used in reports. And functions are stateful. So they can aggregate. Yes, guess we can allow third parties to implement their own aggregations.

The expression API is well documented and it is rather easy to bring in your own aggregation. They even mention Will Gorman's book in the text, so the documentation is there. Ya' know, books must be read!

They claim that BIRT is the only one who supports custom export formats? All our (and JasperReports) export modules are pluggable - as they are in BIRT. Do they really expect any sane developer writes a bunch of monolithic code blocks for that stuff? (Hmm, maybe they do. There is a distinct tradition in writing code differently when doing a once-off prototype versus doing a real product you have to maintain over years and years to come. And we product developers care about long term maintenance as we aren't paid by the hour by our customers.

The table claims, there is no paginated HTML in Pentaho reporting. Now that is a bit surprising as the BI-Server's default rendering mode for reports is .. paginated HTML. They either did not start a report on the server or choose to ignore the fact.

No conditional formatting? Now, this one is tricky. How could they nnotice the green-plus in the tables, how could they notice the examples. But the real WTF is this snippet from "Birt-vs.Jasper":
"Conditional formatting is much more difficult with iReport than with either BIRT or Pentaho."


So either it is not there, then how can it be easier, or it is there, then the table is wrong.

All styles and most (95%) of all attributes on a element can be computed via an expression. Again, this is greatly explained in Will Gorman's book, which must be read to transpire that knowledge from the pages into brains.

And the "BIRT.vs.Pentaho" text claims
"Pentaho’s expression syntax is OpenFormula, which is based on Excel formulas. While this is easy for developers to use and understand, it is often too limiting for real-world reports. "

Bah, use whatever language you want then! I would never force a single language down someone's throat. Want Java/BeanShell/JavaScript/Groovy? It is built in and ready to go. Want Pyhthon/Tcl/Rexx or Groovy? Go get it by adding the Apache-Bean-Scripting-Host jars for it to the lib directory!

We selected OpenFormula, as this language is as close to Excel as it can be while at the same time avoiding the pitfalls of true Excel formulas (like non-regular grammars and really weird behaviour due to legacy behaviour inherited from Excel 2.0 or so). And as Excel and OO-Calc are still the number one tool in the offices around the world, it is a safe bet that these folks actually know how to write Excel formulas. If certain consultants dont know Excel, do I care? Shall I care?

And the one thing that really puzzles me: Code hooks in the designer? I mean, sure, there are scenarios where I want to bring in my own elements, my own data-sources or whatever. And sure as hell, there is a plugin for each of those. Again: Monolithic coding is sooooo 1980!

We call this "plugins" and that is how we delivered a Table-of-Contents element to the community and that is how the charting is integrated.

Code-hooks for OEMs - we also offer that! The Swing-Preview has a ReportMouseListener, a ReportHyperlinkListener and a ReportActionListener you can employ to get notified of user-input. For HTML reports, add your code into one of the many HTML-onmouse* event attributes and you should be ready to go. These properties are not hidden by default, they exist openly and are demonstrated in the demo reports.

Spreading false information may have worked as tactic in the closed source world, where information were limited and no one had easy access to the competitions products and source code. But these days it just looks so sad. I honestly feel for those souls who get so eaten up by their desperation that they have to resort to spreading outright wrong information. Luckily the market these days is swift, so I might not have to see such suffering for long.

Monday, September 13, 2010

BI-Server parameter input: An authoritative voice

Over the last few release cycles, we have spent some tremendous time in making the parameter input easy to use. Along with the community and user feedback the GWT client received new capabilities on when and how to pass parameters around.

We strive for a release it quick and release it often approach, as we rather solve pain in a suboptimal way today than to wait for a perfect solution that comes next year, if ever. And thus it is a sad (and rather natural) truth that not all attempts are hitting the spot.

Especially the auto-submit feature is a creepy little bugger and has seen numerous hacks, tweaks and finally a full make-over in the upcoming 3.7 release.

Architecture: the inner workings of the reporting plugin in 3.7


The reporting plugin is a GWT application, and thus when first invoked, it will initialize itself, inspect all parameters it received from the URL and fires a non-paginating parameter-request back to the server.

On the server, the Parameter-content-handler handles the request. It read all values from the URL. URLs only transmit strings, so in a second step, those raw values get parsed into Java-Objects. The parsed values now get passed down into the reporting engine for further validation. The engine validates the parameters and either accepts the values as fully valid or spits out a bunch of error messages.

If all values are valid and the request asked for pagination, we also paginate the report to return the total number of pages for the report.

The validation and pagination result, along with a description of all available parameters and their accepted values are sent back to the GWT UI.

The first parameter-response now initializes the GWT application and sets the initial state of the auto-submit feature. If the GWT UI detects that all parameters have been accepted as valid, and if auto-submit is enabled, it now fires a full parameter request that paginates the report.

Subsequent updates will then follow a slightly different protocol:

If auto-submit is disabled, any user-input triggers a lightweight parameter validation request. This ensures that list-parameters always get selections based on the actual user input. The response only updates the parameter definitions and values and at no point the report gets executed.

When auto-submit is enabled, we always fire heavyweight parameter requests, which trigger a pagination run on the server. When such a request returns without error, we do show the report content. This finally invokes the report-content-generator and creates the actual report content that is shown below the parameter input pane.

The BI-Server 3.6.0 Situation



In this version all parameter validation requests were heavy weight requests. So every parameter validation triggered a pagination.

A pagination run is only triggered if all parameter were validated successfully. To turn the heavy weight parameter requests into a 3.7-style lightweight request, you need to add a parameter that always validates to false, until the user explicitly selects it. One simple way of doing that is to add a mandatory check-box parameter with the values "Wait" and "Run" and the post-processing formula

=IF([p] = "WAIT"; NA(); [p])


that translates all wait into values and thus making the parameter unvalidatable until the user selected "run".

Thursday, September 2, 2010

Fixed rules: More on Parameter processing

In the famous "lets do it right this time" release of Pentaho Reporting 3.5, we introduced the ability to have parameter on a report. Well, it wasn't quite right, with parameters you need to pre and post process the data to make it sound. That was release 3.6. At that point, so our theory, you, dear user, should be happy. But apparently, the beast we created wasn't all pretty.

Well, its PRD-3.7 now and guess what we are improving: Parameters.

So far, the date parameter processing was not quite right. I still wonder why after 5 years of XAction no one complained about that. Sure, XActions only have strings, and any processing or parsing is up to you - and so is the blame if it does not work.

The various system level options on the parameter UIs was .. sub-optimal. (I'm getting better at phrasing it more positively, don't you think?) The list of supported parameter (aka "the list of parameters you shall not use in your report") grew with every release. The Swing UI and the server UI never quite agreed on what setting to accept and how to behave in border cases. Thus creating formulas that worked in both settings was a chore.

And last but not least: Validating parameters and getting them run in a consistent way was difficult. Give a Integer where a Long was expected and you are screwed. Without a error message. Thus even for me working with the parameters was more a easter-egg search than sane designing.

And last but not least: Even the parameter processing order was a bit funny. It works for simple cases, but behaves rather funny for the not so simple ones.

How Pentaho Reporting Processes Parameters


Each parameter in Pentaho Reporting carries at least two formulas that eventually need to be evaluated.

The default-value-formula is used to produce a valid value if the user provided no value.

The post-processing-formula is used to transform the user's input into something more usable or simply to validate arbitrary business rules (a deposit cannot be negative, for instance).

And last but not least, if you reference an other parameter, you expect it to contain the proper post-processed value.

In PRD-3.6, the order of the validation was largely out of sync with those expectations. In fact, post processing was done in blocks, so that parameter were not able to use other post-processed parameter values in their queries. Now that's bad, and I guess Gretchen will be able to share a few unhappy tales in Lisbon about that.

In PRD-3.7, each parameter is now fully processed on its own before the next defined parameter in the chain is processed.

Lets be more formal for a while:

For each parameter defined:
If the value for the parameter is <null>, we compute the parameter's default value and use that as untrusted parameter value. The default-value formula only sees the previously validated parameters.

In a second step, we post process the parameter to get a trusted value. The post processing formula sees the previously validated parameters and the untrusted value. So be careful how you use the untrusted value here, as you cannot trust users and SQL-injections or cross-site-scripting troubles are never to far away.

If the post-processing formula fails with an error the trusted value of the parameter will be null, a warning message will be issued and last but not least we refuse to execute the report. The parameter processing continues with this value set to <null>.

And finally we check the type of the parameter and compare the parameter value against the list of valid key-values. If the value passes this test it becomes a trusted value and will be used in the further parameter processing and ultimately it will be used in the report.

If the parameter fails the test, we report an error, prevent any report processing and continue to validate the remaining parameters using the parameter's default value.

Beginning with this version, the parameter validation also creates the set of validated values after the validation is complete. For a report without any parameter values set, this will yield the default values for all parameters.

So what does this mean for you?


The new schema brings a couple of changes to the way the system behaves. Default values are now context sensitive and can change when the selection for the previously declared parameters changes. Our parameter UIs do not directly use that feature for usability reasons.Automatically changing the user's input is not very nice and confuses and/or upsets people. A lot.

The post processing formulas are now executed in a timely manor and before the default-value or selection for a parameter is computed. This way, you are now able to compute the mondrian-role array in a hidden parameter's post-processing formula and be sure that your datasource sees it.

And last but not least, your formulas wont be able to use values that have not been validated, nor would the report ever include them. Especially with the SINGLEVALUEQUERY and MULTIVALUEQUERY formula functions, this is mandatory. Your database is yours and we all want to keep it that way.