Reporting in IdentityIQ
IdentityIQ (IIQ) provides a built-in reporting module that is suitable for basic reporting needs. However, for enterprise-scale reporting, it can present several challenges. Below are the key limitations of relying on the native IIQ reporting module.
Resource Contention with Production Systems
The IIQ reporting engine operates directly on the same application server and database as the production environment. This architecture has inherent limitations, particularly in environments with heavy traffic or high workloads. Running resource-intensive reports can impact the overall performance of the production instance. Key tasks such as access certifications, provisioning, and access reviews are critical operations that run on the same server. When large or complex reports are generated, they compete for the same CPU, memory, and database I/O resources, potentially slowing down these critical governance processes.
From a performance management perspective, it’s not ideal to have reporting workloads on the same server that handles operational workflows. As reporting demands grow, this configuration can lead to bottlenecks, degraded user experience, and longer report generation times.
Data Structure Challenges: Lack of Normalization and XML Storage
IIQ’s data model is not designed in a fully normalized form (3rd normal form). This means that the database schema contains redundant and grouped data, which increases complexity when creating custom reports or running ad-hoc queries.
For example, a normalized structure would separate related data into distinct tables, with minimal redundancy, making queries straightforward. In contrast, IIQ’s denormalized structure requires more complex queries with multiple joins and often results in suboptimal performance when dealing with large data sets.
Additionally, some of the most critical data within IIQ, such as user attributes, role assignments, and entitlement details, are stored as XML blobs in the database. While this allows for flexible and dynamic data storage, it significantly complicates the reporting process. Extracting and parsing XML data requires custom Java logic or XPath queries, which are far more complex to write and maintain than typical SQL queries. This added complexity increases the time and expertise needed to develop and maintain custom reports.
Further complicating matters, changes to the XML structure may occur during upgrades or configuration changes, making reports fragile and prone to breaking unless constantly maintained. This makes the built-in reporting approach less agile and adaptable for business needs that evolve over time.
Limited Capabilities of IIQ’s Reporting Module
The native IIQ reporting module offers a set of out-of-the-box reports that cover basic audit and operational needs. However, compared to dedicated Business Intelligence (BI) solutions, it lacks the flexibility and advanced features needed for comprehensive data analysis. For example:
- Limited Data Visualization: IIQ reporting lacks advanced visualization tools like interactive dashboards, charts, or multi-dimensional analysis. It focuses more on static tabular data, which is often insufficient for management-level reporting or detailed data insights.
- Restricted Customization: While the reporting module allows for basic customization, it falls short when compared to external BI solutions that offer robust data modeling, filtering, and drill-down capabilities. Organizations with complex reporting requirements often find the IIQ reporting framework limited when trying to create reports that span multiple data sources or require complex transformations.
- No Real-time Analytics: The IIQ reports are typically generated from a snapshot of the data. Real-time reporting, which is often required for operational decisions, is either slow or impractical due to the underlying architecture of the IIQ reporting engine.
These limitations drive many organizations to seek external BI platforms that can integrate with IIQ data, providing enhanced reporting, scalability, and better performance without impacting core Identity Governance operations.
Enterprise Reporting Requirements
To design a robust reporting solution for IIQ, the following steps are required:
- Export Data from IIQ: Ensure comprehensive data extraction covering identity, access, and operational information.
- Process and Transform Data: Extract data stored in XML, normalize, and apply necessary transformations for efficient reporting.
- Store Data: Save transformed data in a structured format (e.g., relational database or data warehouse) to support fast querying and scalability.
- Connect to a BI Tool: Integrate the data with a Business Intelligence solution (e.g., Power BI, Tableau) to create real-time, visual reports that meet business requirements.
Solution
The decision was made to leverage various Azure services for the reporting solution. The architecture utilized Azure Data Factory to automate the extraction of data from IIQ and manage the overall data workflow. For data processing, Azure Databricks was employed, where Python notebooks were used to extract data from XML, perform transformations, and prepare the data for reporting. The processed data was then stored in Azure Data Lake Storage, providing scalable and efficient storage. Finally, Power BI was chosen as the reporting tool, enabling the design of interactive reports and dashboards, which were made available to end users for real-time access and analysis.
Azure Data Factory configuration
The main coordinator of the whole process is the Data Factory configuration. Let’s looks at how the main workflow is designed.
Data processing is done in two different ways: bulk and incremental. In bulk processing, each table from IIQ is exported in full every time and existing data in reporting storage is overwritten. In incremental processing, a data delta is calculated and only new data from the last refresh is exported and added to the existing set in the reporting storage.
I will show below only the bulk processing just to give you an idea how it works.
The main topmost workflow handles everything. It invokes three other workflows, one to process the bulk data, one to process the incremental data, and the final one to refresh the Power BI reports.
It has two parameters: items
which is a configurable list of tables to read from IIQ database.
And notebooks
which is a configurable list of Databricks notebook names to be run to transform the data in to reporting format.
Let’s look at bulk processing (it’s simpler and better to understand the whole concept.)
The first step in bulk data processing is a workflow which is very simple as it contains only one “ForEach” step.
In this step, the process reads the data from the configured IIQ table for each table from the input parameter.
And saves it in to Databricks readable “Parquet” file. The name of the file is also part of the input configuration.
Once all the data from on-prem IIQ database is copied to cloud storage, a workflow that runs multiple Databricks notebooks is triggered. Each one of these notebooks is basically python code that is executed on Databricks and reads the data copied from IIQ and performs all the transformations like extracting data from XML, translating data to relational database, possibly augmenting the data from other sources, and finally storing it in tables (files) accessible by Power BI.
A huge benefit of Databricks is that it’s built on top of Apache Spark, so we can perform all the above calculations very fast and efficiently. The details of how the DataFrames (the primary object of Apache Spark) are handled and processed by the notebooks is probably a topic for another blog post.
Once all data is processed, the next step is to tell all reports in PowerBI to refresh with a new set of data. This utilizes the flexibility of the DataFactory and is done by another custom workflow.
It uses available building blocks of the DataFactory to perform each action.
First we need to get a service principal secret from the vault.
Then using that secret we obtain the AAD token (which we need to access Power BI).
Once we have the token we get all the datasets from Power BI (this is where Power BI stored data for the reports).
We then have to filter them to get only the refreshable ones.
And finally we invoke a refresh for each one of them.
That finalizes the process.
On top of all that, every configuration and customization is stored in json or yaml format in a Git repository giving us full control over the changes we introduce and an easy way to trace the source of potential bugs in the future.
Once we have the data available in Power BI, we have a plethora of reporting and analytical capabilities to serve the data to the enterprise.
Summary
This architecture provides us with great flexibility. We can add new tables, columns if needed, we can combine data from other sources, we can increase or decrease the processing resources on demand (controlling the cost).
And if we are ever to switch to ISC, we just change the source of the data and reporting stays the same (well, we will need to tweak the data structures ).