Powered By Blogger

Friday, April 22, 2011

Pivot Query


So how to pivot data in a table?
I was asked about this by someone at work today. Let’s say you have a table that looks like following. I have included create and insert statements so that you can see it and understand it yourself.

Create table t1 (a number, b number);

insert into t1 values (1,1);
insert into t1 values (1,2);
insert into t1 values (1,3);
insert into t1 values (2,1);
insert into t1 values (2,2);
insert into t1 values (2,3);
insert into t1 values (3,1);
insert into t1 values (3,2);
insert into t1 values (3,3);

select * from t1 order by 1,2;

####
(A hint aside: You must specify the order in which you want the data to be back. It is completely wrong to say that oracle will return the data in order it was inserted.)
####

A B
----- ------
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

Now, let’s say you want to retrieve all the records from table t1 in a fashion where all the values in column B that belong to the group Column A should be shown side by side. Something like

A Val1 Val2 Val3
-- ---- ---- ----
1 1 2 3

One way would be to write a procedural code that stores different values of a group in an array or a series of variables and then displays them on screen or returns them to the calling program.

Or

You could just write a simple SQL statement that does it for you. You could write a piece of code that looks like this…

select
a
,max(decode(b,1,b,null)) col1
,max(decode(b,2,b,null)) col2
,max(decode(b,3,b,null)) col3
from t1
group by a;

A COL1 COL2 COL3
—– ——- ——- ——-
1 1 2 3
2 1 2 3
3 1 2 3

3 rows selected

… and you won’t have to worry about supporting an object such as a function or a procedure.

What if you had null values in one of the columns.

update t1 set b=null where a=2 and b=2;

One row updated

select
a
,max(decode(b,1,b,null)) col1
,max(decode(b,2,b,null)) col2
,max(decode(b,3,b,null)) col3
from t1
group by a;

A COL1 COL2 COL3
—– ——- ——- ——-
1 1 2 3
2 1 3
3 1 2 3

3 rows selected

There are many ways to write pivot queries. This is probably the easiest way of doing it.

Using Informatica to perform pivot operation.

You can use the Normalizer Transformation to generate rows out of columns. But to pivot values in rows into columns you would have to use Aggregator Transformation. You have to use the *FIRST* or *LAST* function to achieve this. First or last functions will pick first or last of the incoming rows if there are duplicates. Here is an example…

Suppose you have a source table with this data that is a record of monthly expenses for each of your departments.
create table deptexp (sales_id number(5), mon char(3), amt number(10,2));

insert into deptexp values (1,’JAN’,100);
insert into deptexp values (1,’FEB’,120);
insert into deptexp values (1,’MAR’,135);
insert into deptexp values (2,’JAN’,110);
insert into deptexp values (2,’FEB’,130);
insert into deptexp values (2,’MAR’,120);
Commit;

You want to load this data after denormalizing it into this structure.
create table deptexp1 (DEPT_ID number(5), JAN_AMT number(10,2), FEB_AMT number(10,2), MAR_AMT number(10,2))

The data after our process would look like…
DEPT_ID JAN_AMT FEB_AMT MAR_AMT
1 100 120 135
2 110 130 120

Do the following to accomplish this.
1. Create an Aggregator transformation with the following ports and expressions:
NAME IN/OUT EXPRESSION GROUP BY
DEPT_ID IN YES
MONTH IN NO
AMOUNT IN NO
JAN_AMT OUT FIRST(AMOUNT, MONTH='JAN')
FEB_AMT OUT FIRST(AMOUNT, MONTH='FEB')
MAR_AMT OUT FIRST(AMOUNT, MONTH='MAR')
APR_AMT OUT FIRST(AMOUNT, MONTH='APR')
2. Connect the DEPT_ID, MONTH and the AMOUNT ports from the Source Qualifier to the Aggregator.
3. Connect the JAN_AMT, FEB_AMT, MAR_AMT, etc. ports to the target.
And that would do it…

nvarchar2 & varchar2


Informatica imports all VARCHAR2 fields as nVARCHAR2. This is a big deal. Especially when you are using update strategy to perform update-else-insert operations on target columns that are of VARCHAR2 datatype. One could expect noticable performance gain after converting all target column definitions for VARCHAR2 columns in Informatica.

Informatica Performance Tuning



I have gathered this list over time while working at different client sites. It is not necessarily my list. But it definitely is a good list. If you don’t agree with anything below, don’t follow it. Anyways, here it goes….

INFORMATICA BASIC TUNING GUIDELINES

The following points are high-level issues on where to go to perform “tuning” in Informatica’s products. These are, in no way, permanent problem solvers, nor are they the end-all solution. Just some items (which if tuned first) might make a difference. The level of skill available for certain items will cause the results to vary.

To ‘test’ performance throughput it is generally recommended that the source set of data produce about 200,000 rows to process. Beyond this – the performance problems / issues may lie in the database – partitioning tables, dropping / re-creating indexes, striping raid arrays, etc… Without such a large set of results to deal with, your average timings will be skewed by other users on the database, processes on the server, or network traffic. This seems to be an ideal test size set for producing mostly accurate averages.

Try tuning your maps with these steps first. Then move to tuning the session, iterate this sequence until you are happy, or cannot achieve better performance by continued efforts. If the performance is still not acceptable, then the architecture must be tuned which can mean changes to what maps are created). In this case, you can contact us – we tune the architecture and the whole system from top to bottom.

KEEP THIS IN MIND: In order to achieve optimal performance, it’s always a good idea to strike a balance between the tools, the database, and the hardware resources. Allow each to do what they do best. Varying the architecture can make a huge difference in speed and optimization possibilities.

Utilize a database for significant data handling operations (such as sorts, groups, aggregates). In other words, staging tables can be a huge benefit to parallelism of operations. In parallel design – simply defined by mathematics, nearly always cuts your execution time.
Localize. Localize all target tables on to the SAME instance of Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote database links) for anything (including: lookups, stored procedures, target tables, sources, functions, privileges, etc…). Utilizing remote links will most certainly slow things down. For Sybase users, remote mounting of databases can definitely be a hindrance to performance.

If you can – localize all target tables, stored procedures, functions, views, and sequences in the SOURCE database. Again, try not to connect across synonyms. Synonyms (or remote database tables) could potentially affect performance by as much as a factor of 3 times or more.

Remember that Informatica suggests that each session takes roughly 1 to 1 1/2 CPU’s. In keeping with this – Informatica play’s well with RDBMS engines on the same machine, but does NOT get along (performance wise) with ANY other engine (reporting engine, java engine, OLAP engine, java virtual machine, etc…)

Remove any database based sequence generators. This requires a wrapper function / stored procedure call. Utilizing these stored procedures can cause performance to drop by a factor of 3 times. This slowness is not easily debugged – it can only be spotted in the Write Throughput column. Copy the map, replace the stored proc call with an internal sequence generator for a test run – this is how fast you COULD run your map. If you must use a database generated sequence number, use them as part of the source qualifier query. If you’re dealing with GIG’s or Terabytes of information – this should save you lot’s of hours tuning.

TURN OFF VERBOSE LOGGING. The session log has a tremendous impact on the overall performance of the map. Force over-ride in the session, setting it to NORMAL logging mode. Unfortunately the logging mechanism is not “parallel” in the internal core, it is embedded directly in to the operations.

Turn off ‘collect performance statistics’. This also has an impact – although minimal at times – it writes a series of performance data to the performance log. Removing this operation reduces reliance on the flat file operations. However, it may be necessary to have this turned on DURING your tuning exercise. It can reveal a lot about the speed of the reader, and writer threads.

Try to eliminate the use of non-cached lookups. By issuing a non-cached lookup, your performance will be impacted significantly. Particularly if the lookup table is also a “growing” or “updated” target table – this generally means the indexes are changing during operation, and the optimizer looses track of the index statistics. Again – utilize staging tables if possible. In utilizing staging tables, views in the database can be built which join the data together; or Informatica’s joiner object can be used to join data together – either one will help dramatically increase speed. The thing with Informatica’s Joiner Transformation is that it needs caching of the data to scan through it to find matches. In such cases, make sure you tweak the cache sizes. Unless your source tables are from different RDBMS, let the RDBMS join them.

Separate complex maps – try to break the maps out in to logical threaded sections of processing. Re-arrange the architecture if necessary to allow for parallel processing. There may be more smaller components doing individual tasks, however the throughput will be proportionate to the degree of parallelism that is applied.

Balance between Informatica and the power of SQL and the database. Try to utilize the DBMS for what it was built for reading / writing / sorting / grouping / filtering data en-masse. Use Informatica for the more complex logic, outside joins, data integration, multiple source feeds, etc… The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be able to recognize what operations are best in the database, and which ones are best in Informatica. This does not degrade from the use of the ETL tool, rather it enhances it – it’s a MUST if you are performance tuning for high-volume throughput.

TUNE the DATABASE. Don’t be afraid to estimate: small, medium, large, and extra large source data set sizes (in terms of: numbers of rows, average number of bytes per row), expected throughput for each, turnaround time for load, is it a trickle feed? Give this information to your DBAs and ask them to tune the database for “worst case”. Help them assess which tables are expected to be high read/high write, which operations will sort (order by) etc. Moving disks, assigning the right table to the right disk space could make all the difference. Utilize a PERL script to generate “fake” data for small, medium, large, and extra large data sets. Run each of these through your mappings – in this manner, the DBA can watch or monitor throughput as a real load size occurs.

Be sure there is enough SWAP, and TEMP space on your PMSERVER machine. Not having enough disk space could potentially slow down your entire server during processing (in an exponential fashion). Sometimes this means watching the disk space while your session runs. Otherwise you may not get a good picture of the space available during operation. Particularly if your maps contain aggregates, or lookups that flow to disk Cache directory or if you have a JOINER object with heterogeneous sources.

Place some good server load monitoring tools on your PMServer in development – watch it closely to understand how the resources are being utilized, and where the hot spots are. Try to follow the recommendations – it may mean upgrading the hardware to achieve throughput.
TWEAK SESSION SETTINGS. In the session, there is only so much tuning you can do. Balancing the throughput is important – by turning on “Collect Performance Statistics” you can get a good feel for what needs to be set in the session – or what needs to be changed in the database. Read the performance section carefully in the Informatica manuals. Basically what you should try to achieve is: OPTIMAL READ, OPTIMIAL THROUGHPUT, and OPTIMAL WRITE. Over-tuning one of these three pieces can result in ultimately slowing down your session.

Try to merge expression objects, set your lookups to unconnected (for re-use if possible), check your Index and Data cache settings if you have aggregation, or lookups being performed. Etc… If you have a slow writer, change the map to a single target table at a time – see which target is causing the “slowness” and tune it. Make copies of the original map, and break down the copies. Once the “slower” of the N targets is discovered, talk to your DBA about partitioning the table, updating statistics, removing indexes during load, etc… There are many database things you can do here.

Remove all other “applications” on the PMServer. Except for the database / staging database or Data Warehouse itself. PMServer plays well with RDBMS (relational database management system) – but doesn’t play well with application servers, particularly JAVA Virtual Machines, Web Servers, Security Servers, application, and Report servers. All of these items should be broken out to other machines. This is critical to improving performance on the PMServer machine.

Hope this helps.

Kirtan Desai

Updates

Dan Linstedt from Myers-Holum Inc made a few comments a few days ago on this post. I have added his comments below.

If you use a joiner, always use a sorted joiner – the cost of sorting data coming from the RDBMS staging area is frequently less than the cost of building the caches in place. Furthermore, if you are NOT on 64 bit Informatica, or you don’t have unlimited (seemingly unlimited) RAM, you have an upper limit to the caching mechanisms in all the cached objects – including Joiner.
If the RDBMS is NOT tuned properly, putting more work into the RDBMS will actually slow things down.
Too many Instances of an RDBMS on the SAME MACHINE will actually kill performance rather than help it, and it really is not necessary in order to handle fail-over as many set it up to do.
Replacing a Lookup with a SORTED JOINER can improve performance dramatically.
The manuals have the formula wrong for the Data / Index Cache settings, even though it’s counter-intuitive, you want 100% of the INDEX cached if you can get it, giving up Data Cache for disk. Why? Because if you can’t access RAM to check for the existence of data you are actually increasing I/O – when you increase I/O you slow performance dramatically. Set your Index caches to twice the size of your data caches to be safe.
There are many variables which can change the course of one’s actions. Feel free to ask questions on this blog.

Constraint based Loading :-



It is an option available in the session.
If enabled, for each row generated by the source, it will load the record in the primary key table first, then it will load any foreign key tables. It applies only to the inserts.

- All the targets should have the same Active source
- You should define the relationships for the Target in the Warehouse Designer
- Targets shoule be in the same Target connection group
- Treat source as "Insert". You cannot use updates with constraint based loading


FullCBLOSupport = Yes/No
- It is a custom property. If you set it in the Integration service, it is applicable for all workflows. You can also set it in the session config object tab in the Custom Properties attributes.
- Allows you to load the changed data (updates and deletes) to be loaded to the target in the same transaction control unit (TCU).
- Your mapping should not contain any active transformation which changes the row id generated by the reader.

Thursday, April 21, 2011

Session Configuration


Advanced -
Constraint based load ordering [Unchecked]
- To load data into the targets based on primary key foreign key relationship
Cache Lookup() Function - [Checked]
Default buffer block size [Auto]
-
Line Sequential Buffer Length [1024]
- Informatica reads a row from the file and holds it in the buffer. If the row is more than 1024 bytes, then increase the value.
Max Memory Allowed for Auto Memory attributes [512Mb]
Max percentage of Total memory allowed for Auto Mem Attributes [5]
Additional Concurrent pipelines for Lookup Cache Creation [Auto]

Log Options -
Save Sessions Log by [Session runs]
- Session runs - saves the number of sessions logs specifed
- Session timestamp - appends the timestamp to the session log file name.
Save Sessions log for these runs
- will be enabled only for Session Log by Session runs.
- specify the number of log files to store .0 means it will store only the current log file
- you can use $PMSessionLogCount

Error Handling -
Stop on errors
- Specify number of non-fatal errors to allow (reader, writer, dtm errors)
- 0 means it wont stop. you can use $PMSessionErrorThreshold service variable

Override tracing
On Stored Procedures error [Stop] - Stop / Continue
On Pre-session command task error [Stop] - Stop / Continue
On Pre-Post SQL error [Stop] - Stop / Continue


Error Log Type
Errot Log DB Connection
Error Log Table Name prefix
Error Log File Directory
Error Log File Name
Log Row Data
Log Source Row Data
Data column delimiter

Partitioning Options -
Dynamic Partitions [Disabled]
- Other options are Based on number of partitions , Based on source Partitioning, Based on number of nodes in the grid.
Number of Partitions [1]
- Will be enabled if you choose "Based on number of partitions"
- You can also specify $DynamicPartitionCount session parameter


Sessions on Grid -
Is Enabled [Unchecked]
- To run the sessions in a grid

Session Properties


General Options

Write backward compatible Session log file
Session log file name
Session log file directory
Parameter file name

Enable test load
Number of rows to test
- will be enabled only if you Enable Test Load

$Source connection value
$Target connection value

Target source rows as
- Insert / Update / Delete / Data driven
- If the mapping has update transformation then by default Data driven.

Commit Type [Target]
- Source / Target / User defined
Commit Interval
- commit after x number of records
Commit on End of file
Rollback transaction on errors

Recovery Strategy
- Restart Task / Fail task continue workflow / Resume from last check point

Java classpath

Performance -

DTM Buffer size [Auto]
- You can specify Auto or value. By default Integ service allocated 12MB

Collect performance data
- to collect performance data. you can use workflow monitor to see it

Write performance data to repository
- writes the performance details to the repository. you can use workflow monitor to see it

Incremental Aggregation
Reinitialize Aggregate data
Enable High precision
- to process Decimal datatype to a precision of 28

Session retry on deadlock
- applicable only for Normal loads
- In the Integ service you can specify number of deadlock retries and deadlock sleep time period.

Session sort order
- If the integration service runs in Ascii then it is Binary
- If it runs on Unicode, then it displays sort order based on the Integration service code page

Pushdown Optimization
- None / To Source /To Source with view / To Target / Full / Full with view / $PushdownConfig
-

Informatica Powercenter 9 – New features


Lookup Transformation: Cache updates. We can update the lookup cache based on the results of an expression. When an expression is true, We can add to or update the lookup cache. We can update the dynamic lookup cache with the results of an expression.

Multiple rows return: We can configure the Lookup transformation to return all rows that match a lookup condition. A Lookup transformation is an active transformation when it can return more than one row for any given input row.

SQL overrides for uncached lookups: In previous versions We could create a SQL override for cached lookups only. We can create an SQL override for uncached lookup. We can include lookup ports in the SQL query.

Database deadlock resilience: In previous releases, when the Integration Service encountered a database deadlock during a lookup, the session failed. Effective in 9.0, the session will not fail. When a deadlock occurs, the Integration Service attempts to run the last statement in a lookup. We can configure the number of retry attempts and time period between attempts.

SQL transformation: Auto-commit for connections. We can enable auto-commit for each database connection. Each SQL statement in a query defines a transaction. A commit occurs when the SQL statement completes or the next statement is executed, whichever comes first.

Session Log files rollover: We can limit the size of session logs for real-time sessions. We can limit the size by time or by file size. We can also limit the number of log files for a session.

Passive transformation: We can configure the SQL transformation to run in passive mode instead of active mode. When the SQL transformation runs in passive mode, the SQL transformation returns one output row for each input row.

XML transformation: XML Parser buffer validation. The XML Parser transformation can validate an XML document against a schema. The XML Parser transformation routes invalid XML to an error port. When the XML is not valid, the XML Parser transformation routes the XML and the error messages to a separate output group that We can connect to a target.

Model Repository Service: Application service that manages the Model repository. The Model repository is a relational database that stores the metadata for projects created in Informatica Analyst and Informatica Designer. The Model repository also stores run-time and configuration information for applications deployed to a Data.

Integration Service: Create and enable a Model Repository Service on the Domain tab of Informatica Administrator.

Connection management: Database connections are centralized in the domain. We can create and view database connections in Informatica Administrator, Informatica Developer, or Informatica Analyst. Create, view, edit, and grant permissions on database connections in Informatica Administrator.

Deployment: We can deploy, enable, and configure deployment units in the Informatica Administrator. Deploy Deployment units to one or more Data Integration Services. Create deployment units in Informatica Developer.

Monitoring: We can monitor profile jobs, scorecard jobs, preview jobs, mapping jobs, and SQL Data Services for each Data Integration Service. View the status of each monitored object on the Monitoring tab of Informatica Administrator.

New Features of Informatica-9

1. Informatica 9 supports data integration for the cloud as well as on premise. You can integrate the data in cloud applications, as well as run Informatica 9 on cloud infrastructure.

2. Informatica analyst is a new tool available in Informatica 9.

3. There is architectural difference in Informatica 9 compared to previous version.

4. Browser based tool for business analyst is a new feature.

5. Data steward is a new feature.

6. Allows unified administration with a new admin console that enables you to manage power centre and power exchange from the same console.

7. Powerful new capabilities for data quality.

8. Single admin console for data quality, power centre, power exchange and data services.

9. In Informatica 9, Informatica data quality (IDQ) has been further integrated with the Informatica Platform and performance, manageability and reusability have all been significantly enhanced.

10. The mappings rules are shared between the browser based tool for analysts and the eclipse based development leveraging unified metadata underneath.

11. The data services capabilities in Informatica 9 , both over sql and web services ,can be used for real time dash boarding.

12. Informatica data quality provides world wide address validation support with integrated geocoding.

13. The ability to define rules and view and run profiles is available in both the Informatica developer (Thick client) and Informatica analyst (browser based tool-Thin client).these tools sit on a unified metadata infrastructure. Both tools incorporate security features like authentication and authorization ensuring..

14. The developer tool is now eclipse based and supports both data integration and data quality for enhanced productivity. It provides browser based tool for analysts to support the types of tasks they engage in, such as profiling data, specifying and validating rules & monitoring data quality.

15. There will a velocity methodology. Soon it’s going to introduce on I9.

16. Informatica has the capability to pull data from IMS, DB2 on series and series and from other several other legacy systems (Mainframe) environment like VSAM, Datacom, and IDMS etc.

17. There are separate tools available for different roles. The Mapping architect for Vision tool is designed for architects and developers to create templates for common data integration patterns saving developer’s tremendous amount of time.

18. Informatica 9 does not include ESB infrastructure.

19. Informatica supports open interfaces such as web services and can integrate with other tools that support these as well including BPM tool.

20. Informatica 9 complements existing BI architectures by providing immediate access to data through data virtualization, which can supplement the data in existing data warehouse and operational data store.

21. Informatica 9 supports profiling of Mainframe data. Leveraging the Informatica platform’s connectivity to Mainframe sources.

22. Informatica 9 will continue support feature of running the same workflow simultaneously.

23. Eclipse based environment is build for developers.

24. Browser based tool is a fully functional interface for business analysts.

25. Dashboards are designed for business executives.

26. There are 3 interfaces through which these capabilities can be accessed. Analyst tool is a browsed tool for analyst and stewards. Developers can use the eclipse based developer tool. Line of business managers can view data quality scorecards.