Powered By Blogger

Monday, March 14, 2011

How Materialized Views Work


Starting with Oracle 8.1.5, introduced in March 1999, you can have a materialized view, also known as a summary. Like a regular view, a materialized view can be used to build a black-box abstraction for the programmer. In other words, the view might be created with a complicated JOIN, or an expensive GROUP BY with sums and averages. With a regular view, this expensive operation would be done every time you issued a query. With a materialized view, the expensive operation is done when the view is created and thus an individual query need not involve substantial computation.

Materialized views consume space because Oracle is keeping a copy of the data or at least a copy of information derivable from the data. More importantly, a materialized view does not contain up-to-the-minute information. When you query a regular view, your results includes changes made up to the last committed transaction before your SELECT. When you query a materialized view, you're getting results as of the time that the view was created or refreshed. Note that Oracle lets you specify a refresh interval at which the materialized view will automatically be refreshed.

At this point, you'd expect an experienced Oracle user to say "Hey, these aren't new. This is the old CREATE SNAPSHOT facility that we used to keep semi-up-to-date copies of tables on machines across the network!" What is new with materialized views is that you can create them with the ENABLE QUERY REWRITE option. This authorizes the SQL parser to look at a query involving aggregates or JOINs and go to the materialized view instead. Consider the following query, from the ArsDigita Community System's /admin/users/registration-history.tcl page:

 
select 
  to_char(registration_date,'YYYYMM') as sort_key,
  rtrim(to_char(registration_date,'Month')) as pretty_month, 
  to_char(registration_date,'YYYY') as pretty_year, 
  count(*) as n_new
from users
group by 
  to_char(registration_date,'YYYYMM'), 
  to_char(registration_date,'Month'), 
  to_char(registration_date,'YYYY')
order by 1;
 
 
SORT_K PRETTY_MO PRET     N_NEW
------ --------- ---- ----------
199805 May      1998        898
199806 June     1998        806
199807 July     1998        972
199808 August   1998        849
199809 September 1998      1023
199810 October  1998       1089
199811 November  1998      1005
199812 December  1998      1059
199901 January  1999       1488
199902 February  1999      2148

For each month, we have a count of how many users registered at photo.net. To execute the query, Oracle must sequentially scan the users table. If the users table grew large and you wanted the query to be instant, you'd sacrifice some timeliness in the stats with

 
create materialized view users_by_month
   enable query rewrite
   refresh complete
   start with 1999-03-28
   next sysdate + 1 
   as 
   select 
     to_char(registration_date,'YYYYMM') as sort_key,
     rtrim(to_char(registration_date,'Month')) as pretty_month, 
     to_char(registration_date,'YYYY') as pretty_year, 
     count(*) as n_new
   from users
   group by 
     to_char(registration_date,'YYYYMM'), 
     to_char(registration_date,'Month'), 
     to_char(registration_date,'YYYY')
   order by 1

Oracle will build this view just after midnight on March 28, 1999. The view will be refreshed every 24 hours after that. Because of the enable query rewrite clause, Oracle will feel free to grab data from the view even when a user's query does not mention the view. For example, given the query

 
select count(*) 
from users 
where rtrim(to_char(registration_date,'Month')) = 'January'
and to_char(registration_date,'YYYY') = '1999'

Oracle would ignore the users table altogether and pull information from users_by_month. This would give the same result with much less work. Suppose that the current month is March 1999, though. The query

 
select count(*) 
from users 
where rtrim(to_char(registration_date,'Month')) = 'March'
and to_char(registration_date,'YYYY') = '1999'

will also hit the materialized view rather than the users table and hence will miss anyone who has registered since midnight (i.e., the query rewriting will cause a different result to be returned).

Monday, March 7, 2011

Error Handling Approach

Error Handling is one of the most important component in any Data warehouse or data integration project. If process is not able to handle and manage ERROR effeciently, then it is very rare that the project will succeed.



Following steps and questions should be answered

1. Who is the business owner of the process/project

2. Who will be responsible for Error correction

3. What will the interface to present the ERROR to business community

4. Error correction will be at source or IT will be responsibile to re-run error correction

5. Error retention period in ERROR table



Above are few points to consider before deciding on the ERROR management process. Any ETL will provide feature to implement ERROR management. And sometime we go outside the ETL tool for presentation and correction of ERROR.



Informatica version 7.x onward provide feature where you can capture Error information at transformation level and Source record in ERROR table or ERROR File. These tables or files can be re-processed to present the error records to business community. But in my opinion it is always better to have a common ERROR management process of all data integration project ( independent of tool). Any ETL tool can support that Error management process.

Informatica - Architecture

Informatica provides following components


  1. PowerCenter Repository
  2. PowerCenter Repository Server
  3. PowerCenter Client
  4. PowerCenter Server

PowerCenter Repository : It maintains the metadata. Set of tables are created within repository database to maintain the Informatica metadata. PowerCenter client and server access the repository to access metadata.

PowerCenter Repository Server: It manages connections to the repository from client applications.

PowerCenter Client: This is used to maintain the powercenter objects such as Source, Targets, mappings etc. It has following client tools

  • Designer : To create mappings that contain transformation instructions for the PowerCenter Server.
  • Workflow manager: To create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data.
  • Repository Manager: To administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.
  • Repository Administrator Admin Console: To administer the Repository Servers and repositories
  • Workflow Monitor: To monitor scheduled and running workflows for each PowerCenter Server.

PowerCenter Server: With help of repository and repository server, powercenter server execute the business logic for extraction, transformation and load.

What is Repository ?

Powercenter repository is maintained in RDBMS . It contains the instruction required to extract, transform and load data. PowerCenter access repository using repository server. Metadata in the repository is added using powercenter client tools. There can be two types of repository

  1. Global Repository: Object created in Global repository can be access across different repository.
  2. Local Repository: Objected created in local repository can only be accessed within the repository. Global repository are registered to local repository so that global repository objects are access in local repository using shortcuts.

Saturday, March 5, 2011

Typical Interview Questions

Questions for Data Warehouse Architect

  • What is dimensional modeling and how it is different from ERD?
  • What are the benefits and pitfalls of snowflaking?
  • What methodology did you adopt, Kimball or Inmon and why?
  • What are the advantages of the Corporate Information Factory (CIF) architecture vs. the bus architecture with conformed dimensions? What suits best for your environment and why?
  • What is subject area?
  • What is difference between normalization and de-normalization? Can we have a normalized data warehouse schema?
  • Why do we need an ODS?
  • How will you plan to load your data into your data warehouse from an ODS vs. an OLTP environment?
  • Assume that you are designing a data warehouse for an insurance company. What type of SCD you will design for an insurance agent table, where agent keeps moving across many regions/states.
  • How will go about designing star schema for your business?
  • In your organization business requirement/practices have changed dramatically and warrants for major schema refactoring of your existing data warehouse. How will you handle this situation? For e.g., you need to add couple of dimension keys to existing fact tables.
  • You have a data warehouse setup and your company did a major acquisition/merger and how will you go about handling the new data flow into your setup.
  • What are factless fact tables and give a scenario where you will use it?
  • What type of optimization techniques applied at the data model level and why did you choose them?
  • How will you handle data cleansing, validation?
  • What is Master Data Management?
  • How will you handle data rejects in your Data warehouse?
  • Describe common techniques for loading from the staging area/OLTP to the
    warehouse when you only have a small window.
  • How do you load type 1 dimensions, 2 dimensions
  • How would you model unbalanced hierarchies
  • How would you model cyclic relations
  • What major elements would you include in an audit model?
  • How would you implement traceability?
  • What steps would you take to improve reporting performance?

Typical Questions (Most of these questions are meant for experienced professionals. Fresh candidates please do not get overwhelmed with these questions.)

General Questions:

  • Which of these two SQLs produce same result?
    1. select * from (select '1','2','3','1' from dual ) tab1
    union
    select * from (select '1','2','3','1' from dual ) tab2
    2. select distinct * from (select '1','2','3','1' from dual ) tab1
    union all
    select distinct * from (select '1','2','3','1' from dual ) tab2
    3. select distinct * from (
    select * from (select '1','2','3','1' from dual ) tab1
    union all
    select * from (select '1','2','3','1' from dual ) tab2)
    and why?
  • TableB has 100 rows and when you use in the where clause, create tableA as select * from tableB

1. 1=1
2. 1=2.
How many rows will be in tableA for each case?

  • You have two tables A and B and you run ‘insert into tableA select * from tableB’ and
    then run ‘create table tableC as select * from tableA’. Now if you run ‘delete from
    tableB’ and then execute a rollback. How many rows will be there in each of these
    tables, tableA, tableB and tableC.
  • You have a table named ‘customer’ with 100 million rows, and you need to add a
    column with a not-null constraint and a predefined default value to this column. What
    is the best approach, what will happen if you run this sql in production, ALTER TABLE
    customer ADD (cust_credit_rating number default 1 not null);
  • Your sql script must exit with an error should anyone else run other the specified
    user. It should be a one -line sql? Hint –on sqlerror, how do you generate this error?
    Why do we use dimensional modeling instead of ER modeling for data warehousing
    applications?
  • Why can’t we use a copy of our transactional system to meet our data warehousing
    needs? If we are unable to use a copy of transaction system as DW, then why don’t we
    use it as a staging environment?
  • What are the three most important themes in a data warehouse? Give examples?
    Partial Answer - Drilling Down, Drilling Across and Handling Time
  • Why can’t we have just one single fact table instead of more than one fact table?
  • How would you go about deleting 10 million records from 100 million records table in
    a production after an erroneous load? Hint – Try alternate method other than delete,
    like create table xyz as select * from or if it is a partitioned table, try dropping the
    partition.
  • Primary key update-
    Your company XYZ has acquired a new company. You are asked to load their customer
    information into the data warehouse. Your company’s cust_id is numeric, whereas
    theirs is a string. How can you handle this situation? How can you handle the
    customer records that already exist in your data warehouse? Hint – Use of surrogate
    key
  • What is Changed Data Capture and how will handle them in case of OLTP
    environment like SAP, Oracle Apps?
  • What is ODS? Why do you need them? Where would you place them?
  • What do you mean by Drilling Up and Drilling around? Hint – subtract and sharing of
    related fact
  • How would you go about deleting 10 million records from 100 million records table in
    a production after an erroneous load? Hint – Try alternate method other than delete,
    like create table xyz as select * from or if it is a partitioned table, try dropping the
    partition.
  • When you extract data from source systems during night, your extract process keep
    failing with “snapshot too old” error? How will fix this issue? Hint – your DBA has
    provided with alter session privilege
  • What is the difference between ‘drop index’ and ‘alter index unusable’ and rebuild?
    Why do we prefer second method?
  • How do you handle duplicate records from Relational Database?
  • You are required create one million records (sequence numbers) into a table using
    one sql command? Hint:- Take advantage of Cartesian product and union
  • How do you perform debugging in Stored Procedure?
  • How can you transform a sub-query involving the IN clause to a Join? E.g. select
    emp_no, name from emp where dept_id in (select dept_id where dept_name=’HR’)
  • How can you transform a statement involving an OR condition to a UNION ALL? Eg.
    select emp_no, name, title from emp where title=’MANAGER’ or ‘DIRECTOR’
  • How to get count of the different data values in a column?
  • How can you get count/sum ‘ranges of data’ values in a column?
  • Give me sql for each of the following
    1) TOP N Rows from a table
    2) EVERY Nth row from a table
    3) Rows X to Y from a table

Informatica PowerCenter related:

  • What are key differences over PowerCenter 7 vs 8? Give one significant change version 8 is offering.
  • What changes do we need to make, for to run version 7 ‘pmcmd’ command on version 8.
  • What is purpose of ‘infacmd’ command and domains.infa file?
  • Can we create a mapping without using a source qualifier? Hint - Normalizer
  • Can we use a flat file as a lookup?
  • How can you transpose rows to columns and vice versa and what transformations would you choose?
  • In the update strategy, instead of using DD_INSERT I put the value as 1. Will this work?
  • If you check all the ports as group by in the Aggregator transformation, how many rows would be output?
  • How can you limit number of running sessions in a workflow?
  • What is the difference between Union and Joiner Transformation?
  • Your source data is a flat file and it has empid, deptid and 12 columns for salaries of each month (jan-dec). Now you requirement is to create one record for each month and also running total of the salary. What type of transformations you will use?
  • What is the use of indirect file list in PowerCenter?
  • In your environment, there are several mappings in version 7 and these mappings use external procedures (AEP). You need to move these mappings to version 8. What steps do you take?
  • Your customer has a requirement that in a day you should not run workflow more than once. How can you achieve this by means of workflow control?
  • You have a requirement to alert you of any long running sessions in your workflow. How can you create a workflow that will send you email for sessions running more than 30 minutes. You can use any method, shell script, procedure or Informatica mapping or workflow control.
  • You want to attach a file as an email attachment from a particular directory using ‘email task’ in Informatica, How will you do it? Hint –%a
  • You have two sets of sessions (mappings) in a workflow. You wish to run one set and you want to selectively ‘turn-off’ the second set of mappings by means of ‘just’ one parameter so that these sessions/mappings would just run and exit with no data. How can you manage this using a parameter file? Hint 1=2
  • One of your source file you load contains a formula column, emp_id, emp_name,
    dept_id, salary, commission_formula (formula_column) ,salary_date.
    123, King, d01, 4500, 2*3, 01/20/2005
    453, Scott, d02, 8500, 4.5/2*1.4, 01/31/2006
    You need to evaluate the commission formula and calculate the commission with the salary (i.e., commission_formula*salary) to calculate commission. How can you accomplish this?
  • Your mapping is loading the target data in a flat file, but there are many new line characters that cause your application to fail. How would handle these characters before they are loaded inside flat file? Hint – Ascii equivalent of CR- CHR(13), CHR(10)
  • How to delete duplicate records from source database/Flat Files? Can we use post sql to delete these records. In case of flat file, how can you delete duplicates before it starts loading?
  • Join two or more tables and then pull out two columns from each table into the source qualifier. Now, pull out one column from the source qualifier into an Expression transformation and then do a "generate SQL" in the source qualifier, how many columns will show up in the generated SQL?
  • You are asked to document source to target dependency of all Informatica mappings and you know there are some repository views (REP_TBL_MAPPING) provided by Informatica can do this job. But these views return no rows. How can you rectify this issue? Hint - MX Data, what change would do to load OPB_TARG_TBL_EXPR which provides a crucial link to REP_TBL_MAPPING?
  • You are required to perform “bulk loading” using Informatica on Oracle, what action would perform at Informatica + Oracle level for a successful load? Hint –alter session privilege, Environment SQL (where is it?)
  • Your session failed and when you try to open a log file, it complains that the session details are not available. How would do trace the error? What log file would you seek for? Hint – What other types of logs are available in that server?
  • Your business user says revenue report does not tally with SAP (source system) report though the ETL process did not fail today. How will identify the exact issue? Hint – reject records, primary key
  • There are around 100 sessions in a workflow. You have specified variables in the sessions, worklets and workflows. These variables are all over the place. You are supposed to change variable names from variable(x) to myvariable(x+1). What approach would you take? Hint – Export XML file and use of regular expressions?
  • When you export a workflow from Repositor Manager, what does this xml contain? Workflow only? Hint – Question itself contains the hint?
  • What is difference between Mapping parameter and variable?
  • How do you set a mapping variable during session run? How do you reset them?
  • What are the pitfalls of using Sequence Generator transformation or in general, why do we avoid?
  • What precautions do you need take when you use reusable Sequence generator transformation for concurrent sessions?
  • Tell me what are alternative methods of Sequence Generator transformation? How will go about using the same when you use them in concurrent sessions?
  • Is it possible negative increment in Sequence Generator? If yes, how would you accomplish it? Hint – Expression
  • How can you handle sequence generation over 2 billion records?
  • Which directory Informatica looks for parameter file and what happens if it is missing when start the session? Does session stop after it starts? Hint – Does your source qualifier have dependency on mapping parameter?
  • What happens when a particular parameter is missing?
  • Informatica is complaining about the server could not be reached? What steps would you take? Hint – Hosts file
  • You observe lately some sluggishness in getting repository objects, what action would you perform? Hint –Repository objects
  • Informatica server suddenly stops after starting? How can you rectify this issue? Hint – Server variables, path?
  • What is a parameter file and how would dynamically create them using a mapping and what transformation would you choose? Hint – Normalizer
  • What is the Difference between connected & unconnected lookup
  • If you have more than two sources, how will use the joiner transformation to join these sources?
  • If you have more than one pipeline in your mapping how will change the order of load?
  • What approach would you take so that your source qualifier SQL override is database independent? Hint – Next question
  • What is an ANSI SQL?
  • What is a mapplet?
  • What are an active and a passive transformation?
  • Can we use an active transformation inside a mapplet?
  • What is the difference between Source Qualifier and Joiner?
  • How do you override SQL in Lookup?
  • How do you change order by clause in lookups?
  • How many types of lookups are there?
  • What is a dynamic lookup and what is the significance of NewLookupRow? How will use them for rejecting duplicate records?
  • What are benefits over connected vs. unconnected or vice versa?
  • In an unconnected lookup can it only have an output port and will the mapping work if you don’t check return port and why? Hint – Try un-checking Return port
  • You have more five mappings use the same lookup. How can you manage the lookup? Hint – Persistence
  • What are the type of caches are available in lookups?
  • How will you improve the performance of a lookup?
  • What is a sorted Input option in Source Qualifier?
  • How will you increase the performance of an aggregator transformation other than using sorted input?
  • What is an Incremental Aggregation?
  • What is an Incremental Loading?
  • What will happen if you copy the mapping from one repository to another repository and if there is no identical source?
  • Difference between router & filter transformation and where do you need to place filter transformation to get better performance?
  • What is joiner?
  • How do you perform recovery? What happens internally?
  • Which process writes the information into repository tables?
  • What is difference between an abort and stop in a session?
  • What is DTM?
  • What is update strategy? What is data driven?
  • An Aggregate transformation has 4 ports (l sum (col 1), group by col 2, col3), which port should be the output?
  • What is Parameter file and explain scenario when u use?
  • What is the difference between surrogate key and primary ke