Monday, September 26, 2011

Data Warehousing Questions

Data Warehousing Questions

Hi All,

We like to share data warehousing related questions. This topic covers from scratch to end in High level.


Need of Data warehouse
To Analysis of data and History Maintenance.
Companies Require Strategic information to face the competition in market. 

The Operation system are not designed for strategic information.
To Maintain History of data for whole Organization and to have a single place where the entire data stored.


What is data warehousing and Explain Approaches?

Many companies follow either Characteristic defined by W.H.Inmon or Sean kelly.
Inmon definition
Subjected Oriented,Integrated,Non Volatile,Time Variant.

Sean Kelly definition
Seperate,Available,Integrated,TimeStamped,Suject Oriented,Non Volatile,Accessible.

Dwh Approaches
There are two Approches
1.Top Down by Inmon

2.Bottom Up by Ralph kimbal

Inmon approach -->Enterprise datawarehouse structured first and next Datamart created.(TopDown).
Ralph kimbal------>Datamart designed first.Later Datamarts to Datawarehouse designed.(BottomUp).

What are the responsibilities of a data warehouse consultant/professional?

The basic responsibility of a data warehouse consultant is to ‘publish the right data’.
Some of the other responsibilities of a data warehouse consultant are:

1. Understand the end users by their business area, job responsibilities, and computer
tolerance.

2. Find out the decisions the end users want to make with the help of the data warehouse.

3. Identify the ‘best’ users who will make effective decisions using the data warehouse

4. Find the potential new users and make them aware of the data warehouse.

5. Determining the grain of the data.

6. Make the end user screens and applications much simpler and more template driven.

What are fundamental stages of Data Warehousing?

Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.


Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.


What is Datamart Explain Types?

It is a specific Subject area or Functionality or Task.

It is Designed to facilitate end user Analysis.

Wrong Answer-- It is a subset of warehouse--Please dont use this wrong answer.
Types of Datamarts
Dependent,Independent,Logical.
Dependent--->Warehouse created first and datamart is created next.
Independent-->Datamart is created directly from the source systems without depending on the warehouse.
Logical--->It is a backup or replica of any other Datamart.


How to create Datawarehouse and Datamart?

DWH----->By Applying Datawarehouse Approach on any Database.

DM------->Its Created by either using Views or Complex Tables.

What is Dimensional Modeling?

It provides relationship between Dimension and Fact with the help of particular model.(Star,Snowflake etc)

What do you mean by Dimension table and Explain Dimension Types?

Dimension table is a collection of Attributes which defines a Functionality or Task.

Features:
1.It contains textual information or descriptive information.
2.Does not contain any measurable information.
3.Answers for wht,where,when,why qstns.
4,These tables are Master tables and also Maintains History.

Types of Dimension
a.Confirmed
b.Degenerated
c.Junk
d.Role Playing
e.SCD
f.Dirty

What is Fact table and explain types of Measures?

Fact table is a main table in Relational Model.it contains two sections.
a.Foreign keys to Dimensions
b.Measures or Facts.

Features
1.Fact table contains measurable information or Numerical information.
2.Answers for how many,how much related questions.
3.These tables are children or transactional tables also contain history.

Types of Measures

Additive Measure,Semi Additive Measure, Non Additive Measure.

What is Factless Fact Table?

A table which does not contain any Meaningful or Additive measures.


What is Surrogate key? How do we generate?

It is a key contains Unique values like a Primary Key.
A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key.
It is just a unique identifier or number for each row that can be used for the primary key to the table.

we may generate this key in 2 ways

System generated
Manual sequence

What is the necessity of having surrogate keys?

1.Production may reuse keys that it has purged but that you are still maintaining.

2.Production might legitimately overwrite some part of a product description or a
customer description with new values but not change the product key or the customer
key to a new value. We might be wondering what to do about the revised attribute
values (slowly changing dimension crisis)

3.Production may generalize its key format to handle some new situation in the
transaction system.
E.g. changing the production keys from integers to alphanumeric
or may have 12-byte keys you are used to have become 20-byte keys.

4.Acquisition of companies

What are the advantages of using Surrogate Keys?

1. We can save substantial storage space with integer valued surrogate keys.

2.Eliminate administrative surprises coming from production.

3.Potentially adapt to big surprises like a merger or an acquisition.

4.Have a flexible mechanism for handling slowly changing dimensions.

What is SCD? Explian SCD types?

SCD--->Slowly Changing Dimension
As a Dimensions maintains history of the Data.A process into this dimensions in less volume so we call this dimensions as Slowly Changing Dimension.The process we follow here called SCD process.

SCD Types
Type 1 ---> No History
The new record replaces the original record. Only one record exist in database - current data.

Type 2----> History Maintained ---> 1. Current Expired Method
2.Effective Date Range Method.
A new record is added into the customer dimension table.
Two records exist in database - current data and previous history data.

Type 3---->History Maintained.
The original data is modified to include new data. One record exist in database - new information are attached with old information in same row.

What are the techniques for handling SCD’s?

Overwriting
Creating another dimension record
Creating a current value filed

What are the Different methods of loading Dimension tables?

There are two different ways to load data in dimension tables.

Conventional (Slow) :
All the constraints and keys are validated against the data before, it is
loaded, this way data integrity is maintained.

Direct (Fast) :
All the constraints and keys are disabled before the data is loaded.
Once data is loaded, it is validated against all the constraints and keys.
If data is found invalid or dirty it is not included in index and all future
processes are skipped on this data.

What is OLTP?

OLTP is abbreviation of On-Line Transaction Processing. This system is
an application that modifies data the instance it receives and has a
large number of concurrent users.

What is OLAP?

OLAP is abbreviation of Online Analytical Processing. This system is an
application that collects, manages, processes and presents
multidimensional data for analysis and management purposes.

What is the difference between OLTP and OLAP?

Data Source
OLTP: Operational data is from original data source of the data.

OLAP: Consolidation data is from various source.

Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks.


OLAP: Multi-dimensional views of business activities of planning and decision making.

Queries and Process Scripts
OLTP: Simple quick running queries ran by users.

OLAP: Complex long running queries by system to update the aggregated data.

Database Design
OLTP: Normalized small database. Speed will be not an issue due to
smaller database and normalization will not degrade performance.
This adopts entity relationship(ER) model and an application-oriented
database design.

OLAP: De-normalized large database. Speed is issue due to largern database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.
This adopts star,snowflake or fact constellation mode of subject-oriented database
design.

Back up and System Administration

OLTP: Regular Database backup and system administration can do the job.

OLAP: Reloading the OLTP data is good considered as good backup option.


Describes the foreign key columns in fact table and dimension table?

Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.

What is Data Mining?

Data Mining is the process of analyzing data from different perspectives and summarizing
it into useful information.

What is the difference between view and materialized view?

A view takes the output of a query and makes it appear like a virtual
table and it can be used in place of tables.

A materialized view provides indirect access to table data by storing
the results of a query in a separate schema object.


What is ODS?

ODS is abbreviation of Operational Data Store. A database structure that is a repository
for near real-time operational data rather than long term trend data.
The ODS may further become the enterprise shared operational database,
allowing operational systems that are being reengineered to use the ODS as there operation databases.

What is VLDB?

VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally be considered to be a VLDB. Typically, these are decision support systems or transaction processing applications serving large numbers of users.

Is OLTP database is design optimal for Data Warehouse?

No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longer period (many years) data, which needs to be analyzed.

A OLTP system is basically ER model and not Dimensional Model.
If a complex query is executed on a OLTP system,it may cause a heavy overhead on the OLTP server that will affect the normal business processes.

If de-normalized is improves data warehouse processes, why fact table is in normal form?

Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table.


What are lookup tables?

A lookup table is the table placed on the target table based upon the primary key of the target,
it just updates the table by allowing only modified (new or updated) records based on the lookup condition.

What are Aggregate tables?

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions . It is always easy to retrieve data from aggregated tables than visiting original table which has million records.
Aggregate tables reduces the load in the database server and increases the performance of the query and can retrieve the result quickly.



What is real time data-warehousing?

Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes
available instantly.

What are conformed dimensions?

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined . They are common to the cubes.


What is conformed fact?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.

How do you load the time dimension?

Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. Level of granularity would mean what detail are you willing to put for each transactional fact.

What are non-additive facts?

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there is changes in dimensions the same facts can be
useful.

What are Additive Facts? Or what is meant by Additive Fact?

The fact tables are mostly very huge and almost never fetch a single record into our answer set.
We fetch a very large number of records on which we then do, adding, counting, averaging, or
taking the min or max. The most common of them is adding. Applications are simpler if they store facts in an additive format as often as possible.
Thus, in the grocery example, we don’t need to store the unit price.
We compute the unit price by dividing the dollar sales by the unit sales whenever necessary.


What are the 3 important fundamental themes in a data warehouse?

The 3 most important fundamental themes are:
1. Drilling Down
2. Drilling Across and
3. Handling Time

What is meant by Drilling Down?

Drilling down means nothing more than “give me more detail”.
Drilling Down in a relational database means “adding a row header” to an existing SELECT
statement.

For instance, if you are analyzing the sales of products at a manufacturer level, the
select list of the query reads:

SELECT MANUFACTURER, SUM(SALES).

If you wish to drill down on the list of manufacturers to show the brand sold, you add the BRAND row header:

SELECT MANUFACTURER, BRAND, SUM(SALES).

Now each manufacturer row expands into multiple rows listing all the brands sold. This is the
essence of drilling down.

We often call a row header a “grouping column” because everything in the list that’s not
aggregated with an operator such as SUM must be mentioned in the SQL GROUP BY clause.
So the GROUP BY clause in the second query reads, GROUP BY MANUFACTURER, BRAND.


What is meant by Drilling Across?

Drilling Across adds more data to an existing row. If drilling down is requesting ever finer and
granular data from the same fact table, then drilling across is the process fo linking two or more
fact tables at the same granularity, or, in other words, tables with the same set of grouping
columns and dimensional constraints.

A drill across report can be created by using grouping columns that apply to all the fact tables
used in the report.

The new fact table called for in the drill-across operation must share certain dimensions with the
fact table in the original query. All fact tables in a drill-across query must use conformed
dimensions.

What is the significance of handling time?

Example, when a customer moves from a property, we might want to know:

1. who the new customer is
2. when did the old customer move out
3. when did the new customer move in
4. how long was the property empty etc


What are the important fields in a recommended Time dimension table?

Time_key
Day_of_week
Day_number_in_month
Day_number_overall
Month
Month_number_overall
Quarter
Fiscal_period
Season
Holiday_flag
Weekday_flag
Last_day_in_month_flag

What is the main difference between Data Warehousing and Business Intelligence?


The differentials are:

DW - is a way of storing data and creating information through leveraging data marts.
DM's are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category.
DW does not require BI to work. Reporting tools can generate reports from the DW.


BI - is the leveraging of DW to help make business decisions and recommendations.
Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.

What is a Physical data model?

During the physical design process, you convert the data gathered during the logical design
phase into a description of the physical database, including tables and constraints.


What is a Logical data model?

A logical design is a conceptual and abstract design. We do not deal with the physical
implementation details yet;
we deal only with defining the types of information that we need.
The process of logical design involves arranging data into a series of logical relationships called
entities and attributes.


What are an Entity, Attribute and Relationship?

An entity represents a chunk of information. In relational databases, an entity often maps to a
table.
An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.
The entities are linked together using relationships.


What is junk dimension?

A number of very small dimensions might be lumped together to form a single dimension,
a junk dimension - the attributes are not closely related.
Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension is known as junk dimension.

Friday, September 16, 2011

SCENARIOS

1.Consider a scenario where we need to open a PDF File by searching its existence by two different paths  in remote locations(if present in path 1 that  PDF file is opened,if not present in path 1ook in path 2 and open the PDF file ) by clicking the link in  SSRS reports.


Firts we need to Check in Path 1,if file exists open the file else check in path 2 and open the file.
these two paths are stored in a file.

Steps:
1.Open the Report and go to Layout tab.

2.Go to Report---->Report Properties--->Click on Code tab write Custom code as shown.

Click ok

  
3.Point to the cell ,click on which pdf to be opened,then Right click and go to Properties-->Navigation--->Jump to the URL and write the" =javascript:void(window.open(....))" as shown
open the PDF file in a new window -->click ok.

4.Go to Preview tab -->Click view report will open with a hyperlink on the column clicking on which PDF will open.
NOTE:
1,When Complete URL of the path is not in one column of the table, concatenate them in a such a way that all shades in URL are forward slashes(\).

2.When Clicked on the hyperlink and if PDF doesnot open as the URL is not Populated correctly then replace all single slashes(\) with double slashes(\\) when using java script.

--------------------------------------------------------------------------------------------------
scenario 2


USAGE OF STORED PROCEDURE IN SSRS

Consider a scenario in which we need to use stored procedures in SSRS reports.
Steps:
While creating a Dataset,we can use 3 kinds of " Query Type".

1.Text
2.Table
3.Stored Procedure
Among the three, stored procedure is the best query type to be used with respect to performance of the reports.
Here are the steps to use procedure in the report

1.Select the Datasource.

2.Create Dataset and select the stored procedure which is refering for this report.

3.Click on the Query Designer and try to enter the parameters for this procedure .Click on "OK"
you will be able to see the output for the procedure.






>In the report requirement if only time field should be displayed.where a sin the Stored procedure(Date time) field is modified to display only time field.Like[TIME]=CAST(MAX(p.Visit Date Time)AS TIME) in Report_1.we are modifying the stored procedure for this simple requirement which may not be desirable.
Below is the logic where there will not be any modification required in the stored procedure as a whole.


> =Format(Today11+Fields!Time.value),"HMM").It will display only time in desired format(2300HRS Format).























 



Thursday, September 15, 2011

Execution Tree in SSIS

At run time Data Flow Engine divides the Data Flow Task operations into Execution Trees which demonstrate how package uses buffers and threads. These execution trees specify how buffers and threads are allocated in the package.

Each tree creates a new buffer and may execute on a different thread. When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you additional worker thread.[TechNet]

Let's take an example to show the Execution Tree in the Data Flow Task. I will create a simple Data Flow Task and have two flows in it.

1) Direct transfer of data from SrcEmployee table to DestEmployee table.
2) SrcDepartment to DestDepartment by having one Sort Component in between to sort on DepName.



 Now when you run package, SSIS will log an entry for User:PipelineExecutionTrees which discribes the Trees/paths SSIS has created to run the Package.





Message:

Begin Path 0 [Tree 1]
   output "OLE DB Source Output" (11); component "SrcEmployee" (1)
   input "OLE DB Destination Input" (29); component "DestEmployee" (16)
End Path 0

Begin Path 1 [Tree 2]
   output "OLE DB Source Output" (124); component "SrcDepartment" (114)
   input "Sort Input" (147); component "SortDepartment" (146)
End Path 1

Begin Path 2 [Tree 3]
   output "Sort Output" (148); component "SortDepartment" (146)
   input "OLE DB Destination Input" (142); component "DestDepartment" (129)
End Path 2

Select ALL in parameter of SSRS report

Select ALL as parameter value is one of the most common functionality which most of the reports have and there are number of ways to implement it.

The driver of the solution is CASE option under WHERE clause of SELECT DataSet Query.

        SELECT * FROM TableName
         WHERE
         (
               CASE
                      WHEN  @RepParam <> 'ALL' AND ColName= @
RepParam THEN 1
                      WHEN  @
RepParam= 'ALL'  THEN 1
               END
         ) = 1 ;


Lets see it through an simple Student Table example where we will Select student either on the basis of the Grade they are in or select all of them.

1. Records in Student table are

2. Create a simple report with DataSet Student as SELECT * FROM Student;

3. To add the Grade parameter and option for Select ALL

3.a. Create a DataSet for available Grades for Report parameter

3.b. Configure a Report parameter "Grade" as


3.c. Modify query for Student Dataset  to allow filtering on Grades as

4.a. Run the Report for ALL grades

4.b. Run the Report for grade - X



Enable Fast Parse in SSIS,Row number, Use of RS Utility,how to use SSAS Profile scheduler

How to enable Fast Parse in SSIS ?

Solution

Fast parse property is hidden in the depths of the advanced settings of a flat file. This property can drastically improve the performance of the SSIS package when using a flat files source or data conversion transform by basically not validating the columns that you specify.

To enable fast parse on either a flat file source or a data conversion transformation use the following steps.



·          Right-click the Flat File source or Data Conversion transformation, and then click "Show Advanced Editor".

·          In the Advanced Editor dialog box, click the "Input and Output Properties" tab.

·          In the Inputs and Outputs pane, click the column for which you want to enable fast parse.

·          In the Properties window, expand the Custom Properties node, and then set the "FastParse property" to True.


What is ROW_NUMBER  in SQL Server ?



Solution

It returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.

Syntax:



ROW_NUMBER ( )  OVER ( [ <partition_by_clause> ] <order_by_clause> )

       
           
What is the use of RS Utility?


Solution

RS Utility is used to automate report server deployment and administration tasks Developers and report server administrators can perform operations on a report server through the use of the rs utility (RS.exe). Using this utility, you can programmatically administer a report server using Visual Basic .NET scripts.



  How to use SSAS Profiler Scheduler

Solution -

1. Download SSASTraceScheduler folder in a drive where you have ample space.

2. Run Timev1.exe

3. Type in the AS Server Name against which you want to run Trace and click on Connect.

4. If connection will be successful it will allow you to Schedule Traces

5. Schedule Trace timings

6. Click Start Trace.

7. Trace file will be generated in a folder where ProfilerScheduler.exe is located.

      


      


 

Tuesday, September 13, 2011

Prompting for parameter in SSIS

Prompting for parameters in SSIS
A question might struck in your mind that if we have it in SSRS than why not in SSIS. Well I am a great fan of Microsoft as they provide alternative for any thing under the sun (Sometimes SUN itself). In this post I would like to demonstrate how we can implement prompt for variable values in SSIS..

For the implemention, Let me consider a simple table having Name and Email Id.
Through SSIS package we will prompt for Name and accept Name from User.

CREATE TABLE [tblSSISPrompt]
(
[Name] varchar(15),
[Email] varchar(50)
)
GO
--Fill table with some data

INSERT INTO [tblSSISPrompt] VALUES ('Rahul', 'Rahul.Kumar@sqlsvrsol.com')
INSERT INTO [tblSSISPrompt] VALUES ('Prashant','Prashant.M@sqlsvrsol.com')
INSERT INTO [tblSSISPrompt] VALUES ('Mark','Mark.W@sqlsvrsol.com')
GO

SELECT * FROM [tblSSISPrompt]




Now we start with SSIS
A) Declare two variable
a) SqlStmt to store SQL query
b) Email to get Email from the resultset from Sql query



B) Use Script Task to prompt and prepare SQL query and store in SqlStmt variable
a) Specify SqlStmt as ReadWriteVariables


b) Code for prompt and prepare SQL query


Public Sub Main()
Dim Name As String

'Prompt
Name = InputBox("Enter Name", "Name Dialog")

'Prepare SQL query
Dts.Variables("SqlStmt").Value = _
"SELECT Email FROM tblSSISPrompt WHERE Name = '" + Name + "'"

'msgbox
MsgBox(Dts.Variables("SqlStmt").Value, , "Query")

End Sub

C) Use Execute SQL Task to execute the query



D) Execute the Package
a) As you execute the package, it will prompt for Name


b) Script will display query in Message box


c) Finally Execute SQL Task will execute the query
 Congrats!!! thats Prompting for parameters in SSISIf you want to do the same thing in C#, read first comment.

Numeric Checking in SSIS

missing functionality of IsNumeric() in SSIS expressions.

To solve this we can use script task and check of IsNumric()
forexample I will use following data and table

--create a test table
create table TEST
(
[id] varchar(20),
[name] varchar(20)
)
--Fill some data

Insert into TEST values ( 1,'Mike')
Insert into TEST values ( 2 ,'Rahul')
--error data with non numeric id
Insert into TEST values ( '3a' ,'Worngdata')

Now lets design our package
1) Get data extract in Oledb DataSoruce
2) Create a Script task and specify input column as ID


2.b) Add a column as IsNumeric of Boolean datatype





2.c) Write below code as Script


If IsNumeric(Row.id) Then
Row.IsNumeric = True
Else
Row.IsNumeric = False
End If

3) Use Conditional Split to separate out bad records and valid records


4) Now when we execute the package it will separate out records with non-numeric data




Full package