Showing posts with label Data Warehouse. Show all posts
Showing posts with label Data Warehouse. Show all posts

Tuesday, June 14, 2011

Data Warehouse QA Technical Screen Questions

I was recently asked to perform a technical phone screen of a data warehouse QA candidate. The project manager wanted to ensure that this person knew enough SQL to be able to efficiently test our data warehouse contents (not end-to-end processing testing).

So that I didn't sound like a bumbling fool when I made the call I put together a list of questions. It's by no means exhaustive but I figured it could be helpful to some . . .


When was the last time you used SQL on a day to day basis (how often do they use it and how rusty might they be)?

How would you rate your SQL from 1-10 (sets expectations for how well they will answer questions below)?

What database platforms/versions have you used (Oracle, SQL Server, MS Acess, MySQL, etc.)?

1) What are the two keywords necessary for any SQL statement?
A1) SELECT
A2) FROM

2) How do you limit the results of a query based one or more criterion?
A1) WHERE clause

  What is a join?
A1) It Combines records from two or more tables.

4)      What types of joins are there and what do they do?
A1) INNER JOIN
A2) LEFT OUTER JOIN
A3) RIGHT OUTER JOIN
A4) CROSS JOIN
A5) FULL OUTER JOIN

5)      What is the syntax for joining two tables?
 A1) FROM tableA
INNER JOIN tableB ON tableA.column1 = tableB.column2

6)      How do you get a single instance of each value of a column in a table?
A1) DISTINCT
7)    
            How do you get the number of occurrences of each distinct value in a table?
A1) COUNT, GROUP BY
8)     
            How do you get the largest value of a column in a table?
A1) MAX
      
9) What is the difference between putting criteria in a WHERE clause or a JOIN clause?
A1) When it is an INNER JOIN it does not affect the results.      
A2) When it is an OUTER JOIN, putting the criteria in the WHERE clause effectively makes it an INNER JOIN which is not usually the intended behavior.
A3) When it is an OUTER JOIN, putting the criteria in the JOIN clause allows the criteria to be applied to only those rows that are outer-joined and does not create the INNER JOIN – effect.
A4) Depending on the database platform and version, the performance could be affected.
     
10)   What is the difference between a WHERE clause and a HAVING clause?
A1) The WHERE clause applies the criteria at the row level whereas the HAVING clause applies it at the aggregate level.

11)   What is a derived table/inline view?
A1) It is a query that is aliased as a table and used like any other table in a FROM clause.

12)   What is a nested subquery?
A1) It is a query that is nested within a SELECT or a WHERE clause.

13)   What is a correlated subquery and when are they useful?
A1) It is a query that is used within a parent query and refers to the parent query.
A2) A correlated subquery is evaluated once for each row processed by the parent statement.
A3) They are useful in providing a set of values to be used in a WHERE clause.


Our data warehouse solution consists of the following.
       1)      Flat files that provide all source data
       2)      A staging database that the flat files are loaded into
       3)      A data warehouse database that contains the transformed data based on a documented set of rules.

1)      If you could assume that the flat files were loaded to the staging database without any errors, how would you test that the data was correctly transformed from the staging database to the data warehouse? What tools, if any, would you need?
A1) Write queries against the staging database that follow the transformation rules. Then write queries to pull the related rows from the
A2) Some kind of a query execution tool is necessary. Toad, RapidSQL, SQL Server Management Studio, etc.
2)      How do you test large sets of data where there could be millions of rows?
A1) Narrow down the results based on specific criteria.
A2) Apply aggregate functions such as SUM to determine if the test query aggregates to the same value as the data warehouse query.
A3) Join the test query/tables to the data warehouse query/tables and apply a WHERE clause that filters on only those values that do not equal each other.

3)      What data would you provide to the development team when you found inconsistencies between the documented transformations and what was in the data warehouse?
A1) The transformation rule
A2) The test query
A3) The data warehouse query (if independent from above)
A4) Explanation of expected versus actual results

Thursday, July 15, 2010

Data Warehouse vs. Data Mart

II'm currently working on a Business Intelligence project and depending on who I am talking to we are either building a Data Mart or a Data Warehouse and everyone's definitions of both seem to be vastly different. Since this makes communication a bit difficult at times I took it up on myself to attempt to clarify the difference between a mart and warehouse.

My project is following the Kimball approach of dimensional modeling so I tailored my definitions to this approach. Keep in mind that the detailed definitions differ depending based on whether you're talking about the top-down (Inmon) approach or the bottom-up (Kimball) approach. The differences are mostly around the modeling techniques and whether or not the data warehouse is in 3NF or de-normalized.

My non-technical description of data mart and data warehouse are as follows.

A data mart is typically a single star schema representing a specific subject area. Sales is a good example of this and Manufacturing is another.  A data warehouse contains the subject specific data marts and attempts to tie an entire organization’s subject data together using conformed dimensions (attribute data that can be used consistently across multiple facts/marts/star schemas).

That is the very simple description I provided the group I am working with. That said, I still needed to determine if we were building a data mart or a data warehouse. From my definition you would think it would be an easy thing to figure out. Since I tend to overthink most things in life I struggled a bit with this.

The database I'm building to support the Business Intelligence (BI) solution consists of subject areas such as Orders, Complaints, Deliveries, and about 4 more subject areas. They will each have their own fact table and share a number of dimensions. Based on that I would say it is a data warehouse because it is a set of data marts/start schemas providing data across numerous subject areas and sharing conformed dimensions. However, this solution is part of a commercial software product that supports Restaurant Order Taking and is not being developed for or within a specific organization. So now I believe it's a question of perspective. From the customers/restauranteur's perspective, the order taking application and the new BI add-on is simply a small part of a much larger picture. The data contained with the mart/warehouse might be considered a single subject area to this organization. The subject area would be Order Taking or more specifically Web and/or Call Center Order Taking. The customer might have other data around accounting, human resources, inventory, supply chain, etc. The solution I'm building could just be part of a bottom-up enterprise data warehouse at some point at which point it would be considered a data mart by the purchasing organization.

So which one is it? I believe you can argue that it's both. However, I believe we must look at it from our perspective of my client who is a software products company. My client sells a suite of order taking products to the restaurant industry. From within the software product company this is a data warehouse that brings together all aspects (subject areas) of their products.

You could argue that it doesn't matter and that this is purely a symantics or philisophical debate. Or possibly I've missed the boat. What do you think?

In relation to the product and the company that owns and sells this product to restauranteurs, this solution can be considered a data warehouse as I stated above but what about looking at it from the customers (restaurant owner) perspective?

will be an add-on to I would say it is a data warehouse because it captures data across the solution and consists of many subject areas.

Monday, August 24, 2009

Great Book for Building a SQL Server BI Solution



While written for SQL Server 2005 and a little out of date for SQL Server 2008, The Microsoft Data Warehouse Toolkit: With SQL Server2005 and the Microsoft Business Intelligence Toolset is a must read for anyone looking to learn and/or apply the Kimball methodology of dimensional modeling while using the Microsoft SQL Server BI stack (SSIS, SSRS, SSAS).



The book starts by describing what the Kimball group has determined to be the best way to execute the requirements gathering phase, it then moves on to designing the dimensional model including the physical relational data warehouse, creating the ETL infrastructure (SSIS), an OLAP database (SSAS), creating standard reports (SSRS), dealing with partitioning, security, maintenance, etc.

Someone with a few years of SQL Server experience can build a full-blown, mid-sized BI solution with this book alone.

The book also has a great, although not physically attractive, companion website. Check out the tools section for some invaluable tools including a dimensional modeling spreadsheet that is perfect for the first few iterations of the relational data warehouse. While not extremely fancy it will generate quality "create" scripts for the initial development phase of your project.