Looking for:
Microsoft SQL Server Business Intelligence Development Beginner’s Guide | Packt

There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies. The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this chapter, we will explain the Kimball method. Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams.
This chapter is referenced from The Data Warehouse Toolkit , so for a detailed discussion, read the referenced book.
To gain an understanding of data warehouse design and dimensional modeling, it’s better to learn about the components and terminologies of a DW. A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key the primary key of the dimension table is addressed in the fact table as the foreign key.
Facts are numeric and additive values in the business process. For example, in the sales business, a fact can be a sales amount, discount amount, or quantity of items sold.
All of these measures or facts are numeric values and they are additive. Additive means that you can add values of some records together and it provides a meaning.
For example, adding the sales amount for all records is the grand total of sales. Dimension tables are tables that contain descriptive information. Descriptive information, for example, can be a customer’s name, job title, company, and even geographical information of where the customer lives. Each dimension table contains a list of columns, and the columns of the dimension table are called attributes. Each attribute contains some descriptive information, and attributes that are related to each other will be placed in a dimension.
For example, the customer dimension would contain the attributes listed earlier. Each dimension has a primary key, which is called the surrogate key. The surrogate key is usually an auto increment integer value. The primary key of the source system will be stored in the dimension table as the business key. The Fact table is a table that contains a list of related facts and measures with foreign keys pointing to surrogate keys of the dimension tables.
Fact tables usually store a large number of records, and most of the data warehouse space is filled by them around 80 percent. Grain is one of the most important terminologies used to design a data warehouse. Grain defines a level of detail that stores the Fact table. For example, you could build a data warehouse for sales in which Grain is the most detailed level of transactions in the retail shop, that is, one record per each transaction in the specific date and time for the customer and sales person.
Understanding Grain is important because it defines which dimensions are required. There are two different schemas for creating a relationship between fact and dimensions: the snow flake and star schema. In the start schema, a Fact table will be at the center as a hub, and dimensions will be connected to the fact through a single-level relationship.
There won’t be ideally a dimension that relates to the fact through another dimension. The following diagram shows the different schemas:.
The snow flake schema, as you can see in the preceding diagram, contains relationships of some dimensions through intermediate dimensions to the Fact table. If you look more carefully at the snow flake schema, you may find it more similar to the normalized form, and the truth is that a fully snow flaked design of the fact and dimensions will be in the 3NF.
The snow flake schema requires more joins to respond to an analytical query, so it would respond slower. Hence, the star schema is the preferred design for the data warehouse. It is obvious that you cannot build a complete star schema and sometimes you will be required to do a level of snow flaking.
However, the best practice is to always avoid snow flaking as much as possible. After a quick definition of the most common terminologies in dimensional modeling, it’s now time to start designing a small data warehouse. One of the best ways of learning a concept and method is to see how it will be applied to a sample question. Assume that you want to build a data warehouse for the sales part of a business that contains a chain of supermarkets; each supermarket sells a list of products to customers, and the transactional data is stored in an operational system.
Our mission is to build a data warehouse that is able to analyze the sales information. Before thinking about the design of the data warehouse, the very first question is what is the goal of designing a data warehouse? What kind of analytical reports would be required as the result of the BI system? The answer to these questions is the first and also the most important step.
This step not only clarifies the scope of the work but also provides you with the clue about the Grain. Defining the goal can also be called requirement analysis. Your job as a data warehouse designer is to analyze required reports, KPIs, and dashboards. After requirement analysis, the dimensional modeling phases will start.
Based on Kimball’s best practices, dimensional modeling can be done in the following four steps:. In our example, there is only one business process, that is, sales. Grain, as we’ve described earlier, is the level of detail that will be stored in the Fact table. Based on the requirement, Grain is to have one record per sales transaction and date, per customer, per product, and per store.
Once Grain is defined, it is easy to identify dimensions. Based on the Grain, the dimensions would be date, store, customer, and product. It is useful to name dimensions with a Dim prefix to identify them easily in the list of tables.
The next step is to identify the Fact table, which would be a single Fact table named FactSales. This table will store the defined Grain. After identifying the Fact and dimension tables, it’s time to go more in detail about each table and think about the attributes of the dimensions, and measures of the Fact table. Next, we will get into the details of the Fact table and then into each dimension. There is only one Grain for this business process, and this means that one Fact table would be required.
To connect to each dimension, there would be a foreign key in the Fact table that points to the primary key of the dimension table. The table would also contain measures or facts. For the sales business process, facts that can be measured numeric and additive are SalesAmount, DiscountAmount, and QuantitySold.
The Fact table would only contain relationships to other dimensions and measures. The following diagram shows some columns of the FactSales :. As you can see, the preceding diagram shows a star schema. We will go through the dimensions in the next step to explore them more in detail.
Fact tables usually don’t have too many columns because the number of measures and related tables won’t be that much.
However, Fact tables will contain many records. The Fact table in our example will store one record per transaction. As the Fact table will contain millions of records, you should think about the design of this table carefully. The String data types are not recommended in the Fact table because they won’t add any numeric or additive value to the table.
The relationship between a Fact table and dimensions could also be based on the surrogate key of the dimension. The best practice is to set a data type of surrogate keys as the integer; this will be cost-effective in terms of the required disk space in the Fact table because the integer data type takes only 4 bytes while the string data type is much more.
Using an integer as a surrogate key also speeds up the join between a fact and a dimension because join and criteria will be based on the integer that operators works with, which is much faster than a string. If you are thinking about adding comments in this made by a sales person to the sales transaction as another column of the Fact table, first think about the analysis that you want to do based on comments.
No one does analysis based on a free text field; if you wish to do an analysis on a free text, you can categorize the text values through the ETL process and build another dimension for that.
Then, add the foreign key-primary key relationship between that dimension to the Fact table. The customer’s information, such as the customer name, customer job, customer city, and so on, will be stored in this dimension. You may think that the customer city is, as another dimension, a Geo dimension. But the important note is that our goal in dimensional modeling is not normalization. So resist against your tendency to normalize tables. For a data warehouse, it would be much better if we store more customer-related attributes in the customer dimension itself rather than designing a snow flake schema.
The following diagram shows sample columns of the DimCustomer table:. The DimCustomer dimension may contain many more attributes. The number of attributes in your dimensions is usually high. Actually, a dimension table with a high number of attributes is the power of your data warehouse because attributes will be your filter criteria in the analysis, and the user can slice and dice data by attributes.
So, it is good to think about all possible attributes for that dimension and add them in this step. As we’ve discussed earlier, you see attributes such as Suburb , City , State , and Country inside the customer dimension.
This is not a normalized design, and this design definitely is not a good design for a transactional database because it adds redundancy, and making changes won’t be consistent. However, for the data warehouse design, not only is redundancy unimportant but it also speeds up analytical queries and prevents snow flaking.
The CustomerKey is the surrogate key and primary key for the dimension in the data warehouse. The CustomerKey is an integer field, which is autoincremented. It is important that the surrogate key won’t be encoded or taken as a string key; if there is something coded somewhere, then it should be decoded and stored into the relevant attributes.
The surrogate key should be different from the primary key of the table in the source system. There are multiple reasons for that; for example, sometimes, operational systems recycle their primary keys, which means they reuse a key value for a customer that is no longer in use to a new customer.
CustomerAlternateKey is the primary key of the source system. It is important to keep the primary key of the source system stored in the dimension because it would be necessary to identify changes from the source table and apply them into the dimension. The primary key of the source system will be called the business key or alternate key. The date dimension is one of the dimensions that you will find in most of the business processes. There may be rare situations where you work with a Fact table that doesn’t store date-related information.
This is obvious as you can fetch all other columns out of the full date column with some date functions, but that will add extra time for processing. So, at the time of designing dimensions, don’t think about spaces and add as many attributes as required. The following diagram shows sample columns of the date dimension:.
It would be useful to store holidays, weekdays, and weekends in the date dimension because in sales figures, a holiday or weekend will definitely affect the sales transactions and amounts. So, the user will require an understanding of why the sale is higher on a specific date rather than on other days. You may also add another attribute for promotions in this example, which states whether that specific date is a promotion date or not.
The date dimension will have a record for each date. The table, shown in the following screenshot, shows sample records of the date dimension:. As you can see in the records illustrated in the preceding screenshot, the surrogate of the date dimension DateKey shows a meaningful value. This is one of the rare exceptions where we can keep the surrogate key of this dimension as an integer type but with the format of YYYYMMDD to represent a meaning as well. In this example, if we store time information, where do you think would be the place for time attributes?
Inside the date dimension? Definitely not. The date dimension will store one record per day, so a date dimension will have records per year and records for 10 years. However, 5 million records for a single dimension are too much; dimensions are usually narrow and they occasionally might have more than one million records.
So in this case, the best practice would be to add another dimension as DimTime and add all time-related attributes in that dimension. The following screenshot shows some example records and attributes of DimTime :. Usually, the date and time dimensions are generic and static, so you won’t be required to populate these dimensions through ETL every night; you just load them once and then you could use them.
I’ve written two general-purpose scripts to create and populate date and time dimensions on my blog that you can use.
The product dimension will have a ProductKey , which is the surrogate key, and the business key, which will be the primary key of the product in the source system something similar to a product’s unique number. The product dimension will also have information about the product categories. Again, denormalization in dimensions occurred in this case for the product subcategory, and the category will be placed into the product dimension with redundant values. However, this decision was made in order to avoid snow flaking and raise the performance of the join between the fact and dimensions.
We are not going to go in detail through the attributes of the store dimension. The most important part of this dimension is that it can have a relationship to the date dimension. For example, a store’s opening date will be a key related to the date dimension. This type of snow flaking is unavoidable because you cannot copy all the date dimension’s attributes in every other dimension that relates to it. On the other hand, the date dimension is in use with many other dimensions and facts.
So, it would be better to have a conformed date dimension. Outrigger is a Kimball terminology for dimensions, such as date, which is conformed and might be used for a many-to-one relationship between dimensions for just one layer. In the previous example, you learned about transactional fact. Transactional fact is a fact table that has one record per transaction.
This type of fact table usually has the most detailed Grain. There is also another type of fact, which is the snapshot Fact table. In snapshot fact, each record will be an aggregation of some transactional records for a snapshot period of time. For example, consider financial periods; you can create a snapshot Fact table with one record for each financial period, and the details of the transactions will be aggregated into that record.
Transactional facts are a good source for detailed and atomic reports. They are also good for aggregations and dashboards. The Snapshot Fact tables provide a very fast response for dashboards and aggregated queries, but they don’t cover detailed transactional records.
Based on your requirement analysis, you can create both kinds of facts or only one of them. There is also another type of Fact table called the accumulating Fact table. This Fact table is useful for storing processes and activities, such as order management. You can read more about different types of Fact tables in The Data Warehouse Toolkit , Ralph Kimball , Wiley which was referenced earlier in this chapter. We’ve explained that Fact tables usually contain FKs of dimensions and some measures.
However, there are times when you would require a Fact table without any measure. These types of Fact tables are usually used to show the non-existence of a fact. For example, assume that the sales business process does promotions as well, and you have a promotion dimension. So, each entry in the Fact table shows that a customer X purchased a product Y at a date Z from a store S when the promotion P was on such as the new year’s sales.
This Fact table covers every requirement that queries the information about the sales that happened, or in other words, for transactions that happened. However, there are times when the promotion is on but no transaction happens! This is a valuable analytical report for the decision maker because they would understand the situation and investigate to find out what was wrong with that promotion that doesn’t cause sales.
So, this is an example of a requirement that the existing Fact table with the sales amount and other measures doesn’t fulfill. This Fact table doesn’t have any fact or measure related to it; it just has FKs for dimensions. However, it is very informative because it tells us on which dates there was a promotion at specific stores on specific products.
We call this Fact table as a Factless Fact table or Bridge table. Using examples, we’ve explored the usual dimensions such as customer and date.
When a dimension participates in more than one business process and deals with different data marts such as date , then it will be called a conformed dimension. Sometimes, a dimension is required to be used in the Fact table more than once. For example, in the FactSales table, you may want to store the order date, shipping date, and transaction date. All these three columns will point to the date dimension. In this situation, we won’t create three separate dimensions; instead, we will reuse the existing DimDate three times as three different names.
So, the date dimension literally plays the role of more than one dimension. This is the reason we call such dimensions role-playing dimensions. There are other types of dimensions with some differences, such as junk dimension and degenerate dimension. The junk dimension will be used for dimensions with very narrow member values records that will be in use for almost one data mart not conformed.
For example, the status dimensions can be good candidates for junk dimension. If you create a status dimension for each situation in each data mart, then you will probably have more than ten status dimensions with only less than five records in each.
The junk dimension is a solution to combine such narrow dimensions together and create a bigger dimension. You may or may not use a junk dimension in your data mart because using junk dimensions reduces readability, and not using it will increase the number of narrow dimensions. So, the usage of this is based on the requirement analysis phase and the dimensional modeling of the star schema. A degenerate dimension is another type of dimension, which is not a separate dimension table.
In other words, a degenerate dimension doesn’t have a table and it sits directly inside the Fact table. Assume that you want to store the transaction number string value. Where do you think would be the best place to add that information? You may think that you would create another dimension and enter the transaction number there and assign a surrogate key and use that surrogate key in the Fact table.
This is not an ideal solution because that dimension will have exactly the same Grain as your Fact table, and this indicates that the number of records for your sales transaction dimension will be equal to the Fact table, so you will have a very deep dimension table, which is not recommended. On the other hand, you cannot think about another attribute for that dimension because all attributes related to the sales transaction already exist in other dimensions connected to the fact.
So, instead of creating a dimension with the same Grain as the fact and with only one column, we would leave that column even if it is a string inside the Fact table. This type of dimension will be called a degenerate dimension. Now that you understand dimensions, it is a good time to go into more detail about the most challengeable concepts of data warehousing, which is slowly changing dimension SCD. The dimension’s attribute values may change depending on the requirement.
You will do different actions to respond to that change. As the changes in the dimension’s attribute values happen occasionally, this called the slowly changing dimension. SCD depends on the action to be taken after the change is split in different types.
In this section, we only discuss type 0, 1, and 2. Type 0 doesn’t accept any changes. You might want to visit www. You can upgrade to the eBook version at www. Get in touch with us at for more details. At www. Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital book library. Here, you can access, read and search across Packt’s entire library of books.
If you have an account with Packt at www. Simply use your login credentials for immediate access. Get notified! Many companies and organizations intend to utilize a BI system to solve problems and help decision makers make decisions. This high demand for BI systems has raised the number of job openings in this field. Business Intelligence BI is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access and analysis of information to improve and optimize decisions and performance.
There are various reasons to have a BI system in place, but helping decision makers to make better decisions is one of the main purposes of BI.
As an example, a director of a manufacturing company would like to understand the trend of sales in past months or years on specific products. This trend would be helpful for him to decide any changes in that product or to create some other editions of that product.
A bank directory might like to use data mining solutions to distinguish suspicious or fraudulent transactions. BI could help in all the scenarios mentioned here and many more. A BI system usually uses a data warehouse as a core tool. The data warehouse is an integrated dimensional data structure. Data from a variety of sources will be fed into the data warehouse and some data quality and governance would be applied on the data.
The dimensional model of data warehousing is optimized for reporting and analysis, so data visualization tools can directly query against the data warehouse. These models will improve data access in terms of speed and performance of queries. BI systems have one or more data visualization frontends that will be the GUI for the end user.
In this book, we will go through the BI architecture and explore the Microsoft technologies that can implement and deliver BI solutions. As the first steps, a developer needs to design the data warehouse DW and needs an understanding of the key concepts of the design and methodologies to create the data warehouse.
Chapter 4, ETL with Integration Services , describes how ETL is an operation of transferring and integrating data from source systems into the data warehouse. ETL needs to be done on a scheduled basis. Chapter 5, Master Data Management , guides readers on how to manage reference data. Chapter 6, Data Quality and Data Cleansing , explains that data quality is one of the biggest concerns of database systems. The data should be cleansed to be reliable through the data warehouse.
In this chapter, readers will learn about data cleansing and how to use Data Quality Services DQS , which is one of the new services of SQL Server , to apply data cleansing on data warehouse. In this chapter, readers will understand data mining concepts and how to use data mining algorithms to understand the relationship between historical data, and how to analyze it using Microsoft technologies. In this chapter, readers will become familiar with algorithms that help in prediction, and how to use them and customize them with parameters.
Users will also understand how to compare models together to find the best algorithm for the case. Chapter 9, Reporting Services , explores Reporting Services, one of the key tools of the Microsoft BI toolset, which provides different types of reports with charts and grouping options.
Chapter 10, Dashboard Design , describes how dashboards are one of the most popular and useful methods of visualizing data. In this chapter, readers will learn when to use dashboards, how to visualize data with dashboards, and how to use PerformancePoint and Power View to create dashboards. Chapter 11, Power BI , explains how predesigned reports and dashboards are good for business users, but power users require more flexibility. Power BI is a new self-service BI tool. Chapter 12, Integrating Reports in Applications , begins with the premise that reports and dashboards are always required in custom applications.
NET applications in web or Metro applications to provide reports on the application side for the users. However, you can also download and install MS SQL Server Evaluation Edition, which has the same functionalities but is free for the first days, from the following link:. There are many examples in this book and all of the examples use the following databases as a source:. After downloading the database files, open SQL Server Management Studio and enter the following scripts to create databases from their data files:.
This book is very useful for BI professionals consultants, architects, and developers who want to become familiar with Microsoft BI tools. It will also be handy for BI program managers and directors who want to analyze and evaluate Microsoft tools for BI system implementation. Instructions often need some extra explanation so that they make sense, so they are followed with:.
This heading explains the working of tasks or instructions that you have just completed. You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Expand the Chapter 02 SSAS Multidimensional database and then expand the dimensions.
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination. Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked.
Reader feedback is important for us to develop. Open navigation menu. Close suggestions Search Search. User Settings. Skip carousel. Carousel Previous. Carousel Next. What is Scribd? Explore Ebooks. Bestsellers Editors’ Picks All Ebooks.
Explore Audiobooks. Bestsellers Editors’ Picks All audiobooks. Explore Magazines. Editors’ Picks All magazines.
Explore Podcasts All podcasts. Difficulty Beginner Intermediate Advanced. Explore Documents. Enjoy millions of ebooks, audiobooks, magazines, and more. Start your free days Read preview. Publisher: Packt Publishing. Released: May 26, ISBN: Format: Book. Written in an easy-to-follow, example-driven format, there are plenty of stepbystep instructions to help get you started!
The book has a friendly approach, with the opportunity to learn by experimenting. This book is will give you a good upshot view of each component and scenarios featuring the use of that component in Data Warehousing and Business Intelligence systems. About the author RR. Related Podcast Episodes. Data Exploration For Business Users Powered By Analytics Engineering With Lightdash: An interview with Oliver Laslett about the open source Lightdash framework for business intelligence and how it builds on the work that your analytics engineers are doing with dbt.
Mark brings up using Livebook as a Business Intelligence tool for doing analysis of a running application’s data. Single Source of Truth: In mathematics, truth is universal. In data, truth lies in the where clause of the query. As large organizations have grown to rely on their data more significantly for decision making, a common problem is not being able to agree on what the Astrato is a data analytics and business intelligence tool built on the cloud and for the cloud.
Alexander discusses the features and capabilities of Astrato for Data Discovery From Dashboards To Databases With Castor: An interview about how the Castor platform approaches the problem of data discovery and preserving context for your organization.
Jonathan Sharr is the kind of story that keeps us going! Since then he went Think too hard about it, and you might actually find yourself Ismail, the CTO and co-founder of Hingeto, a Y-combinator funded and fast-growing Silicon Valley startup, shares how they use business intelligence Business Intelligence Beyond The Dashboard With ClicData: An interview with Telmo Silva about all of the layers involved in a full featured business intelligence system and how he created ClicData to make them available to organizations of every size.
A first look at Oracle Spatial: Spatially aware databases such as Oracle Spatial can offer enhanced data validity, finer control over level of access and user privileges, and ease of use for web developers who are not familar with geo-coding.
Related Articles. Related categories Skip carousel. Free access for Packt account holders Instant updates on new Packt books Preface What this book covers What you need for this book Who this book is for Conventions Time for action — heading What just happened? Reader feedback Customer support Downloading the example code Downloading color versions of the images for this book Errata Piracy Questions 1.
Time for action — creating the first cube What just happened? Time for action — viewing the cube in the browser What just happened? Dimensions and measures Time for action — using the Dimension Designer What just happened? Time for action — change the order of the Month attribute What just happened?
Time for action — modifying the measure properties What just happened? Time for action — using a Named Query What just happened? Using dimensions Time for action — adding a Fact relationship What just happened? Hierarchies Time for action — creating a hierarchy What just happened?
Time for action — calculated members What just happened? Time for action — processing the data What just happened? Summary 3. Time for action — creating measures What just happened? Creating hierarchies Time for action — creating a hierarchy from a single table What just happened?
Time for action — creating a hierarchy from multiple tables What just happened? Data Analysis eXpression, calculated columns, and measures Time for action — using time intelligence functions in DAX What just happened? Securing the data Time for action — security in tabular What just happened? Storage modes Time for action — creating a model with the DirectQuery storage mode What just happened?
The Data Flow tab Time for action — loading customer information from a flat file into a database table with a Data Flow Task What just happened? Containers and dynamic packages Time for action — looping through CSV files in a directory and loading them into a database table What just happened?
Summary 5. Creating models and entities Time for action — creating a model and an entity What just happened? Time for action — creating an entity with data from the Excel Add-in What just happened? Time for action — change tracking What just happened?
The entity relationship Time for action — creating a domain-based relationship What just happened? Business rules Time for action — creating a simple business rule What just happened? Working with hierarchies Time for action — creating a derived hierarchy What just happened?
Security and permission Time for action — permission walkthrough What just happened? Integration management Time for action — a subscription view What just happened? Time for action — entity-based staging What just happened? Summary 6. Knowledge discovery Time for action — knowledge discovery What just happened? Domain and composite domain rules Time for action — composite domain rules What just happened?
Synonyms and standardization Time for action — creating synonyms and setting standardization What just happened? Matching Time for action — matching policy What just happened? Time for action — matching projects What just happened?
A Brief History of MS SQL Server | .Robot or human?
Get Microsoft SQL Server Business Intelligence Development: Beginner’s Guide now with the O’Reilly learning platform. O’Reilly members experience live. Take advantage of the real power behind the BI components of SQL Server , Excel , and SharePoint with this hands-on book.