Friday, March 25, 2022
HomeBusiness IntelligenceLogical Knowledge Mannequin: Logical Normalization

Logical Knowledge Mannequin: Logical Normalization


This can be a first a part of a miniseries underlining the advantages of establishing a correct logical knowledge mannequin (LDM), also referred to as a semantic knowledge mannequin. Every weblog publish will present a particular characteristic of the GoodData LDM, which allows customers to create metrics which are each advanced and reusable. Moreover, every characteristic offers additional details about your knowledge to the system and can assist you obtain your total purpose.

The GoodData LDM options coated on this miniseries are logical normalization, which is highlighted beneath; attribute labels; and shared datasets. (For individuals who should not but conversant in the fundamentals of an LDM, learn this introduction.)

All through the collection, we are going to illustrate the performance through the next job: to create a reusable metric by computing the ratio of order quantity on the given report dimensionality to the overall order quantity for every Area. (This metric is similar one as in my earlier article about Multidimensional Analytical Question Language [MAQL]: A Information to MAQL Metrics Reusability.)

Be aware: If you wish to comply with alongside by constructing your individual LDM, please learn Get GoodData.CN and Create a Workspace within the documentation. Should you created a workspace and GoodData.CN is working, then execute the next command:

curl http://localhost:3000/api/entities/dataSources 
 -H "Content material-Sort: software/vnd.gooddata.api+json" 
 -H "Settle for: software/vnd.gooddata.api+json" 
 -H "Authorization: Bearer YWRtaW46Ym9vdHN0cmFwOmFkbWluMTIz" 
 -X POST 
 -d '{
     "knowledge": {
         "attributes": {
             "identify": "demo-denormalized",
             "url": "jdbc:postgresql://localhost:5432/demo",
             "schema": "demo_denormalized",
             "kind": "POSTGRESQL",
             "username": "demouser",
             "password": "demopass"
         },
         "id": "demo-denormalized",
         "kind": "dataSource"
     }
 }' | jq .

It connects the demo-denormalized knowledge supply, and you need to use that knowledge supply in your workspace. (In our documentation library, please learn the part on the best way to construct a logical knowledge mannequin; this describes the best way to create an LDM from the related knowledge supply.)

Afterward, you need to find yourself with the LDM pictured beneath. There are two truth datasets (Order traces and Marketing campaign channels) and a single Date dataset. We are going to use solely the Order traces on this first half.

LDM with two fact datasets (Order lines and Campaign channels) and a single Date dataset

Relations within the LDM

One essential property of the datasets is an implied 1:N relation between the first key (e.g., Order line id) and its attributes (e.g., Buyer id, Buyer identify, and so forth.). The system doesn’t know something extra, and by default, it assumes that the relations between attributes themselves are all M:N. In different phrases, it assumes that there are not any useful dependencies between attributes, and the dataset is within the third regular type. In actuality, this won’t be true (e.g., every State belongs to precisely one Area).

You possibly can verify it on the info pattern:

Data sample

The implications of this “relation” might be revealed later. (Be aware that the info pattern is incomplete, and it’s included only for demonstration functions.)

Metric Definition

Let’s create the reusable metric Order Quantity Ratio to Area in MAQL. As described in our information to MAQL metrics reusability, set the metric format to % (rounded).

SELECT (SELECT SUM({truth/worth})) / 
(SELECT SUM({truth/worth}) BY {label/area}, ALL {label/state})

If we add our metric to the report with the Area and State dimension, we get this:

The created metric displays the same reusability capabilities as the metric noted here, even though it’s defined over a more straightforward LDM.

Be aware that the created metric shows the identical reusability capabilities because the metric included right here, regardless that it’s outlined over a extra easy LDM.

Logical Normalization

As acknowledged above, our LDM has an implicit (non-expressed) relation between Area and State. Nonetheless, we have been nonetheless in a position to create a worthwhile and reusable metric computing ratio between States and Areas. There are two issues right here, although:

  1. We all the time have to incorporate Area in our report regardless that it is pointless from the semantics perspective.
  2. Extra importantly, we don’t have the express 1:N relation (between Area and State) specified within the LDM, and we should simulate it within the metric utilizing the BY {label/area}, ALL {label/state}) assemble. It means additional psychological effort for the metric creators. Additionally, it in the end results in logic duplication as a result of it’s important to simulate the relation in every metric the place the property needs to be utilized.

We might finally wish to have a brand new metric that might treatment our points — in different phrases, it might current anticipated outcomes when the report has only one attribute State, regardless that it’s now not hardcoded within the metric.

To do this, now we have to encode the State-Area relation into the LDM. MAQL is sensible sufficient to acknowledge the state of affairs when attributes from report and attributes from BY are from the identical hierarchy. MAQL replaces the report attributes with the BY attributes — it computes each components of the metric on completely different dimensionalities and routinely joins them collectively.

Presently, the one approach to specify attributes hierarchy within the GoodData LDM is to normalize the bodily database. (Sooner or later, we plan to enhance the state of affairs through so-called logical hierarchies.)

We are going to extract the State-Area attributes/columns to a separate dataset/desk and join it to Order Strains through State overseas key. Right here’s a easy transformation SQL script performing the denormalization of the supply knowledge:

CREATE TABLE states AS (SELECT DISTINCT state, area FROM order_lines);
ALTER TABLE states ADD CONSTRAINT pk_states PRIMARY KEY ("state");
ALTER TABLE order_lines ADD CONSTRAINT fk_order_lines_state FOREIGN KEY ("state") REFERENCES "states" ("state");
ALTER TABLE order_lines DROP COLUMN area;

To execute it, log in to the PostgreSQL database embedded in your working GoodData.CN utilizing the next command:

psql -h localhost -p 5432 -d demo -U demouser --password

It is going to ask you for a password, and it’s the similar one that you simply used in the course of the knowledge supply registration: demopass. Earlier than you execute the SQL script, set the search path to the proper schema demo_denormalized:

SET search_path TO demo_denormalized;

Now you can simply copy and paste the SQL script after which go to the tab Knowledge, the place you need to scan the bodily database once more. Don’t forget to pick Exchange the present mannequin:

Scan the data source

It’s best to find yourself with the next mannequin:

Data model after scanning data source

Let’s now create the brand new metric Order Quantity Ratio to Area 2:

SELECT (SELECT SUM({truth/worth})) / 
(SELECT SUM({truth/worth}) BY {label/area})

We are able to now add the created metric to report and take away the Area. You see that it really works nice with simply State whereas additionally retaining the reusability points. (Strive placing one other attribute to the report, similar to Product, Date, and so forth.)

Report with added created metric and with region removed

Sadly, it doesn’t work appropriately if we need to examine Prospects (as an alternative of States) order quantity to their Areas, regardless that Area is functionally depending on the Buyer (through State).

It does not work correctly if we want to correct Customers (instead of States) order amount to their Regions.

We already know the best way to repair that, although: explicitly specific the Buyer-State-Area hierarchy by altering the bodily database mannequin.

CREATE TABLE clients AS (SELECT DISTINCT customer_id, customer_name, state FROM order_lines);
ALTER TABLE clients ADD CONSTRAINT pk_customers PRIMARY KEY ("customer_id");
ALTER TABLE order_lines DROP CONSTRAINT fk_order_lines_state;
ALTER TABLE order_lines ADD CONSTRAINT fk_order_lines_customer_id FOREIGN KEY ("customer_id") REFERENCES "clients" ("customer_id");
ALTER TABLE clients ADD CONSTRAINT fk_customers_state FOREIGN KEY ("state") REFERENCES "states" ("state");
ALTER TABLE order_lines DROP COLUMN customer_name;
ALTER TABLE order_lines DROP COLUMN state;

It’s best to scan the bodily database once more. Don’t forget to pick Exchange the present mannequin. The consequence needs to be following:

The result of changing the physical database model

Now the Order Quantity Ratio to Area 2 metric lastly works as anticipated; it permits us to provide the next report solely with the Buyer id attribute.

Report only with the Customer id attribute

Additionally, word one necessary benefit of abstracting your database tables to an LDM. Our metrics saved working even after we repeatedly modified the database schema and corresponding LDM.

Abstract

The article described the advantages of LDM methods to acquire worthwhile and reusable metrics. Nonetheless, our improved demo mannequin just isn’t optimum but — we are able to’t reply all potential questions. Keep tuned for the second half.

Study Extra About MAQL

In the meantime, if you wish to be taught extra in regards to the GoodData LDM and MAQL, verify our GoodData College programs. Additionally, see our group discussion board and group Slack channel for assist and additional information.

Header photograph by John Schnobrich on Unsplash

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments