Primary Keys Are Not Unique
(How to not mess up your Calculated Insights)
Data Structures
In the Customer Data Platform (CDP) all data streams have a primary key. When a stream is created, the single field primary key is selected, and data is deduplicated within the stream based on that field. Similarly, data model objects (DMOs) have a primary key that can be seen when mapping streams.
The visible primary key on DMOs is not unique though. There are additional fields, primarily Data Source and Data Source Object which make the mapped data unique. If these are not taken into consideration then the result can be invalid or overstated values.
Take for example this sample set of data sources:
- Marketing Cloud Bundle
- Loyalty Customers file on S3
In this scenario, we have 2 data sources that are mapping data to our Individual object.
SFMC Bundle:
Loyalty File:
In both streams we have overlap on our customers with a common identifier. A sample from the Loyalty file:
And one from SFMC:
To also note, these streams are set up with different Source values:
In the data, SFMC Subscriber Key (mapped to Individual Id) of 33645523 is the same value as the Customer ID (also mapped to Individual Id) from the loyalty file.
You might expect that a single record for Individual would be created and updated, but records across streams do not update by design. This is necessary due to how the CDP functions and unifies data, but it's not obvious if you are used to working with simple databases. This can be seen when looking at Data Explorer:
There are now 2 records in the Individual DMO because the Data Source and Data Source Object fields are part of the uniqueness. This unfortunately is not shown in the Metadata API because Individual displays:
{
"name": "ssot__Individual__dlm",
"displayName": "Individual",
"primaryKeys": [
{
"name": "ssot__Id__c",
"displayName": "Individual Id",
"indexOrder": "1"
}
]
}
However, you can see the additional fields when looking at Unified Link Individual:
{
"name": "IndividualIdentityLink__dlm",
"displayName": "Unified Link Individual",
"primaryKeys": [
{
"name": "SourceRecordId__c",
"displayName": "Individual Id",
"indexOrder": "1"
},
{
"name": "ssot__DataSourceId__c",
"displayName": "Data Source",
"indexOrder": "2"
},
{
"name": "ssot__DataSourceObjectId__c",
"displayName": "Data Source Object",
"indexOrder": "3"
}
]
}
Calculated Insights
It's critical to know the above information when building out your calculated insights. If not you may end up with incorrect data as a result.
For example, using the loyalty information above let's say that we want to create a CI that shows us the total loyalty points. Note: in below examples, I'm running the queries through the Query API.
The SQL could be written as:
SELECT ssot__Individual__dlm.ssot__Id__c as IndId,
UnifiedIndividual__dlm.ssot__Id__c as UnifiedId,
sum(ssot__LoyaltyMemberCurrency__dlm.ssot__PointsBalanceNumber__c) as TotalPoints
from UnifiedIndividual__dlm
inner join IndividualIdentityLink__dlm ON UnifiedIndividual__dlm.ssot__Id__c = IndividualIdentityLink__dlm.UnifiedRecordId__c
inner join ssot__Individual__dlm on ssot__Individual__dlm.ssot__Id__c = IndividualIdentityLink__dlm.SourceRecordId__c
inner join ssot__LoyaltyProgramMember__dlm on ssot__LoyaltyProgramMember__dlm.ssot__PartyId__c = ssot__Individual__dlm.ssot__Id__c
inner join ssot__LoyaltyMemberCurrency__dlm on ssot__LoyaltyMemberCurrency__dlm.ssot__LoyaltyProgramMemberId__c = ssot__LoyaltyProgramMember__dlm.ssot__Id__c
group by ssot__Individual__dlm.ssot__Id__c, UnifiedIndividual__dlm.ssot__Id__c
Given the example data earlier in this article, individual 33645523 has 332 points. The results of this query would look like this:
Incorrectly I have a result of 1328 whereas we would have expected 332, which is 4x higher than the amount. This is coming from a cross join scenario where there are 2 records in IndividualIdentityLink__dlm with a SourceRecordId__c of 33645523 and then in ssot__Individual__dlm there are also 2 records with a ssot__Id__c of 33645523. This results in a 2x2 join, so a total of 4 records are returned.
Instead, we need to include the additional keys as the joins. The highlighted conditions has been added. Both joins need to have the extra conditions since there are two joins based on the individual id.
SELECT ssot__Individual__dlm.ssot__Id__c as IndId,
UnifiedIndividual__dlm.ssot__Id__c as UnifiedId,
sum(ssot__LoyaltyMemberCurrency__dlm.ssot__PointsBalanceNumber__c) as TotalPoints
from UnifiedIndividual__dlm inner join IndividualIdentityLink__dlm ON UnifiedIndividual__dlm.ssot__Id__c = IndividualIdentityLink__dlm.UnifiedRecordId__c
inner join ssot__Individual__dlm on ssot__Individual__dlm.ssot__Id__c = IndividualIdentityLink__dlm.SourceRecordId__c
and IndividualIdentityLink__dlm.ssot__DataSourceId__c = ssot__Individual__dlm.ssot__DataSourceId__c
and IndividualIdentityLink__dlm.ssot__DataSourceObjectId__c = ssot__Individual__dlm.ssot__DataSourceObjectId__c
inner join ssot__LoyaltyProgramMember__dlm on ssot__LoyaltyProgramMember__dlm.ssot__PartyId__c = ssot__Individual__dlm.ssot__Id__c
and ssot__Individual__dlm.ssot__DataSourceId__c = ssot__LoyaltyProgramMember__dlm.ssot__DataSourceId__c
and ssot__Individual__dlm.ssot__DataSourceObjectId__c = ssot__LoyaltyProgramMember__dlm.ssot__DataSourceObjectId__c
inner join ssot__LoyaltyMemberCurrency__dlm on ssot__LoyaltyMemberCurrency__dlm.ssot__LoyaltyProgramMemberId__c = ssot__LoyaltyProgramMember__dlm.ssot__Id__c
group by ssot__Individual__dlm.ssot__Id__c, UnifiedIndividual__dlm.ssot__Id__c
Now when we run the query, the correct results are returned. 💥
This same situation may apply for any other table or join where there may be duplicates across data sources based on the unique identifier.