Frequent hits @dbdebunk.com are driven by the question �Are keys mandatory?� Puzzlingly, many data professionals do not seem to understand why duplicates should be prohibited. This should worry analysts.
Entities are distinguishable in the real world, or we would not be able to tell them apart (and count them). Those important enough to track individually, such as legal US residents, cars, or computers, are issued identifiers such as SSN’s, tags, or S/N’s. A real world identifier captures an entity’s identity within a class — all its remaining descriptive (non-identifying) properties are "about the identifier", so to speak. Members of multiple classes have multiple identifiers.
The lack of a real world identifier means that individual tracking is not meaningful. Such entities — cans of cat food, for example — are still distinguishable and countable in the real world, but only by virtue of their visually distinct location in physical space.
Recall, now, that a relational database consists of logical relations. A relation is a set of tuples that represent facts about a class of property-sharing entities.
A real world identifier is represented in the database by a natural key (NK), a relation attribute that has unique values. For example, an EMPLOYEES relation can be pictured as this R-table with the employee number attribute (shown as the EMP# column) representing the real world identifier.
A RDBMS does not permit keyless relations in the database. So what about entities that lack a real world identifier?
SQL DBMS�s permit keyless tables — that do not picture relations — such as this.
But the absence of an identifier means that individual entities are not meaningful, so this representation contradicts the real world. Note: The table represents the facts:
- Can of Cat Food has price 0.39.
- Can of Cat Food has price 0.39.
�Stating the same fact more than once, does not make it truer, only redundant,� as E. F. Codd used to say.
Contradictions produce problems. First, a DBMS is incapable of �visually� discerning a data entry duplication error from "valid" duplicates, which means high risk of inconsistent databases and wrong counts and other query results.
Second, how to interpret query results that differ only in the number of duplicates they contain? Either they are semantically equivalent — in which case, why the difference? Or they are not, in which case, what does the difference mean? (SQL DBMS�s do not treat them as equivalent, which inhibits performance optimization).
Third, keyless tables violate two of Codd’s core 12 rules and rob databases of the practical advantages they are intended to confer.
- The Guaranteed Logical Access (GLA) rule mandates that every data value in a relational database be logically accessible via a combination of relation name + attribute name + key value.
- The Physical Data Independence (PDI) rule mandates that queries and applications should not have to be rewritten just because the database is physically reorganized to maximize performance.
The design consistent with reality is, of course:
If a need arises to track cat food cans individually, they must be assigned some real world identifier:
Only SQL tables with
- unique, unordered rows
- uniquely named, unordered columns
- no missing values
are R-tables that guarantee logically correct query results. Unfortunately, SQL DBMS�s permit tables that violate these features, including keyless ones, so analysts better exercise care.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Data Fundamentals for Analysts, Not Worth Repeating: Duplicates