Clicky

Introduction to The Relational Language

This section was labeled under programming.

Sequel, The Structured Query Language

IBM developed the original version of SQL, originally called Sequel, as part of the System R project in the early 1970s. The Sequel language has evolved since then, and its name has changed to SQL (Structured Query Language). Many products now support the SQL language. SQL has clearly established itself as the standard relational database language.

The SQL language has several parts:

  • Data-definition language (DDL). The SQL DDL provides commands for defining relation schemas, deleting relations, and modifying relation schemas.
  • Data-manipulation language (DML). The SQL DML provides the ability to query information from the database and to insert tuples into, delete tuples from, and modify tuples in the database.
  • Integrity. The SQL DDL includes commands for specifying integrity constraints that the data stored in the database must satisfy. Updates that violate integrity constraints are disallowed.
  • View definition. The SQL DDL includes commands for destining views.
  • Transaction control. SQL includes commands for specifying the beginning and end points of transactions.
  • Embedded SQL and dynamic SQL. Embedded and dynamic SQL define how SQL statements can be embedded within general-purpose programming languages, such as C, C++, and Java.
  • Authorization. The SQL DDL includes commands for specifying access rights to relations and views.

The set of relations in a database are specified using a data-definition language (DDL). The SQL DDL allows specification of not only a set of relations, but also information about each relation, including:

  • The schema for each relation.
  • The types of values associated with each attribute.
  • The integrity constraints.
  • The set of indices to be maintained for each relation.
  • The security and authorization information for each relation.
  • The physical storage structure of each relation on disk.

Pre-theory

Before going through any SQL flavor, let us take a look at the thoery prerequisites.

Set Theory

Set theory, which originated with the mathematician Georg Cantor, is one of the mathematical branches on which the relational model is based. Cantor’s definition of a set follows:

By a “set” we mean any collection M into a whole of definite, distinct objects m (which are called the “elements” of M) of our perception or of our thought.

Joseph W. Dauben and Georg Cantor (Princeton University Press, 1990)

Every word in the definition has a deep and crucial meaning. The definitions of a set and set membership are axioms that are not supported by proofs. Each element belongs to a universe, and either is or is not a member of the set.

Let’s start with the word whole in Cantor’s definition. A set should be considered a single entity. Your focus should be on the collection of objects as opposed to the individual objects that make up the collection. Later on, when you write T-SQL queries against tables in a database (such as a table of employees), you should think of the set of employees as a whole rather than the individual employees. This might sound trivial and simple enough, but apparently many programmers have difficulty adopting this way of thinking.

The word distinct means that every element of a set must be unique. Jumping ahead to tables in a database, you can enforce the uniqueness of rows in a table by defining key constraints. Without a key, you won’t be able to uniquely identify rows, and therefore the table won’t qualify as a set. Rather, the table would be a multiset or a bag.

Predicate Logic

Predicate logic, is another branch of mathematics on which the relational model is based. Dr. Edgar F. Codd, in creating the relational model, had the insight to connect predicate logic to both management and querying of data. Loosely speaking, a predicate is a property or an expression that either holds or doesn’t hold—in other words, is either true or false. The relational model relies on predicates to maintain the logical integrity of the data and define its structure. One example of a predicate used to enforce integrity is a constraint defined in a table called Employees that allows only employees with a salary greater than zero to be stored in the table. The predicate is “salary greater than zero” (T-SQL expression: \(salary > 0\)).

In set theory, you can use predicates to define sets. This is helpful because you can’t always define a set by listing all its elements (for example, infinite sets), and sometimes for brevity it’s more convenient to define a set based on a property. As an example of an infinite set defined with a predicate, the set of all prime numbers can be defined with the following predicate: “x is a positive integer greater than 1 that is divisible only by 1 and itself.” For any specified value, the predicate is either true or not true. The set of all prime numbers is the set of all elements for which the predicate is true. As an example of a finite set defined with a predicate, the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} can be defined as the set of all elements for which the following predicate holds true: “x is an integer greater than or equal to 0 and smaller than or equal to 9.”

Propositions, Predicates, and Relations

The common belief that the term relational stems from relationships between tables is incorrect. “Relational” actually pertains to the mathematical term relation. In set theory, a relation is a representation of a set. In the relational model, a relation is a set of related information, with the counterpart in SQL being a table—albeit not an exact counterpart. A key point in the relational model is that a single relation should represent a single set (for example, Customers). It is interesting to note that operations on relations (based on relational algebra) result in a relation (for example, a join between two relations).

When you design a data model for a database, you represent all data with relations (tables). You start by identifying propositions that you will need to represent in your database. A proposition is an assertion or a statement that must be true or false. For example, the statement, “Employee Itzik BenGan was born on February 12, 1971, and works in the IT department” is a proposition. If this proposition is true, it will manifest itself as a row in a table of Employees. A false proposition simply won’t manifest itself. This presumption is known as the close world assumption (CWA).

Basic Types

The SQL standard supports a variety of built-in types, including:

  • char \((n)\): A fixed-length character string with user-specified length n. The full form, character, can be used instead. SQL also provides the nvarchar type to store multilingual data using the Unicode representation. However, many databases allow Unicode (in the UTF-8 representation) to be stored even in varchar types.
  • varchar \((n)\): A variable-length character string with user-specified maximum length n. The full form, character varying, is equivalent
  • int: An integer (a finite subset of the integers that is machine dependent). The full form, integer, is equivalent.
  • smallint: A small integer (a machine-dependent subset of the integer type).
  • numeric \((p, d)\): A fixed-point number with user-specified precision. The number consists of \(p\) digits (plus a sign), and \(d\) of the \(p\) digits are to the right of the decimal point. Thus, numeric(3,1) allows 44.5 to be stored exactly, but neither 444.5 nor 0.32 can be stored exactly in a field of this type.
  • float(n): A floating-point number with precision of at least n digits.

Each type may include a special value called the null value. A null value indicates an absent value that may exist but be unknown or that may not exist at all. In certain cases, we may wish to prohibit null values from being entered, as we shall see shortly.

You can understand the difference between char and varchar this way:

Create table temp
(City CHAR(10),
Street VARCHAR(10));

Insert into temp
values('Pune','Oxford');

select length(city), length(street) from temp;

output will be:

length(City)          Length(street)
10                    6

It’s obvious. Very similar to hard disk storage methods.

Each type may include a special value called the null value. A null value indicates an absent value that may exist but be unknown or that may not exist at all. In certain cases, we may wish to prohibit null values from being entered, as we shall see shortly.

Schema Definition

We define an SQL elation by using the create table command. The following command creates a relation department in the database:

create table department
(
    dept_name varchar (20),
    building varchar (15),
    budget numeric (12,2),
    primary key (dept_name)
);

The relation created above has three attributes, dept_name, which is a character string of maximum length 20, building, which is a character string of maximum length 15, and budget, which is a number with 12 digits in total, two of which are after the decimal point. The create table command also specifies that the dept name attribute is the primary key of the department relation.

The general form of writing this kind of schemas looks like this:

create table r
    (A1 D1 ,
    A2 D2 ,
    ...,
    An Dn ,
    ⟨integrity-constraint1 ⟩,
    …,
    ⟨integrity-constraintk ⟩);

where \(r\) is the name of the relation, each \(A_{i}\) is the name of an attribute in the schema of relation \(r\), and \(D_{i}\) is the domain of attribute \(A_{i}\) ; that is, \(D_{i}\) specifies the type of attribute \(A_{i}\) along with optional constraints that restrict the set of allowed values for \(A_{i}\).

The semicolon shown at the end of the create table statements, as well as at the end of other SQL statements later in this chapter, is optional in many SQL implementations. In Microsoft’s ’s implementation, it’s replaced with GO command.

SQL supports a number of different integrity constraints. In this section, we discuss only a few of them:

  • primary key \((A_{j_1} , A_{j_2} , … , A_{ j_{ m } } )\): The primary-key specification says that attributes \((A_{j_1} , A_{j_2} , … , A_{ j_{ m } } )\) form the primary key for the relation. The primary-key attributes. are required to be nonnull and unique; that is, no tuple can have a null value for a primary-key attribute, and no two tuples in the relation can be equal on all the primary-key attributes. Although the primary-key specification is optional, it is generally a good idea to specify a primary key for each relation.
  • foreign key \((A_{k_1} , A_{k_2} , … , A_{ k_{ m } } )\): references s: The foreign key specification says that \((A_{k_1} , A_{k_2} , … , A_{ k_{ m } } )\) for any tuple in the relation must correspond to values of the primary key attributes of some tuple in relation s.
  • not null: The not null constraint on an attribute specifies that the null value is not allowed for that attribute; in other words, the constraint excludes the null value from the domain of that attribute.
  • Check Constraints: A check constraint allows you to define a predicate that a row must meet to be entered into the table or to be modified. For example, the following check constraint ensures that the salary column in the Employees table will support only positive values.
  • Default Constraints: A default constraint is associated with a particular attribute. It is an expression that is used as the default value when an explicit value is not specified for the attribute when you insert a row. For example, the following code defines a default constraint for the orderts attribute (representing the order’s time stamp):

    ALTER TABLE dbo.Orders
          ADD CONSTRAINT DFT_Orders_orderts
          DEFAULT(SYSDATETIME()) FOR orderts;
    

SQL prevents any update to the database that violates an integrity constraint. For example, if a newly inserted or modified tuple in a relation has null values for any primary-key attribute, or if the tuple has the same value on the primary-key attributes as does another tuple in the relation, SQL flags an error and prevents the update. Similarly, an insertion of a course tuple with a dept name value that does not appear in the department relation would violate the foreign-key constraint on course, and SQL prevents such an insertion from taking place.

ALTER TABLE dbo.Employees
      ADD CONSTRAINT CHK_Employees_salary
    CHECK(salary > 0.00);

An attempt to insert or update a row with a non-positive salary value will be rejected by the RDBMS. Note that a check constraint rejects an attempt to insert or update a row when the predicate evaluates to FALSE. The modification will be accepted when the predicate evaluates to either TRUE or UNKNOWN. For example, salary –1000 will be rejected, whereas salaries 50000 and NULL will both be accepted.

Structure of SQL Queries

The basic structure of an SQL query consists of three clauses: select, from, and where. A query takes as its input the relations listed in the from clause, operates on them as specified in the where and select clauses, and then produces a relation as the result. We introduce the SQL syntax through examples, and we describe the general structure of SQL queries later.

Select

Database tables are objects that store all the data in a database. In a table, data is logically organized in a row-and-column format which is similar to a spreadsheet.

To query data from a table, you use the SELECT statement. The following illustrates the most basic form of the SELECT statement:

SELECT
    select_list
FROM
    schema_name.table_name;

Order by

When you use the SELECT statement to query data from a table, the order of rows in the result set is not guaranteed. It means that SQL can return a result set with an unspecified order of rows.

The only way for you to guarantee that the rows in the result set are sorted is to use the ORDER BY clause. The following illustrates the ORDER BY clause syntax:

SELECT
    select_list
FROM
    table_name
ORDER BY
    column_name | expression [ASC | DESC ];
Offset Fetch

The OFFSET and FETCH clauses are the options of the ORDER BY clause. They allow you to limit the number of rows to be returned by a query.

ORDER BY column_list [ASC |DESC]
OFFSET offset_row_count {ROW | ROWS}
FETCH {FIRST | NEXT} fetch_row_count {ROW | ROWS} ONLY

In this syntax:

  • The OFFSET clause specifies the number of rows to skip before starting to return rows from the query. The offset_row_count can be a constant, variable, or parameter that is greater or equal to zero.
  • The FETCH clause specifies the number of rows to return after the OFFSET clause has been processed. The offset_row_count can a constant, variable or scalar that is greater or equal to one.
  • The OFFSET clause is mandatory while the FETCH clause is optional. Also, the FIRST and NEXT are synonyms respectively so you can use them interchangeably. Similarly, you can use the FIRST and NEXT interchangeably.

Note that you must use the OFFSET and FETCH clauses with the ORDER BY clause. Otherwise, you will get an error.

Top

The SELECT TOP clause allows you to limit the number of rows or percentage of rows returned in a query result set.

SELECT TOP (expression) [PERCENT]
    [WITH TIES]
FROM
    table_name
ORDER BY
    column_name;
Percent

The PERCENT keyword indicates that the query returns the first N percentage of rows, where N is the result of the expression.

With Ties

The WITH TIES allows you to return more rows with values that match the last row in the limited result set. Note that WITH TIES may cause more rows to be returned than you specify in the expression.

Distinct

Sometimes, you may want to get only distinct values in a specified column of a table. To do this, you use the SELECT DISTINCT clause as follows:

SELECT DISTINCT
    column_name
FROM
    table_name;
Where

When you use the SELECT statement to query data against a table, you get all the rows of that table, which is unnecessary because the application may only process a set of rows at the time.

To get the rows from the table that satisfy one or more conditions, you use the WHERE clause as follows:

SELECT
    select_list
FROM
    table_name
WHERE
    search_condition;
And

The AND is a logical operator that allows you to combine two Boolean expressions. It returns TRUE only when both expressions evaluate to TRUE.

SELECT
    product_id,
    product_name,
    category_id,
    model_year,
    list_price
FROM
    production.products
WHERE
    category_id = 1 AND model_year = 2018
ORDER BY
    list_price DESC;
Or

The OR is a logical operator that allows you to combine two Boolean expressions. It returns TRUE when either of the conditions evaluates to TRUE.

SELECT
    product_id,
    product_name,
    category_id,
    model_year,
    list_price
FROM
    production.products
WHERE
    list_price > 3000 OR model_year = 2018
ORDER BY
    list_price DESC;
Between

The BETWEEN operator is a logical operator that allows you to specify a range to test.

SELECT
    product_id,
    product_name,
    category_id,
    model_year,
    list_price
FROM
    production.products
WHERE
    list_price BETWEEN 1899.00 AND 1999.99
ORDER BY
    list_price DESC;
Null

In the database world, NULL is used to indicate the absence of any data value. For example, when recording the customer information, the email may be unknown, so you record it as NULL in the database.

Is Null

See the following customers table from the sample database.

customer_id(PK, int, not null),
first_name(varchar, not null),
last_name(varchar, not null),
phone(varchar, null),
email(varchar, not null),
street(varchar, null),
city(varchar, null),
state(varchar, null),
zip_code(varchar, null)

The following statement finds the customers who do not have phone numbers recorded in the customers table:

SELECT
    customer_id,
    first_name,
    last_name,
    phone
FROM
    sales.customers
WHERE
    phone = NULL
ORDER BY
    first_name,
    last_name;

The query returned an empty result set.

The WHERE clause returns rows that cause its predicate evaluates to TRUE. However, the following expression evaluates to UNKNOWN.

phone = NULL;

The right way to do it is to use the following instead:

SELECT
    customer_id,
    first_name,
    last_name,
    phone
FROM
    sales.customers
WHERE
    phone IS NULL
ORDER BY
    first_name,
    last_name;

Which evaluates to:

| customer_id | first_name | last_name | phone |
|-------------+------------+-----------+-------|
|         338 | Abbey      | Pugh      | NULL  |
|          75 | Abby       | Gamble    | NULL  |
|        1224 | Abram      | Copeland  | NULL  |
|         673 | Adam       | Henderson | NULL  |
|        1085 | Adam       | Thornton  | NULL  |
|         195 | Addie      | Hahn      | NULL  |
|        1261 | Adelaida   | Hancock   | NULL  |
|          22 | Adelle     | Larsen    | NULL  |
|        1023 | Adena      | Blake     | NULL  |
|        1412 | Adrien     | Hunter    | NULL  |
Load 10 more... (10/1267)

As you may guess, to check if a value is not NULL, you can use the IS NOT NULL operator.

Joins

The FROM clause of a query is the first clause to be logically processed, and within the FROM clause, table operators operate on input tables. Microsoft SQL Server supports four table opera- tors—JOIN, APPLY, PIVOT, and UNPIVOT. The JOIN table operator is standard, whereas APPLY, PIVOT, and UNPIVOT are T-SQL extensions to the standard. Each table operator acts on tables provided to it as input, applies a set of logical query processing phases, and returns a table result.

A JOIN table operator operates on two input tables. The three fundamental types of joins are cross joins, inner joins, and outer joins. These three types of joins differ in how they apply their logical query processing phases; each type applies a different set of phases. A cross join applies only one phase—Cartesian Product. An inner join applies two phases—Cartesian Product and Filter. An outer join applies three phases—Cartesian Product, Filter, and Add Outer Rows.

Logical query processing describes a generic series of logical steps that for any specified query produces the correct result, whereas physical query processing is the way the query is processed by the RDBMS engine in practice. Some phases of logical query processing of joins might sound inefficient, but the inefficient phases will be optimized by the physical implementation. It’s important to stress the term logical in logical query processing. The steps in the process apply operations to the input tables based on relational algebra. The database engine does not have to follow logical query processing phases literally, as long as it can guarantee that the result that it produces is the same as that dictated by logical query processing. The SQL Server relational engine often applies many shortcuts for optimization purposes when it knows that it can still produce the correct result.

Cross Joins

Logically, a cross join is the simplest type of join. A cross join implements only one logical query processing phase—a Cartesian Product. This phase operates on the two tables provided as inputs to the join and produces a Cartesian product of the two. That is, each row from one input is matched with all rows from the other. So if you have \(m\) rows in one table and \(n\) rows in the other, you get \(m \cdot n\) rows in the result.

ANSI Syntax

T-SQL supports two standard syntaxes for cross joins—the ANSI SQL-92 and ANSI SQL-89 syntaxes. I recommend that you use the ANSI-SQL 92 syntax for reasons that I’ll describe shortly.

SELECT C.custid, E.empid
FROM Sales.Customers AS C
CROSS JOIN HR.Employees AS E;

Because there are 91 rows in the Customers table and 9 rows in the Employees table, this query produces a result set with 819 rows, as shown here in abbreviated form.

Inner Joins

An inner join applies two logical query processing phases—it applies a Cartesian product between the two input tables as in a cross join, and then it filters rows based on a predicate that you specify.

Using the ANSI SQL-92 syntax, you specify the INNER JOIN keywords between the table names. The INNER keyword is optional, because an inner join is the default, so you can specify the JOIN keyword alone. You specify the predicate that is used to filter rows in a designated clause called ON. This predicate is also known as the join condition.

SELECT E.empid, E.firstname, E.lastname, O.orderid
FROM HR.Employees AS E
JOIN Sales.Orders AS O
ON E.empid = O.empid;

This was the SQL-92 syntax, here is 89’s:

SELECT E.empid, E.firstname, E.lastname, O.orderid
FROM HR.Employees AS E, Sa

I strongly recommend that you stick to the ANSI SQL-92 join syntax because it is safer in several ways. Suppose you intend to write an inner join query, and by mistake you forget to specify the join condition. With the ANSI SQL-92 syntax, the query becomes invalid, and the parser generates an error. For example, try to run the following code.

SELECT E.empid, E.firstname, E.lastname, O.orderid
FROM HR.Employees AS E
JOIN Sales.Orders AS O;
Msg 102, Level 15, State 1, Line 3
Incorrect syntax near ';'.
Composite Joins

A composite join is simply a join based on a predicate that involves more than one attribute from each side. A composite join is commonly required when you need to join two tables based on a primary key–foreign key relationship and the relationship is composite; that is, based on more than one attribute. For example, suppose you have a foreign key defined on dbo.Table2, columns col1, col2, referencing dbo.Table1, columns col1, col2, and you need to write a query that joins the two based on a primary key–foreign key relationship. The FROM clause of the query would look like this.

FROM dbo.Table1 AS T1
JOIN dbo.Table2 AS T2
ON T1.col1 = T2.col1
AND T1.col2 = T2.col2

For a more tangible example, suppose that you need to audit updates to column values against the OrderDetails table in the TSQL2012 database. You create a custom auditing table called OrderDetailsAudit.

Multi-Joins

A join table operator operates only on two tables, but a single query can have multiple joins. In general, when more than one table operator appears in the FROM clause, the table operators are logically processed from left to right. That is, the result table of the first table operator is treated as the left input to the second table operator; the result of the second table operator is treated as the left input to the third table operator; and so on. So if there are multiple joins in the FROM clause, the first join operates on two base tables, but all other joins get the result of the preceding join as their left input. With cross joins and inner joins, the database engine can (and often does) internally rearrange join ordering for optimization purposes because it won’t have an impact on the correctness of the result of the query.

As an example, the following query joins the Customers and Orders tables to match customers with their orders, and then it joins the result of the first join with the OrderDetails table to match orders with their order lines.

SELECT
C.custid, C.companyname, O.orderid,
OD.productid, OD.qty
FROM Sales.Customers AS C
JOIN Sales.Orders AS O
ON C.custid = O.custid
JOIN Sales.OrderDetails AS OD
ON O.orderid = OD.orderid;
Outer Joins

Outer joins were introduced in ANSI SQL-92 and, unlike inner joins and cross joins, have only one standard syntax—the one in which the JOIN keyword is specified between the table names, and the join condition is specified in the ON clause. Outer joins apply the two logical processing phases that inner joins apply (Cartesian product and the ON filter), plus a third phase called Adding Outer Rows that is unique to this type of join.

In an outer join, you mark a table as a “preserved” table by using the keywords LEFT OUTER JOIN, RIGHT OUTER JOIN, or FULL OUTER JOIN between the table names. The OUTER keyword is optional. The LEFT keyword means that the rows of the left table are preserved; the RIGHT keyword means that the rows in the right table are preserved; and the FULL keyword means that the rows in both the left and right tables are preserved. The third logical query processing phase of an outer join identifies the rows from the preserved table that did not find matches in the other table based on the ON predicate. This phase adds those rows to the result table produced by the first two phases of the join, and uses NULL marks as placeholders for the attributes from the nonpreserved side of the join in those outer rows.

SELECT
    m.name member,
    p.title project
FROM
    pm.members m
    RIGHT OUTER JOIN pm.projects p
    ON p.id = m.project_id

A common question about outer joins that is the source of a lot of confusion is whether to specify a predicate in the ON or WHERE clause of a query. You can see that with respect to rows from the preserved side of an outer join, the filter based on the ON predicate is not final. In other words, the ON predicate does not determine whether a row will show up in the output, only whether it will be matched with rows from the other side. So when you need to express a predicate that is not final— meaning a predicate that determines which rows to match from the nonpreserved side—specify the predicate in the ON clause. When you need a filter to be applied after outer rows are produced, and you want the filter to be final, specify the predicate in the WHERE clause. The WHERE clause is processed after the FROM clause—specifically, after all table operators have been processed and (in the case of outer joins) after all outer rows have been produced. Also, the WHERE clause is final with respect to rows that it filters out, unlike the ON clause.

Aggregation with Grouping

There are circumstances where we would like to apply the aggregate function not only to a single set of tuples, but also to a group of sets of tuples; we specify this in SQL using the group by clause. The attribute or attributes given in the group by clause are used to form groups. Tuples with the same value on all attributes in the group by clause are placed in one group.

As an illustration, consider the query “Find the average salary in each department.” We write this query as follows

select dept name, avg (salary) as avg salary
from instructor
group by dept name;

Nested Subqueries

SQL provides a mechanism for nesting subqueries. A subquery is a select-from-where expression that is nested within another query. A common use of subqueries is to perform tests for set membership, make set comparisons, and determine set cardinality by nesting subqueries in the where clause.

Set Membership

SQL allows testing tuples for membership in a relation. The in connective tests for set membership, where the set is a collection of values produced by a select clause. The not in connective tests for the absence of set membership.

As an illustration, reconsider the query “Find all the courses taught in the both the Fall 2017 and Spring 2018 semesters.” Earlier, we wrote such a query by intersecting two sets: the set of courses taught in Fall 2017 and the set of courses taught in Spring

  1. We can take the alternative approach of finding all courses that were taught in

Fall 2017 and that are also members of the set of courses taught in Spring 2018. This formulation generates the same results as the previous one did, but it leads us to write our query using the in connective of SQL. We begin by finding all courses taught in Spring 2018, and we write the subquery:

(select course id
from section
where semester = 'Spring' and year= 2018)

We then need to find those courses that were taught in the Fall 2017 and that appear in the set of courses obtained in the subquery. We do so by nesting the subquery in the where clause of an outer query. The resulting query is:

select distinct course id
from section
where semester = 'Fall' and year= 2017 and
course id in (select course id
from section
where semester = 'Spring' and year= 2018);
Set Comparison

Check the MSDOCS; here, here.

Test for Empty Relations

SQL includes a feature for testing whether a subquery has any tuples in its result. The exists construct returns the value true if the argument subquery is nonempty. Using the exists construct, we can write the query “Find all courses taught in both the Fall 2017 semester and in the Spring 2018 semester” in still another way:

select course id
from section as S
where semester = 'Fall' and year= 2017 and
exists (select *
from section as T
where semester = 'Spring' and year= 2018 and
S.course id= T .course id);

SQL includes a Boolean function for testing whether a subquery has duplicate tuples in its result. The unique construct11 returns the value true if the argument subquery contains no duplicate tuples. Using the unique construct, we can write the query “Find all courses that were offered at most once in 2017” as follows:

select T .course id
from course as T
where unique (select R.course id
from section as R
where T .course id= R.course id and
R.year = 2017);

Note that if a course were not offered in 2017, the subquery would return an empty result, and the unique predicate would evaluate to true on the empty set. An equivalent version of this query not using the unique construct is:

select T .course id
from course as T
where 1 >= (select count(R.course id)
from section as R
where T .course id= R.course id and
R.year = 2017);
From Subqueries

SQL allows a subquery expression to be used in the from clause. The key concept ap- plied here is that any select-from-where expression returns a relation as a result and, therefore, can be inserted into another select-from-where anywhere that a relation can appear.

select dept name, avg salary
from (select dept name, avg (salary) as avg salary
from instructor
group by dept name)
where avg salary > 42000;

Gotchas

This section contains several exercise and tricks for various topics.

Generating Five Copies from Each Row   cross_joins

Producing multiple copies of rows can be achieved with a fundamental technique that utilizes a cross join. If you need to produce five copies of each employee row, you need to perform a cross join be- tween the Employees table and a table that has five rows; alternatively, you can perform a cross join between Employees and a table that has more than five rows, but filter only five from that table in the WHERE clause. The Nums table is very convenient for this purpose. Simply cross Employees and Nums, and filter from Nums as many rows as the number of requested copies (five, in this case).

create table temptable
(
    id int
);
insert into temptable
values
    (1),
    (2),
    (3),
    (4),
    (5);

select stf.staff_id empid, stf.first_name firstname, last_name lastname, tmp.id n
from sales.staffs as stf cross join temptable as tmp
order by firstname, n;

drop table temptable;
| empid | firstname  | lastname | n |
|-------+------------+----------+---|
|    10 | Bernardine | Houston  | 1 |
|    10 | Bernardine | Houston  | 2 |
|    10 | Bernardine | Houston  | 3 |
|    10 | Bernardine | Houston  | 4 |
|    10 | Bernardine | Houston  | 5 |
|     1 | Fabiola    | Jackson  | 1 |
|     1 | Fabiola    | Jackson  | 2 |
|     1 | Fabiola    | Jackson  | 3 |
|     1 | Fabiola    | Jackson  | 4 |
|     1 | Fabiola    | Jackson  | 5 |
Load 10 more... (10/50)

General Notes

  • When you use the NOT IN predicate against a subquery that returns at least one NULL, the outer query always returns an empty set. Values from the outer table that are known to appear in the set are not returned because the outer query is supposed to return values that do not appear in the set. Values that do not appear in the set of known values are not returned because you can never tell
  • Suppose that you need to query the Orders table in the TSQL2012 database and return information about the order that has the maximum order ID in the table. You could accomplish the task by using a variable. The code could retrieve the maximum order ID from the Orders table and store the result in a variable. Then the code could query the Orders table and filter the order where the order ID is equal to the value stored in the variable. The following code demonstrates this technique.

    USE TSQL2012;
    DECLARE @maxid AS INT = (SELECT MAX(orderid)
    FROM Sales.Orders);
    SELECT orderid, orderdate, empid, custid
    FROM Sales.Orders
    WHERE orderid = @maxid;
    
  • Suppose that you need to query the Orders table in the TSQL2012 database and return, for each order, information about the current order and also the previous order ID. The concept of “previous” implies logical ordering, but because you know that the rows in a table have no order, you need to come up with a logical equivalent to the concept of “previous” that can be phrased with a T-SQL expression. One example of such a logical equivalent is “the maximum value that is smaller than the current value.” This phrase can be expressed in T-SQL with a correlated subquery like this.

    SELECT orderid, orderdate, empid, custid,
           (SELECT MAX(O2.orderid)
            FROM Sales.Orders AS O2
            WHERE O2.orderid < O1.orderid) AS prevorderid
    FROM Sales.Orders AS O1;
    

Confusions

  • The different between UNION and UNION ALL is that, UNION will remove duplicates from your query, UNION ALL will not.

Table Expressions

A table expression is a named query expression that represents a valid relational table. You can use table expressions in data manipulation statements much like you use other tables.

Table expressions are not physically materialized anywhere—they are virtual. When you query a table expression, the inner query gets un-nested. In other words, the outer query and the inner query are merged into one query directly against the underlying objects. The benefits of using table expressions are typically related to logical aspects of your code and not to performance. For example, table expressions help you simplify your solutions by using a modular approach. Table expressions also help you circumvent certain restrictions in the language, such as the inability to refer to column aliases assigned in the SELECT clause in query clauses that are logically processed before the SELECT clause.

Derived Tables

Derived tables (also known as table subqueries) are defined in the FROM clause of an outer query. Their scope of existence is the outer query. As soon as the outer query is finished, the derived table is gone.

You specify the query that defines the derived table within parentheses, followed by the AS clause and the derived table name. For example, the following code defines a derived table called USACusts based on a query that returns all customers from the United States, and the outer query selects all rows from the derived table.

USE TSQL2012;
SELECT *
FROM (SELECT custid, companyname
FROM Sales.Customers
WHERE country = N'USA') AS USACusts;

A query must meet 2 requirements to be valid to define a table expression of any kind:

  1. Order is not guaranteed. A table expression is supposed to represent a relational table, and the rows in a relational table have no guaranteed order. Recall that this aspect of a relation stems from set theory. For this reason, standard SQL disallows an ORDER BY clause in queries that are used to define table expressions, unless the ORDER BY serves another purpose besides presentation. An example for such an exception is when the query uses the OFFSET-FETCH filter. T-SQL enforces similar restrictions, with similar exceptions—when TOP or OFFSET-FETCH is also specified. In the context of a query with the TOP or OFFSET-FETCH filter, the ORDER BY clause serves as part of the specification of the filter. If you use a query with TOP or OFFSET-FETCH and ORDER BY to define a table expression, ORDER BY is only guaranteed to serve the filtering-related purpose and not the usual presentation purpose. If the outer query against the table expression does not have a presentation ORDER BY, the output is not guaranteed to be returned in any particular order. See the “Views and the ORDER BY Clause” section later in this chapter for more detail on this item.
  2. All columns must have names. All columns in a table must have names; therefore, you

must assign column aliases to all expressions in the SELECT list of the query that is used to define a table expression.

Common Table Expressions

Common table expressions (CTEs) are another standard form of table expression very similar to derived tables, yet with a couple of important advantages. CTEs are defined by using a WITH statement and have the following general form.

WITH <CTE_Name>[(<target_column_list>)]
AS
(
<inner_query_defining_CTE>
)
<outer_query_against_CTE>;

The inner query defining the CTE must follow all requirements mentioned earlier to be valid to define a table expression. As a simple example, the following code defines a CTE called USACusts based on a query that returns all customers from the United States, and the outer query selects all rows from the CTE.

WITH USACusts AS
(
    SELECT custid, companyname
    FROM Sales.Customers
    WHERE country = N'USA'
)
SELECT * FROM USACusts;

With CTE, there are two ways of assigning aliases:

  • The inline form:

    WITH C AS
    (
    SELECT YEAR(orderdate) AS orderyear, custid
    FROM Sales.Orders
    )
    SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
    FROM C
    GROUP BY orderyear;
    
  • The external form:

    WITH C(orderyear, custid) AS
    (
    SELECT YEAR(orderdate), custid
    FROM Sales.Orders
    )
    SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
    FROM C
    GROUP BY orderyear;
    

Recursive Common Table Expressions

CTEs are unique among table expressions because they have recursive capabilities. A recursive CTE is defined by at least two queries (more are possible)—at least one query known as the anchor member and at least one query known as the recursive member. The general form of a basic recursive CTE looks like the following.

WITH <CTE_Name>[(<target_column_list>)]
AS
(
<anchor_member>
UNION ALL
<recursive_member>
)
<outer_query_against_CTE>;

The anchor member is a query that returns a valid relational result table—like a query that is used to define a nonrecursive table expression. The anchor member query is invoked only once.

The recursive member is a query that has a reference to the CTE name. The reference to the CTE name represents what is logically the previous result set in a sequence of executions. The first time that the recursive member is invoked, the previous result set represents whatever the anchor member returned. In each subsequent invocation of the recursive member, the reference to the CTE name represents the result set returned by the previous invocation of the recursive member. The recursive member has no explicit recursion termination check—the termination check is implicit. The recursive member is invoked repeatedly until it returns an empty set or exceeds some limit.

Both queries must be compatible in terms of the number of columns they return and the data types of the corresponding columns.

The reference to the CTE name in the outer query represents the unified result sets of the invocation of the anchor member and all invocations of the recursive member.

The following code demonstrates how to use a recursive CTE to return information about an employee (Don Funk, employee ID 2) and all of the employee’s subordinates in all levels (direct or indirect).

WITH EmpsCTE AS
(
    SELECT empid, mgrid, firstname, lastname
    FROM HR.Employees
     WHERE empid = 2
     UNION ALL
    SELECT C.empid, C.mgrid, C.firstname, C.lastname
     FROM EmpsCTE AS P
    JOIN HR.Employees AS C
     ON C.mgrid = P.empid
)
SELECT empid, mgrid, firstname, lastname
FROM EmpsCTE;

Views

The two types of table expressions discussed so far—derived tables and CTEs—have a very limited scope, which is the single-statement scope. As soon as the outer query against those table expressions is finished, they are gone. This means that derived tables and CTEs are not reusable.

Views and inline table-valued functions (inline TVFs) are two reusable types of table expressions; their definitions are stored as database objects. After they have been created, those objects are permanent parts of the database and are only removed from the database if they are explicitly dropped.

IF OBJECT_ID ('Sales.USACusts') IS NOT NULL DROP VIEW Sales.USACusts;

GO
CREATE VIEW Sales.USACusts
AS
    SELECT
        custid,
        companyname,
        contactname,
        contacttitle,
        address,
        city,
        region,
        postalcode,
        country,
        phone,
        fax
    FROM
        Sales.Customers
    WHERE
  country = N'USA';

GO

Note that the general recommendation to avoid using SELECT * has specific relevance in the context of views. The columns are enumerated in the compiled form of the view, and new table columns will not be automatically added to the view. For example, suppose you define a view based on the query SELECT * FROM dbo.T1, and at the view creation time the table T1 has the columns col1 and col2. SQL Server stores information only on those two columns in the view’s metadata. If you alter the definition of the table to add new columns, those new columns will not be added to the view. You can refresh the view’s metadata by using the stored procedure sp_refreshview or sp_refreshsqlmodule, but to avoid confusion, the best practice is to explicitly list the column names that you need in the definition of the view. If columns are added to the underlying tables and you need them in the view, use the ALTER VIEW statement to revise the view definition accordingly.

Order

The query that you use to define a view must meet all requirements mentioned earlier with respect to table expressions in the context of derived tables. The view should not guarantee any order to the rows, all view columns must have names, and all column names must be unique. In this section, I elaborate a bit about the ordering issue, which is a fundamental point that is crucial to understand.

Remember that a presentation ORDER BY clause is not allowed in the query defining a table expression because there’s no order among the rows of a relational table. An attempt to create an ordered view is absurd because it violates fundamental properties of a relation as defined by the relational model. If you need to return rows from a view sorted for presentation purposes, you shouldn’t try to make the view something it shouldn’t be. Instead, you should specify a presentation ORDER BY clause in the outer query against the view.

Because T-SQL allows an ORDER BY clause in a view when TOP or OFFSET-FETCH is also specified, some people think that they can create “ordered views.” One of the ways to try to achieve this is by using TOP (100) PERCENT, like the following.

ALTER VIEW Sales.USACusts
AS
SELECT TOP (100) PERCENT
custid, companyname, contactname, contacttitle, address,
city, region, postalcode, country, phone, fax
FROM Sales.Customers
WHERE country = N'USA'
ORDER BY region;
GO

Even though the code is technically valid and the view is created, you should be aware that because the query is used to define a table expression, the ORDER BY clause here is only guaranteed to serve the logical filtering purpose for the TOP option. If you query the view and don’t specify an ORDER BY clause in the outer query, presentation order is not guaranteed.

Options

When you create or alter a view, you can specify view attributes and options as part of the view definition. In the header of the view, under the WITH clause, you can specify attributes such as ENCRYPTION and SCHEMABINDING, and at the end of the query you can specify WITH CHECK OPTION.

Encryption

The ENCRYPTION option is available when you create or alter views, stored procedures, triggers, and user-defined functions (UDFs). The ENCRYPTION option indicates that SQL Server will internally store the text with the definition of the object in an obfuscated format. The obfuscated text is not directly visible to users through any of the catalog objects—only to privileged users through special means.

Consider the following definition:

ALTER VIEW Sales.USACusts
AS
    SELECT
        custid, companyname, contactname, contacttitle, address, city, region, postalcode, country, phone, fax
    FROM Sales.Customers
    WHERE
  country = N'USA';

GO

To get the definition of the view, invoke the OBJECT_DEFINITION function like this.

SELECT OBJECT_DEFINITION(OBJECT_ID('Sales.USACusts'));
| (No column name)           |
|----------------------------|
| CREATE VIEW Sales.USACusts |
AS
    SELECT
        custid, companyname, contactname, contacttitle, address, city, region, postalcode, country, phone, fax
    FROM Sales.Customers
    WHERE
  country = N'USA';

The text with the definition of the view is available because the view was created without the ENCRYPTION option.

Next, alter the view definition—only this time, include the ENCRYPTION option.

ALTER VIEW Sales.USACusts
WITH
  ENCRYPTION
AS
  SELECT
    custid,
    companyname,
    contactname,
    contacttitle,
    address,
    city,
    region,
    postalcode,
    country,
    phone,
    fax
  FROM Sales.Customers
  WHERE
  country = N'USA';

GO

You will get:

| (No column name) |
|------------------|
| NULL             |

This type of encryption can be easily decrypted using a plaintext attack.

Check

The purpose of CHECK OPTION is to prevent modifications through the view that conflict with the view’s filter—assuming that one exists in the query defining the view.

The view is currently defined without CHECK OPTION. This means that you can currently insert rows through the view with customers from countries other than the United States, and you can update existing customers through the view, changing their country to one other than the United States. For example, the following code successfully inserts a customer with company name Customer ABCDE from the United Kingdom through the view.

INSERT INTO Sales.USACusts(
companyname, contactname, contacttitle, address,
city, region, postalcode, country, phone, fax)
VALUES(
N'Customer ABCDE', N'Contact ABCDE', N'Title ABCDE', N'Address ABCDE',
N'London', NULL, N'12345', N'UK', N'012-3456789', N'012-3456789');

Similarly, if you update a customer row through the view, changing the country attribute to a country other than the United States, the update makes it to the table. But that customer information doesn’t show up anymore in the view because it doesn’t satisfy the view’s query filter.

If you want to prevent modifications that conflict with the view’s filter, add WITH CHECK OPTION at the end of the query defining the view.

SELECT
custid, companyname, contactname, contacttitle, address,
city, region, postalcode, country, phone, fax
FROM Sales.Customers
WHERE country = N'USA'
WITH CHECK OPTION;
GO

Inline Table-Valued Function

Inline TVFs are reusable table expressions that support input parameters. In all respects except for the support for input parameters, inline TVFs are similar to views. For this reason, I like to think of inline TVFs as parameterized views, even though they are not called this formally.

USE TSQL2012;

IF OBJECT_ID ('dbo.GetCustOrders') IS NOT NULL DROP FUNCTION dbo.GetCustOrders;

GO
CREATE FUNCTION dbo.GetCustOrders (@cid AS int)
  RETURNS TABLE
  AS RETURN
  SELECT
    orderid,
    custid,
    empid,
    orderdate,
    requireddate,
    shippeddate,
    shipperid,
    freight,
    shipname,
    shipaddress,
    shipcity,
    shipregion,
    shippostalcode,
    shipcountry
FROM
    Sales.Orders
WHERE
    custid = @cid;

GO

KILL Apply

The APPLY operator is a very powerful table operator. Like all table operators, this operator is used in the FROM clause of a query. The two supported types of APPLY operator are CROSS APPLY and OUTER APPLY. CROSS APPLY implements only one logical query processing phase, whereas OUTER APPLY implements two.

Sets

Set operators are operators that are applied between two input sets—or, to use the more accurate SQL term, multisets—that result from two input queries. Remember, a multiset is not a true set, because it can contain duplicates. When I use the term multiset I’m referring to the intermediate results from two input queries that might contain duplicates. Although there are two multisets as inputs to an operator, depending on the flavor of the operator, the result is either a set or a multiset. If the operator is a true set operator (a DISTINCT flavor), the result is a set with no duplicates. If the operator is a multiset operator (an ALL flavor), the result is a multiset with possible duplicates.

Precedence

SQL defines precedence among set operators. The INTERSECT operator precedes UNION and EXCEPT, and UNION and EXCEPT are considered equal. In a query that contains multiple set operators, first INTERSECT operators are evaluated, and then operators with the same precedence are evaluated based on order of appearance.

Consider the following query, which shows how INTERSECT precedes EXCEPT.

SELECT country, region, city FROM Production.Suppliers
EXCEPT
SELECT country, region, city FROM HR.Employees
INTERSECT
SELECT country, region, city FROM Sales.Customers;

Because INTERSECT precedes EXCEPT, the INTERSECT operator is evaluated first, even though it appears second. Therefore, the meaning of this query is, “locations that are supplier locations but not (locations that are both employee and customer locations).”

Write a query that returns customer and employee pairs that had order activity in January 2008 but not in February 2008.

Terminology

Subqueries terminology might sound a little bit confusing, this section provides some basic glossary that might help you.

  • Self-contained subqueries. Every subquery has an outer query that it belongs to. Self-contained subqueries are subqueries that are independent of the outer query that they belong to. Self-contained subqueries are very convenient to debug, because you can always highlight the subquery code, run it, and ensure that it does what it’s supposed to do. Logically, it’s as if the subquery code is evaluated only once before the outer query is evaluated, and then the outer query uses the result of the subquery. The following sections take a look at some concrete examples of self-contained subqueries.
    • With Scalar

      USE TSQL2012;
      DECLARE @maxid AS INT = (SELECT MAX(orderid)
                              FROM Sales.Orders);
      
      SELECT orderid, orderdate, empid, custid
      FROM Sales.Orders
      WHERE orderid = @maxid;
      
  • Self-Contained Multivalued Subquery. A multivalued subquery is a subquery that returns multiple values as a single column, regardless of whether the subquery is self-contained. Some predicates, such as the IN predicate, operate on a multivalued subquery.
  • Correlated Subqueries. Correlated subqueries are subqueries that refer to attributes from the table that appears in the outer query. This means that the subquery is dependent on the outer query and cannot be invoked independently. Logically, it’s as if the subquery is evaluated separately for each outer row.
  • Derived Tables. Derived tables (also known as table subqueries) are defined in the FROM clause of an outer query. Their scope of existence is the outer query. As soon as the outer query is finished, the derived table is gone.
  • Common table expressions. (CTEs) are another standard form of table expression very similar to derived tables, yet with a couple of important advantages. CTEs are defined by using a WITH statement and have the following general form.

Footnotes:

1

SQL also provides the nvarchar type to store multilingual data using the Unicode representation. However, many databases allow Unicode (in the UTF-8 representation) to be stored even in varchar types.


I seek refuge in God, from Satan the rejected. Generated by: Emacs 29.4 (Org mode 9.6.17). Written by: Salih Muhammed, by the date of: 2022-04-22 Fri 03:00. Last build date: 2024-07-04 Thu 21:55.