The merge table function lets us query multiple tables in parallel.
It does this by creating a temporary Merge table and derives this table's structure by taking a union of their columns and by deriving common types.
We're going to learn how to use this function with help from Jeff Sackmann's tennis dataset.
We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade.
We'll also add a couple of extra columns for the 1990s decade.
The import statements are shown below:
CREATE OR REPLACE TABLE atp_matches_1960s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1968..1969}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(String), loser_seed Nullable(UInt8)';
CREATE OR REPLACE TABLE atp_matches_1970s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1970..1979}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt8), loser_seed Nullable(UInt8)';
CREATE OR REPLACE TABLE atp_matches_1980s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1980..1989}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16)';
CREATE OR REPLACE TABLE atp_matches_1990s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score,
toBool(arrayExists(x -> position(x, 'W/O') > 0, score))::Nullable(bool) AS walkover,
toBool(arrayExists(x -> position(x, 'RET') > 0, score))::Nullable(bool) AS retirement
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1990..1999}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16), surface Enum(\'Hard\', \'Grass\', \'Clay\', \'Carpet\')';
We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences.
SELECT * EXCEPT(position) FROM (
SELECT position, name,
any(if(table = 'atp_matches_1960s', type, null)) AS 1960s,
any(if(table = 'atp_matches_1970s', type, null)) AS 1970s,
any(if(table = 'atp_matches_1980s', type, null)) AS 1980s,
any(if(table = 'atp_matches_1990s', type, null)) AS 1990s
FROM system.columns
WHERE database = currentDatabase() AND table LIKE 'atp_matches%'
GROUP BY ALL
ORDER BY position ASC
)
SETTINGS output_format_pretty_max_value_width=25;
Let's write a query to find the matches that John McEnroe won against someone who was seeded #1:
SELECT loser_name, score
FROM merge('atp_matches*')
WHERE winner_name = 'John McEnroe'
AND loser_seed = 1;
┌─loser_name────┬─score───────────────────────────┐
│ Bjorn Borg │ ['6-3','6-4'] │
│ Bjorn Borg │ ['7-6','6-1','6-7','5-7','6-4'] │
│ Bjorn Borg │ ['7-6','6-4'] │
│ Bjorn Borg │ ['4-6','7-6','7-6','6-4'] │
│ Jimmy Connors │ ['6-1','6-3'] │
│ Ivan Lendl │ ['6-2','4-6','6-3','6-7','7-6'] │
│ Ivan Lendl │ ['6-3','3-6','6-3','7-6'] │
│ Ivan Lendl │ ['6-1','6-3'] │
│ Stefan Edberg │ ['6-2','6-3'] │
│ Stefan Edberg │ ['7-6','6-2'] │
│ Stefan Edberg │ ['6-2','6-2'] │
│ Jakob Hlasek │ ['6-3','7-6'] │
└───────────────┴─────────────────────────────────┘
Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower.
This is a bit trickier because winner_seed uses different types across the various tables:
SELECT loser_name, score, winner_seed
FROM merge('atp_matches*')
WHERE winner_name = 'John McEnroe'
AND loser_seed = 1
AND multiIf(
variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
variantElement(winner_seed, 'String')::UInt16 >= 3
);
We use the variantType function to check the type of winner_seed for each row and then variantElement to extract the underlying value.
When the type is String, we cast to a number and then do the comparison.
The result of running the query is shown below:
┌─loser_name────┬─score─────────┬─winner_seed─┐
│ Bjorn Borg │ ['6-3','6-4'] │ 3 │
│ Stefan Edberg │ ['6-2','6-3'] │ 6 │
│ Stefan Edberg │ ['7-6','6-2'] │ 4 │
│ Stefan Edberg │ ['6-2','6-2'] │ 7 │
└───────────────┴───────────────┴─────────────┘
We can see that the walkover column is NULL for everything except atp_matches_1990s.
We'll need to update our query to check whether the score column contains the string W/O if the walkover column is NULL:
SELECT _table,
multiIf(
walkover IS NOT NULL,
walkover,
variantType(score) = 'Array(String)',
toBool(arrayExists(
x -> position(x, 'W/O') > 0,
variantElement(score, 'Array(String)')
)),
variantElement(score, 'String') LIKE '%W/O%'
),
count()
FROM merge('atp_matches*')
GROUP BY ALL
ORDER BY _table;
If the underlying type of score is Array(String) we have to go over the array and look for W/O, whereas if it has a type of String we can just search for W/O in the string.