It's not everyday I need to store hierarchical data in SQL tables but when I do, the usual self-join on a table works out fine. For example, if I'm storing file system folders in a SQL database, I'll probably define a table like this:
create table Folder ( Id int Primary Key Identity(1,1), Name varchar(256) Not Null, Path varchar(Max) Not Null, ParentId Int Null ); create index IX_Folder_ParentId on Folder (ParentId) go
If I'm storing my humble music collection the output might look like this:
I can now find out which folder belong to whom. For something trickier like finding the the entire hierarchy below a particular folder, I can use a CTE like this:
-- I want to list all albums of Guns n' Roses declare @folderId int = 335; with FolderHierarchy (id, name, path, parentid, level) as ( select Id, Name, Path, ParentId, 0 as level from Folder where Id = @folderId union all select f.Id, f.Name, f.Path, f.ParentId, level + 1 from Folder f inner join FolderHierarchy h on f.ParentId = h.Id ) select * from FolderHierarchy order by level go
As it turns out this isn't the only way to represent hierarchical data. SQL Server since version 2008 has specific datatype for such scenarios, its the 'hierarchyid' type. If I now introduce a new column for it, the table can be created like this:
if exists(select * from sys.tables where name = 'Folder') drop table Folder go create table Folder ( Id int Primary Key Identity(1,1), Name varchar(256) Not Null, Path varchar(Max) Not Null, ParentId Int Null, -- No longer needed but I've kept it to avoid making two tables for illustration HId HierarchyId Null ); go create index IX_Folder_ParentId on Folder (ParentId) go create unique index UQ_Folder_HId on Folder (HId) go
The HId column is supposed to store hierarchy in a particular format. The format is:
The root item should have the value /. The first child of the root can be /1/, the second child can be /2/. There can be a child inserted between 1 and 2 as '/1.5/'.
If 1 has a child, it can be /1/1/ and so on. These values can computed at the application side and inserted into the table or SQL functions can help you generate these.
In my case I have decided to generate these values in the sample C# application itself.
Once the hierarchy is correctly setup we can perform things like listing all direct or indirect descendant quite easily. For example, instead of using the CTE we can now use:
select *, cast(hid as varchar(max)) from Folder where HId.IsDescendantOf('/4/40/') = 1 go
What about doing the reverse which is listing all direct and indirect parents? The code below shows both approaches.
-- Find all ancestors using CTE with FolderHierarchy (id, name, path, parentid) as ( select Id, Name, Path, ParentId from Folder where ID = 335 union all select f1.Id, f1.Name, f1.Path, f1.ParentId from Folder f1 inner join FolderHierarchy f2 on f1.id = f2.ParentId ) select * from FolderHierarchy -- Find all ancestors using hierarchy select f1.*, cast(f1.hid as varchar(max)) as Path, f1.hid.GetLevel() as Level from Folder f1 inner join Folder f2 on f2.HId.IsDescendantOf(f1.HId) = 1 where f2.Id = 335
So which is better? Well the hierarchyid datatype allows quite a few other operations like moving entire hierarchy under a different node but this should be easy using the traditional approach too. What about performance? Here is screenshot of the plan comparison for a small number (1000 odd) of records.
It looks like using hierarchyid results in a much simpler plan and is 2 times better. However when I tried with greater number of records the difference was not too much.
For the query where we listed direct and indirect ancestors, the difference is reverse, using hierarchyid makes the performance worse.
The sample application which reads a file system folder and populates the Folder table is present in the ZIP file. The ZIP file also contains the database.sql file which contains all the T-SQL code discussed here.