NTFS Directories and Files

Yes, NTFS volumes have directories and files. Isn't that good to know? :^) Well, you probably want to learn a bit more about them than that, I am sure, and in this part of the NTFS guide I will endeavor to do just that. If you are experienced with the FAT file system used in other versions of Windows, then as a user of NTFS partitions, you will find much that is familiar in the way directories and files are used. However, internally, NTFS stores and manages directories and files in a rather different way than FAT does.

In this section I will explore the fundamentals of NTFS directories and files. I will begin with a look at directories and how they are stored on NTFS volumes. I will then discuss user data files in some detail, including a look at how files are stored and named, and what their maximum size can be. I will then describe some of the more common standard attributes associated with files. Finally, I will discuss reparse points, a special enhanced feature present in NTFS 5.0 under Windows 2000.

NTFS Directories (Folders)

From an external, structural perspective, NTFS generally employs the same methods for organizing files and directories as the FAT file system (and most other modern file systems as well). This is usually called the hierarchical or directory tree model. The "base" of the directory structure is the root directory, which is actually one of the key system metadata files on an NTFS volume. Within this root directory, references are stored to files, or to other directories. Each directory can in turn store any combination of files or more sub-directories, allowing you to create an arbitrary tree structure. I describe these general concepts in more detail on this page discussing the FAT file system.

Note: Directories are also often called folders.

While NTFS is similar to FAT in its hierarchical structuring of directories, it is very different in how they are managed internally. One of the key differences is that in FAT volumes, directories are responsible for storing most of the key information about files; the files themselves contain only data. In NTFS, files are collections of attributes, so they contain their own descriptive information, as well as their own data. An NTFS directory pretty much stores only information about the directory itself, not about the files within the directory.

Everything within NTFS is considered a file, and that applies to directories as well. Each directory has an entry in the Master File Table, which serves as the main repository of information for the directory. The MFT record for the directory contains the following information and NTFS attributes:

So in a nutshell, small directories are stored entirely within their MFT entries, just like small files are. Larger ones have their information broken into multiple data records that are referenced from the root entry for the directory in the MFT. NTFS uses a special way of storing these index entries however, compared to traditional PC file systems. The FAT file system uses a simple linked-list arrangement for storing large directories: the first few files are listed in the first cluster of the directory, and then the next files go into the next cluster, which is linked to the first, and so on. This is simple to implement, but means that every time you look at the directory you must scan it from start to end and then sort it for presentation to the user. It also makes it time-consuming to locate individual files in the index, especially with very large directories.

To improve performance, NTFS directories use a special data management structure called a B-tree. This is a concept taken from relational database design. In brief terms, a B-tree is a balanced storage structure that takes the form of trees, where data is balanced between branches of the tree. It's kind of hard to explain what B-trees are without getting far afield, so if you want to learn more about them, try this page. (Note that the "B-tree" concept here refers to a tree of storage units that hold the contents of an individual directory; it is a different concept entirely from that of the "directory tree", a logical tree of directories themselves.)

From a practical standpoint, the use of B-trees means that the directories are essentially "self-sorting". There is a bit more overhead involved when adding files to an NTFS directory, because they must be placed in this special structure. However, the payoff occurs when the directories are used. The time required to find a particular file under NTFS is dramatically reduced compared to an unsorted linked-list structure--especially for very large directories.

NTFS Files and Data Storage

As with most file systems, the fundamental unit of storage in NTFS from the user's perspective is the file. A file is just a collection of any sort of data, and can contain anything: programs, text files, audio clips, database records--and thousands of other kinds of information. The operating system doesn't distinguish between types of files. The use of a particular file depends on how it is interpreted by applications that use it.

Within NTFS, all files are stored in pretty much the same way: as a collection of attributes. This includes the data in the file itself, which is just another attribute: the "data attribute", technically. Note that to understand how NTFS stores files, one must first understand the basics of NTFS architecture, and in particular, it's good to comprehend what the Master File Table is and how it works. You may also wish to review the discussion of NTFS attributes, because understanding the difference between resident and non-resident attributes is important to making any sense at all of the rest of this page. ;^)

The way that data is stored in files in NTFS depends on the size of the file. The core structure of each file is based on the following information and attributes that are stored for each file:

These are the basic attributes; others may also be associated with a file (see this full discussion of attributes for details). If a file is small enough that all of its attributes can fit within the MFT record for the file, it is stored entirely within the MFT. Whether this happens or not depends largely on the size of the MFT records used on the volume. If the file is too large for all of the attributes to fit in the MFT, NTFS begins a series of "expansions" that move attributes out of the MFT and and make them non-resident. The sequence of steps taken is something like this:

  1. First, NTFS will attempt to store the entire file in the MFT entry, if possible. This will generally happen only for rather small files.
  2. If the file is too large to fit in the MFT record, the data attribute is made non-resident. The entry for the data attribute in the MFT contains pointers to data runs (also called extents) which are blocks of data stored in contiguous sections of the volume, outside the MFT.
  3. The file may become so large that there isn't even room in the MFT record for the list of pointers in the data attribute. If this happens, the list of data attribute pointers is itself made non-resident. Such a file will have no data attribute in its main MFT record; instead, a pointer is placed in the main MFT record to a second MFT record that contains the data attribute's list of pointers to data runs.
  4. NTFS will continue to extend this flexible structure if very large files are created. It can create multiple non-resident MFT records if needed to store a great number of pointers to different data runs. Obviously, the larger the file, the more complex the file storage structure becomes.

The data runs (extents) are where most file data in an NTFS volume is stored. These runs consist of blocks of contiguous clusters on the disk. The pointers in the data attribute(s) for the file contain a reference to the start of the run, and also the number of clusters in the run. The start of each run is identified using a virtual cluster number or VCN. The use of a "pointer+length" scheme means that under NTFS, it is not necessary to read each cluster of the file in order to determine where the next one in the file is located. This method also reduces fragmentation of files compared to the FAT setup.

NTFS File Size

One of the most important limiting issues for using serious business applications--especially databases--under consumer Windows operating systems and the FAT file system, is the relatively small maximum file size. In some situations the maximum file size is 4 GiB, and for others it is 2 GiB. While this seems at first glance to be fairly large, in fact, neither is even close to being adequate for the needs of today's business environment computing. Even on my own home PC I occasionally run up against this limit when doing backups to hard disk files.

In the page describing how data is stored in NTFS files, I explained the way that NTFS first attempts to store files entirely within the MFT record for the file. If the file is too big, it extends the file's data using structures such as external attributes and data runs. This flexible system allows files to be extended in size virtually indefinitely. In fact, under NTFS, there is no maximum file size. A single file can be made to take up the entire contents of a volume (less the space used for the MFT itself and other internal structures and overhead.)

NTFS also includes some features that can be used to more efficiently store very large files. One is file-based compression, which can be used to let large files take up significantly less space. Another is support for sparse files, which is especially well-suited for certain applications that use large files that have non-zero data in only a few locations.

NTFS File Naming

Microsoft's early operating systems were very inflexible when it came to naming files. The DOS convention of eight characters for the file name and three characters for the file extension--the so-called "8.3 standard"--was very restrictive. Compared to the naming abilities of competitors such as UNIX and the Apple Macintosh, 8.3 naming was simply unacceptable. To solve this problem, when NTFS was created, Microsoft gave it greatly expanded the file naming capabilities.

The following are the characteristics of regular file names (and directory names as well) in the NTFS file system:

Tip: For more information about Unicode, see this web site.

You may recall that when Windows 95's  VFAT file system introduced long file names to Microsoft's consumer operating systems, it provided for an aliasing feature. The file system automatically creates a short file name ("8.3") alias of all long file names, for use by older software written before long file names were introduced. NTFS does something very similar. It also creates a short file name alias for all long file names, for compatibility with older software. (If the file name given to the file or directory is short enough to fit within the "8.3", no alias is created, since it is not needed). It's important to realize, however, that the similarities between VFAT and NTFS long file names are mostly superficial. Unlike the VFAT file system's implementation of long file names, NTFS's implementation is not a kludge added after the fact. NTFS was designed from the ground up to allow for long file names.

File names are stored in the file name attribute for every file (or directory), in the Master File Table. (No big surprise there!) In fact, NTFS supports the existence of multiple file name attributes within each file's MFT record. One of these is used for the regular name of the file, and if a short MS-DOS alias file name is created, it goes in a second file name attribute. Further, NTFS supports the creation of hard links as part of its POSIX compliance. Hard links represent multiple names for a single file, in different directories. These links are each stored in separate file name attributes. (This is a limited implementation of the very flexible naming system used in UNIX file systems.)

NTFS File Attributes

As I mention in many places in this discussion of NTFS, almost everything in NTFS is a file, and files are implemented as collections of attributes. Attributes are just chunks of information of various sorts--the meaning of the information in an attribute depends on how software interprets and uses the bits it contains. Directories are stored in the same general way as files; they just have different attributes that are used in a different manner by the file system.

All file (and directory) attributes are stored in one of two different ways, depending on the characteristics of the attribute--especially, its size. The following are the methods that NTFS will use to store attributes:

In practice, only the smallest attributes can fit into MFT records, since the records are rather small. Many other attributes will be stored non-resident, especially the data of the file, which is also an attribute. Non-resident storage can itself take two forms. If the attribute doesn't fit in the MFT but pointers to the data do fit, then the data is placed in a data run, also called an extent, outside the MFT, and a pointer to the run is placed in the file's MFT record. In fact, an attribute can be stored in many different runs, each with a separate pointer. If the file has so many extents that even the pointers to them won't fit, the entire data attribute may be moved to an external attribute in a separate MFT record entry, or even multiple external attributes. See the discussion of file storage for more details on this expansion mechanism.

NTFS comes with a number of predefined attributes, sometimes called system defined attributes. Some are associated with only one type of structure, while others are associated with more than one. Here's a list, in alphabetical order, of the most common NTFS system defined attributes:

Note: For more detail on how the attributes associated with files work, see the page on file storage; for directories, the page on directories.

In addition to these system defined attributes, NTFS also supports the creation of "user-defined" attributes. This name is a bit misleading, however, since the term "user" is really given from Microsoft's perspective! A "user" in this context means an application developer--programs can create their own file attributes, but actual NTFS users generally cannot.

NTFS Reparse Points

One of the most interesting new capabilities added to NTFS version 5 with the release of Windows 2000 was the ability to create special file system functions and associate them with files or directories. This enables the functionality of the NTFS file system to be enhanced and extended dynamically. The feature is implemented using objects that are called reparse points.

The use of reparse points begins with applications. An application that wants to use the feature stores data specific to the application--which can be any sort of data at all--into a reparse point. The reparse point is tagged with an identifier specific to the application and stored with the file or directory. A special application-specific filter (a driver of sorts) is also associated with the reparse point tag type and made known to the file system. More than one application can store a reparse point with the same file or directory, each using a different tag. Microsoft themselves reserved several different tags for their own use.

Now, let's suppose that the user decides to access a file that has been tagged with a reparse point. When the file system goes to open the file, it notices the reparse point associated with the file. It then "reparses" the original request for the file, by finding the appropriate filter associated with the application that stored the reparse point, and passing the reparse point data to that filter. The filter can then use the data in the reparse point to do whatever is appropriate based on the reparse point functionality intended by the application. It is a very flexible system; how exactly the reparse point works is left up to the application. The really nice thing about reparse points is that they operate transparently to the user. You simply access the reparse point and the instructions are carried out automatically. This creates seamless extensions to file system functionality.

In addition to allowing reparse points to implement many types of custom capabilities, Microsoft itself uses them to implement several features within Windows 2000 itself, including the following:

These are just a few examples of how reparse points can be used. As you can see, the functionality is very flexible. Reparse points are a nice addition to NTFS: they allow the capabilities of the file system to be enhanced without requiring any changes to the file system itself.