Expand

The Expand Parallel File System

Motivation

The main motivation of Expand is to build a parallel file system for heterogeneous general purpose distributed environments. To satisfy this goal, we use Network File System (NFS).

NFS supports the NFS protocol, a set of remote procedure call (RPC) that provides the means for clients to perform operations on a remote file server. NFS protocol is operating system independent. Developed originally for being used in networks of UNIX systems, it is widely available today in many systems, as LINUX or Windows 2000, two operating systems very frequently used in clusters.

Figure 1 shows the architecture of Expand. This architecture shows how multiple NFS servers can be used as a single file system. File data are declustered by Expand among all NFS servers, using blocks of different sizes as stripping unit. Processes in clients use an Expand library to access an Expand distributed partition.

Using the former approach offers the following advantages:

No changes to the NFS server are required to run Expand. All aspects of Expand operations are implemented on the clients. In this way, we can use several servers with different operating system to build a distributed partition. Furthermore, declustered partitions can coexist with NFS traditional partitions without problems.

Expand is independent of the operating system used in the client. All operations are implemented using RPC and NFS protocol. Expand is available for Linux and Windows plataforms. Furthermore, we provide an universal Java client.

The parallel file system construction is greatly simplified, because all operations are implemented on the clients. This approach is completely different to that used in many current parallel file system, that implement both client and server sides.

It allows parallel access to both data of different files and data of the same file, reducing the bottleneck that represents the traditional distributed file systems.

It allows the using of servers with different architectures and operating systems, because the NFS protocol hides those differences. Because of this feature, Expand is very suitable for heterogenous systems, as cluster of workstations.

NFS servers use caching, increasing int this way the performance.

It simplifies the configuration, because NFS is very familiar to users. Server only need to export the appropriate directories and clients only need a little configuration file that explain the distributed partition.

Design

Expand provides high performance I/O exploiting parallel accesses to files stripped among several NFS servers. Expand is designed as a client-server system with multiple NFS servers, with each Expand file striped across some of the NFS servers. All operations in Expand clients are based on RPCs and NFS protocol. The first prototype of Expand is a user-level implementation, through a library that must be linked with the applications. Expand provides a global name space in all cluster.

Next sections describe data distribution, file structure, naming, metadata management, and parallel access to files in Expand.

Data Distribution

To provide large storage capacity and to enhance flexibility, Expand combines several NFS servers to provide a generic stripped partition which allows to create several types of file systems on any kind of parallel I/O system.

Each server exports one o more directories that are combined to build a distributed partition. All files in the system are declustered across all NFS servers to facilitate parallel access, with each server storing conceptually a subfile of the parallel file.

A design goal, no implemented in the current prototype, is to allow that a partition in Expand can be expanded adding more servers to an existing distributed partition.This feature increases the scalability of the system and it also allows to increase the size of partitions. When new servers are added to an existent partition, the partition must be rebuilt to accommodate all files. Conceptually, this rebuilding will be made using the following idea:

    rebuild_partition(old_partition, new_partition) {
        for each file in old_partition    {
            copy the file into the new partition
            unlink the file in old_partition
        }
    }

In this algorithm, when a file is copied into the new partition, the new file is declusterd across all NFS server of the new partition. To implement this expansion, algorithms must be designed to allow a redistribution of data without integrally copying files.

Naming

To simplify the naming process and reduce potential bottlenecks, Expand does not use any metadata manager, as the used in PVFS . Figure 3 shows how directory mapping is made in Expand. The Expand tree directory is replicated in all NFS servers. In this way, we can use the lookup operation of NFS without any change to access to all subfiles of a file. This feature also allows access to fault tolerance files when a server node fails.

The Structure of Expand Files

A file in Expand consists in several subfiles, one for each NFS partition. All subfiles are fully transparent to the Expand users. Expand hides this subfiles, offering to the clients a traditional view of the files. This idea is similar to the used in PVFS.

On a distributed partition, the user can create in the current prototype stripped files with cyclic layout. In these files, blocks are distributed across the partition following a round-robin pattern. This structure is shown in Figure 2. Each subfile of a Expand file as a small header at the beginning of the subfile. This header stores the file's metadata. This metadata includes the following information:

Stride size. Each file can use a different stride size. This parameter can be configured in open operation.
Kind of file: cyclic, RAID4, or RAID5. This parameter also can be specified in the open operation.
Base node. This parameter identifies the NFS server where the first block of the file resides.
Distribution pattern. All files in Expand are striped across all NFS servers. In the current prototype, we only use a round-robin pattern. In the future, this field will store the order used to decluster the file across the servers.

All subfiles has a header for metadata, although only one node, called master node (described below) stores the current metadata. The master node can be different from the base node.

Metadata management

The metadata of a file resides in the header of a subfile stored in a NFS server. This NFS server is the master node of the file, similar to the mechanism used in the Vesta Parallel File System. To obtain the master node of a file, the file name is hashed into the number of node:

        hash(namefile) = NFS serverI

The use of this simple approach offers a good distribution of masters. Initially the base node for a file agrees with the master node. The use of this simple scheme allows to distribute the master nodes and the blocks between all NFS servers, balancing the use of all NFS servers and, hence, the I/O load.

Because the determination of the master node is based on the file name, when a user renames a file, the master node for this file is changed. The algorithm used in Expand to rename a file is the following:

rename(oldname, newname) {
    oldmaster = hash(oldname)
    newmaster = hash(newname)
    move the metadata from oldmaster to newmaster
}

This process is shown in Figure 4. Moving the metadata is the only operation needed to maintain the coherence of the base node system for all the Expand files.

Fig 4. Renaming a file

Parallel access

NFS clients use filehandle to access the files. A NFS filehandle is an opaque reference to a file or directory that is independent of the filenane. All NFS operations use a filehandle to identify the file or directory which the operation applies to. Only the server can interpret the data contained within the filehandle. Expand uses a virtual filehandle, that is defined as the union of all filehandleI, where filehandleI is the filehandle used for the NFS server I to reference the subfile I belonging to the Expand file. The virtual filehandle is the reference used in Expand to reference all operations. When Expand needs to access to a subfile, it uses the appropriated filehandle. Because filehandles are opaque to clients, Expand can use different NFS implementations for the same distributed partition.

To enhance I/O, user requests are split by the Expand library into parallel subrequests sent to the involved NFS servers. When a request involves k NFS servers, Expand issues k requests in parallel to the NFS servers, using threads to parallelize the operations. The same criteria is used in all Expand operations. A parallel operation to k servers is divided in k individual operations that use RPC and the NFS protocol to access the corresponding subfile.

User Interface

Expand offers two different interfaces. The first interface is based on POSIX system call. This interface, however, is not appropriate for parallel applications using strided patterns with small access size. Parallel applications can also used Expand with MPI-IO. Expand have been integrated inside ROMIO and can be used with MPICH (see Figure 5). Portability in ROMIO is achieved using an abstract-device interface for IO (ADIO).

Fig 5. Expand inside ROMIO

ADIO

ADIO is a mechanism specifically designed for implementing parallel-I/O APIs portably on multiple file systems. ADIO consists of a small set of basic functions for parallel I/O. The ROMIO implementation of MPI-IO use the ADIO to achieve portability. The ADIO interface implemented for Expand includes:

Open a file. All opens are considered to be collective operations.
Close a file. Close operation is also a collective operation.
Contiguous reads and writes. ADIO provides separate soutines for contiguous and noncontiguous access.
Noncontiguous reads and writes. Parallel applications often need to read or write data that is located in a noncontiguous fashion in files and even in memory. ADIO provides routines for specifying noncontiguous accesses with a single call.
Nonblocking reads and writes}. ADIO provides nonblocking versions of all read and write calls.
Collective reads and writes. A collective routine must be called by all processes in the group that opened the file.
Seek. This function can be used to change the position of the individual file pointer.
Test and wait. These operations are used ti test the completion of nonblocking operations.
File Control. This operation is used to set or get information about an open file.
Miscellaneous. Other operations included in ADIO provides routines for deleting files, resizing files, flushing cached data to disks, and initializing and terminating ADIO.

Configuration

To configure Expand, the user must to specify in a configuration file the Expand partition. An example of configuration file is the following:

/xpn1      8    4
server1        /export/home1
server2        /export/home2
server3        /export/home3
server4        /home

/xpn2        4    2
server1        /users
server3        /export/home/users

This file defines two expand partition. The first partition uses 4 servers (server1, server2, server3, and server4) and it uses by default a stripping unit of 8 KB. The second partition uses 2 servers and it uses a stripping unit of 4 KB. The path /xpn1 is the root path for the first partition, and /xpn2 is the root path for the second partition. So, the expand file /xpn1/dir/data.txt is mapped in the following subflies:

/export/home1/dir/data.txt in server1

/export/home2/dir/data.txt in server2

/export/home3/dir/data.txt in server3

/home/dir/data.txt in server4

Evaluation

In this section we show the results obtained in Expand with MPI-IO using two workloads:

Strided accesses.

An out-core matrix multiplication.

In the strided access tests, multiple noncontiguous regions of the data file were simultaneously accessed by each process. We used a
file of 100 MB and different access size (from 128 bytes to 4MB).

The out-of-core matrix multiplication assumes a NxN matrix. This implementation doesn't do anything fancy with memory management and simply assumes that each processor can hold 3N elements (each element is the type double) in core at one time: 2N elements for matrices A and B, and N elements for matrix C.

The multiplication process is shown in Figure 6. This is the algorithm using MPI-IO. All matrix are stored in disks is row order.

The platform used in the evaluation

The platform used in the evaluation was a cluster with 8 Pentium III biprocessors, each one with 1GB of main memory, connected through a Fast Ethernet, and executing Linux operating system (kernel 2.4.5). All experiments have been executed on Expand (XPN in figures) and PVFS using MPI-IO as interface. For both parallel file system, we have used 8 I/O servers. The block size used in all files, both in Expand as PVFS tests, has been 8 KB. All clients have been executed in the 8 machines, i.e: all machines in the cluster have been used as server and clients.

The results for strided accesses are shown in Figure 7. Graphics show the aggregated bandwidth obtained for write operations and read operations for 4, 8, and 16 processes and different access size.

GETTING EXPAND

AUTORS

PUBLICATIONS

Articles related with Expand
Other articles using Expand

GETTING EXPAND

AUTORS

PUBLICATIONS

Articles related with Expand
Other articles using Expand