Distributed File ServerProject DescriptionStorage architectures are constantly trying to balance various needs. To meet size requirements, storage must be expandable. To meet ease of use and management needs storage must appear centralized. Centralized storage must have high performance in order to support multiple clients that require the centralized data. To meet these needs we need an architecture that can scale in performance and storage size in direct proportion to the number of clients. I propose a distributed file server that appears to clients as a single entity but transparently spreads the file data across multiple servers. Each client should have a library that presents an application programming interface (API) that allows the single entity be connected to and files to be created, read, and written. The client library should be able to quickly discern where the chunks of each file are located and handle all read/write requests. When multiple chunks of a file are read or written, the chunks should be accessed in parallel. DetailsThe system will be available through Java. It will work like the Java File API, described at http://java.sun.com/j2se/1.5.0/docs/api/java/io/package-summary.html. A new class called DFile will represent a distributed file. File, FileInputStream, and FileOutputStream will all be extended or have a parallel class. A distributed file has its contents spread over the servers participating in the storage cluster. Its pathing is similar to UNIX systems but it also has a server name associated with it. An example is testserver /folder1/file1.txt . Here "testserver" is the logical name of the storage cluster, "folder1" is one folder or directory in the path of the file, and finally "file1.txt" is the name of the file. Each DFile* class will be subclass of the appropriate File* class or a parallel class and support much of its API. DFile* objects will work much like file objects with little changes needed in an application to use them. The communication between the client and the server will be handled by Java's RMI. A DFile will gets its meta data from a remote file object living on the servers for that file. Meta data calls like create(), length(), delete() require touching all servers. Calls like exists() and isFile() are accessed on each server in round robin fashion. Each call hits only one server with the next call hitting a different server. DFileInputStreams and DFileOutputStreams with read and write through RMI calls to read and write to remote files. Remote files will be represented by an RFile and RFileInputStream and RFileOutputStream classes that will expose meta data and allow for read and write calls. The algorithm for splitting a file works as follows: the complete name of a file, including its server name are hashed using the String hashCode of the server exclusive or'ed with the String hashCode of the pathname of the file. This code is then mod'ed by the number of servers in the distributed system to find the starting data server. This server holds start of the file. This keeps small files balanced between the servers. The file is then broken up into pieces each a configurable BlockSize long. When writing a file, the first block is written to the starting sever and then the second block is written to the next server in the list. When the end of the server list is reached, the I/O continues with the first sever in the list.
Current StatusFull documentation is available here. The current code implements spreading a file over multiple servers. The I/O is done synchronously. The distributed server cluster is made up of the name passed in and the localhost. The code does balance load on a set of distributed servers by round robin'ing meta data requests and distributing the actual data across the servers. RMI is used to transport requests and data. The ByteArray class allows for data to be returned without extra data copies occurring.Use Cases
|
| Use Case | Description | Preconditions | Postconditions |
|---|---|---|---|
| Open | A file is opened so that information can be read about it, and it can be created or deleted. | The user has access rights to the file. | Handle to a file is returned |
| File Info | File information like length, if a file is a directory, the contents of a directory. | The user has a file handle and has access rights | User gets file information returned for the distributed file |
| Create | File is created if it did not already exist | The user has a file handle and has access rights | User is told if the creation succeeded or not |
| Delete | File is deleted from the system | The user has a file handle to an existing file and has access rights | User is told if the deletion succeeded or not |
| Read | A portion of the file is read, offset from where the last operation completed. | The file was previously opened for reading. | Data is copied from the file and written into the caller's buffer. |
| Write | The caller's buffer is written either at the start of the buffer or after the end of the previous operation. | The file was previously opened for writing. | Data is copied form the caller's buffer and written to the file. |
|
|
|
|
|
|
Lessons LearnedPerformance issues include lack of asynchronous sending of chuncks and the chunck size. The chunck size currently used is 1347 bytes. This is the amount of data that can fit into a TCP/IP RMI packet after headers. The system currently sends one chunck at a time to each server. Each chunck must be acknowledged. Performance could be improved by sending all the chuncks in parallel. Also, the system could send all the chuncks for a particular server in one large message. This would require a form of ByteArray that could include a list of offset/length pairs that would essentially make up a scatter/gather list. When serialized this new ByteArray send all the data as one contiguous stream. One must be careful when using RMI to avoid excessive copies or transmission of data that isn't intended. For example, the File method write(byte[] b, int off, int len) allows the user to write from an offset. A trivial version of this would just have a similar remote method. The problem is that the entire byte array would be sent across the wire, including parts not needed. The ByteArray class allows the user to control the serialization process. FuturesWhen a server receives data it could simultaneously write it to NumBackupServer backup servers with a default of 1. It would then pick the server that is half the number of servers away from it, wrapping around the end of the list back to the beginning. For example, if there are 8 servers in the list and the I/O started at server 0, it would send its data to sever 4 as well. Server 4 knows this is backup data and does not copy it anywhere else. Another example would be there were 9 servers and the configuration was for 2 backups. The first server would pick the server that is 1/3 of the list size away, or 3 away in this case. When server 3 receives the data from server 0, it knows that it is the first backup of 2 total and begins copying the data to the next server, server 6 immediately. Data is copied as soon as it is received. Data could be read and written in parallel by a client by requesting all the chuncks of a file required to serve the request at once with up to one thread per server. Each thread reads/writes all the data require to serve the request into the appropriate places in memory. The result is an image in memory that matches what was written out. Read or write errors are handled by the client on a per server basis. The client knows the backup of each server and can request the data from the backup server. On error a server is banned by the client until it is told things are ok. When a server fails a request, the client attempts to use the backup server. The backup server realizes that the original master server for this segment must have failed and begins to attempt to periodically contact it. When it contacts the server it attempts to synchronize its data. Only after all data has been synchronized does the backup server report to the client that it should begin using the old master server again. Adding and removing servers results in a rebalance of data. Servers move from the old view to the new view. While transitioning the ownership of a block the new owner must report to the client that it should use the original owner until it has all of the data. Meanwhile the original owner must continue to serve requests until the new owner is ready. When the new owner is ready it should then report to the client that it should now use the new owner. Server rebalancing requires large amounts of space as servers must keep there new and old copies of data during the rebalance. |
|
By David Acker Back to Distributed Systems Page |