Description:
®
Two Commodity Scaleable Servers:
a Billion Transactions per Day and
the Terra-Server
A
White Paper
from the Business Systems Division
Two Commodity Scaleable Servers:
a Billion Transactions per Day and
the Terra-Server
A White Paper from the Desktop
and Business Systems Division
Abstract
This paper describes two large
scaleable servers built using Windows NT® Server and Microsoft® SQL
Server⢠running on commodity hardware. One server processes
more than a billion transactions per day â more transactions per day
than any enterprise in the world. It demonstrates a 45-node cluster
running a three-tier DCOM-based SQL application. Another application,
called Terra-Server, stores one terabyte of images in a Microsoft SQL
Server Database on a single node. Terra-Server manages a database
of aerial images. It is the world's largest atlas, and soon it
will be on the Internet. It will use Microsoft Site server to
sell the right to use the imagery. Terra-Server is a classic scaleup
story. Other Windows NT scalability projects are also mentioned:
a large Microsoft Exchange mail server, a hundred-million hits per day
web server, a 64-bit addressing SQL server, and cluster fault-tolerance.
The Microsoft Desktop and
Business Systems Division series of white papers is designed to
educate information technology (IT) professionals about Windows NT Server
and the Microsoft BackOffice⢠family of products. While current technologies
used in Microsoft products are often covered, the real purpose of these
papers is to give readers an idea of how major technologies are evolving,
how Microsoft is using those technologies, and how this information
affects technology planners.
For the latest information
on Microsoft products, check out our World Wide Web site at
http://www.microsoft.com/ or the Windows NT Server Forum on the Microsoft
Network.
Legal Notice
The information contained in this
document represents the current view of Microsoft Corporation on the
issues discussed as of the date of publication. Because Microsoft
must respond to changing market conditions, it should not be interpreted
to be a commitment on the part of Microsoft, and Microsoft cannot guarantee
the accuracy of any information presented after the date of publication.
This document is for informational purposes
only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS
DOCUMENT.
© 1997 Microsoft Corporation. All
rights reserved.
Microsoft, the BackOffice logo, Encarta,
Automap, Visual Basic, and Windows NT are registered trademarks and
BackOffice, Expedia and Sidewalk are trademarks of Microsoft Corporation.
Other product and company names mentioned
herein may be the trademarks of their respective owners.
Part No. xxxxxxxx
Table of Contents
Introduction
Many forces are causing enterprises
to move all their business online. The most obvious force is the
declining cost of hardware. Now almost anyone can afford a computer.
Companies can now afford very powerful systems with huge data stores.
There have been comparable breakthroughs in software tools. Internet
tools, database systems, and new programming systems make it much easier
and less expensive to create applications. The software tools
and resulting applications are themselves becoming commodities.
The Need for Scalability:
When companies deploy application systems, they often worry about
how the system will grow over time. As the database grows and
the application load grows, the system must grow (scale) to handle the
increased load. Scalability has several dimensions:
Increased number
of users
Increased workload
(transaction size)
Increased database
size
Increased number
of sites for a distributed system
A scaleable system must allow
growth in all these dimensions.
SMP Scaleup:
To show that Windows NT can scale to solve large problems, Microsoft
began a series of demonstration projects in 1996. We set out to
build large servers ï 10-way SMPs, terabyte SQL databases,
terabyte Microsoft Exchange mail archives, high-traffic web servers,
and massive-main memory database applications. Other scaleup projects
include:
Moving Windows NT
and SQL Server to support 64-bit addressing, and thus
exploit large main memories.
Building a single
node Microsoft Exchange server that can store a 50 GB of mail and public
folders as well as support 50,000 POP3 clients exchanging 1.8 million
messages per day.
Building a web server
on a single node that can support 100 Million HTTP web hits per day.
Windows applications are portable
to a broad range of platforms. This upwards (and downwards) compatibility
protects customer software investments.
These are all examples of
scaleup - building larger and larger single node-systems. Windows
lets you write applications that run on the spectrum of systems in the
figure at left. As your needs grow, you can use the upwards compatibility
of these nodes to move to a larger box. Scaleup, is very
attractive.
Scaleup protects your software
investment, but it does not protect your hardware investment if you
outgrow your current platform. (1) As you scale to a larger box, your
investment in the smaller box must be written off. (2) Larger
boxes are expensive and so the unit-of-growth tends to be very large
at the high end. (3) Because this growth is so painful, customers buy
over-capacity in advance in anticipation of growth. (4) There
is no simple fault tolerance story. If the core of the single-node
system fails, then the system is unavailable. (5) There is a largest
box. Once your needs exceed that size, you are forced to build a cluster
of systems using several of these large boxes.
Cluster Scaleout:
Designs that use a cluster of processors, disks, and networks allow
customers to buy computers by the slice. We call each slice a
CyberBrick. Ideally you can start by buying just what you need.
Then as demand grows, you can buy more CyberBricks. There are
several examples of cluster architectures: Tandem, Digital VMScluster,
NCR-Teradata, HP-Apollo, IBM-Sysplex and IBM-SP2. In each of these
systems, you can build a cluster of computers by starting with a one
or two node system. You can add nodes as the demand grows.
The novel thing about Microsoft's cluster approach is that it uses commodity
CyberBricks and commodity software. This translates into huge
savings for customers â commodity components are up to ten times less
expensive than proprietary systems.
Scaleup to
a large shared-memory multi-processor: the Terra-Sever
is a good example of this approach
Scaleout
with a
shared-nothing cluster. The billion-transactions
per day system is a cluster of 45 nodes.
Cluster scaleout avoids the scaleup
problems. Clusters
preserve your previous investment. They allow incremental growth
in small and inexpensive steps by adding to the existing processor,
storage, and network base. Clusters do not force you to buy over-capacity
in advance. Clusters offer fault tolerance: if one node fails,
another can take over for it. Properly designed clusters can grow without
limit.
Hardware and economic drivers for
CyberBricks:
The increasing power and declining price of hardware make all these
things possible. The billion transactions per day system is profligate
in its use of hardware (140 Intel PentiumPro⢠processors â about
14 billion instructions per second). Such a system would cost
140 Million dollars if it were built with mainframe components.
In fact, the hardware costs about 2 million dollars when built with
CyberBricks.
Similarly, CyberBrick disks are about 10x less expensive than mainframe
disks â the billion-transactions-per-day
system has almost 900 disks, the one-node Terra-Server has over 300
disks. These inexpensive disks enable commodity clusters to store
much larger databases for the same price.
Software CyberBricks: Similar forces are at work
in software. A few years ago, a "real" database system
rented for 100,000$ per year. A fully configured MVS system rents
for 30,000$/month. This seems high when compared to the lifetime
purchase price of Windows NT and BackOffice. The software for
a fully configured Windows NT system with licenses for 8,000 clients
costs less than 20,000$ [TPC-C]. Windows NT continues to add functionality.
Microsoft recently added distributed objects (DCOM), a transaction-processing
monitor, many web services, and fault-tolerant clustering to Windows
NT.
Windows NT is creating a market for commodity components and applications.
Today you can buy hardware CyberBricks. Microsoft estimates that software
plugins are a $240 million market in 1997. You are now able to
buy software CyberBricks: commodity components that easily snap together.
This is the promise of object-oriented technologies like Microsoft's
Distributed Component Object Model (DCOM).
Most applications are structured
in 3-levels, Presentation, workflow, and business objects.
Three-tier
computing built from software CyberBricks:
Modern applications are typically structured as a three-tiered system.
The three tiers are:
Client workstations
gather inputs and send requests to servers. When the server responds
the client workstation presents the results to the end user.
Sever systems have
a workflow front-end (often called a Transaction Processing Monitor)
which authorizes the requests and dispatches it to an application program.
The application
program in turn makes requests to a database server to perform actions
on the underlying business objects. This architecture applies
to transaction processing systems, application servers, and web-based
applications.
The advent of the Internet
and World Wide Web has changed the terms a bit: browsers send HTTP requests
to web servers like IIS, Netscape, or Apache that perform the TP monitor
function. In the world of objects, the Object Request Brokers
like CORBA and MTS perform the TP-monitor function. This middle
tier then passes the requests to application or database servers that
perform the third-tier functions.
These systems are ideally suited
for software plug-ins (software CyberBricks) that allow you to easily
construct and extend your application. The applications described
here use the three-tier model of computing and use Microsoft's Distributed
Component Object Model (DCOM) as the glue to build the application.
Scalability Projects: Microsoft undertook several scalability
projects to demonstrate the cluster capabilities of Windows NT and Microsoft
BackOffice. We built Terra-Server, a one terabyte Microsoft SQL
Server database, to show that Windows NT can scale to huge databases.
We built fault-tolerant nodes with our Wolfpack clustering technology
as a scaleout project. Microsoft, working closely with Intel and
Compaq, built a 45-node cluster capable of running more than a billion
transactions per day using Microsoft Transaction server (MTS) and a
2.4 terabyte Microsoft SQL Server database.
These projects taught us a
lot. They demonstrate that Windows NT and Microsoft BackOffice
can scale to very large single nodes. We built and managed large
single nodes, and built large clusters from CyberBricks. The projects
benefited from the wonderful new tools for building three-tier applications.
These tools made it possible for a few people do the work of dozens.
In one case, we recast an application that had taken several months
to build with conventional tools. The revised application was
more powerful, more robust, and took three days to implement.
We see similar benefits in many other projects.
This paper describes two of
our scalability projects in detail. First it describes the Billion
transactions per day effort â a three-tier DCOM-based cluster.
Then it describes the Terra-Server, a SQL Server application that serves
aerial images over the Internet. Terra-Server uses Microsoft Site
server to sell the imagery over the Internet. Terra-Server runs
on a single large node and is a classic scaleup example. The paper
concludes with brief descriptions other Windows NT scalability projects:
a large Microsoft Exchange mail server, a hundred-million hits
per day web server, a 64-bit addressing SQL server, and cluster fault-tolerance.
Microsoft has leadership with
its desktop products. Windows NT and the Microsoft BackOffice
products extend that leadership to servers. Windows NT and BackOffice
are maturing, and now Microsoft can demonstrate that PC technology can
solve even the most demanding problems.
One Billion Transactions
per Day
Online transaction processing
is a common use of clusters for scaleout. Application servers are added
as the workload grows. Disks and controllers are added as the database
grows. Network interfaces and servers are added as network traffic grows.
The Application
We wanted to demonstrate that
a three-tier application using Microsoft Transaction Server, DCOM, and
SQL Server can scale to serve huge online applications. To do
this we picked the DebitCredit scaleable transaction application as
our workload [Datamation].
Read input from client
(account, branch, teller, delta)
Update account +delta
Update teller +delta
Update branch +delta
Insert history
(account, branch, teller,
delta, time)
Reply new balance
Commit
DebitCredit transaction
profile: The DebitCredit transaction profile has been the workhorse
of the industry for two decades. The application postulates a
memo-posting banking application with many branches and many customers.
The application is a debit or credit to some account. The application
also updates the teller cash drawer, branch balance, and does a memo-post
to a general ledger. A thirty-day history of this ledger is kept
online.
Scaling rules for one transaction
per second
1 branch (100 byte record)
10 tellers (100 byte record)
100,000 customers
(100 bytes, ~ 10 MB)
2,600,000 30-day history (50
bytes, 130 MB)
DebitCredit transaction
scaling rules: Each of the tellers at a branch collectively generate
one transaction per second. That is, each teller is entering one
transaction every ten seconds. Each branch has ten tellers and
100,000 customer accounts. The system grows by adding more branches,
tellers, and accounts. A thousand-transactions-per-second system
has 1,000 branches, 10,000 tellers, and 100 million accounts.
There is one additional requirement, 15% of all transactions are "remote".
That is, customers do debits and credits at their home branch 85% of
the time, but 15% of the time a customer goes to a non-home branch and
does a debit or credit to his account. In this case, the transaction
involves two branches: the originating branch and the home branch.
This 15% non-local requirement makes the task much harder. It
means that all partitions of the database must cooperate in about
30% of the work (about 30% of the work at each node is incoming or outgoing
transactions.)
DebitCredit transaction
recast as a three-tier object-oriented application: Recasting the
DebitCredit transaction in object terminology, the bank is an object.
It has various methods on it like Create Customer, Create Branch, Create
Account, and so on. These sub-objects have their own methods.
The DebitCredit method, called Update_Account, deposits or withdraws
money from an account.
The Bank.Init() method connects
to the bank database via ODBC and authenticates itself to the database.
Thereafter, the Bank.Update_Account() method uses this ODBC connection
to send a request to an SQL Server. The method parameters are
account_number, branch_number, teller_number, and amount. If the
account is non-local to the branch, then the Update_Account method invokes
a special stored procedure designating the home_branch_number.
The SQL Server has three stored
procedures.
The flows among the processes
in a DebitCredit transaction. 85% of the time, the transaction
involves a single SQL Server. 15% off the time the transaction
involves a second SQL Server and one or two Distributed Transaction
Coordinators.
DebitCreditLocal
handles local transactions.
DebitCreditRemote
handles non-local transactions. It begins a distributed transaction,
then it updates the local branch, teller, and history file. Then
it invokes the UpdateAccountRemote stored procedure at the customer's
home SQL Server. When that stored procedure completes, the SQL
Server commits the distributed transaction.
UpdateAccountRemote
is invoked by DebitCreditRemote to update the specified account.
In the benchmark system, each
SQL Server manages 800 branches and eighty million customer accounts.
So, each CyberBrick is an 800 transaction per second system (tps) managing
eighty million bank accounts. This is a large bank. The
full system has twenty such CyberBricks, for a total of 1.6 billion
bank accounts. We believe this is more than all the bank accounts
in the world.
The "local" message
flows are shown above. 15% of the transactions are non-local.
They need the two-phase-commit mechanisms of DTC. 20% of them
involve two SQL Servers and one DTC (shown at right). 80% of the
distributed DebitCredit transactions involve two DTCs and exchange 35
logical messages. Most of these messages are
âbatchedâ so that DTC has relatively little message traffic per
transaction. Local transactions generate two network messages
while distributed transactions generate
six. These messages are short. The total network
traffic is less than 10 MBps.
Simulating the 160,000 terminals:
We did not configure the 160,000 terminals needed to drive the 20-node
system. Rather the terminals are simulated by 1,100 threads submitting
transactions as fast as they can. We did configure one real ATM
client to let viewers "feel" the response time. One
workflow node drives each server node â a front-end back-end model.
Each workflow node has a driver process with 55 concurrent threads.
Each thread generates transactions as quickly as it can. The transaction
rate per workflow node varies between 600 tps and 800 tps. Servers
slow down as the SQL system checkpoints, and there is some variation
with the random load.
A federated database:
Each node has one twentieth of the database. Yet each node is
an autonomous database system. The collection of all these databases
forms a federated database that manages all the bank branches as one
global database. Partitioning the bank in this way was the
key to building a scaleable system. If this were a manufacturing
application we might partition by geography or by product or by customer.
In general, picking the partitioning strategy is a key step in building
a scaleable application.
The distributed transaction
coordinator (DTC) ties all the nodes together into one bank-wide database.
When a transaction involves a money transfer between two or more nodes,
the DTC makes sure that the transfer has the ACID properties (atomicity,
consistency, isolation, and durability) [Gray & Reuter]. DTC
implements the two-phase commit protocol (sometimes called the XA or
LU 6.2 protocol) to assure that all nodes have a consistent version
of the database.
The Database is two terabytes: The database is spread among twenty
servers running Microsoft SQL Server 6.5. Each server is sized
for 800 bank branches. So each server manages eighty million bank
accounts. The sizes of the tables are as follows:
The
Billion Transactions Per Day Database has space for over 38 billion
records. At the end of the first day, it has almost three billion
records. All storage is RAID protected, so 4.3 TB of raw disk
supports the 2.3 TB of user data.
Table
records
Bytes
RAID
Branch
16,000
2 MB
mirror
Teller
160,000
20
MB
mirror
Account
1,600,000,000
200
GB
mirror
History
(30 day)
30,000,000,000
1.5 TB
mirror
Recovery
log
7,000,000,000
700 GB
RAID5
Total
38 Billion
2.3 TB
Disk
space available before RAID
4.3 TB
Disk
space available after RAID
2.4 TB
Distributed transaction
rate: When a node is running at the 800 tps rate, there are approximately
120 outgoing distributed transactions and 120 incoming distributed transactions
each second at each node. That is 240 distributed transactions per second
at each node (30% of the load on a node). There are about 1,800 distributed
transactions per second overall. A distributed transaction coordinator
(DTC) is configured for every four SQL Server nodes. In the full
20-server case there are 5 DTC nodes. In this case, 85% of the
transactions are local, involving only one node. 15% of the transactions
are distributed involving 2 SQL Servers. Of the distributed transactions,
25% involve only one DTC node, while 75% involve two DTC nodes.
Software Summary: a
Classic 3-tier application.
In summary, this is a classic 3-tier application. The 1,100 threads
simulate a network of 160,000 terminals. The threads make DCOM
DebitCredit calls on the bank object. The DCOM calls are translated
to ODBC calls that are executed by Microsoft SQL server. Most
of the transactions involve only a single node, but about 30% of the
work involves multi-node transactions.
The Hardware
The hardware consists of 20
workflow nodes (the middle tier) driving 20 SQL Server nodes.
Five additional nodes coordinate distributed transactions involving
two or more server nodes. In all, the cluster has 140
processors executing about 14 billion instructions per second on this
workload.
Type
nodes
CPUs
DRAM
ctlrs
disks
RAID
space
Workflow
MTS
20
Compaq Proliant
2500
20x
2
20x
128
20x
1
20x
2
20x
2 GB
SQL
Server
20
Compaq Proliant
5000
20x
4
20x
512
20x
4
20x
36x4.2GB
7x9.1GB
20x
130 GB
Distributed
Transaction Coordinator
5
Compaq Proliant
5000
5x
4
5x
256
5x
1
5x
3
5x
8 GB
TOTAL
45
140
13 GB
105
895
3 TB
Clients: The client
nodes are Compaq Proliant 2500 servers. Each client has 2x200
MHz Intel PentiumPro⢠processors, 128 MB of memory, and two 2GB disks.
Each has a single Ethernet card. The workflow systems are running MTS
version 1.0.
Servers: The server
nodes are Compaq Proliant 5000's with 4x200 MHz Intel PentiumProâ¢
processors, 512 MB of DRAM, and 36 4.2 GB disks
configured as a RAID1 (mirror) sets and a 7x9.1 GB disk RAID5 array
(with one spare drive.) The servers have 2.4 TB of storage managed by
Microsoft SQL Server 6.5.
Transaction Coordinators:
The transaction coordinators are Compaq Proliant 5000's with 4x200 MHz
Intel PentiumPro⢠processors, 128 MB of DRAM, one system disk, and
a duplexed log disk.
The Interconnect: All
the nodes are interconnected via a single Cisco-5000 100-Mbps Ethernet
switch. It is full duplex and so gives each node about 20 megabytes
per second of message bandwidth to the other nodes. The switch
has less than 5% utilization because it carries only request-reply and
transaction commit messages. Simple transactions generate only
two messages. Distributed transactions can involve up to 30 messages,
but many of these are short and are batched by the transaction coordinator.
Summary:
In all there are 20 server processors with, 10 GB of DRAM and 860 disks
holding 2.4 TB of real data. The 30 billion history records in
the general ledger dwarf the 1.6 billion account records. The
system is configured for the one-day log files (700 GB) and almost enough
space for the 1.6TB 30-day history file. The log is protected
by RAID5 while the disks are protected by RAID1 (mirroring).
Management and Monitoring:
In addition to the application, we built a simple application to manage
and monitor the cluster. This application starts the drivers,
ramps up the system load, and monitors system throughput.
Assessment
The system runs approximately
1.2 B transactions per day. That is nearly a million transactions
a minute and 14 thousand transactions a second. During the day-long
run, each of the 20 SQL Servers in the system is checkpointing every
five minutes
A screen shot of the benchmark
monitor at startup. As the 20 server nodes come online over a
one-minute period, the throughput ramps up to a rate of 1.24 billion
transactions per day.
The benchmark run:
The screen shot at left shows the initial ramp-up of the system.
Nodes are brought online one-at-a-time to show the linear growth in
throughput. We believe that a system twice this size would have
twice the throughput - 2.4 billion transactions per day.
Once all twenty workflow systems
and servers have been brought online, the system throughput is fairly
steady. It oscillates a bit due to the SQL Server checkpoint activity,
and due to the stochastic nature of the benchmark.
We have failed disks and repaired
disks during a run. While the disk is being repaired, throughput
drops about 5%. We have experimented with 60% distributed transactions
rather than just 15%. That saturates the DTC nodes and drops throughput
to about 350 million transactions per day. By adding more DTC
nodes, and more servers we could do 100% non-local transactions.
1 Billion transactions per
day is more than any enterprise or industry.
How much is a Billion
Transactions Per day? To put these numbers in perspective, IBM estimates
that all existing IBM systems combined perform about 20 billion transactions
per day [IBM]. The world Wide Web executes about 20 billion hits
per day. The airline and travel reservation industry processes about
250 million transactions per day. Visa handles about 80 million
transactions on a peak day and 20 billion transactions per year.
ATT completes less than 200 M calls a day. UPS and Federal Express
carry less than 50 million parcels a day. Large banks handle about
20 million transactions per day. The New York Stock Exchange handles
less than a million transactions on a peak day. Master Card recently
reported that it does two billion transactions per year.
Micro-dollars per
transaction. The server nodes have a list price of 1.7 m$, the workflow
nodes have a list price of 148 k$, and the switch costs 35 k$.
The software has a list price of 140 k$. So the total system price is
about two million dollars. In three years, the hardware can do
over a trillion transactions (1015), at a cost of about a
micro-dollar each
Easy to Build:
Assembling the hardware for this system took about a month. Installing
the software proceeded in parallel with the hardware install.
Formatting the disks and loading the database takes 5 hours. The
hardware moved to New York for a demo. It was re-assembled it
in four days (in a hotel). During the trip, the Ethernet cables
were lost, and one RAID-protected disk did not restart, but otherwise
the hardware and software survived the transcontinental move.
Easy to Program:
As can be seen from the figures, the programs and stored procedures
are extremely simple. Building the application in MTS and DCOM
took only a week. This does not tell the whole story. This
application was a good stress test for Microsoft Software. The
whole project has been underway for about 18 months. Early experience
resulted in performance enhancements that appear in Windows NT Sever
Enterprise edition.
Easy to Manage: The
cluster is managed from a single console that can view each node individually
and can view the overall system. The cluster is a single protection
domain. It uses the Windows NT, Microsoft SQL Server, and Microsoft
Transaction Sever management tools to configure, tune, and manage all
the nodes from one console.
Summary:
The billion-transactions-per-day DebitCredit system is a classic three-tier
application showing a DCOM application sending a billion transactions
per day to a Microsoft Transaction Server. The DCOM calls are
translated to ODBC requests to SQL Server. The native Windows
NT Distributed Transaction Coordinator gives these transactions the
ACID property. The database size and the application throughput
scales linearly as client and server nodes are added to the system.
Each transaction costs approximately a micro-dollar. This benchmark
should resolve any questions about the scalability of Windows NT Server,
DCOM, and Microsoft SQL Server. Properly designed applications
can scale to solve the largest OLTP problems. This system has
a huge (2.3 TB database) and has huge (1.2 billion transactions per
day) throughput.
Photographs
of the Compaq 45-node cluster as assembled in New York.
The Terra-Serverï¤
Terra-Server has a terabyte
of satellite and aerial images covering most urban areas of the world.
It will serve these images onto the Internet. The application
demonstrates several things:
Information at
your fingertips: This will be the most comprehensive world atlas
anywhere â and it will be available to anyone with access to the Internet.
The Terra-Server hardware
Windows NT and
SQL Server can scale up to huge nodes: The Terra-Server fills four
lager cabinets: one for the Digital Alpha processors, and three cabinets
for the 324 disks. The single-node Terra-Server is has 1/3 as
much storage as the 45-node billion-transaction-per-day cluster.
Windows NT and
SQL Server are excellent for serving multi-media and spatial data onto
the Internet.
Microsoft Site
Server can sell intellectual property over the Internet.
Terra-Server is a multi-media
database that stores both classical text and numeric data, and also
stores multi-media data. In the future, most huge databases will
be comprised primarily of document and image data. The relational
meta-data is a relatively small part of the total database size.
Terra-Server is a good example of this new breed of databases â now
called object-relational databases.
The Application
Terra-Server demonstrates a
huge database running on a huge a single-node Windows NT Server
and Microsoft SQL Server system . The billion-transaction-per-day
cluster shows that Windows NT can support large databases and applications.
A Terabyte Internet Server:
Terra-Server is a real application that will be served onto the Internet
to demonstrate the next-generation Microsoft SQL Server and Microsoft's
Internet tools. For this to be compelling, the application had
to be interesting to almost everyone everywhere, be
offensive to no one, and be relatively inexpensive. It is hard to find
data like that â especially a terabyte of such data. A terabyte
is nearly a billion pages of text â 4 million books. A terabyte holds
250 full-length movies. It is a lot of data.
World population density
Satellite Images of the
Urban World: Pictures are big and have a universal appeal, so it
was natural to pick a graphical application. Aerial images of
the urban world seemed to be a good application. The earth's surface
is about 500 square tera-meters. 75% is water, 20% of the rest
is above 70ï°
latitude. This leaves about 100 square tera-meters. Most
of that is desert, mountains, or farmland. Less than 4% of the
land is urban. The Terra-Server is in the process of loading this
data. Right now, it has nearly three million square kilometers.
Soon it will have over seven million square kilometers
Cooperating with United
States Geologic Survey: The USGS has published aerial imagery of
many parts of the United States. These images are approximately one-meter
resolution (each pixel covers one square meter.) We have loaded all
the published USGS data (214 GB). We have a Cooperative Research
Agreement with the USGS to get access to, and load the remaining data
(3 TB more). This is 30% of the United States. This data
is unencumbered and can be freely distributed to anyone. It is
a wonderful resource for researchers, urban planners, and students.
The picture at left shows a baseball game in progress near San Francisco.
You can see the cars, but one-meter resolution is too coarse to show
people.
A USGS DOQ image.
An SPIN-2 2-meter
image.
Working with Russian Space
Agency and Aerial Images. To be interesting to everyone everywhere,
Terra-Server must have worldwide coverage. The USGS data covers
much of the United States. There is considerable imagery of the
planet but much of it is either of poor quality (10 meter to 1-km resolution),
or has not been digitized, or is heavily encumbered. An Internet
search for high-quality images with worldwide coverage gave a surprising
result. The Russian Space Agency and their representative Aerial Images
have the best data and are the most cooperative. Microsoft agreed
to load the Russian data and serve it onto the Internet. Microsoft
is helping the Aerial Images create an online-store for and is giving
them rights to use the resulting applications. The Russians and Aerial
Images are delivering two square tera-meters of imagery to us (2-meter
resolution). This data is trademarked SPIN-2 meaning satellite-2-meter
imagery. They promise to deliver an additional 2.4 square terra-meters
over the next year.
Terra-Server is the largest
world atlas: The Russian SPIN-2 imagery covers Rome, Paris, Tokyo,
Moscow, Rio, New York, Chicago, Seattle, and most other cities of more
than 50,000 people. Terra-Server has more data in it than ALL
the HTML pages on the Internet. If printed in a paper atlas, with
500 pages per volume, it would be a collection of 2,000 volumes.
It will grow by 10,000 pages per month. Clearly, this atlas must
be stored online. The USGS data (the three square tera-meters)
approximately doubles the database size. It is a world-asset that
will likely change the way geography is taught in schools, the way maps
are published, and the way we think about our planet.
Terra-Server as a business.
Slicing, dicing, and loading the SPIN-2 and USGS data is not yet complete.
Today, the Terra-Server stores a terabyte, but much of the data is duplicated
to exercise the database system. All the images should be loaded
by the late summer. Starting in the fall of 1997, Aerial Images,
Digital, and Microsoft will operate the Terra-Server on the Internet
as a world asset. Microsoft views Terra-Server as a living advertisement
for the scalability of Windows NT Server and Microsoft SQL Server.
Digital views it as an advertisement for their Alpha and StorageWorks
servers. The USGS distributes its data to the public at cost.
The Russian Space Agency and Aerial Images view Terra-Server as a try-and-buy
way to distribute their intellectual property. They will make
coarse-resolution (12-meter and 24-meter) imagery available for free.
The fine-resolution data is viewable in small quantities, but customers
will have to buy the right to use the "good" imagery.
All the SPIN-2 images are watermarked, and the high-resolution images
are lightly encrypted.
Site Server, a new business model for the Internet. Aerial
Images' business model is likely to become a textbook case of Internet
commerce. By using the Internet to sample and distribute their
images, Aerial Images has very low distribution costs. This allows
Aerial Images to sell imagery in small quantities and large volumes.
Microsoft is helping Aerial Images set up a Microsoft Site Server that
will accept credit-card payments for the imagery. A "buy-me"
button on the image page takes the user to this Aerial Images Site server.
In a few months, you should be able to buy a detailed image of your
neighborhood for a few dollars. We will also help the USGS set
up a Microsoft Site Server so that people can easily acquire USGS products.
Navigation can be via name
(left) or via a map (right). In either case, the user can select
the USGS or SPIN-2 images for the place
User Interface to Terra-Server
Navigation via database
searches: The Terra-Server can be accessed from any web browser
(e.g. Internet Explorer, Navigator) supporting ActiveX⢠controls and
Java⢠applets. Navigation can be a spatial point-and-click
map control based on Microsoft's Encarta® World Atlas.
Clients only knowing the place name can navigate textually by presenting
a name to the Encarta World Atlas Gazetteer. The gazetteer
knows the names and locations of 1.1 million places in the world.
For example, "Moscow" finds 28 cities, while "North Pole"
finds 5 cities, a mining district, a lake, and a point-of-interest.
There are 378 San Francisco's in the Gazetteer. The user can select
the appropriate member from the list. The map control displays
the 40-km map of that area. The user can then pan and zoom with
this map application, and can select the USGS and SPIN-2 images for
the displayed area.
Spatial navigation via the
map control. We built a Java applet that runs in the client browser
and talks to a server that has the basic features of Microsoft Encarta
World Atlas and Microsoft Automap Streets. The Java
applet decides what the client wants to see and sends a request for
that map to the Terra-Server. The Terra-Server builds the associated
map image on demand and returns it to the client. Since the applet
is written in Java, it can run on Windows, Macintosh, and UNIX clients.
Spatial access will be especially convenient for those who do not understand
English.
Zooming in and out: The map control allows the browser to zoom
out and see a larger area, or zoom in and see finer detail. The
coarsest view shows the whole planet. The user can "spin"
the globe to see the "other side" and place the point of interest
in the center of the screen. Then the user can zoom in to see
fine detail. Where we have street maps (Microsoft Automap®
Streets), the zoom can go all the way down to a neighborhood.
Encarta, USGS, and SPIN-2
Themes: Terra Server has several different views of the earth: the
Encarta World Atlas view, the USGS image view, and the Russian-SPIN-2
view. We call each of these a theme. The user may
switch from one theme to another: perhaps starting with the Encarta
theme, then the SPIN-2 theme, and then the USGS theme of the same spot.
With time, we expect to have multiple images of the same spot.
Then the user will be able to see each image in turn. Your grandchildren
will be able to see how your neighborhood evolved since 1990.
Moving Around: Once
you find the spot you are looking for, you can see nearby places by
pushing navigation buttons to pan and zoom. Doing this, you can
"drive" cross-country.
Buy-me: If you like the SPIN-2 imagery, you can push the "buy-me"
button. That takes you to the Aerial Images Site Server.
The Site Server (1) allows you to shop for imagery, (2) quotes you a
price, (3) if you want to purchase the right to use the images, Site
Server asks for your credit card, debits it, and (4) downloads the images
you purchased to you. This is a good example of selling soft goods
over the Internet.
Server Design
The Terra-Server design shows
that Microsoft SQL Server, Windows NT, and Internet Information Server
are good at storing huge multi-media databases to be served onto the
Internet. The billion transactions per day project demonstrated
a 2-terabyte database, but those are small-record databases spread across
20 servers. We wanted to build one very big server.
Terra-Server hardware and software
The SQL database has several components:
Gazetteer: The Encarta
World Atlas Gazetteer has over a million entries describing most
places on earth. All these records are stored and indexed by Microsoft
SQL Server. Stored procedures to look up these names and produce an
HTML page describing the top 10 "hits", with hot-links to
the images if they are in the Terra-Server.
Map Control:
The Encarta World Atlas group implemented a map control for us
and for other Microsoft groups (e.g., Expedia⢠and Sidewalkâ¢).
This server program, given the corners of a rectangle and an altitude
generates the view of the earth inside that rectangle. It generates
a GIF image that is downloaded to the Java-based applet in the client
browser. The map server runs on a Digital Prioris computer next to the
Terra-Server.
The Reference Frame:
Terra Server uses the latitude and longitude coordinate system, potentially
with centimeter accuracy, to geo-locate objects. All the
themes are represented in this coordinate system. After several
false starts, Terra-Server settled on a simple latitude-longitude-tiling
scheme. When indexing, it use the Z-transform (interleave latitude
and longitude bits) to give a single clustering key for each spot on
earth [Samet]. All the spatial indices are based on this scheme.
Terra Server uses Sphinx: Terra-Server uses the next version
of Microsoft SQL Server, code-named Sphinx. This version supports
larger page sizes, has better support for multi-media, supports parallelism
within queries, parallel load, backup and restore utilities, and supports
much larger databases. Terra-Server has been a good alpha test
for Sphinx.
Database Design: Each
theme has it's own set of SQL tables. Each image is stored along
with its meta-data as a record in a table of the database. The
data is indexed by Z-transform. The database parameters are summarized
in the following table. We have not received all the DOQ and SPIN-2
data. So, to reach a terabyte database, we replicated some of
the real data into the Pacific Ocean to create the full-sized database.
There is 1.5 real square tera-meters and 3.8 total sq tm. We are
removing this phantom pacific island as more real data arrives.
The
Terra-Server database has 886 GB of data stored in 123 million records.
The remaining space is used for indices, catalogs, recovery logs, and
temporary storage for queries and utilities. The database has
a formatted capacity of 1.001 TB.
Total
Disk Capacity
Unprotected 1.4
TB
after RAID5: 1.1
TB
Database
Size
324 disks x 4.2
GB, 19 volumes
1.001 TB
Million Records
Gazetteer
.16 GB
1.1
USGS
.9 sq tera-meters
@ 1m (JPEG)
99 GB
13.1
USGS
fill
1.2 tera-meters
@ 1m (JPEG)
124 GB
24.2
SPIN-2
.3 sq tera meters
@ 2m (TIFF)
77 GB
8.2
SPIN-2
fill
2.2 sq tera meters
@ 2m (TIFF)
555 GB
77.5
Total
3.8 square
tera meters
886GB
123 million
records
Index,
Catalog, log
89 GB
Temp
Space
50 GB
Total
1.00 TB
SPIN-2 Theme:
The raw SPIN-2 data is scanned at 2-meter resolution. The images
are then geo-rectified (North is up, optical distortion minimized),
and geo-located (pixels are accurate to 2-meters). The Russian
Space Agency and Aerial Images do all this work. The data is then
sent to Microsoft on Digital 7000 40 GB DLT magnetic tapes. The
typical image is 500 MB. It would take five years to download
such an image over a 28.8 modem. We slice-and-dice these images
into 30-KB tiles that can be downloaded within ten seconds. The
slice and dice step produces three products:
Thumbnails:
JPEG compressed images covering a 8 km x 4 km area at 24-meter resolution
Browse:
JPEG compressed images covering a 4 km x 2 km area at 12-meter resolution
Tiles:
TIFF images that cover a 600m x 400m area at 2-meter resolution.
The key property is that these
tiles can be downloaded quickly over a voice-grade telephone line.
All the images are watermarked. The tile images are lightly encrypted.
Each image, along with it's meta-data (time, place, instrument, etc.,...
) is stored in a database record. Each resolution is stored in
a separate table. This data can be cross-correlated with the Gazetteer
and other sources by using the Z-transform. At this time we have
43 million tiles, 670,000 browse, and 74,000 thumbnails loaded.
This totals 632 gigabytes of user data. Loading is continuing
as more data arrives from the Russian Space Agency (we received .5 sq
tm as of May 7.)
USGS Theme: The USGS
images are stored in a similar way â except that they are compressed
using JPEG (about 10x compression). They arrive on CDs from the
USGS. The slice and dice step produces three products:
Thumbnails:
JPEG images covering a 9 km x 6 km area at 24 meter resolution.
Browse:
JPEG compressed images covering a 3 km x 2 km area at 8 meter resolution.
Tiles:
JPEG images that cover a 375 m x 250 m area at one meter resolution.
Each image, along with it's
meta-data (time, place, instrument, ) is stored in a database
record. Each resolution is stored in a separate table. At
this time we have 19 M tiles, 300,000 browse, and 33,000 thumbnails.
This totals 223 gigabytes of user data. The US is about 9.8 million
square kilometers, so this is about 2% of the US. Important areas
have not yet been published on CD (e.g. LA, San Francisco, New York,
etc...). Fortunately, the Russian SPIN-2 data has good coverage of these
areas. We plan to get 3 TB more of data from the USGS in the near
future.
Loading the Database:
The slice and dice process produces a set of files. We wrote a
load manger that consumes these files and feeds the data into the Terra-Server
using the SQL Server loader (BCP). Using several parallel streams,
we are loading at approximately 1 MBps. At this rate, the load
takes 12 days. The load is more constrained by the scan, slice,
and dice process than by the load rate.
Database Access procedures.
Clients send requests to the Terra-Server's Internet Information Server
(IIS) built into NT. These requests are passed on to stored procedures
in the Terra-Server that decide what tiles are wanted and return them
to the client. For browse images, the server first returns the
HTML for the outer frame, and then returns the 9 tiles that correspond
to the 9 quadrants of the image. When the client pans an image,
only the next 3 tiles are returned.
Site Server: If a client
wants to buy some Imagery from Aerial Images, the client pushes the
buy-me button. Site server uses secure HTTP to authenticate the
user, quote the user a price for the requested data. Users can
buy the right to use a few square kilometers, or they can buy the rights
to an entire region. If the user wants to buy the right use the
high-quality images of a certain area, Site Server validates and debits
the user's credit card. Then Site Server sets up a download of
the image data to the user. This data is watermarked, but not encrypted.
Summary: Terra-Server is a new world atlas â far larger than
any seen before. It is a relatively simple database application,
but it demonstrates how to build a real Internet application using NT
and SQL Server running on Digital Alpha and StorageWorks servers.
Terra-Server Hardware
Processors: Terra-Server
runs on a Digital Alpha 4100 system with 2 GB of memory. This system
has four 430 MHz Digital Alpha processors.
Disks: The system has
three additional storage cabinets each holding 108 4.2 GB drives, for
a total of 324 drives. Their total capacity is 1.36 terabytes.
The drives are configured as RAID5 sets using nine HSZ50 dual-ported
disk controllers. Each controller manages 36 disks on six SCSI
strings. Each controller externalizes these 36 disks as three
RAID5 sets. Windows NT file striping (RAID0) is used to collapse
nine of these disks into one stripe set. The resulting 19 logical drives
are each given a drive letter. SQL Server stripes the database
across these 19 logical drives. This mapping is dictated by the
fault-tolerance properties of the StorageWorks array. The design
masks any single disk fault, masks many string failures, and masks some
controller failures. Spare drives are configured to help availability.
In the end, SQL Server has just over 1 TB of RAID5 protected storage.
The StorageWorks array has been trouble-free.
Tapes: The Terra-Server
has a four-station DLT7000 tape robot with a capacity of over 2 terabytes.
This robot is used for backup and recovery. It is also used to
import data from other sources.
Map Server and Microsoft
Site Server: The Map and Site servers run on a dedicated Digital
Prioris 4-way 200 MHz Intel P6 processors, with 512 MB of DRAM, and
twenty 4 GB disks.
Networking: The Terra-Server
will be behind the Microsoft firewall. It will be connected via
dual 100 Mbps Ethernet to high-speed Internet ports.
Slicing and Dicing:
Most of the Slicing and Dicing of the images was done on a second Digital
4100 with 200GB of disk storage. The load processes deliver new
data from this node to the Terra-Server over a switched 100 Mbps Ethernet.
Image processing by the Russian Space Agency, Aerial Images, and the
University of California at Santa Barbara, used Windows NT/Intel systems.
Digital Equipment Corporation provides all the Terra-Server hardware.
This includes a 324 StorageWorks disk array, a 4-processor AphaServer
4100, and a tape archive. Intel-based servers run the Site server
and map server tasks.
Assessment
The Terra-Server is a simple
application, but it involves many tools: HTML, Java, ActiveX, HTTP,
IIS, ISAPI, ODBC, and SQL. As such, it is a showcase web application.
Once the rough design was chosen, it was fairly easy to design and configure
the database.
We were novices at the many
data formats used in geo-spatial data â but we learned quickly as
we did the slicing and dicing. Working with our geo-spatial-data
mentors at Aerial Images, USGS, and the UCSB Alexandria Digital Library
vastly accelerated this process.
Once the first user-interface
was built, it was clear that we needed a better one. We are now
on the forth iteration. The design is converging, but designing
intuitive user interfaces is one of the most difficult aspects of any
system.
Once we understood the process,
the slicing, dicing, and loading went very smoothly. Although
we were using a pre-Alpha copy of the next SQL Server, it gave us very
few problems. The Digital Alpha and StorageWorks equipment performed
flawlessly. Having great tools makes it possible to experiment.
The SQL, IIS, and Windows NT
management tools were a big asset. Overall, the project was relatively
easy. The one area that gave us difficulty was Backup and Restore
speeds. We are working to improve that aspect of SQL Server.
Terra-Server shows that Windows
NT and Microsoft SQL Server can support huge databases on a single node.
If one wanted to store an atlas of the entire landmass of the planet,
it would be 100 times larger. Clearly, one would use larger disks
(23 GB disks rather than 4 GB disks) and would use a cluster of 20 of
these huge nodes in a design similar to the billion-transactions-per-day
cluster.
Other Scalability Projects
Concurrent with the one-billion-transactions-per-day
effort and the Terra-Server, other Microsoft groups undertook scaleup
projects. This is a brief synopsis of those efforts.
100 M web hits per day:
A single dual processor Windows NT Server running Internet Information
Server was configured with 1,000 virtual roots (web sites) all served
by one web server. Interestingly, each web site models a large
city and uses imagery from Terra-Server as a graphic for the root page.
Clients send HTTP requests to drive this web server. The server
sustains an average rate of one hundred million -web hits per day.
That is twice the hit rate of the www.microsoft.com site. This
shows a breakthrough in web-server performance derived from the combination
of Windows NT Sever and its imbedded Internet Information Server (IIS).
50 GB Mail Server:
Most mail servers have a modest limit on the size of the message store.
Microsoft Exchange Server had a limit of 16 GB. That limit has
been extended to several terabytes. A server with a 50 GB message
store was built (by filling it with Internet news groups).
50,000 POP3 Mail users.
Microsoft Exchange was also stress tested with 50,000 clients sending
5 message per day, receiving 15, and checking mail twice an hour using
the standard POP3 protocol. This ran on a single 4-processor node.
64-bit Windows NT and
SQL Server: Windows NT 5.0 supports 64-bit addressing on the Digital
Alpha and Intel Merced processors. SQL Server has been modified
to exploit the power of massive memory. In particular, a data
analysis scenario using SAS running on SQL server completes within 45
seconds using the Digital Alpha 64-bit addressing, but takes 13 minutes
to complete if the system does not exploit massive main memory.
High Availability SQL
and Windows NT: Fault-tolerance becomes very important when a cluster
grows to have thousands of disks, hundreds of processors and thousands
of applications. Windows NT Enterprise edition adds high-availability
support to Windows NT 4.0. Most BackOffice applications have been
made cluster-aware so that they can failover from one node to another
if the node fails. To demonstrate this, Microsoft worked closely
with SAP to make a high-availability version of R/3. With this
version running on two nodes, one node does the R/3 front-end work,
wile the second node does the back-end SQL work (a classic three-tier
application). If one node fails, the other node performs both
the R3 and the SQL tasks. Users can continue using the R/3 system,
even while the failed node is repaired and brought online. One-minute
application failover times are common now.
Summary
Commodity-scalable servers
have arrived. These scaleable demonstrations show that with
proper design, Windows NT Server, Microsoft SQL Server, and the other
Microsoft BackOffice products can be used to solve the most demanding
problems. They demonstrate the key properties one wants of a scaleable
system:
Scalability â
growth without limits.
Manageability â
as easy to manage as a single system, self-tuning.
Programmability
â easy to build applications.
Availability â
tolerates hardware and software faults.
Affordability â
built from commodity hardware and commodity software components.
These applications demonstrate
the extraordinary performance obtainable with commodity components.
Commodity components give these systems excellent price-performance.
They are a breakthrough in the cost of doing business. The cost
of serving a page onto the Internet, delivering a mail message, or transacting
a bank deposit has gone nearly to zero â it costs a micro-dollar
per transaction. One could use advertising to pay for such
transactions â an advertisement pays a thousand times more per impression.
These applications were each
built by a few people in a few weeks. For both systems, we rewrote
the application in a week. The Terra-Server design keeps evolving.
Both systems are modular, so components are easily rewritten and plugged
into the whole. They demonstrate the incredible power of the new
tools created by the Windows platform and by the Internet. The
resulting applications have easy-to-use management interfaces, and have
the manageability and availability properties essential to operating
them.
Microsoft learned a great deal
in building these proof-of-concept systems. We fixed many performance
bugs and eliminated many system limits in the process of doing the scaling
experiments. There is still much more to do to make these products
even more self-tuning and self-managing. That process is in full
swing now. The next releases of Windows NT Server and Microsoft
BackOffice products will reflect these improvements.
Acknowledgments
The project teams wrote this
document to describe the work in some detail. Robert Barnes led
the billion-transaction-per-day effort, Tom Barclay led the Terra-Server
effort. Many others helped. Key contributors within Microsoft
were Graham Anderson, Mohsen Al Ghosein, Robert Eberl, Jim Ewel, Jim
Gray, Ed Muth, John Nordlinger, Guru Raghavendran, Don Slutz, Greg Smith,
Phil Smoot, James Utzschneider, David Vaskevitch, and Rick Vicik.
Many groups helped with these
scalability efforts. The Intel Scaleable Servers lab in
Portland supported the billion-transactions per day project from its
inception. Stu Goossen and Seckin Unlu from Intel's Systems Software
Lab in Hillsboro and Justin Rattner were especially helpful in loaning
us hardware and giving us advice. Once we had completed the proof-of-concept,
Compaq generously provided us the cluster, described here.
The Terra-Server project was
done jointly with Digital Equipment Corporation and Aerial Images.
The Alpha Serverï division and the Storage Worksï
division were especially helpful. The 324 StorageWorks disks
on the Terra-Server worked on day one and continue to work without event.
That is great news. John Hoffman and Nat Robb III of Aerial Images
gave this project enthusiastic support. SPIN-2, along with Mr.
Mikhail Fromchenko and Dr. Victor Lavrov of Sovinformsputnik (the Russian
Space Agency) contributed the Russian imagery to the project.
Jim Frew, Terry Smith, and others at UC Santa Barbara helped us slice
and dice the USGS imagery. Steve Smyth helped us with the
Encarta World Atlas Gazetteer, and Microsoft's Geo Business Unit built
the Java-based map control for us.
References
[Datamation] Anon, A Measure of Transaction Processing
Power, Datamation, April 1, 1985: Vol. 31, No. 7. p. 112-118
[Gray & Reuter] Gray, J., and
Reuter, A., Transaction Processing Concepts and Techniques, Morgan
Kaufmann, San Francisco, CA. 1994
[IBM] International
Business Machines Annual Report, International Business Machines,
Armonk, N.Y., April, 1996, page 5.
[Samet] Samet, H., The
Design and Analysis of Spatial Data Structures, Addison Wesley,
New York, Jan 1994,
[TPC] TPC-C Executive Summary,
Compaq ProLiant 5000 6/200 Model 1X c/s, 3, March, 1997 http://www.tpc.org/results/individual_results/Compaq/compaq.5000.ms.es.pdf
For the latest information
on Microsoft SQL Server, check out our World Wide Web site at
http://www.microsoft.com/sql
or the Microsoft SQL Server Forum on the Microsoft Network (GO
WORD: MSSQL).
Two Commodity Scaleable Servers: a
Billion-Transactions-per-Day and the Terra-Server