blob: 1e476064d9520e3e07701d7241108440ad6c3f06 [file] [log] [blame]
% TEMPLATE for Usenix papers, specifically to meet requirements of
% TCL97 committee.
% originally a template for producing IEEE-format articles using LaTeX.
% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
% Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
% both the above pioneers
% use at your own risk. Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point
% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document. Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file.
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
% adapted for Ottawa Linux Symposium
\documentclass[twocolumn]{article}
\usepackage{ols,epsfig}
\begin{document}
%Remove this next line if your system defaults correctly.
\special{papersize=8.5in,11in}
%don't want date printed
\date{}
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf Linux Kernel SCTP : The Third Transport}
%for single author (just remove % characters)
\author{
La~Monte H.P.\ Yarroll \\
%{\em Your Department} \\
{\em Motorola GTSS}\\
%{\em Your City, State, ZIP}\\
% is there a standard format for email/URLs??
% remember that ~ doesn't do what you expect, use \~{}.
{\normalsize piggy@acm.org} \\
%
% copy the following lines to add more authors
\and
Karl Knutson \\
{\em Motorola GTSS}\\
%% is there a standard format for email/URLs??
{\normalsize karl@athena.chicago.il.us}
%
} % end author
\maketitle
% You have to do this to suppress page numbers. Don't ask.
\thispagestyle{empty}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\subsection*{Abstract}
% nuked italics -- FD
The Stream Control Transmission Protocol (SCTP) is a reliable
message-oriented protocol with transparent support for multihoming.
It allows multiple independent complex exchanges which all share a
single connection and congestion context.
We provide an overview of the protocol, the UDP-style API and the
details of the Linux kernel reference implementation. The brief API
discussion is intended for developers wishing to use SCTP. The
detailed implementation discussion is for developers interested in
contributing to the kernel development effort.
\section{Introduction}
The developers at the Linux 2.5 Kernel Summit in San Jose achieved a
rough consensus that 2.5 should probably support SCTP, a new transport
protocol from the IETF. This paper introduces the ongoing work on
such an implementation, providing some details for both the
application developer and the kernel developer.
The Stream Control Transmission Protocol (SCTP) is a reliable
message-oriented protocol with transparent support for multihoming.
It allows multiple independent complex exchanges which all share a
single connection and congestion context.
\subsection{History of SCTP}
The SIGTRAN (Signalling Transport) Working Group of the IETF is
concerned with the transport of telephony signalling data over IP.
Upon reviewing the available standard transport protocols, they
concluded that none of them met the transport requirements of
signalling data.
SIGTRAN concluded that they needed a new transport protocol which
could provide reliable message delivery, tolerate network failures,
and avoid the head-of-line-blocking problem. We will discuss this
problem later.
The WG selected a proposal from Randall Stewart and Qiaobing Xie of
Motorola as a starting point. Stewart and Xie had developed a
Distributed Processing Environment, Quantix, aimed at telephony
applications. This DPE had been successfully demonstrated at Geneva
Telecom in 1999.
The Working Group took great care in constructing the new protocol,
SCTP, incorporating many lessons learned from TCP, such as congestion
control, selective ACK, message fragmentation and bundling.
The core transport protocol from Quantix brought support for
multihoming, message framing, and streams. We discuss all of these
features at length later.
The IESG decided that the resulting protocol was robust enough to be
elevated from a specialised transport for telephony signalling to a new
general purpose transport to stand beside UDP and TCP. To this end,
they moved the work from SIGTRAN to TSVWG, the general transport
group.
As of this writing, the core specification, \cite{rfc2960}, is at Proposed
Standard. There have been three successful bakeoffs covering over 25
separate implementations. Lessons learned from the most recent
bakeoff are being written up in an ``Implementor's Guide'', \cite{impl}.
\subsection{SCTP in the Linux kernel}
Shortly before the first bakeoff, the IESG asked SIGTRAN to move SCTP
from riding on UDP to riding directly on top of IP. The long term
goal was clearly was to move SCTP from user space into the kernel.
Aside from the obvious performance gains, this has the effect of
reducing the number of implementations to roughly one per operating
system. This makes it easier to verify the stability of most of the
implementations which appear on the Internet.
Randall Stewart saw the importance of this and started one of the
authors of this paper working on a port of the user space
implementation to the Linux kernel. This port was intended as a
reference for developers of implementations for other kernels to
examine. The Linux kernel implementation has since diverged
significantly from the user space reference, but maintains the
standards of a reference implementation (see Coding Standards, below).
\subsection{SCTP examples}
SCTP is a reliable message-oriented protocol with transparent support
for multihoming. It allows multiple independent complex exchanges
which all share a single connection and congestion context.
Many network applications operate by exchanging simultaneously, short,
similar sequences of data continuously. The traffic produced by these
operations can be characterised as MICE (Multiple Independent Complex
Exchanges). It is also true that many applications which use MICE
also have high network reliability requirements.
\subsubsection{A database app}
One example is a client/server database application. Each request and
each response is a message. Each transaction is a sequence of
dependent request/response pairs.
Implemented over TCP, this application would have to provide its own
message boundaries, since TCP sends bytes, not messages. How do we
implement MICE with TCP? We have two ways of doing this: multiple
connections, or a single multiplexed and reused connection.
With each transaction over a separate TCP connection, we gain the
independence of transactions, but at a cost in performance. Since TCP
(as a general purpose transport protocol) uses congestion control,
each of the connections would have to go through slow-start and if
most transactions were short, they would never get out of slow-start.
With all transactions over a single TCP connection, we make efficient
use of the network bandwidth, but open ourselves up to the
head-of-line blocking problem. This means that if one segment in one
transaction is lost, this blocks all transactions, not just the one
with the lost segment.
If we use SCTP for the same application we gain the benefits of using
TCP, as well as advantages peculiar to SCTP. SCTP directly supports
messages and guarantees TCP-like levels of bandwidth efficiency via
bundling and fragmentation. Each database transaction can be
represented as an ordered stream of messages, which are independent in
SCTP for retransmission purposes. This means that while SCTP has the
same congestion control mechanisms as TCP, it does not have to resort
to multiple connections nor is it vulnerable to the head-of-line
blocking problem.
\subsubsection{A free clinic}
Another example of SCTP use is for a free\footnote{Free as in ``free
beer''.} clinic which needs a reliable way to use its IP-networked
patient monitoring software.
This has many similarities to the example above in that different
monitoring devices would need to send simultaneous
information---multiple independent complex exchanges. The main
difference is in the higher network reliability requirements.
A reasonable way to improve the network reliability is to set up a
parallel network and use multihoming for the client and server
applications. However, if the application is TCP-based, the
multihoming needs to be added to the application. With SCTP, the
multihoming ability is built into the protocol. All that is necessary
is to make the appropriate socket calls and SCTP will take advantage
of the addresses available in the existing network. This also applies
if one side of the connection has more addresses than the other.
\section{The UDP-style API}
Any new protocol needs an API. In particular for an Internet
protocol, it's important to have the API match the API normally used
for IP networks. This is the Berkeley sockets model---the SCTP
version is defined in the Internet Draft ``Sockets API Extensions for
SCTP''\cite{api}. The API draft defines two complementary interfaces
to SCTP--one for compatibility with older TCP-based applications, and
another for new applications designed expressly to use SCTP. The
Linux Kernel SCTP stack does not yet implement the former, so we
discuss only the UDP-style interface.
The conceptual model of the UDP-style API is (naturally) that of plain
UDP. To send a message in UDP, you create a socket, bind an address
to it and send your message using \texttt{sendmsg()}. To receive a
message in UDP, you create a socket, bind an address to it and use
\texttt{recvmsg()}. It's much the same with the UDP-style API for
SCTP. To send a message, you create a socket, bind \textit{addresses}
to it and use \texttt{sendmsg()}. The SCTP stack underlying the API
handles association startup and shutdown automatically. The same goes
for message reception. To receive a message in UDP-style, you create
a socket, bind \textit{addresses} to it and use \texttt{recvmsg()}.
The important API differences between UDP and UDP-style SCTP are:
multihoming; ancillary data; and the option of notifications from the
SCTP stack.
\subsection{Multihoming and \texttt{bindx()}}
There are three ways to work with multihoming with SCTP. One is to
ignore multihoming and use one address. Another way is to bind all
your addresses through the use of \texttt{INADDR\_ANY} or
\texttt{IN6ADDR\_ANY}. This will ``associate the endpoint with the
optimal subset of available local interfaces.''(Section 3.1.2,
\cite{api}) The most flexible way is through the use of
\texttt{sctp\_bindx()}, which allows additional addresses to be
added to a socket after the first one is bound with \texttt{bind()},
but before the socket is used to transfer or receive data. The
function \texttt{sctp\_bindx()} is further described in section 8.1 of
\cite{api}.
\subsection{Ancillary data}
To use streams with the UDP-style API, you use ancillary data in the
\texttt{struct~cmsghdr} part of the \texttt{struct~msghdr} argument to
both \texttt{sendmsg()} and \texttt{recvmsg()}. Ancillary data is
used for initialisation data (\texttt{struct~sctp\_initmsg} and for
header data (\texttt{struct~sctp\_sndrcvinfo}).
Ancillary data are manipulated with the macros \texttt{CMSG\_FIRSTHDR,
CMSG\_NEXTHDR, CMSG\_DATA, CMSG\_SPACE, \textnormal{and} CMSG\_LEN}.
These are all defined in \cite{rfc2292}. \cite{api} provides a nice
example in section 5.4.2.
{\tt \small
\begin{verbatim}
struct sctp_initmsg {
uint16_t sinit_num_ostreams;
uint16_t sinit_max_instreams;
uint16_t sinit_max_attempts;
uint16_t sinit_max_init_timeo;
};
\end{verbatim}
}
The initialisation ancillary data sets information for starting
new associations.
{\tt \small
\begin{verbatim}
struct sctp_sndrcvinfo {
uint16_t sinfo_stream;
uint16_t sinfo_ssn;
uint16_t sinfo_flags;
uint32_t sinfo_ppid;
uint32_t sinfo_context;
uint8_t sinfo_dscp;
sctp_assoc_t sinfo_assoc_id;
};
\end{verbatim}
}
The header ancillary data reports information gleaned from the SCTP
headers. If requested with the \texttt{SCTP\_RECVDATAIOEVNT} socket
option, this ancillary data is provided with every inbound data
message. There is a handy key (\texttt{sinfo\_assoc\_id}) which
identifies the association for this particular message. It also
provides the flags needed to implement partial delivery of very large
messages.
Outbound messages should include an \texttt{sctp\_sndrcvinfo} ancillary
data structure to tell SCTP which SCTP stream to put this datagram
into. It is also possible to set a default stream so that this
ancillary data may be omitted.
\subsection{Notifications}
SCTP provides for the concept of optional notifications. These are
messages delivered in-band about events inside the SCTP stack, such as
a destination transport address failure or a new association coming
up. The notifications are marked with the \texttt{MSG\_NOTIFICATION}
flag in the \texttt{msg\_flags} field of the \texttt{sctp\_sendrcvinfo}
ancillary data. The notification is delivered as the body of the
message returned by \texttt{recvmsg()}.
In \ref{notifications} we find a table of notifications. Each
notification delivers its own data structure which shares the same
name (lower case, naturally) as the notification type itself. The first
field of every notification is a \texttt{uint16\_t} which caries the
notification type.
\begin{figure*}[t]
\begin{center}
{\tt
\begin{tabular}{ l l l }
\hline
\textnormal{Type} & \textnormal{Socket Option} & \textnormal{Description} \\
\hline
SCTP\_ASSOC\_CHANGE & SCTP\_RECVASSOCEVNT & \textnormal{Change of association} \\
SCTP\_PEER\_ADDR\_CHANGE & SCTP\_RECVADDREVNT & \textnormal{Change in status of a given address} \\
SCTP\_REMOTE\_ERROR & SCTP\_RECVPEERERR & \textnormal{An error received from a peer} \\
SCTP\_SEND\_FAILED & SCTP\_RECVSENDFAILEVNT & \textnormal{A failure to send} \\
SCTP\_SHUTDOWN\_EVENT & SCTP\_RECVDOWNEVNT & \textnormal{The reception of a \texttt{SHUTDOWN} chunk} \\
\end{tabular}
}
\end{center}
\caption{\label{notification}Useful notifications for an SCTP socket}
\end{figure*}
\section{The lksctp Project}
A critical factor in the success of any new IETF protocol is of course
a Linux implementation. Fortunately, key personnel at Motorola
recognised this and encouraged us to tackle such a project. Months
later, we have a core implementation with an ever-expanding feature
set. We now have significant participation from developers at IBM and
Intel and the pace is picking up.
\subsection{Coding standards}
In addition to the usual requirements of kernel code, our code seeks
to be a useful reference for people making their own kernel
implementations of SCTP. If a reader has some question about how to
implement a particular section of the RFC, they need only grep for the
relevant text in our code and they can find an example. As much as
practical, we draw names directly from the RFC. We made the state
machine into an explicit table (see \ref{states} for an
excerpt) with names that refer directly back to the relevant section
numbers. Clarity is a compelling requirement for our code.
\subsection{Extreme Programming}
As the project grew and we added developers, we clearly needed some
way of coordinating our work. We decided to experiment with Extreme
Programming, \cite{xp}.
XP is a collection of practices aimed at controlling risk in a small
to medium-sized software development project. One important principle
is that you should do the simplest thing that could possibly work. A
second important principle is to take advantage of the fact that
programmers like to code.
We use a range of XP practices, but the practices which are most
visible to anybody who reads or works on lksctp are the tests and the
metaphors.
\section{The Tests}
One of the XP practices we use is code-to-the-test. XP asks, ``If
testing is good, why don't we do it all the time?'' Instead of writing
tests for working code, write tests first, and then write code to pass
the tests. This practice leads to a large automated test suite which
runs several times per day.
We use three kinds of test, unit tests, test frame functional tests,
and live kernel functional tests.
The most basic form of test is the unit test. Unit tests exercise all
the interfaces of a particular object and confirm that it behaves
correctly. They also encode regression checks for fixed bugs. These
tests all have names beginning with \texttt{test\_}.
The second form of test is the test frame functional tests. These are
the tests with names beginning with \texttt{ft\_frame\_}. These tests
check for external behaviours of the system, but with a simulated
kernel. The simulated kernel is very light weight and gives us very
fine control over things like timing and network properties.
Ideally, functional tests should be written by the customer for a
system---they encode the behaviours that the customer expects. In our
case, we play the role of customer on behalf of the RFC. We also use
test frame functional tests to define work items for off-site
development groups. The off-site group writes tests which describe the
feature they intend to implement and submits those tests as a
proposal. This has proven an excellent medium for describing work.
The final form of test we use is the live kernel functional test. We
have many fewer of these than we would like---they are difficult to run
since we must install and boot a kernel to test. This is much more
work than simply running \texttt{make unit\_test}. We are exploring UML as
a possible way to automate our kernel functional tests. These tests
have names beginning with \texttt{ft\_kern\_}.
Code-to-the-test is a practice which you can introduce at any point in
a project. When you first start, it seems that you are spending more
time writing tests than writing code, but once you begin to have a
critical mass of interacting tests you begin to see significant
payoffs in both code quality and development velocity.
We have had several incidents where interactions between unit tests
and functional tests have uncovered complimentary masking bugs.
Tests are not a substitute for understanding code---they are a
mechanism for encoding that understanding to share with other
developers, including future versions of yourself. You can learn
nearly as much about our code by reading our tests as by reading the
code itself.
Lately, we have begun using functional tests to encode major bugs.
These are among the best of all possible bug reports---they describe
the failure precisely and tell exactly when the problem is gone.
After the bugs are fixed the tests serve as part of the regression
suite.
\section{The Metaphors}
XP projects are built around a unifying metaphor rather than an
elaborate architecture. In our case, we chose two metaphors which
could serve quite well for nearly any protocol development project.
Our metaphors are the state machine and the smart pipe. Most readers
are probably familiar with the state machine, but the smart pipe is a
twist on a familiar concept. The idea behind a smart pipe\footnote{An
alternate term may be ``oven''.} is that raw stuff goes in one end and
cooked stuff comes out the other end.
\subsection{The State Machine}
The state machine in our implementation is quite literal. We have an
explicit state table which keys to specific state functions which are
tied directly back to parts of the RFC. The core of the state machine
(found in \texttt{sctp\_do\_sm()}) is almost purely functional---only header
conversions are permitted. Each state function produces a description
of the side effects (in the form of a \texttt{struct~sctp\_sm\_retval})
needed to handle the particular event. A separate side effect
processor, \texttt{sctp\_side\_effects()}, converts this structure into
actions.
Events fall into four categories. The RFC is very explicit about
state transitions associated with arriving chunks. The RFC discusses
transitions due to primitive requests from upper layers, but many of
these are implementation dependent. The third category of events is
timeouts. The final category is a catch-all for odd events like
queues emptying.
\begin{figure*}[t]
\begin{center}
{\tt
\begin{tabular}{ l l l l l }
\hline
\textnormal{State:} & CLOSED & COOKIE-WAIT & COOKIE-ECHOED & ESTABLISHED \\
\hline
\hline
\textnormal{Chunks} & & & & \\
\hline
INIT & do\_5\_1B\_init & do\_5\_2\_1\_siminit & do\_5\_2\_1\_siminit & do\_5\_2\_2\_dupinit \\
INIT ACK & discard(5.2.3) & do\_5\_1C\_ack & discard(5.2.3) & discard(5.2.3) \\
COOKIE ECHO & do\_5\_1D\_ce & do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook \\
COOKIE ACK & discard & discard(5.2.5) & do\_5\_1E\_ca & discard(5.2.5) \\
DATA & tabort\_8\_4\_8 & discard(6.0) & discard(6.0) & eat\_data\_6\_2 \\
SACK & tabort\_8\_4\_8 & discard(6.0) & eat\_sack\_6\_2\_1 & eat\_sack\_6\_2\_1 \\
\hline
\textnormal{Timeouts} & & & & \\
\hline
T1-INIT TO & bug & do\_4\_2\_reinit & bug & bug \\
T3-RTX TO & bug & bug & do\_6\_3\_3\_retx & do\_6\_3\_3\_retx \\
\hline
\textnormal{Primitives} & & & & \\
\hline
PRM\_ASSOCIATE & do\_PRM\_ASOC & error & error & error \\
PRM\_SEND & error & do\_PRM\_SENDQ6.0 & do\_PRM\_SENDQ6.0 & do\_PRM\_SEND \\
\end{tabular}
}
\end{center}
\caption{\label{states}Portion of SCTP state table showing association initialisation}
\end{figure*}
In order to create an explicit state machine, it was necessary to
first create an explicit state table. The process of creating this
table uncovered a few minor contradictions in one of the drafts of the
RFC. These mostly involved conflicting catch-all cases. In Figure 1
we have an excerpt which shows the state functions involved in
initialising a new association.
\subsection{The Smart Pipes}
Each smart pipe has one or more structures which define its internal
data, and a set of functions which define its external interactions.
In this respect these smart pipes can be considered a type of object,
in the OO sense. All of these definitions can be found in the include
file \texttt{<net/sctp/sctpStructs.h>}.
Most of our smart pipes have push inputs---external objects explictly
put things in by calling methods directly. A pull input is
possible---the smart pipe would need to have a way to register a
callback function which can fetch more input in response to some other
stimulus.
Some of our pipes use pull outputs. E.g. \texttt{SCTP\_ULPqueue} passes
data and notifications up the protocol stack through explicit calls to
the socket functions, usually \texttt{readmsg(2)}. Some of our smart
pipes use push outputs. E.g. \texttt{SCTP\_outqueue} has a set of
callback functions which it invokes when it needs to send chunks out
toward the wire.
There are four smart pipes in lksctp. They are
\texttt{SCTP\_inqueue}, \texttt{SCTP\_ULPqueue},
\texttt{SCTP\_outqueue}, and \texttt{SCTP\_packet}. The first two
carry information up the stack from the wire to the user; the second
two carry information back down the stack.
\subsubsection{\texttt{SCTP\_inqueue}}
\texttt{SCTP\_inqueue} accepts packets and provides chunks. It is
responsible for reassembling fragments, unbundling, tracking received
TSN's for acknowledgement, and managing rwnd for congestion control.
There is an \texttt{SCTP\_inqueue} for each endpoint (to handle chunks
not related to a specific association) and one for each association.
The function \texttt{sctp\_v4\_rcv()} (which is the receiving function
for SCTP registered with IPv4) calls \texttt{sctp\_push\_inqueue()} to
push packets into the input queue for the appropriate association or
endpoint. The function \texttt{sctp\_push\_inqueue()} schedules
either \texttt{sctp\_bh\_rcv\_asoc()} or \texttt{sctp\_bh\_rcv\_ep()}
on the immediate queue to complete delivery. These functions call
\texttt{sctp\_pop\_inqueue()} to pull data out of the
\texttt{SCTP\_inqueue}. This function does most of the work for this
smart pipe.
The functions \texttt{sctp\_bh\_rcv\_ep()} and
\texttt{sctp\_bh\_rcv\_asoc()} run the state machine on incoming
chunks. Among many other side effects, the state machine can generate
events for an upper-layer-protocol (ULP), and/or chunks to go back out
on the wire.
\subsubsection{\texttt{SCTP\_ULPqueue}}
\texttt{SCTP\_ULPqueue} is the smart pipe which accepts events (either
user data messages or notifications) from the state machine and
delivers them to the ULP through the sockets layer. It is responsible
for delivering streams of messages in order. There is one
\texttt{SCTP\_ULPqueue} for every endpoint, but this is likely to
change at some point to one \texttt{SCTP\_ULPqueue} for each socket.
This smart pipe uses a data structure distributed between the
\texttt{struct~SCTP\_endpoint} and the
\texttt{struct~SCTP\_association}.
The state machine, \texttt{sctp\_do\_sm()}, pushes data into an
\texttt{SCTP\_ULPqueue} by calling
\texttt{sctp\_push\_chunk\_ULPqueue()}. It pushes notifications with
\texttt{sctp\_push\_event\_ULPqueue()}. The sockets layer extracts
events from an \texttt{SCTP\_ULPqueue} with
\texttt{sctp\_pop\_ULPqueue()}.
\subsubsection{\texttt{SCTP\_outqueue}}
\texttt{SCTP\_outqueue} is responsible for bundling logic, transport
selection, outbound congestion control, fragmentation, and any
necessary data queueing. It knows whether or not data can go out onto
the wire yet. With one exception noted below, every outbound chunk
goes through an \texttt{SCTP\_outqueue} attached to an association.
The state machine injects chunks into an \texttt{SCTP\_outqueue} with
\texttt{sctp\_push\_outqueue()}. They automatically push out the other
end through a small set of callbacks which are normally attached to an
\texttt{SCTP\_packet}.
The state machine is capable of putting a fully-formed packet directly
on the wire. At this point only \texttt{ABORT} uses this feature. It is
likely that we will refactor \texttt{INIT ACK} generation again to use
this feature.
\subsubsection{\texttt{SCTP\_packet}}
An \texttt{SCTP\_packet} is a lazy packet transmitter associated with a
specific transport. The upper layer pushes data into the packet,
usually with \texttt{sctp\_transmit\_chunk()}. The packet blindly
bundles the chunks. If the it fills (hits the PMTU for its transport),
it transmits the packet to make room for the new chunk.
\texttt{SCTP\_packet} rejects packets which need fragmenting. It is
possible to force a packet to transmit immediately with
\texttt{sctp\_transmit\_packet()}. \texttt{SCTP\_packet} tracks the
congestion counters, but handles none of the congestion logic.
\section{More Data Structures}
Not everything is a state table or a smart pipe---after all, this is
the kernel and we ARE programming in C. Here again, we have followed
the RFC very closely. Most of the key concepts in the RFC manifest
themselves as explicit data structures. For convenience, we refer to
these data structures as ``nouns''.
Nearly all of the ``noun'' structures are designed for use with the
\texttt{sk\_buff} macros for list manipulation. These macros provide a
doubly-linked list with locking.
\subsection{\texttt{struct~SCTP\_proto}}
The entire lksctp universe is grounded in an instance of \texttt{
struct~SCTP\_proto} accessible through \texttt{sctp\_get\_protocol()}.
This structure holds system-wide defaults for things like the maximum
number of permitted retransmissions. It contains a list of all
endpoints on the system.
\subsection{\texttt{struct~SCTP\_endpoint}}
Each UDP-style SCTP socket has an endpoint, represented as a
\texttt{struct~SCTP\_endpoint}. Once we implement high-bandwidth sockets and
TCP-style sockets, it will be possible for multiple sockets to share a
single endpoint structure. The endpoint structure contains a local
SCTP socket number and a list of local IP addresses. These two items
define the endpoint uniquely. In addition to endpoint-wide default
values and statistics, the endpoint maintains a list of associations.
\subsection{\texttt{struct~SCTP\_association}}
Each association structure, \texttt{struct~SCTP\_association}) is defined
by a local endpoint (a pointer to a \texttt{struct~SCTP\_endpoint}), and
a remote endpoint (an SCTP port number and a list of transport
addresses). This is one of the most complicated structures in the
implementation as it includes a great deal of information mandated by
the RFC. Among many other things, this structure holds the state of
the state machine. The list of transport addresses for the remote
endpoint is more elaborate than the simple list of IP addresses in the
local endpoint data structure since SCTP needs to maintain congestion
information about each of the remote transport addresses.
\subsection{\texttt{struct~SCTP\_transport}}
A \texttt{struct~SCTP\_transport} is defined by a remote SCTP port number
and an IP address. The structure holds congestion and reachability
information for the given address. This is also where we get the list
of functions to call to manipulate the specific address family. For
TCP you would find this information way up in the socket, but this is
not possible for SCTP.
\subsection{\texttt{struct~SCTP\_chunk}}
Possibly the most fundamental data structure in lksctp is
\texttt{struct~SCTP\_chunk}. This holds SCTP chunks both inbound and
outbound. It is essentially an extension to \texttt{struct~sk\_buff}.
It adds pointers to the various possible SCTP subheaders and a few
flags needed specifically for SCTP. One strict convention is that
\texttt{chunk->skb->data} is the demarcation line between headers in
network byte order and headers in host byte order. All outbound
chunks are ALWAYS in network byte order. The first function which
needs a field from an inbound chunk converts that full header to host
byte order {\it in situ}.
\section{Acknowledgements}
The authors are members of a team at Motorola dedicated to producing
open source implementations in support of IETF standardisation. We
would like to thank the people who make these efforts possible,
specifically Maureen~Govern, Stephen~Spear, Qiaobing~Xie, and
Irfan~Ali. We are of course deeply indebted to Randall Stewart and
Qiaobing Xie for having created SCTP and for starting the Linux Kernel
SCTP Implementation Project. We wish to recognizee the ongoing and
significant contributions from developers outside Motorola, especially
Jon Grimm and Daisy Chang of IBM, and Xingang Guo of Intel.
\section{Availability}
All the code discussed in this paper is available from the lksctp
project on Source Forge:
\begin{center}
\texttt{http://sourceforge.net/projects/lksctp/}
\end{center}
\begin{thebibliography}{2001}
\bibitem[RFC2960]{rfc2960} R.~Stewart, Q.~Xie, K.~Morneault, C.~Sharp,
H.~J.~Schwarzbauer, T.~Taylor, I.~Rytina, M.~Kalla, L.~Zhang, and,
V.~Paxson, {\em Stream Control Transmission Protocol}, RFC~2960 (Oct~2000).
\bibitem[SCTPAPI]{api} R.~Stewart, Q.~Xie, L.~H.~P.~Yarroll, J.~Wood,
K.~Poon, K.~Fujita., {\em Sockets API Extensions for SCTP}, Work In
Progress, \texttt{draft-ietf-tsvwg-sctpsocket-00.txt} (Jun~2001).
\bibitem[SCTPIMPL]{impl} R.~Stewart. {\it et al},
{\em SCTP Implementor's Guide}, Work In Progress,
\texttt{draft-ietf-tsvwg-sctpimpguide-00.txt} (Jun~2001).
\bibitem[SCTPMIB]{mib} J.~Pastor, M.~Belinchon. {\em Stream Control
Transmission Protocol Management Information Base using SMIv2}, Work
In Progress, \texttt{draft-ietf-sigtran-sctp-mib-03.txt} (Feb~2001).
\bibitem[XP]{xp} K.~Beck. {\em Extreme Programming Explained: Embrace
Change}, Addison-Wesley Publishers (2000).
\bibitem[SCTPORG]{sctporg}{\em Randall Stewart's SCTP site},\\
\texttt{http://www.sctp.org}, (2001).
\bibitem[SCTPDE]{sctpde}{\em T\"uxen/Jungmeier SCTP site},\\
\texttt{http://www.sctp.de}, (2001).
\end{thebibliography}
\end{document}