blob: 1e476064d9520e3e07701d7241108440ad6c3f06 [file] [log] [blame]
Austin Schuh8d0a2852019-12-28 22:54:28 -08001% TEMPLATE for Usenix papers, specifically to meet requirements of
2% TCL97 committee.
3% originally a template for producing IEEE-format articles using LaTeX.
4% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
5% adapted by David Beazley for his excellent SWIG paper in Proceedings,
6% Tcl 96
7% turned into a smartass generic template by De Clarke, with thanks to
8% both the above pioneers
9% use at your own risk. Complaints to /dev/null.
10% make it two column with no page numbering, default is 10 point
11
12% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
13% the .sty file from the LaTeX source template, so that people can
14% more easily include the .sty file into an existing document. Also
15% changed to more closely follow the style guidelines as represented
16% by the Word sample file.
17% This version uses the latex2e styles, not the very ancient 2.09 stuff.
18
19% adapted for Ottawa Linux Symposium
20
21\documentclass[twocolumn]{article}
22\usepackage{ols,epsfig}
23\begin{document}
24
25%Remove this next line if your system defaults correctly.
26\special{papersize=8.5in,11in}
27
28%don't want date printed
29\date{}
30
31%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
32\title{\Large \bf Linux Kernel SCTP : The Third Transport}
33
34%for single author (just remove % characters)
35\author{
36La~Monte H.P.\ Yarroll \\
37%{\em Your Department} \\
38{\em Motorola GTSS}\\
39%{\em Your City, State, ZIP}\\
40% is there a standard format for email/URLs??
41% remember that ~ doesn't do what you expect, use \~{}.
42{\normalsize piggy@acm.org} \\
43%
44% copy the following lines to add more authors
45\and
46Karl Knutson \\
47{\em Motorola GTSS}\\
48%% is there a standard format for email/URLs??
49{\normalsize karl@athena.chicago.il.us}
50%
51} % end author
52
53\maketitle
54
55% You have to do this to suppress page numbers. Don't ask.
56\thispagestyle{empty}
57\renewcommand{\thefootnote}{\fnsymbol{footnote}}
58
59\subsection*{Abstract}
60% nuked italics -- FD
61
62The Stream Control Transmission Protocol (SCTP) is a reliable
63message-oriented protocol with transparent support for multihoming.
64It allows multiple independent complex exchanges which all share a
65single connection and congestion context.
66
67We provide an overview of the protocol, the UDP-style API and the
68details of the Linux kernel reference implementation. The brief API
69discussion is intended for developers wishing to use SCTP. The
70detailed implementation discussion is for developers interested in
71contributing to the kernel development effort.
72
73\section{Introduction}
74
75The developers at the Linux 2.5 Kernel Summit in San Jose achieved a
76rough consensus that 2.5 should probably support SCTP, a new transport
77protocol from the IETF. This paper introduces the ongoing work on
78such an implementation, providing some details for both the
79application developer and the kernel developer.
80
81The Stream Control Transmission Protocol (SCTP) is a reliable
82message-oriented protocol with transparent support for multihoming.
83It allows multiple independent complex exchanges which all share a
84single connection and congestion context.
85
86\subsection{History of SCTP}
87The SIGTRAN (Signalling Transport) Working Group of the IETF is
88concerned with the transport of telephony signalling data over IP.
89Upon reviewing the available standard transport protocols, they
90concluded that none of them met the transport requirements of
91signalling data.
92
93SIGTRAN concluded that they needed a new transport protocol which
94could provide reliable message delivery, tolerate network failures,
95and avoid the head-of-line-blocking problem. We will discuss this
96problem later.
97
98The WG selected a proposal from Randall Stewart and Qiaobing Xie of
99Motorola as a starting point. Stewart and Xie had developed a
100Distributed Processing Environment, Quantix, aimed at telephony
101applications. This DPE had been successfully demonstrated at Geneva
102Telecom in 1999.
103
104The Working Group took great care in constructing the new protocol,
105SCTP, incorporating many lessons learned from TCP, such as congestion
106control, selective ACK, message fragmentation and bundling.
107
108The core transport protocol from Quantix brought support for
109multihoming, message framing, and streams. We discuss all of these
110features at length later.
111
112The IESG decided that the resulting protocol was robust enough to be
113elevated from a specialised transport for telephony signalling to a new
114general purpose transport to stand beside UDP and TCP. To this end,
115they moved the work from SIGTRAN to TSVWG, the general transport
116group.
117
118As of this writing, the core specification, \cite{rfc2960}, is at Proposed
119Standard. There have been three successful bakeoffs covering over 25
120separate implementations. Lessons learned from the most recent
121bakeoff are being written up in an ``Implementor's Guide'', \cite{impl}.
122
123\subsection{SCTP in the Linux kernel}
124
125Shortly before the first bakeoff, the IESG asked SIGTRAN to move SCTP
126from riding on UDP to riding directly on top of IP. The long term
127goal was clearly was to move SCTP from user space into the kernel.
128
129Aside from the obvious performance gains, this has the effect of
130reducing the number of implementations to roughly one per operating
131system. This makes it easier to verify the stability of most of the
132implementations which appear on the Internet.
133
134Randall Stewart saw the importance of this and started one of the
135authors of this paper working on a port of the user space
136implementation to the Linux kernel. This port was intended as a
137reference for developers of implementations for other kernels to
138examine. The Linux kernel implementation has since diverged
139significantly from the user space reference, but maintains the
140standards of a reference implementation (see Coding Standards, below).
141
142\subsection{SCTP examples}
143
144SCTP is a reliable message-oriented protocol with transparent support
145for multihoming. It allows multiple independent complex exchanges
146which all share a single connection and congestion context.
147
148Many network applications operate by exchanging simultaneously, short,
149similar sequences of data continuously. The traffic produced by these
150operations can be characterised as MICE (Multiple Independent Complex
151Exchanges). It is also true that many applications which use MICE
152also have high network reliability requirements.
153
154\subsubsection{A database app}
155One example is a client/server database application. Each request and
156each response is a message. Each transaction is a sequence of
157dependent request/response pairs.
158
159Implemented over TCP, this application would have to provide its own
160message boundaries, since TCP sends bytes, not messages. How do we
161implement MICE with TCP? We have two ways of doing this: multiple
162connections, or a single multiplexed and reused connection.
163
164With each transaction over a separate TCP connection, we gain the
165independence of transactions, but at a cost in performance. Since TCP
166(as a general purpose transport protocol) uses congestion control,
167each of the connections would have to go through slow-start and if
168most transactions were short, they would never get out of slow-start.
169
170With all transactions over a single TCP connection, we make efficient
171use of the network bandwidth, but open ourselves up to the
172head-of-line blocking problem. This means that if one segment in one
173transaction is lost, this blocks all transactions, not just the one
174with the lost segment.
175
176If we use SCTP for the same application we gain the benefits of using
177TCP, as well as advantages peculiar to SCTP. SCTP directly supports
178messages and guarantees TCP-like levels of bandwidth efficiency via
179bundling and fragmentation. Each database transaction can be
180represented as an ordered stream of messages, which are independent in
181SCTP for retransmission purposes. This means that while SCTP has the
182same congestion control mechanisms as TCP, it does not have to resort
183to multiple connections nor is it vulnerable to the head-of-line
184blocking problem.
185
186\subsubsection{A free clinic}
187
188Another example of SCTP use is for a free\footnote{Free as in ``free
189beer''.} clinic which needs a reliable way to use its IP-networked
190patient monitoring software.
191
192This has many similarities to the example above in that different
193monitoring devices would need to send simultaneous
194information---multiple independent complex exchanges. The main
195difference is in the higher network reliability requirements.
196
197A reasonable way to improve the network reliability is to set up a
198parallel network and use multihoming for the client and server
199applications. However, if the application is TCP-based, the
200multihoming needs to be added to the application. With SCTP, the
201multihoming ability is built into the protocol. All that is necessary
202is to make the appropriate socket calls and SCTP will take advantage
203of the addresses available in the existing network. This also applies
204if one side of the connection has more addresses than the other.
205
206\section{The UDP-style API}
207
208Any new protocol needs an API. In particular for an Internet
209protocol, it's important to have the API match the API normally used
210for IP networks. This is the Berkeley sockets model---the SCTP
211version is defined in the Internet Draft ``Sockets API Extensions for
212SCTP''\cite{api}. The API draft defines two complementary interfaces
213to SCTP--one for compatibility with older TCP-based applications, and
214another for new applications designed expressly to use SCTP. The
215Linux Kernel SCTP stack does not yet implement the former, so we
216discuss only the UDP-style interface.
217
218The conceptual model of the UDP-style API is (naturally) that of plain
219UDP. To send a message in UDP, you create a socket, bind an address
220to it and send your message using \texttt{sendmsg()}. To receive a
221message in UDP, you create a socket, bind an address to it and use
222\texttt{recvmsg()}. It's much the same with the UDP-style API for
223SCTP. To send a message, you create a socket, bind \textit{addresses}
224to it and use \texttt{sendmsg()}. The SCTP stack underlying the API
225handles association startup and shutdown automatically. The same goes
226for message reception. To receive a message in UDP-style, you create
227a socket, bind \textit{addresses} to it and use \texttt{recvmsg()}.
228
229The important API differences between UDP and UDP-style SCTP are:
230multihoming; ancillary data; and the option of notifications from the
231SCTP stack.
232
233\subsection{Multihoming and \texttt{bindx()}}
234
235There are three ways to work with multihoming with SCTP. One is to
236ignore multihoming and use one address. Another way is to bind all
237your addresses through the use of \texttt{INADDR\_ANY} or
238\texttt{IN6ADDR\_ANY}. This will ``associate the endpoint with the
239optimal subset of available local interfaces.''(Section 3.1.2,
240\cite{api}) The most flexible way is through the use of
241\texttt{sctp\_bindx()}, which allows additional addresses to be
242added to a socket after the first one is bound with \texttt{bind()},
243but before the socket is used to transfer or receive data. The
244function \texttt{sctp\_bindx()} is further described in section 8.1 of
245\cite{api}.
246
247\subsection{Ancillary data}
248
249To use streams with the UDP-style API, you use ancillary data in the
250\texttt{struct~cmsghdr} part of the \texttt{struct~msghdr} argument to
251both \texttt{sendmsg()} and \texttt{recvmsg()}. Ancillary data is
252used for initialisation data (\texttt{struct~sctp\_initmsg} and for
253header data (\texttt{struct~sctp\_sndrcvinfo}).
254
255Ancillary data are manipulated with the macros \texttt{CMSG\_FIRSTHDR,
256CMSG\_NEXTHDR, CMSG\_DATA, CMSG\_SPACE, \textnormal{and} CMSG\_LEN}.
257These are all defined in \cite{rfc2292}. \cite{api} provides a nice
258example in section 5.4.2.
259
260{\tt \small
261\begin{verbatim}
262 struct sctp_initmsg {
263 uint16_t sinit_num_ostreams;
264 uint16_t sinit_max_instreams;
265 uint16_t sinit_max_attempts;
266 uint16_t sinit_max_init_timeo;
267 };
268\end{verbatim}
269}
270
271The initialisation ancillary data sets information for starting
272new associations.
273
274{\tt \small
275\begin{verbatim}
276 struct sctp_sndrcvinfo {
277 uint16_t sinfo_stream;
278 uint16_t sinfo_ssn;
279 uint16_t sinfo_flags;
280 uint32_t sinfo_ppid;
281 uint32_t sinfo_context;
282 uint8_t sinfo_dscp;
283 sctp_assoc_t sinfo_assoc_id;
284 };
285\end{verbatim}
286}
287
288The header ancillary data reports information gleaned from the SCTP
289headers. If requested with the \texttt{SCTP\_RECVDATAIOEVNT} socket
290option, this ancillary data is provided with every inbound data
291message. There is a handy key (\texttt{sinfo\_assoc\_id}) which
292identifies the association for this particular message. It also
293provides the flags needed to implement partial delivery of very large
294messages.
295
296Outbound messages should include an \texttt{sctp\_sndrcvinfo} ancillary
297data structure to tell SCTP which SCTP stream to put this datagram
298into. It is also possible to set a default stream so that this
299ancillary data may be omitted.
300
301\subsection{Notifications}
302
303SCTP provides for the concept of optional notifications. These are
304messages delivered in-band about events inside the SCTP stack, such as
305a destination transport address failure or a new association coming
306up. The notifications are marked with the \texttt{MSG\_NOTIFICATION}
307flag in the \texttt{msg\_flags} field of the \texttt{sctp\_sendrcvinfo}
308ancillary data. The notification is delivered as the body of the
309message returned by \texttt{recvmsg()}.
310
311In \ref{notifications} we find a table of notifications. Each
312notification delivers its own data structure which shares the same
313name (lower case, naturally) as the notification type itself. The first
314field of every notification is a \texttt{uint16\_t} which caries the
315notification type.
316
317\begin{figure*}[t]
318\begin{center}
319{\tt
320\begin{tabular}{ l l l }
321 \hline
322 \textnormal{Type} & \textnormal{Socket Option} & \textnormal{Description} \\
323 \hline
324 SCTP\_ASSOC\_CHANGE & SCTP\_RECVASSOCEVNT & \textnormal{Change of association} \\
325 SCTP\_PEER\_ADDR\_CHANGE & SCTP\_RECVADDREVNT & \textnormal{Change in status of a given address} \\
326 SCTP\_REMOTE\_ERROR & SCTP\_RECVPEERERR & \textnormal{An error received from a peer} \\
327 SCTP\_SEND\_FAILED & SCTP\_RECVSENDFAILEVNT & \textnormal{A failure to send} \\
328 SCTP\_SHUTDOWN\_EVENT & SCTP\_RECVDOWNEVNT & \textnormal{The reception of a \texttt{SHUTDOWN} chunk} \\
329\end{tabular}
330}
331\end{center}
332\caption{\label{notification}Useful notifications for an SCTP socket}
333\end{figure*}
334
335\section{The lksctp Project}
336
337A critical factor in the success of any new IETF protocol is of course
338a Linux implementation. Fortunately, key personnel at Motorola
339recognised this and encouraged us to tackle such a project. Months
340later, we have a core implementation with an ever-expanding feature
341set. We now have significant participation from developers at IBM and
342Intel and the pace is picking up.
343
344\subsection{Coding standards}
345
346In addition to the usual requirements of kernel code, our code seeks
347to be a useful reference for people making their own kernel
348implementations of SCTP. If a reader has some question about how to
349implement a particular section of the RFC, they need only grep for the
350relevant text in our code and they can find an example. As much as
351practical, we draw names directly from the RFC. We made the state
352machine into an explicit table (see \ref{states} for an
353excerpt) with names that refer directly back to the relevant section
354numbers. Clarity is a compelling requirement for our code.
355
356\subsection{Extreme Programming}
357
358As the project grew and we added developers, we clearly needed some
359way of coordinating our work. We decided to experiment with Extreme
360Programming, \cite{xp}.
361
362XP is a collection of practices aimed at controlling risk in a small
363to medium-sized software development project. One important principle
364is that you should do the simplest thing that could possibly work. A
365second important principle is to take advantage of the fact that
366programmers like to code.
367
368We use a range of XP practices, but the practices which are most
369visible to anybody who reads or works on lksctp are the tests and the
370metaphors.
371
372\section{The Tests}
373
374One of the XP practices we use is code-to-the-test. XP asks, ``If
375testing is good, why don't we do it all the time?'' Instead of writing
376tests for working code, write tests first, and then write code to pass
377the tests. This practice leads to a large automated test suite which
378runs several times per day.
379
380We use three kinds of test, unit tests, test frame functional tests,
381and live kernel functional tests.
382
383The most basic form of test is the unit test. Unit tests exercise all
384the interfaces of a particular object and confirm that it behaves
385correctly. They also encode regression checks for fixed bugs. These
386tests all have names beginning with \texttt{test\_}.
387
388The second form of test is the test frame functional tests. These are
389the tests with names beginning with \texttt{ft\_frame\_}. These tests
390check for external behaviours of the system, but with a simulated
391kernel. The simulated kernel is very light weight and gives us very
392fine control over things like timing and network properties.
393
394Ideally, functional tests should be written by the customer for a
395system---they encode the behaviours that the customer expects. In our
396case, we play the role of customer on behalf of the RFC. We also use
397test frame functional tests to define work items for off-site
398development groups. The off-site group writes tests which describe the
399feature they intend to implement and submits those tests as a
400proposal. This has proven an excellent medium for describing work.
401
402The final form of test we use is the live kernel functional test. We
403have many fewer of these than we would like---they are difficult to run
404since we must install and boot a kernel to test. This is much more
405work than simply running \texttt{make unit\_test}. We are exploring UML as
406a possible way to automate our kernel functional tests. These tests
407have names beginning with \texttt{ft\_kern\_}.
408
409Code-to-the-test is a practice which you can introduce at any point in
410a project. When you first start, it seems that you are spending more
411time writing tests than writing code, but once you begin to have a
412critical mass of interacting tests you begin to see significant
413payoffs in both code quality and development velocity.
414
415We have had several incidents where interactions between unit tests
416and functional tests have uncovered complimentary masking bugs.
417
418Tests are not a substitute for understanding code---they are a
419mechanism for encoding that understanding to share with other
420developers, including future versions of yourself. You can learn
421nearly as much about our code by reading our tests as by reading the
422code itself.
423
424Lately, we have begun using functional tests to encode major bugs.
425These are among the best of all possible bug reports---they describe
426the failure precisely and tell exactly when the problem is gone.
427After the bugs are fixed the tests serve as part of the regression
428suite.
429
430\section{The Metaphors}
431
432XP projects are built around a unifying metaphor rather than an
433elaborate architecture. In our case, we chose two metaphors which
434could serve quite well for nearly any protocol development project.
435
436Our metaphors are the state machine and the smart pipe. Most readers
437are probably familiar with the state machine, but the smart pipe is a
438twist on a familiar concept. The idea behind a smart pipe\footnote{An
439alternate term may be ``oven''.} is that raw stuff goes in one end and
440cooked stuff comes out the other end.
441
442\subsection{The State Machine}
443
444The state machine in our implementation is quite literal. We have an
445explicit state table which keys to specific state functions which are
446tied directly back to parts of the RFC. The core of the state machine
447(found in \texttt{sctp\_do\_sm()}) is almost purely functional---only header
448conversions are permitted. Each state function produces a description
449of the side effects (in the form of a \texttt{struct~sctp\_sm\_retval})
450needed to handle the particular event. A separate side effect
451processor, \texttt{sctp\_side\_effects()}, converts this structure into
452actions.
453
454Events fall into four categories. The RFC is very explicit about
455state transitions associated with arriving chunks. The RFC discusses
456transitions due to primitive requests from upper layers, but many of
457these are implementation dependent. The third category of events is
458timeouts. The final category is a catch-all for odd events like
459queues emptying.
460
461\begin{figure*}[t]
462\begin{center}
463{\tt
464\begin{tabular}{ l l l l l }
465 \hline
466 \textnormal{State:} & CLOSED & COOKIE-WAIT & COOKIE-ECHOED & ESTABLISHED \\
467 \hline
468 \hline
469 \textnormal{Chunks} & & & & \\
470 \hline
471
472 INIT & do\_5\_1B\_init & do\_5\_2\_1\_siminit & do\_5\_2\_1\_siminit & do\_5\_2\_2\_dupinit \\
473 INIT ACK & discard(5.2.3) & do\_5\_1C\_ack & discard(5.2.3) & discard(5.2.3) \\
474 COOKIE ECHO & do\_5\_1D\_ce & do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook \\
475 COOKIE ACK & discard & discard(5.2.5) & do\_5\_1E\_ca & discard(5.2.5) \\
476 DATA & tabort\_8\_4\_8 & discard(6.0) & discard(6.0) & eat\_data\_6\_2 \\
477 SACK & tabort\_8\_4\_8 & discard(6.0) & eat\_sack\_6\_2\_1 & eat\_sack\_6\_2\_1 \\
478
479 \hline
480 \textnormal{Timeouts} & & & & \\
481 \hline
482
483 T1-INIT TO & bug & do\_4\_2\_reinit & bug & bug \\
484 T3-RTX TO & bug & bug & do\_6\_3\_3\_retx & do\_6\_3\_3\_retx \\
485
486 \hline
487 \textnormal{Primitives} & & & & \\
488 \hline
489
490 PRM\_ASSOCIATE & do\_PRM\_ASOC & error & error & error \\
491 PRM\_SEND & error & do\_PRM\_SENDQ6.0 & do\_PRM\_SENDQ6.0 & do\_PRM\_SEND \\
492
493\end{tabular}
494}
495\end{center}
496\caption{\label{states}Portion of SCTP state table showing association initialisation}
497\end{figure*}
498
499In order to create an explicit state machine, it was necessary to
500first create an explicit state table. The process of creating this
501table uncovered a few minor contradictions in one of the drafts of the
502RFC. These mostly involved conflicting catch-all cases. In Figure 1
503we have an excerpt which shows the state functions involved in
504initialising a new association.
505
506\subsection{The Smart Pipes}
507
508Each smart pipe has one or more structures which define its internal
509data, and a set of functions which define its external interactions.
510In this respect these smart pipes can be considered a type of object,
511in the OO sense. All of these definitions can be found in the include
512file \texttt{<net/sctp/sctpStructs.h>}.
513
514Most of our smart pipes have push inputs---external objects explictly
515put things in by calling methods directly. A pull input is
516possible---the smart pipe would need to have a way to register a
517callback function which can fetch more input in response to some other
518stimulus.
519
520Some of our pipes use pull outputs. E.g. \texttt{SCTP\_ULPqueue} passes
521data and notifications up the protocol stack through explicit calls to
522the socket functions, usually \texttt{readmsg(2)}. Some of our smart
523pipes use push outputs. E.g. \texttt{SCTP\_outqueue} has a set of
524callback functions which it invokes when it needs to send chunks out
525toward the wire.
526
527There are four smart pipes in lksctp. They are
528\texttt{SCTP\_inqueue}, \texttt{SCTP\_ULPqueue},
529\texttt{SCTP\_outqueue}, and \texttt{SCTP\_packet}. The first two
530carry information up the stack from the wire to the user; the second
531two carry information back down the stack.
532
533\subsubsection{\texttt{SCTP\_inqueue}}
534
535\texttt{SCTP\_inqueue} accepts packets and provides chunks. It is
536responsible for reassembling fragments, unbundling, tracking received
537TSN's for acknowledgement, and managing rwnd for congestion control.
538There is an \texttt{SCTP\_inqueue} for each endpoint (to handle chunks
539not related to a specific association) and one for each association.
540
541The function \texttt{sctp\_v4\_rcv()} (which is the receiving function
542for SCTP registered with IPv4) calls \texttt{sctp\_push\_inqueue()} to
543push packets into the input queue for the appropriate association or
544endpoint. The function \texttt{sctp\_push\_inqueue()} schedules
545either \texttt{sctp\_bh\_rcv\_asoc()} or \texttt{sctp\_bh\_rcv\_ep()}
546on the immediate queue to complete delivery. These functions call
547\texttt{sctp\_pop\_inqueue()} to pull data out of the
548\texttt{SCTP\_inqueue}. This function does most of the work for this
549smart pipe.
550
551The functions \texttt{sctp\_bh\_rcv\_ep()} and
552\texttt{sctp\_bh\_rcv\_asoc()} run the state machine on incoming
553chunks. Among many other side effects, the state machine can generate
554events for an upper-layer-protocol (ULP), and/or chunks to go back out
555on the wire.
556
557\subsubsection{\texttt{SCTP\_ULPqueue}}
558
559\texttt{SCTP\_ULPqueue} is the smart pipe which accepts events (either
560user data messages or notifications) from the state machine and
561delivers them to the ULP through the sockets layer. It is responsible
562for delivering streams of messages in order. There is one
563\texttt{SCTP\_ULPqueue} for every endpoint, but this is likely to
564change at some point to one \texttt{SCTP\_ULPqueue} for each socket.
565This smart pipe uses a data structure distributed between the
566\texttt{struct~SCTP\_endpoint} and the
567\texttt{struct~SCTP\_association}.
568
569The state machine, \texttt{sctp\_do\_sm()}, pushes data into an
570\texttt{SCTP\_ULPqueue} by calling
571\texttt{sctp\_push\_chunk\_ULPqueue()}. It pushes notifications with
572\texttt{sctp\_push\_event\_ULPqueue()}. The sockets layer extracts
573events from an \texttt{SCTP\_ULPqueue} with
574\texttt{sctp\_pop\_ULPqueue()}.
575
576\subsubsection{\texttt{SCTP\_outqueue}}
577
578\texttt{SCTP\_outqueue} is responsible for bundling logic, transport
579selection, outbound congestion control, fragmentation, and any
580necessary data queueing. It knows whether or not data can go out onto
581the wire yet. With one exception noted below, every outbound chunk
582goes through an \texttt{SCTP\_outqueue} attached to an association.
583The state machine injects chunks into an \texttt{SCTP\_outqueue} with
584\texttt{sctp\_push\_outqueue()}. They automatically push out the other
585end through a small set of callbacks which are normally attached to an
586\texttt{SCTP\_packet}.
587
588The state machine is capable of putting a fully-formed packet directly
589on the wire. At this point only \texttt{ABORT} uses this feature. It is
590likely that we will refactor \texttt{INIT ACK} generation again to use
591this feature.
592
593\subsubsection{\texttt{SCTP\_packet}}
594
595An \texttt{SCTP\_packet} is a lazy packet transmitter associated with a
596specific transport. The upper layer pushes data into the packet,
597usually with \texttt{sctp\_transmit\_chunk()}. The packet blindly
598bundles the chunks. If the it fills (hits the PMTU for its transport),
599it transmits the packet to make room for the new chunk.
600\texttt{SCTP\_packet} rejects packets which need fragmenting. It is
601possible to force a packet to transmit immediately with
602\texttt{sctp\_transmit\_packet()}. \texttt{SCTP\_packet} tracks the
603congestion counters, but handles none of the congestion logic.
604
605\section{More Data Structures}
606
607Not everything is a state table or a smart pipe---after all, this is
608the kernel and we ARE programming in C. Here again, we have followed
609the RFC very closely. Most of the key concepts in the RFC manifest
610themselves as explicit data structures. For convenience, we refer to
611these data structures as ``nouns''.
612
613Nearly all of the ``noun'' structures are designed for use with the
614\texttt{sk\_buff} macros for list manipulation. These macros provide a
615doubly-linked list with locking.
616
617\subsection{\texttt{struct~SCTP\_proto}}
618
619The entire lksctp universe is grounded in an instance of \texttt{
620struct~SCTP\_proto} accessible through \texttt{sctp\_get\_protocol()}.
621This structure holds system-wide defaults for things like the maximum
622number of permitted retransmissions. It contains a list of all
623endpoints on the system.
624
625\subsection{\texttt{struct~SCTP\_endpoint}}
626
627Each UDP-style SCTP socket has an endpoint, represented as a
628\texttt{struct~SCTP\_endpoint}. Once we implement high-bandwidth sockets and
629TCP-style sockets, it will be possible for multiple sockets to share a
630single endpoint structure. The endpoint structure contains a local
631SCTP socket number and a list of local IP addresses. These two items
632define the endpoint uniquely. In addition to endpoint-wide default
633values and statistics, the endpoint maintains a list of associations.
634
635\subsection{\texttt{struct~SCTP\_association}}
636
637Each association structure, \texttt{struct~SCTP\_association}) is defined
638by a local endpoint (a pointer to a \texttt{struct~SCTP\_endpoint}), and
639a remote endpoint (an SCTP port number and a list of transport
640addresses). This is one of the most complicated structures in the
641implementation as it includes a great deal of information mandated by
642the RFC. Among many other things, this structure holds the state of
643the state machine. The list of transport addresses for the remote
644endpoint is more elaborate than the simple list of IP addresses in the
645local endpoint data structure since SCTP needs to maintain congestion
646information about each of the remote transport addresses.
647
648\subsection{\texttt{struct~SCTP\_transport}}
649
650A \texttt{struct~SCTP\_transport} is defined by a remote SCTP port number
651and an IP address. The structure holds congestion and reachability
652information for the given address. This is also where we get the list
653of functions to call to manipulate the specific address family. For
654TCP you would find this information way up in the socket, but this is
655not possible for SCTP.
656
657\subsection{\texttt{struct~SCTP\_chunk}}
658
659Possibly the most fundamental data structure in lksctp is
660\texttt{struct~SCTP\_chunk}. This holds SCTP chunks both inbound and
661outbound. It is essentially an extension to \texttt{struct~sk\_buff}.
662It adds pointers to the various possible SCTP subheaders and a few
663flags needed specifically for SCTP. One strict convention is that
664\texttt{chunk->skb->data} is the demarcation line between headers in
665network byte order and headers in host byte order. All outbound
666chunks are ALWAYS in network byte order. The first function which
667needs a field from an inbound chunk converts that full header to host
668byte order {\it in situ}.
669
670\section{Acknowledgements}
671
672The authors are members of a team at Motorola dedicated to producing
673open source implementations in support of IETF standardisation. We
674would like to thank the people who make these efforts possible,
675specifically Maureen~Govern, Stephen~Spear, Qiaobing~Xie, and
676Irfan~Ali. We are of course deeply indebted to Randall Stewart and
677Qiaobing Xie for having created SCTP and for starting the Linux Kernel
678SCTP Implementation Project. We wish to recognizee the ongoing and
679significant contributions from developers outside Motorola, especially
680Jon Grimm and Daisy Chang of IBM, and Xingang Guo of Intel.
681
682\section{Availability}
683
684All the code discussed in this paper is available from the lksctp
685project on Source Forge:
686
687\begin{center}
688\texttt{http://sourceforge.net/projects/lksctp/}
689\end{center}
690
691\begin{thebibliography}{2001}
692
693\bibitem[RFC2960]{rfc2960} R.~Stewart, Q.~Xie, K.~Morneault, C.~Sharp,
694H.~J.~Schwarzbauer, T.~Taylor, I.~Rytina, M.~Kalla, L.~Zhang, and,
695V.~Paxson, {\em Stream Control Transmission Protocol}, RFC~2960 (Oct~2000).
696
697\bibitem[SCTPAPI]{api} R.~Stewart, Q.~Xie, L.~H.~P.~Yarroll, J.~Wood,
698K.~Poon, K.~Fujita., {\em Sockets API Extensions for SCTP}, Work In
699Progress, \texttt{draft-ietf-tsvwg-sctpsocket-00.txt} (Jun~2001).
700
701\bibitem[SCTPIMPL]{impl} R.~Stewart. {\it et al},
702{\em SCTP Implementor's Guide}, Work In Progress,
703\texttt{draft-ietf-tsvwg-sctpimpguide-00.txt} (Jun~2001).
704
705\bibitem[SCTPMIB]{mib} J.~Pastor, M.~Belinchon. {\em Stream Control
706Transmission Protocol Management Information Base using SMIv2}, Work
707In Progress, \texttt{draft-ietf-sigtran-sctp-mib-03.txt} (Feb~2001).
708
709\bibitem[XP]{xp} K.~Beck. {\em Extreme Programming Explained: Embrace
710Change}, Addison-Wesley Publishers (2000).
711
712\bibitem[SCTPORG]{sctporg}{\em Randall Stewart's SCTP site},\\
713\texttt{http://www.sctp.org}, (2001).
714
715\bibitem[SCTPDE]{sctpde}{\em T\"uxen/Jungmeier SCTP site},\\
716\texttt{http://www.sctp.de}, (2001).
717
718\end{thebibliography}
719
720\end{document}